Holy crap locales!

Here’s something fun to try.  Create a text file that looks like this (Note: utf-8 encoded!):

A
ß
C
ßa
ßz
a
B
b
c
S
s
SS
ss
SA
sa
SZ
sz

Just to be really clear, here are the exact bytes I’m talking about:

$ hexdump -C  /tmp/foo.txt 
00000000  41 0a c3 9f 0a 43 0a c3  9f 61 0a c3 9f 7a 0a 61  |A....C...a...z.a|
00000010  0a 42 0a 62 0a 63 0a 53  0a 73 0a 53 53 0a 73 73  |.B.b.c.S.s.SS.ss|
00000020  0a 53 41 0a 73 61 0a 53  5a 0a 73 7a 0a           |.SA.sa.SZ.sz.|
0000002d
$ md5sum /tmp/foo.txt
ac2be5e453dd79c070da74d0e67aa6b2 /tmp/foo.txt

Now, compare the output of the following commands:

$ sort /tmp/foo.txt
$ LC_ALL='en_US' sort /tmp/foo.txt
$ LC_ALL='en_US.utf8' sort /tmp/foo.txt
$ LC_ALL='en_US.iso88591' sort /tmp/foo.txt
$ LC_ALL='C' sort /tmp/foo.txt
$ LC_ALL='de_DE.utf8' sort /tmp/foo.txt

How’s that for rocking your world? So, the next time your friend says “hey, can you return those results sorted for me?” then you’ll have something really fun to think about when you can’t sleep at night.

And just when you thought “Oh, well great, at least all the UTF-8 versions sort the same” then comes along this little gem:

$ LC_ALL="jp_JP.utf8" sort /tmp/foo.txt

Oh, and just when you thought “Well, I guess I’ll be OK with en_US.utf8 and at least English will sort the way I want worldwide!” then along comes your friends to the North with this awesome zinger:

$ LC_ALL="en_CA.utf8" sort /tmp/foo./txt

7 thoughts on “Holy crap locales!”

  1. SPOILER ALERT

    I know locale affects sort order, but I’m not seeing many variations here. Using GNU sort 8.13 on Ubuntu 12.04 I get the following. I’m assuming that based on your inclusion of the Eszett and de_DE.utf8, that you expected more variations?

    C, en_US, en_CA, en_US.iso88591, de_DE.utf8, jp_JP.utf8: Byte order
    en_US.utf8: Case insensitive, lowercase first
    en_CA.utf8: Case insensitive, uppercase first (I’m Canadian and didn’t know this; who decides this stuff?)

    I guess the lesson here is that you can’t rely on locale-aware code to behave the same on different computers/environments.

  2. Oh, hm. Running `locale -a` on my machine only includes C and en_*. I see Ubuntu ships each language separately and I only have English installed, so that explains that. Again, my takeaway is that locale-specific code could be randomly “broken” on different machines! Scary.

  3. Did you include the character “ß” in your set of strings that was being sorted? In short, I found:

    I have 73 available locales on my system (viewable via ‘locale -a’) and they sort the file above 5 different ways:

    61 of them: sort the same as “en_US.utf8″ which is lowercase precedes upper: ['a', 'A', 'b', 'B'...], ß sorts between SS and ST. This includes de_DE.utf8 and zh_CN.utf8. It would be interesting to have more languages in the test file, as I suspect many of them have different orderings.

    en_CA.utf8, fr_CA.utf8, nb_NO.utf8, nn_NO.utf8: Sort the same as en_US, but uppercase precedes lowercase: ['A', 'a', 'B', 'b', ...]

    C, C.UTF-8, jp_JP.utf8 and POSIX all sort in byte order. ['A', 'B', ..., 'a', 'b', ...]

    en_US, en_US.iso88591, en_US.iso885915: Sort with lower first, but ß comes between A and B. ['a', 'A', 'ß', b', 'B', ...]

  4. And yes, an invalid locale string is treated as “C”. So, LC_ALL=’craphole’ sort foo.txt will sort the same as “C”, which is troublesome when it’s a valid locale that’s not installed.

  5. FYI I’m running something like this:

    $ for locale in $(locale -a); do LC_ALL=$locale sort ./foo.txt > $locale.txt; done
    $ md5sum *.txt > sums
    $ md5sum *.txt | awk ‘{print $1}’ | sort | uniq -c

    Which will show the number of different results, then you can grep the sums file for which locales sort in which order(s).

  6. Just tested: Including some Japanese and Chinese characters in the test file makes ja_JP.utf8 and zh_CN.utf8 sort differently than en_US.utf8.

  7. Yes, I used the exact same file you posted, with the exact same MD5. The difference is that my Linux install has vary few locales installed than yours. I only have 24 listed, and strangely it does *not* include “en_US”, only “en_US.utf8″, so I think that explains the differences.

Leave a Reply