difference in unix vs. linux sort

aj.schaeffer · August 18, 2010, 4:42pm

Hi,

I am using some codes that have been ported from unix to linux, and now the sorting no longer results in the desired ordering. I'm hoping to find a way to mimic the unix sort command in linux. The input file is structured the following:

$> cat file.txt
US;KSU1;00;LHZ;2006-07-26;17:41:00;2008-12-31;23:59:59
US;KSU1;10;LH2;2006-07-26;17:41:00;2999-12-31;23:59:59
US;KSU1; ;LHZ;2008-12-31;17:41:00;2999-12-31;23:59:59
US;KSU1;HR;LHZ;2006-06-01;00:00:00;2999-12-31;23:59:59
US;KSU1;XX;LH1;2006-07-26;17:41:00;2999-12-31;23:59:59
US;KSU1;HR;LH2;2006-06-01;00:00:00;2999-12-31;23:59:59
US;KSU1;HR;LH1;2006-06-01;00:00:00;2999-12-31;23:59:59
US;BSU2;00;LHZ;2006-07-26;17:41:00;2008-12-31;23:59:59
US;BSU2;10;LHN;2006-07-26;17:41:00;2999-12-31;23:59:59
US;BSU2; ;LHZ;2008-12-31;17:41:00;2999-12-31;23:59:59
US;BSU2;HR;LHZ;2006-06-01;00:00:00;2999-12-31;23:59:59
US;BSU2;XX;LHE;2006-07-26;17:41:00;2999-12-31;23:59:59
US;BSU2;HR;LHN;2006-06-01;00:00:00;2999-12-31;23:59:59
US;BSU2;HR;LHE;2006-06-01;00:00:00;2999-12-31;23:59:59

It is semi colon separated (although that doesn't particularly matter). Please note that in the 3rd and 10th rows, column three appears to be "missing" a value. It isn't, it is simple two blanks "<space><space>". This is a real entry in this file. The output should be sorted into a specified format, where it is keyed in order on each column. In unix, the default sort command (also removing unique lines is what we've always used). The result is

unix> cat file.txt | sort -u
US;BSU2; ;LHZ;2008-12-31;17:41:00;2999-12-31;23:59:59
US;BSU2;00;LHZ;2006-07-26;17:41:00;2008-12-31;23:59:59
US;BSU2;10;LHN;2006-07-26;17:41:00;2999-12-31;23:59:59
US;BSU2;HR;LHE;2006-06-01;00:00:00;2999-12-31;23:59:59
US;BSU2;HR;LHN;2006-06-01;00:00:00;2999-12-31;23:59:59
US;BSU2;HR;LHZ;2006-06-01;00:00:00;2999-12-31;23:59:59
US;BSU2;XX;LHE;2006-07-26;17:41:00;2999-12-31;23:59:59
US;KSU1; ;LHZ;2008-12-31;17:41:00;2999-12-31;23:59:59
US;KSU1;00;LHZ;2006-07-26;17:41:00;2008-12-31;23:59:59
US;KSU1;10;LH2;2006-07-26;17:41:00;2999-12-31;23:59:59
US;KSU1;HR;LH1;2006-06-01;00:00:00;2999-12-31;23:59:59
US;KSU1;HR;LH2;2006-06-01;00:00:00;2999-12-31;23:59:59
US;KSU1;HR;LHZ;2006-06-01;00:00:00;2999-12-31;23:59:59
US;KSU1;XX;LH1;2006-07-26;17:41:00;2999-12-31;23:59:59

There are two main entries, as determined by column one and column two. "US BSU2" and "US KSU1". For each of these, the "blank" in column three has been sorted highest, then in numerical order, followed by the alphabetical values. This is the correct formatting for this file. However, if I perform the same command within linux, the output is much different.

linux$> cat file.txt | sort -t';' -u
US;BSU2;00;LHZ;2006-07-26;17:41:00;2008-12-31;23:59:59
US;BSU2;10;LHN;2006-07-26;17:41:00;2999-12-31;23:59:59
US;BSU2;HR;LHE;2006-06-01;00:00:00;2999-12-31;23:59:59
US;BSU2;HR;LHN;2006-06-01;00:00:00;2999-12-31;23:59:59
US;BSU2;HR;LHZ;2006-06-01;00:00:00;2999-12-31;23:59:59
US;BSU2; ;LHZ;2008-12-31;17:41:00;2999-12-31;23:59:59
US;BSU2;XX;LHE;2006-07-26;17:41:00;2999-12-31;23:59:59
US;KSU1;00;LHZ;2006-07-26;17:41:00;2008-12-31;23:59:59
US;KSU1;10;LH2;2006-07-26;17:41:00;2999-12-31;23:59:59
US;KSU1;HR;LH1;2006-06-01;00:00:00;2999-12-31;23:59:59
US;KSU1;HR;LH2;2006-06-01;00:00:00;2999-12-31;23:59:59
US;KSU1;HR;LHZ;2006-06-01;00:00:00;2999-12-31;23:59:59
US;KSU1; ;LHZ;2008-12-31;17:41:00;2999-12-31;23:59:59
US;KSU1;XX;LH1;2006-07-26;17:41:00;2999-12-31;23:59:59

In this case, the rows with the "blanks" are no longer given the highest ranking, and instead slot between HR and XX.

Is there a way to emulate the behaviour of the unix sort command within linux. I imagine there is a difference in the precedence of the characters, but how the <space><space> is interpreted to fit between HR and XX, I don't know.

Thanks for any help.

Corona688 · August 18, 2010, 5:28pm

I think something's up with your data files. I can't get a sorting order anything like any of those.

aj.schaeffer · August 18, 2010, 5:34pm

Hmmm.

I just copied and pasted from my browser the output from "cat file.txt" to an empty gedit, saved, then from the command line ran "cat file.txt | sort -t';' -u",
and it produced the same output as the bottom output (the linux one). I'm not sure what I'm doing to the files by copying and pasting.

What sorted output did you get?

Corona688 · August 18, 2010, 5:36pm

$ sort -u < file.txt
US;BSU2; ;LHZ;2008-12-31;17:41:00;2999-12-31;23:59:59
US;BSU2;00;LHZ;2006-07-26;17:41:00;2008-12-31;23:59:59
US;BSU2;10;LHN;2006-07-26;17:41:00;2999-12-31;23:59:59
US;BSU2;HR;LHE;2006-06-01;00:00:00;2999-12-31;23:59:59
US;BSU2;HR;LHN;2006-06-01;00:00:00;2999-12-31;23:59:59
US;BSU2;HR;LHZ;2006-06-01;00:00:00;2999-12-31;23:59:59
US;BSU2;XX;LHE;2006-07-26;17:41:00;2999-12-31;23:59:59
US;KSU1; ;LHZ;2008-12-31;17:41:00;2999-12-31;23:59:59
US;KSU1;00;LHZ;2006-07-26;17:41:00;2008-12-31;23:59:59
US;KSU1;10;LH2;2006-07-26;17:41:00;2999-12-31;23:59:59
US;KSU1;HR;LH1;2006-06-01;00:00:00;2999-12-31;23:59:59
US;KSU1;HR;LH2;2006-06-01;00:00:00;2999-12-31;23:59:59
US;KSU1;HR;LHZ;2006-06-01;00:00:00;2999-12-31;23:59:59
US;KSU1;XX;LH1;2006-07-26;17:41:00;2999-12-31;23:59:59
$ sort -t';' -u < file.txt
US;BSU2; ;LHZ;2008-12-31;17:41:00;2999-12-31;23:59:59
US;BSU2;00;LHZ;2006-07-26;17:41:00;2008-12-31;23:59:59
US;BSU2;10;LHN;2006-07-26;17:41:00;2999-12-31;23:59:59
US;BSU2;HR;LHE;2006-06-01;00:00:00;2999-12-31;23:59:59
US;BSU2;HR;LHN;2006-06-01;00:00:00;2999-12-31;23:59:59
US;BSU2;HR;LHZ;2006-06-01;00:00:00;2999-12-31;23:59:59
US;BSU2;XX;LHE;2006-07-26;17:41:00;2999-12-31;23:59:59
US;KSU1; ;LHZ;2008-12-31;17:41:00;2999-12-31;23:59:59
US;KSU1;00;LHZ;2006-07-26;17:41:00;2008-12-31;23:59:59
US;KSU1;10;LH2;2006-07-26;17:41:00;2999-12-31;23:59:59
US;KSU1;HR;LH1;2006-06-01;00:00:00;2999-12-31;23:59:59
US;KSU1;HR;LH2;2006-06-01;00:00:00;2999-12-31;23:59:59
US;KSU1;HR;LHZ;2006-06-01;00:00:00;2999-12-31;23:59:59
US;KSU1;XX;LH1;2006-07-26;17:41:00;2999-12-31;23:59:59

I'm guessing that maybe, those spaces aren't actually spaces. Try transferring the data with scp instead of copy-paste.

aj.schaeffer · August 18, 2010, 5:48pm

I created the file file.txt within linux in gedit, manually putting in two spaces. I then ran the sort on it, with the same incorrect output.

I then scp'd the file to a sun computer, ran sort, and the output was sorted in the same order as yours.

Then I scp'd the file back to a new file on the linux computer, and ran sort again, but got the same output as the original sort I had carried out on linux (Ubuntu 9.10 x86_64). What *nix are you running?

Corona688 · August 18, 2010, 5:50pm

How about, instead of copy/pasting into gedit, you copy the original file, from the original source and try it on that? Not copy-paste, get the file itself. Copy-pasting is likely where the change is likely happening. Character sets might be getting changed slightly(or maybe you have a different character set than me), whitespace mangled. (things like tabs get copy/pasted as spaces, usually.) Especially copy-pasting from a web browser tends to eat multiple spaces.

Gentoo linux.

 $ sort --version
sort (GNU coreutils) 7.5
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Mike Haertel and Paul Eggert.

aj.schaeffer · August 19, 2010, 6:36am

Just to clarify, the text file created on the sun computer was created by hand (ie typed out) simply to test some of the routines. The original full text files are station lists that are obtained by copying and pasting from a web browser into a text file. In the past, on the sun system, this had never been a problem. After I scp'd the files from the sun system to the linux system, the sort command no longer parsed them the same way.

I've scp'd the file from the sun computer (generated there) to the linux computer, and ran the sort, but it still produces the same incorrect output. As far as I know, there is no copy and paste going on now.

$ sort --version
sort (GNU coreutils) 7.4
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Mike Haertel and Paul Eggert.

I can't imagine that this is caused by the difference between GNU 7.4 and 7.5.

Thanks!

---------- Post updated 08-19-10 at 11:36 AM ---------- Previous update was 08-18-10 at 11:10 PM ----------

I have manged to get this figured out now, after coming across the following post: sort (GNU coreutils) 7.4 not sorting in ascii order (asked and answered)..

My LC_COLLATE environment variable was set to "en_IE.utf8". I set that instead as "C", for POSIX ASCII sort as in Unix environments, and the sort command now works as I would expect on my input file.

Thanks for your help!