Sort command - strange behaviour

miwinter · May 28, 2008, 4:51am

Hi guys,

I have the following example data:

A;00:00:19
B;00:01:02
C;00:00:13
D;00:00:16
E;00:02:27
F;00:00:12
G;00:00:21
H;00:00:19
I;00:00:13
J;00:13:22

I run the following sort against it, yet the output is as follows:

sort -t";" +1 -nr example_data.dat

A;00:00:19
B;00:01:02
C;00:00:13
D;00:00:16
E;00:02:27
F;00:00:12
G;00:00:21
H;00:00:19
I;00:00:13
J;00:13:22

I'd expect it to recognise the field delimiter, skipping field 1 and then sorting numerically on the second field (i.e. to put the longest time first and the shortest last). Any ideas please? Is the ":" in this field causing an issue?

I've proven it is recognising the second field as the sort key by changing to a dictionary-based sort too:

sort -t"," +1 -dr example_data.dat

J,00:13:22
E,00:02:27
B,00:01:02
G,00:00:21
A,00:00:19
H,00:00:19
D,00:00:16
C,00:00:13
I,00:00:13
F,00:00:12

This produces the desired output against this subset of data, but when I run it against the 'live' data (of much larger volume) it isn't any use as it will start with all times beginning with '9' first and then in descending order, it has to be a numeric-based sort ultimately.

Thanks in advance,

Mark

krishmaths · May 28, 2008, 4:56am

sort -t";" -rk2,2 example_data.dat

miwinter · May 28, 2008, 5:10am

Thanks for the uber-fast reply Krish. I looked at the key definition thing (k switch) but it didn't seem to to work either. Using what you gave does the right thing, only, when I transpose that command to use on my live data, it doesn't. Here's an example (first 10 lines out of the newly sorted file):

sort -t";" rk2,2 mwreport_joined.txt > mwreport_sorted.txt

GLMLRP_ComparisonJob;989:13:42
GLMLRP_Diff_HighlighterJob;989:08:56
AD046;988:44:15
GleamMIPostCanadaExtractJob;9196:53:12
GleamMIAGREERepAllBackOutJob;9025:39:12
GleamMIAGREEProdFacilCombJob;9025:29:36
GleamMIAGREEExcRateHistExtractJob;9025:21:26
GleamMIAGREEDynamicParamJob;9025:19:10
GleamMIAGREEClassExtractJob;9025:11:35
GleamMIAGREEClassPODLoadJob;9025:09:43

As you can see above, the "9196:53:12" value in the fourth record should be top of the list as it is the largest numerically

zaxxon · May 28, 2008, 9:06am

Is this what you want?

sort -t";" -rn -k2,2 mwreport_joined.txt

GleamMIPostCanadaExtractJob;9196:53:12
GleamMIAGREERepAllBackOutJob;9025:39:12
GleamMIAGREEProdFacilCombJob;9025:29:36
GleamMIAGREEExcRateHistExtractJob;9025:21:26
GleamMIAGREEDynamicParamJob;9025:19:10
GleamMIAGREEClassPODLoadJob;9025:09:43
GleamMIAGREEClassExtractJob;9025:11:35
GLMLRP_Diff_HighlighterJob;989:08:56
GLMLRP_ComparisonJob;989:13:42
AD046;988:44:15

miwinter · May 28, 2008, 9:09am

zaxxon:

Is this what you want?

sort -t";" -rn -k2,2 mwreport_joined.txt

GleamMIPostCanadaExtractJob;9196:53:12
GleamMIAGREERepAllBackOutJob;9025:39:12
GleamMIAGREEProdFacilCombJob;9025:29:36
GleamMIAGREEExcRateHistExtractJob;9025:21:26
GleamMIAGREEDynamicParamJob;9025:19:10
GleamMIAGREEClassPODLoadJob;9025:09:43
GleamMIAGREEClassExtractJob;9025:11:35
GLMLRP_Diff_HighlighterJob;989:08:56
GLMLRP_ComparisonJob;989:13:42
AD046;988:44:15

That's closer yes... although, I've highlighted above where records are out of line:

9025:11:35 - this should sit between 9025:19:10 and 9025:09:43

989:13:42 - this should be above 989:08:56

zaxxon · May 28, 2008, 9:19am

Wrote nonsense - will come up with a better idea, brb

Here it is:

awk -F ";" '{print $2}' mwreport_joined.txt| sort -t ";" -rn -k 1,1 -k 2,2 -k 3,3 | xargs -I {} grep {} mwreport_joined.txt

Maybe not nice but works on Debian Linux. Not sure about the -I {} on xargs for other OS'es. On AIX I usually just leave it out iirc.

miwinter · May 28, 2008, 10:21am

Cheers again That seems to work although it returns each line multiple times, meaning the output file becomes much larger than the original

ripat · May 28, 2008, 11:44am

Using option g?

sort -t";" -k 2,2rg file.to.sort

zaxxon · May 29, 2008, 1:07am

@ripat
Nope doesn't work - check the second column when having the ":" as delimeter. It only sorts the 1st number in front of the 1st ":".

@miwinter
Hm, I tested it with mawk on Debian Linux and on AIX with awk and no duplicate output. Not sure what you have to add to supress the duplicates... if you find no option to tell it awk, maybe pipe it into "uniq" at the end.

era · May 29, 2008, 2:35am

Or split the input to make it easier for sort to parse the fields you want.

vnix$*awk -F';' '{ split($2,t,":"); print t[1] ";" t[2] ";" t[3] ";" $0 }' <<HERE |
> GleamMIPostCanadaExtractJob;9196:53:12
> GleamMIAGREERepAllBackOutJob;9025:39:12
> GleamMIAGREEProdFacilCombJob;9025:29:36
> GleamMIAGREEExcRateHistExtractJob;9025:21:26
> GleamMIAGREEDynamicParamJob;9025:19:10
> GleamMIAGREEClassPODLoadJob;9025:09:43
> GleamMIAGREEClassExtractJob;9025:11:35
> GLMLRP_Diff_HighlighterJob;989:08:56
> GLMLRP_ComparisonJob;989:13:42
> AD046;988:44:15
> HERE
> sort -t ';' -rn | cut -d ';' -f4-
GleamMIPostCanadaExtractJob;9196:53:12
GleamMIAGREERepAllBackOutJob;9025:39:12
GleamMIAGREEProdFacilCombJob;9025:29:36
GleamMIAGREEExcRateHistExtractJob;9025:21:26
GleamMIAGREEDynamicParamJob;9025:19:10
GleamMIAGREEClassExtractJob;9025:11:35
GleamMIAGREEClassPODLoadJob;9025:09:43
GLMLRP_ComparisonJob;989:13:42
GLMLRP_Diff_HighlighterJob;989:08:56
AD046;988:44:15

Take away the final cut to see what it's really doing.

ripat · May 29, 2008, 3:53am

sort has a strange behavior. When you pipe him the second field using awk, the sort is ok:

$ awk -F";" '{print $2}' f | sort -rn
9196:53:12
9025:39:12
9025:29:36
9025:21:26
9025:19:10
9025:11:35
9025:09:43
989:13:42
989:08:56
988:44:15

But when you ask sort to do the very same i.e to split on the ; and sort on the second key, the output is, indeed, not as expected:

$ sort -t";" -k2rn f
GleamMIPostCanadaExtractJob;9196:53:12
GleamMIAGREEClassExtractJob;9025:11:35
GleamMIAGREEClassPODLoadJob;9025:09:43
GleamMIAGREEDynamicParamJob;9025:19:10
GleamMIAGREEExcRateHistExtractJob;9025:21:26
GleamMIAGREEProdFacilCombJob;9025:29:36
GleamMIAGREERepAllBackOutJob;9025:39:12
GLMLRP_ComparisonJob;989:13:42
GLMLRP_Diff_HighlighterJob;989:08:56
AD046;988:44:15

Very strange.

zaxxon · May 29, 2008, 4:02am

And one more...
Forgot about that you can use more than one Field Separator - so here a more compact version (ok, some cheating on the delimiters with sed in the end):

awk -F ";|:" 'BEGIN{OFS=":"} {print $1,$2,$3,$4}' infile| sort -t ":" -rn -k 2,2 -k 3,3 -k 4,4 | sed -n 's/\:/\;/p'

@era
Btw, funny awards on your site

ripat · May 29, 2008, 4:18am

Got it! I had to read all possible options, even the one that are not well documented!

$ sort -t";" -s -k2,2rn f
GleamMIPostCanadaExtractJob;9196:53:12
GleamMIAGREERepAllBackOutJob;9025:39:12
GleamMIAGREEProdFacilCombJob;9025:29:36
GleamMIAGREEExcRateHistExtractJob;9025:21:26
GleamMIAGREEDynamicParamJob;9025:19:10
GleamMIAGREEClassExtractJob;9025:11:35
GleamMIAGREEClassPODLoadJob;9025:09:43
GLMLRP_ComparisonJob;989:13:42
GLMLRP_Diff_HighlighterJob;989:08:56
AD046;988:44:15

Option -s did the trick! Found solution by reading this man page:

   Finally, as a last resort when all keys compare equal (or if no
   ordering options were specified at all), `sort' compares the entire
   lines.  The last resort comparison honors the `-r' global option.  The
   `-s' (stable) option disables this last-resort comparison so that lines
   in which all fields compare equal are left in their original relative
   order.  If no fields or global options are specified, `-s' has no
   effect.

sort Linux Commands: Sort text files

zaxxon · May 29, 2008, 4:29am

Nice solution!

Edit:
Just saw that sort does not have that switch on AIX (didn't look for a similar one) but on Linux, nice though

miwinter · May 29, 2008, 6:32am

Aargh! Why on Earth does AIX not provide this option! It's just what I'm after

era · May 29, 2008, 6:44am

If you split on the colons as well, sort should work. (We assume you don't really require a stable sort; if two lines have exactly the same sort key, their order should not matter; correct?) Did you try the awk script I posted before?

miwinter · May 29, 2008, 7:19am

Sorry era, yep, I have gone with that solution and it looks to be working a dream on my entire live data (about 43,000 rows). Thanks very much to all who have contributed in this thread, your help is much appreciated. I thought I was going mad when I saw the odd behaviour we've witnessed with sort in AIX here!