Why is sort -k not working all the time?

alan · August 8, 2012, 6:51pm

I have a script that puts a list of files in two separate arrays:

First, I get a file list from a ZIP file and fill `FIRST_Array()` with it. Second, I get a file list from a control file within a ZIP file and fill `SECOND_Array()` with it

                while read length date time filename 
                do
                        FIRST_Array+=( "$filename" )
                        echo "$filename" >> FIRST.report.out
                done < <(/usr/bin/unzip -qql AAA.ZIP |sort -k12 -t~)

Third, I compare both array like so:

    diff -q <(printf "%s\n" "${FIRST_Array[@]}") <(printf "%s\n" "${SECOND_Array[@]}") |wc -l

I can tell that `Diff` fails because I output each array to files: `FIRST.report.out` and `SECOND.report.out` are simply not sorted properly.

1) FIRST.report.out (what's inside the ZIP file)

JGS-Memphis~AT1~Pre-Test~X-BanhT~JGMDTV387~6~P~1100~HR24-500~033072053326~20120808~240914.XML
JGS-Memphis~PRE~DTV_PREP~X-GuinE~JGMDTV069~6~P~1100~H24-700~033081107519~20120808~240914.XML
JGS-Memphis~PRE~DTV_PREP~X-MooreBe~JGM98745~40~P~1100~H21-200~029264526103~20120808~240914.XML
JGS-Memphis~FUN~Pre-Test~X-RossA~jgmdtv168~2~P~1100~H21-200~029415655926~20120808~240914.XML

2) SECOND.report.out (what's inside the ZIP's control file)

JGS-Memphis~AT1~Pre-Test~X-BanhT~JGMDTV387~6~P~1100~HR24-500~033072053326~20120808~240914.XML
JGS-Memphis~FUN~Pre-Test~X-RossA~jgmdtv168~2~P~1100~H21-200~029415655926~20120808~240914.XML
JGS-Memphis~PRE~DTV_PREP~X-GuinE~JGMDTV069~6~P~1100~H24-700~033081107519~20120808~240914.XML
JGS-Memphis~PRE~DTV_PREP~X-MooreBe~JGM98745~40~P~1100~H21-200~029264526103~20120808~240914.XML

Using sort -k12 -t~ made sense since ~ is the delimiter for the file's date field ("20120808" : 12th position). But it is not working consistently.

The sort is worse when my script processes bigger ZIP files. Why is sort -k not working all the time? How can I sort both arrays?

Chubler_XL · August 8, 2012, 7:21pm

If your sort supports try --debug option to see what it's sorting by. Field 12 is the filename.

Try sort -t~ -k10,11 to sort by the 12 digit number and then date

spacebar · August 8, 2012, 7:24pm

Don't you want to sort the whole file name(from both lists) so both lists will be in the same order?

Don_Cragun · August 8, 2012, 8:32pm

What shell are you using? I'm not familiar with a redirection operator of the form:

< <(command pipeline)

Sorry. Never mind. It is a bash feature. This form is not available in ksh and is an extension to the POSIX shell.

alister · August 8, 2012, 9:37pm

don cragun:

What shell are you using? I'm not familiar with a redirection operator of the form:
< <(command pipeline)
Sorry. Never mind. It is a bash feature. This form is not available in ksh and is an extension to the POSIX shell.

ksh88 and ksh93 both support that construct. The process substitution, <(cmd ...) is replaced with /dev/fd/n , and the first < is the usual input redirection operator.

Regards,
Alister

Don_Cragun · August 8, 2012, 10:07pm

That's interesting. The ksh on Mac OS X Lion (Version M 1993-12-28 s+ $) says:

when given the command:

while read line;do echo x $line;done < <(printf "a\nb c\nd e f\n")

The New Kornshell Command and Programming Language book by Nolsky and Korn does say that Process Substitution using this form is a "Possible Extension" that

and OS X does support /dev/fd. It looks like someone at Apple failed to configure OS X's ksh to give us access to this feature.

alan · August 9, 2012, 12:38pm

Yes, I do want to sort the whole file name.

I tried to do that by using "|sort" instead of "sort -k..." but the problem remains.

Don_Cragun · August 9, 2012, 2:39pm

Your problem statement says sort -k isn't working, but you're not showing us what you're trying to sort. Please show us the output of the command:

/usr/bin/unzip -qql AAA.ZIP

Then show us the output of the command:

/usr/bin/unzip -qql AAA.ZIP |sort -k12 -t~

and the command:

/usr/bin/unzip -qql AAA.ZIP |sort -t~

If what you showed as the contents of FIRST.report.out is the output from the unzip and SECOND.report.out is the output from feeding FIRST.report.out through

sort -k12 -t~

then the output is correct. Your primary sort key is field 12 which is "20120808" in every input line. So, sort processes the entire line as a sort key to resolve the ambiguity and finds the first differences in the second field. It has correctly sorted AT1 before FUN and FUN before PRE. Since the last two line match on the PRE, sort continues looking for differences and finally sorts the last two lines so GuinE comes before MooreBe. Since field 12 is identical in every record, the output from sort -k12 and sort without -k12 is identical. What are we missing?

spacebar · August 9, 2012, 8:51pm

Ok, then that means something is different per record in each array.
Can you for testing write the data from each array to a file and then sort the files and then do the diff?

I put your example data(below) into two files, sorted the whole records and diff reported them as the same:

1) FIRST.report.out (what's inside the ZIP file)
JGS-Memphis~AT1~Pre-Test~X-BanhT~JGMDTV387~6~P~1100~HR24-500~033072053326~20120808~240914.XML 
JGS-Memphis~PRE~DTV_PREP~X-GuinE~JGMDTV069~6~P~1100~H24-700~033081107519~20120808~240914.XML 
JGS-Memphis~PRE~DTV_PREP~X-MooreBe~JGM98745~40~P~1100~H21-200~029264526103~20120808~240914.XML 
JGS-Memphis~FUN~Pre-Test~X-RossA~jgmdtv168~2~P~1100~H21-200~029415655926~20120808~240914.XML
2) SECOND.report.out (what's inside the ZIP's control file)
JGS-Memphis~AT1~Pre-Test~X-BanhT~JGMDTV387~6~P~1100~HR24-500~033072053326~20120808~240914.XML 
JGS-Memphis~FUN~Pre-Test~X-RossA~jgmdtv168~2~P~1100~H21-200~029415655926~20120808~240914.XML 
JGS-Memphis~PRE~DTV_PREP~X-GuinE~JGMDTV069~6~P~1100~H24-700~033081107519~20120808~240914.XML 
JGS-Memphis~PRE~DTV_PREP~X-MooreBe~JGM98745~40~P~1100~H21-200~029264526103~20120808~240914.XML