Sort with UNIX

aydj · October 18, 2010, 7:38am

I want to sort unique values of column 2 that has the maximum value at column 7. this is my file and desired output

I have a file like this:

AD008 AD0081010180947 101018 0947 0950 1010180947 1010180950
AD008 AD0081010180947 101018 0947 0956 1010180947 1010180956
AD008 AD0081010180947 101018 0947 1014 1010180947 1010181014
AD008 AD0081010180947 101018 0947 0956 1010180947 1010180956
ED006 ED0061010180949 101018 0949 0956 1010180949 1010180956
ED006 ED0061010180949 101018 0949 0956 1010180949 1010180956
ED007 ED0071010180949 101018 0949 0956 1010180949 1010180956
ED011 ED0111010180949 101018 0949 0956 1010180949 1010180956
ED011 ED0111010180949 101018 0949 1201 1010180949 1010181201
ED011 ED0111010180949 101018 0949 0956 1010180949 1010180956
ED011 ED0111010180949 101018 0949 0956 1010180949 1010180956
ED011 ED0111010180949 101018 0949 0956 1010180949 1010180956
ED011 ED0111010180949 101018 0949 0956 1010180949 1010180956
KW005 KW0051010180950 101018 0950 0950 1010180950 1010180950
KW005 KW0051010180950 101018 0950 0956 1010180950 1010180956
OG010 OG0101010181002 101018 1002 0956 1010181002 1010180956
OG010 OG0101010181002 101018 1002 1123 1010181002 1010181123
OG010 OG0101010181002 101018 1002 0956 1010181002 1010180956

to give:

ED011 ED01171010180949 101018 0949 1201 1010180949 1010181201
OG010 OG01001010181002 101018 1002 1123 1010181002 1010181123
AD008 ED00801010180947 101018 0947 1014 1010180947 1010181014
ED006 ED00631010180949 101018 0949 0956 1010180949 1010180956
ED007 ED00741010180949 101018 0949 0956 1010180949 1010180956
KW005 KW00521010180950 101018 0950 0956 1010180950 1010180956

Scrutinizer · October 18, 2010, 8:01am

Like this?

$ awk '$7>M[$1]{M[$1]=$7;A[$1]=$0}END{for(i in A)print A}' infile427
KW005 KW0051010180950 101018 0950 0956 1010180950 1010180956
OG010 OG0101010181002 101018 1002 1123 1010181002 1010181123
ED011 ED0111010180949 101018 0949 1201 1010180949 1010181201
ED006 ED0061010180949 101018 0949 0956 1010180949 1010180956
AD008 AD0081010180947 101018 0947 1014 1010180947 1010181014
ED007 ED0071010180949 101018 0949 0956 1010180949 1010180956

ctsgnb · October 18, 2010, 8:29am

scrutinizer:

Like this?

$ awk '$7>M[$1]{M[$1]=$7;A[$1]=$0}END{for(i in A)print A}' infile427
KW005 KW0051010180950 101018 0950 0956 1010180950 1010180956
OG010 OG0101010181002 101018 1002 1123 1010181002 1010181123
ED011 ED0111010180949 101018 0949 1201 1010180949 1010181201
ED006 ED0061010180949 101018 0949 0956 1010180949 1010180956
AD008 AD0081010180947 101018 0947 1014 1010180947 1010181014
ED007 ED0071010180949 101018 0949 0956 1010180949 1010180956

OMG awk is darn powerful ... i really should get deeper into it

Could someone kindly comment the code pls ?

do awk execute M[$1]=$7 before the checking condition $7>M[$1] ?
for(i in A)print A [i]i doesn't need to be initialized ? it automatically scan the A array ?
how does it know which [$1] index value to take to scan this array ?

Thanks in advance for you lightening

rdcwayx · October 18, 2010, 8:31am

should be compare on $2

awk '$7>M[$2]{M[$2]=$7;A[$2]=$0}END{for(i in A)print A}' infile

Scrutinizer · October 18, 2010, 8:38am

I adjusted $1 to $2 in your question.

M[$2] does not exist when $7>M[$2] gets executed the first time. But in awk this is no problem and it evaluates to $7 being bigger.
yes
The arrays in awk are associative. The for..in construct iterates over the elements present.

ctsgnb · October 18, 2010, 9:40am

awk Array are associative ?...

Does it mean that we don't care about the index value [$2] ?
I mean : this $2 value is in fact not the value of the index itself : it is mapped to an internal increment** of the current A [i]element ?
(**) this internal increment would be the real one over which the for construct iterates.

Is that correct or am i missing something ?

Scrutinizer · October 18, 2010, 10:00am

Hi, no in the END section, the array element indices are equal to the $2 fields of the input records.

ctsgnb · October 18, 2010, 10:16am

LoL OK then ...

so if we consider

M[$2]=$7

when the value of $2 takes - for example - the value 5 and then 3

The array looks like this :
when $2=5

the M array is
M[1] = unknown
M[2] = unknown
M[3] = unknown
M[4] = unknown
M[5] = $7
M[6] = unknown
...

and when the
$2=3
the M array is
M[1] = unknown
M[2] = unknown
M[3] = $7
M[4] = unknown
M[5] = $7
M[6] = unknown
...

?

or does it build an array such as :
M[5]=$7
M[3]=$7

So that in a way or another, is has an internal mapping table or whatever, in which it hold the order in which the [n] has been entered, so it start by scanning M[5] before M[3] whereas 3<5 ...

???

So in short :

i takes the values of $2 in the for construct, but awk still manage a mapping with an internal index mapping or whatever, which helps it to keep the order in which the M[$2] value are entered (even if $2 does not have ordered values)

Is that correct ? or should i go to buy another brain ?

Scrutinizer · October 18, 2010, 10:22am

After line 1 M["AD0081010180947"] contains the value 1010180950
After line 2 M["AD0081010180947"] contains the value 1010180956
after line 3 M["AD0081010180947"] contains the value 1010181014
after line 4 M["AD0081010180947"] still contains the value 1010181014
after line 5 M["AD0081010180947"] contains the value 1010181014 and M["ED0061010180949"] contains the value 1010180956
etc.

radoulov · October 18, 2010, 10:24am

Yes,
it's just a sparse associative (keyed by strings) array.

Unless you're using WHINY_USERS (a GNU awk undocumented extention), the internal order of the keys cannot be controlled:

DGPickett · October 18, 2010, 10:31am

Simpler if not as resource-sparing: sort all by columns 7 reverse (not unique) pipe sort unique column 2. The sort -u saves the first holder of a key value. If you want a sequential presentation on another column, that is a third sort.

sort -r +6 -7 your_file | sort -u +1 -2

ctsgnb · October 18, 2010, 10:47am

Ok, i think i start to begin to understand ...

Thank you guys for your help & patience sharing your knowledge !

aydj · October 18, 2010, 12:13pm

Thanks to everybody, i can now complete my script.