mantis
October 29, 2012, 5:37pm
1
Hi,
I have a huge text file with filenames which which looks like the following ie uniquenumber_version_filename:
e.g.
1234_1_xxxx
1234_2_vfvfdbb
343333_1_vfvfdvd
2222222_1_ggggg
55555_1_xxxxxx
55555_2_vrbgbgg
55555_3_grgrbr
What I need to do is examine the file, look for duplicate uniquenumbers and then filter out the uniquenumber with the highest version, so for example in the above it would be the following:
1234_2_vfvfdbb
55555_3_grgrbr
Is there a scripted method by which I can do this?
Thanks in advance
Mantis
Yoda
October 29, 2012, 6:22pm
2
awk -F_ ' { a=$1; b=$2; getline; c=$1; d=$2; if(a==c) { if(b<d) $0=$0; } else { getline; $0=""; e=$1; f=$2; if(d<f) $0=""; } } { print; } ' infile
mantis
October 29, 2012, 7:07pm
3
That's an amazing solution sir, but it looks abit complicated.
Also will I will be able to apply it to a huge file full of uniquenumber_version_filename with thousands of rows?
Thanks
Another awk solution:
awk -F_ '$2>0+V[$1]{V[$1]=$2;F[$1]=$0} END{for(k in V) print F[k]}' OFS=_ infile
This should work for quite large files, however the output will be unsorted, you didn't specify if the file order was important.
If the field "uniquenumber" is clustered together along with the "version" in order (as shown in your example), you can use:
awk -F_ '{ if ( ($1 != u) && (v > 1) ) { print l } u=$1; v=$2; l=$0 }' yourfile