Need a quick and dirty solution

Finja · June 28, 2012, 5:46pm

I have a list of multiple versions of software. The list is formated as follows:

NAME VERSION

I simply need to pull out the highest version of each software, for example:

Original File

a v1.0
a v1.1
a v1.2
b v2.1
b v2.2
b v2.21
b v3.0

Output

a v1.2
b v3.0

Corona688 · June 28, 2012, 6:18pm

$ awk '(!A[$1]) || (A[$1] < $2) { A[$1]=$2 }; END { for (X in A) { print X, A[X] } }' <<EOF
a v1.0
a v1.1
a v1.2
b v2.1
b v2.2
b v2.21
b v3.0
EOF

a v1.2
b v3.0

$

alister · June 28, 2012, 7:04pm

Given that the sample version data contains components of varying widths, e.g. 2.2 and 2.21, I don't think this code is sufficient.

The highlighted expression is a string comparison. v3.9 will be considered greater than v3.10. If the comparison were made using numeric substrings, 3.9 would still be greater than 3.10.

Regards,
Alister

---------- Post updated at 07:04 PM ---------- Previous update was at 07:00 PM ----------

Quick and dirty, as per the topic:

tr . '\t' < file | sort -k1,1 -k2.2b,2nr -k3,3nr | awk 'p!=$1 {p=$1; print}' | tr '\t' .

Assumes there is only the one dot in the version numbers, that there are no tabs in the file, and that your field delimiter is a space. If the delimiter is a tab and there are no spaces, change \t to a space.

Regards,
Alister

drl · June 28, 2012, 9:51pm

Hi.

Using available (but non-standard) utility msort:

#!/usr/bin/env bash

# @(#) s1	Demonstrate msort hybrid sort.
# See: http://freecode.com/projects/msort

pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C msort awk

FILE=${1-data1}
pl " Input data $FILE:"
cat $FILE

pl " Results:"
msort -q -l -n 2,2 -I -c hybrid $FILE |
tee f1 |
awk '!a[$1]++'

exit 0

producing:

% ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0.8 (lenny) 
bash GNU bash 3.2.39
msort 8.44
awk GNU Awk 3.1.5

-----
 Input data data1:
d v4.0.9
d v4.0.10
a v1.0
a v1.1
a v1.2
b v2.1
b v2.2
b v2.21
b v3.0
c v3.10
c v3.9

-----
 Results:
d v4.0.10
c v3.10
b v3.0
a v1.2

The intermediate results are on file f1. The awk prints the first occurrence of an item, skipping the rest.

If msort is not in your repository, see link in script for binary or source to compile.

Best wishes ... cheers, drl

elixir_sinari · June 29, 2012, 2:29am

awk -F"[ v]" 'a[$1]+0<$3+0{a[$1]=$3} END{for(i in a) print i" v"a}' inputfile

expert · June 29, 2012, 3:01am

try below

for var in `awk -F " " '{print $1}' test_file_data | uniq`
do
  grep $var test_file_data | sort | tail -1
done

hope this would work out as simple..

jayan_jay · June 29, 2012, 3:06am

$ sort -nr infile | nawk '!x[$1]++'
b v3.0
a v1.2
$

alister · June 29, 2012, 6:50am

The suggestions mentioned below are all incorrect for reasons stated in post #3.

Incorrect numeric comparison. Numerically, 1.10 is less than 1.9, but as a version number, 1.10 is greater than 1.9. To perform the comparison numerically and correctly, you'd have to transform the version number in some way. Perhaps something like a*1000 + b , where a is the first component of the version string and b is the second. Obviously, though, b can never be greater than or equal to the constant multiplier, or there would be some ambiguity.

A lexicographical sort is inappropriate for a version string. Comparing as strings, 1.9 is greater than 1.10 since 9 follows 1 in the collation sequence of every locale of which I'm aware. Further, lexicographical sort results can vary with locale.

Why specify the field separator in awk -F " " ? A single space is already the default value for FS. awk treats that specially. A single space ignores leading and trailing blank characters and splits on sequences of blanks (in the C/POSIX locale, spaces and tabs). If you intended to split on spaces only, you must use a regular expression bracket expression: awk -F '[ ]' .

grep $var is vulnerable to regular expression metacharacters in the first column of the data. This may or may not be a concern, depending on what the real data in that first column looks like. Regardless, fixed string matching, -F , is the correct approach. You do not want to treat the contents of that first column as a regular expression that can match strings that are not exactly identical to itself. You want to treat that column literally. Consider it a bonus that fixed string matching is also simpler and faster.

That is performing a lexicographical sort. See the previous example's critique for why a lexicographical sort is inappropriate.

Why? sort -nr will look for a numeric string at the very beginning of the line. If it doesn't find one, it will behave as if it read a zero. Since the first field in the sample data is alphabetic, all lines will numerically compare as equal to zero and to each other, which then requires sort to break that tie by performing a lexicographical sort on the entire line. Long story short, -n may as well not have been specified; given the sample data (and any data set where the first non-blank sequence is not a valid numeric string), sort -nr and sort -r give identical results.

Regards,
Alister

elixir_sinari · June 29, 2012, 8:33am

Is there any problem in this solution, alister?

awk -F"[ v.]" '$3+0==vleft[$1]+0 && $4"">vright[$1]"" {vright[$1]=$4} $3+0>vleft[$1]+0 {vleft[$1]=$3;vright[$1]=$4} END {for(i in vleft) print i" v"vleft"."vright}' inputfile

ctsgnb · June 29, 2012, 8:35am

I didn't noticed that tricky difference between -F " " and -F "[ ]"

Thanks, for pointing this!

alister · June 29, 2012, 10:01am

elixir_sinari:

Is there any problem in this solution, alister?

awk -F"[ v.]" '$3+0==vleft[$1]+0 && $4"">vright[$1]"" {vright[$1]=$4}
$3+0>vleft[$1]+0 {vleft[$1]=$3;vright[$1]=$4} END {for(i in vleft) print i" v"vleft"."vright}' inputfile

Hi, elixir:

Is it my imagination, or did you change +0 to "" in the highlighted expression? (I had looked at it earlier and thought it looked good, but was sidetracked and did not post.)

That string comparison is incorrect. The +0 variant, forcing numeric comparison of $4, should work correctly.

Regards,
Alister

drl · June 29, 2012, 10:15am

Hi, elixir_sinar.

elixir_sinari:

Is there any problem in this solution, alister?

awk -F"[ v.]" '$3+0==vleft[$1]+0 && $4"">vright[$1]"" {vright[$1]=$4} $3+0>vleft[$1]+0 {vleft[$1]=$3;vright[$1]=$4} END {for(i in vleft) print i" v"vleft"."vright}' inputfile

Well, let us try it:

#!/usr/bin/env bash

# @(#) user2	Demonstrate proposed awk solution for "version" strings.

pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C awk

FILE=${1-data1}
pl " Input data $FILE:"
cat $FILE

pl " Results:"
# awk -F"[ v]" 'a[$1]+0<$3+0{a[$1]=$3} END{for(i in a) print i" v"a}' $FILE
awk -F"[ v.]" '$3+0==vleft[$1]+0 && $4"">vright[$1]"" {vright[$1]=$4} $3+0>vleft[$1]+0 {vleft[$1]=$3;vright[$1]=$4} END {for(i in vleft) print i" v"vleft"."vright}' $FILE

exit 0

producing:

% ./user2

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0.8 (lenny) 
bash GNU bash 3.2.39
awk GNU Awk 3.1.5

-----
 Input data data1:
d v4.0.9
d v4.0.10
a v1.0
a v1.1
a v1.2
b v2.1
b v2.2
b v2.21
b v3.0
c v3.10
c v3.9

-----
 Results:
a v1.2
b v3.0
c v3.9
d v4.0

That does not look correct to me. Did I paste your code faithfully? Are your results different?

Best wishes ... cheers, drl

---------- Post updated at 09:15 ---------- Previous update was at 09:08 ----------

Hi.

With alister's correction:

#!/usr/bin/env bash

# @(#) user2	Demonstrate proposed awk solution for "version" strings.

pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C awk

FILE=${1-data1}
pl " Input data $FILE:"
cat $FILE

pl " Results:"
# awk -F"[ v]" 'a[$1]+0<$3+0{a[$1]=$3} END{for(i in a) print i" v"a}' $FILE
# awk -F"[ v.]" '$3+0==vleft[$1]+0 && $4"">vright[$1]"" {vright[$1]=$4} $3+0>vleft[$1]+0 {vleft[$1]=$3;vright[$1]=$4} END {for(i in vleft) print i" v"vleft"."vright}' $FILE
awk -F"[ v.]" '$3+0==vleft[$1]+0 && $4+0>vright[$1]+0 {vright[$1]=$4} $3+0>vleft[$1]+0 {vleft[$1]=$3;vright[$1]=$4} END {for(i in vleft) print i" v"vleft"."vright}' $FILE

exit 0

producing:

% ./user2

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0.8 (lenny) 
bash GNU bash 3.2.39
awk GNU Awk 3.1.5

-----
 Input data data1:
d v4.0.9
d v4.0.10
a v1.0
a v1.1
a v1.2
b v2.1
b v2.2
b v2.21
b v3.0
c v3.10
c v3.9

-----
 Results:
a v1.2
b v3.0
c v3.10
d v4.0

Seems OK for 2-part version strings ... cheers, drl

ctsgnb · June 29, 2012, 10:41am

If your infile is already sorted, maybe you can give a try with:

tail -r infile | nawk '!x[$1]++' | sort

or

tac infile | nawk '!x[$1]++' | sort

elixir_sinari · June 29, 2012, 11:55am

Yes, I had used numeric comparison earlier but then happened to cook up some pretty unusual (and unrealistic) input data which made me change it to a string comparison...:o

The numeric comparisons should work in this case as you said...