how to delete duplicate rows based on last column

hii i have a huge amt of data stored in a file.Here in this file i need to remove duplicates rows in such a way that the last column has different data & i must check for greatest among last colmn data & print the largest data along with other entries but just one of other duplicate entries is needed .For example the given file which looks like this
1902 8 22 3 40.0000 77.0000 8.60
1902 8 22 3 40.0000 76.5000 8.20
1902 8 22 3 40.0000 76.5000 8.30
1902 8 22 3 40.0000 77.0000 8.40
1902 8 22 3 39.8000 76.2000 8.10
1902 9 30 6 38.5000 67.0000 7.70
1902 9 30 6 38.5000 67.0000 6.30
1902 10 6 9 36.5000 70.5000 7.20
1902 12 4 22 37.8000 65.5000 4.90

Now i want the output for such a file as below
1902 8 22 3 40.0000 77.0000 8.60
1902 8 22 3 40.0000 76.5000 8.30
1902 8 22 3 39.8000 76.2000 8.10
1902 9 30 6 36.5000 67.0000 7.70
1902 10 6 9 36.5000 70.5000 7.20
1902 12 4 22 37.8000 65.5000 4.90
------
:confused:

Why the o/p have

1902  8 22  3  40.0000  76.5000 8.20

instead of

1902  8 22  3  40.0000  76.5000 8.30

although Your condition ( last column value > previous value for the same data ).

oh sorry by mistak i wil correct my output a min...

---------- Post updated at 04:18 AM ---------- Previous update was at 04:16 AM ----------

ok now tell me how ....

something like this :

awk '{ va=$NF;$NF=" "; if ($0 in a) { if (va > a[$0]){a[$0]=va}} else {a[$0]=va}} END { for ( i in a ) print i" "a }'  file_name.txt

need to check further as the order of the elements in associative array is not the same.

thanks a lot its working.but first few lines are been deleted in my file...

one more thing for the same data if i need the ouput as

1902 8 22 3 40.0000 77.0000 8.60
1902 9 30 6 38.5000 67.0000 7.70
1902 10 6 9 36.5000 70.5000 7.20
1902 12 4 22 37.8000 65.5000 4.90

that is just check for first 4 columns if its equal & other columns for largest value as shown in above ..

Something like this :

awk '{ va2=$NF;va1=$(NF-1);va=$(NF-2);$NF=" ";$(NF-1)=" ";$(NF-2)=" ";if ($0 in a) { if (va" "va1" "va2 >a[$0] ){a[$0]=va" "v
a1" "va2" "}} else {a[$0]=va" "va1" "va2}} END { for ( i in a ) print i" "a }'  file_name.txt

As i said already :

need to check further as the order of the elements in associative array is not the same.

Another way...

For the 1st one...

 
sort -n +6 infile | awk '{t[$1" "$2" "$3" "$4" "$5" "$6]=$7}END{for (i in t){print i,t}}'

For the 2nd one...

 
sort -n +4 infile | awk '{t[$1" "$2" "$3" "$4]=$5" "$6" "$7}END{for (i in t){print i,t}}'

Its not exactly working..
To tell
My data has different values in the first column not all are same as i had mentioned in question &
data in my file looks some what lik this

1900  2  7  0   9.5000  76.5000 4.30
1900  2  7  0   9.5000  76.5000 6.00
1901  2 15  0  26.0000 100.0000 6.00
1901  4 27  0  12.0000  75.0000 5.00
1901  4 17 21  40.0000  71.0000 5.90
1902  4 17 21  40.0000  71.0000 5.90
1902  8 12 17  39.5000  68.5000 6.20
1902  8 22  3  40.0000  77.0000 8.60
1902  8 22  3  40.0000  76.5000 8.20
1902  8 22  3  40.0000  76.5000 8.30
1902  8 22  3  40.0000  77.0000 8.20
1903  8 30 21  37.0000  71.0000 7.70
1904  9 20  6  38.5000  67.0000 6.30

The output which i need is exactly lik this....

1900  2  7  0   9.5000  76.5000 6.00
1901  2 15  0  26.0000 100.0000 6.00
1901  4 27  0  12.0000  75.0000 5.00
1901  4 17 21  40.0000  71.0000 5.90
1902  8 12 17  39.5000  68.5000 6.20
1902  8 22  3  40.0000  77.0000 8.60
1902  8 22  3  40.0000  76.5000 8.30
1903  8 30 21  37.0000  71.0000 7.70
1904  9 20  6  38.5000  67.0000 6.30

To keep the forums high quality for all users, please take the time to format your posts correctly.

First of all, use Code Tags when you post any code or data samples so others can easily read your code. You can easily do this by highlighting your code and then clicking on the # in the editing menu. (You can also type code tags

```text
 and 
```

by hand.)

Second, avoid adding color or different fonts and font size to your posts. Selective use of color to highlight a single word or phrase can be useful at times, but using color, in general, makes the forums harder to read, especially bright colors like red.

Third, be careful when you cut-and-paste, edit any odd characters and make sure all links are working property.

Thank You.

The UNIX and Linux Forums

Reva,

it's working properly for me , of course with sort a you can make the sequence in order.

something like this :

awk '{ va2=$NF;va1=$(NF-1);va=$(NF-2);$NF="";$(NF-1)="";$(NF-2)="";if ($0 in a) { if (va" "va1" "va2 >a[$0] ){a[$0]=va" "va1" "va2}} else {a[$0]=va" "va1" "va2}} END { for ( i in a ) print i" "a }'  file_name.txt | sort +1n

ya i will follow from next post...

---------- Post updated 08-26-09 at 04:34 AM ---------- Previous update was 08-25-09 at 08:49 AM ----------

Thanks for the help i got it...

---------- Post updated at 04:45 AM ---------- Previous update was at 04:34 AM ----------

Hiii
now if i have data like shown below.how to sort it out. i mean delete duplicate entries in such a way that it must take the largest value in last column & it must choose a row which has many sets of values in the row.
For example the data in my file is

 
1900  2  7  0   9.5000  76.5000 0.00 4.30 0.00 0.00 0.00 4.30
1900  2  7  0  10.8000  76.8000 0.00 6.00 0.00 0.00 0.00 6.00
1901 12  1  0  37.8000  66.0000 0.00 5.00 0.00 0.00 0.00 5.00
1901 12  1  0  37.8000  66.0000 0.00 4.60 3.00 3.50 3.50 4.60
1902  4 17 21  40.0000  71.0000 0.00 5.80 0.00 5.90 5.70 5.90
1902  8 12 17  39.5000  68.5000 0.00 6.00 0.00 6.20 5.90 6.20
1902  8 22  3  40.0000  77.0000 0.00 0.00 0.00 8.00 8.60 8.60
1902  8 22  3  40.0000  76.5000 0.00 0.00 0.00 0.00 8.20 8.20
1902  8 22  3  40.0000  76.5000 0.00 0.00 0.00 0.00 8.30 8.30
1903  5 16  6   5.3600  80.0000 0.00 4.50 0.00 5.00 0.00 5.00
1903  5 16  6   5.3600  80.0000 0.00 4.30 0.00 3.00 0.00 4.30

The output for it is

1900  2  7  0  10.8000  76.8000 0.00 6.00 0.00 0.00 0.00 6.00
1901 12  1  0  37.8000  66.0000 0.00 4.60 3.00 0.00 3.50 4.60
1902  4 17 21  40.0000  71.0000 0.00 5.80 0.00 5.90 5.70 5.90
1902  8 12 17  39.5000  68.5000 0.00 6.00 0.00 6.20 5.90 6.20
1902  8 22  3  40.0000  77.0000 0.00 0.00 0.00 8.00 8.60 8.60
1903  5 16  6   5.3600  80.0000 0.00 4.50 0.00 5.00 0.00 5.00

Here it removes duplicates & checks for longest row with many values & largest value in last column.
If any one has an idea help me out..

From where you got :

1901 12  1  0  37.8000  66.2000 0.00 4.60 3.00 0.00 3.50 4.60

in the output you mentioned.

I hope with the code that we have given you can try further a bit to achieve your task.

ya i have corrected my output just check now once..

If i have 19 columns & i need to just check duplicates for column 1,2,3,4 & tak the largest value of column 18.Then how to use awk..help me out & try explaining the code also i am very new to unix to tell.
Thanks in advance

post sample input and expected output at least a few lines to test.

The sample input is

 SIG  2007  3 24  4 35 45.80   5.2600  94.3100  58   0 5.20   0 0.00 5.00 0.00   0 0.00 5.20   0
 SSS  2007  3 24  9  3 37.40  36.5600  71.4800 152   0 4.70   0 0.00 0.00 0.00   0 0.00 4.70   0
 SIG  2008  3 25 18 29 33.15   1.7700  99.3400 163   0 4.60   0 0.00 0.00 0.00   0 0.00 4.60   0
 SEG  2008  3 25 18 27 35.06   1.7700  99.3400  89   0 5.00   0 0.00 0.00 0.00   0 0.00 5.00   0
PDE-Q 2009  7  2 22 36 45.17  37.4800  71.7400  20   0 4.60   0 0.00 0.00 0.00   0 0.00 4.60   0 
PDE-Q 2009  7  2 23 50 49.20  37.4800  71.7400 108   0 4.70   0 0.00 0.00 0.00   0 0.00 4.70   0 
PDE-Q 2009  7  3  4 42 32.83  34.4600  24.1200  41   0 4.50   0 0.00 0.00 0.00   0 0.00 4.50   0 
PDE-Q 2009  7  5  9 45 48.77  36.4600  71.0700 248   0 4.90   0 0.00 0.00 0.00   0 0.00 4.90   0
PDE-Q 2009  7  5 12 25 37.44   1.3300  99.7800 185   0 4.50   0 0.00 0.00 0.00   0 0.00 4.60   0
PDE-Q 2009  7  5 12 25 37.44   1.3300  99.7800 185   0 4.50   0 0.00 0.00 0.00   0 0.00 4.50   0
PDE-Q 2009  7  6 16  0 38.96   3.0400  93.3500  34   0 4.90   0 0.00 0.00 0.00   0 0.00 4.90   0
PDE-Q 2009  7  7  0 32 47.11  34.1600  25.5100  13   0 0.00   0 0.00 0.00 0.00   0 0.00 0.00   0
PDE-Q 2009  7  7  1  2  0.48  34.1600  25.5100  25   0 4.80   0 0.00 0.00 0.00   0 3.00 4.80   0

The sample output is

 SIG  2007  3 24  4 35 45.80   5.2600  94.3100  58   0 5.20   0 0.00 5.00 0.00   0 0.00 5.20   0
 SEG  2008  3 25 18 27 35.06   1.7700  99.3400  89   0 5.00   0 0.00 0.00 0.00   0 0.00 5.00   0
PDE-Q 2009  7  2 23 50 49.20  37.4800  71.7400 108   0 4.70   0 0.00 0.00 0.00   0 0.00 4.70   0 
PDE-Q 2009  7  3  4 42 32.83  34.4600  24.1200  41   0 4.50   0 0.00 0.00 0.00   0 0.00 4.50   0 
PDE-Q 2009  7  5  9 45 48.77  36.4600  71.0700 248   0 4.90   0 0.00 0.00 0.00   0 0.00 4.90   0
PDE-Q 2009  7  5 12 25 37.44   1.3300  99.7800 185   0 4.50   0 0.00 0.00 0.00   0 0.00 4.60   0
PDE-Q 2009  7  6 16  0 38.96   3.0400  93.3500  34   0 4.90   0 0.00 0.00 0.00   0 0.00 4.90   0
PDE-Q 2009  7  7  1  2  0.48  34.1600  25.5100  25   0 4.80   0 0.00 0.00 0.00   0 3.00 4.80   0

something like this :

awk '{ if($1" "$2" "$3" "$4 in a) { if(va < $(NF-1)) {a[$1" "$2" "$3" "$4]=$0;va=$(NF-1);next}} else { a[$1" "$2" "$3" "$4]=$0;va=$(NF-1)}} END { for ( i in a) print a }' file_name.txt | sort +1n