Delete duplicate data and pertain the latest month data.

vee_789 · April 1, 2011, 1:18am

Hi I have a file with following records
It contains three months of data, some data is duplicated,i need to access the latest data from the duplicate ones.
for e.g; i have foll data

"200","0","","11722","-63","","","","11722","JUL","09"
"200","0","","11722","-63","","","","11722","JUL","09"
"200","0","","11722","-63","","","","11722","JUL","09"
"200","0","","11722","-63","","","","11722","JUN","09"
"200","0","","11722","-63","","","","11722","JUN","09"

As it can be seen that the records are same with difference of the month,i want to delete the duplicate records and keep the records with the latest month value
e.g; if i consider the 3rd and 5th record both are same in terms of data but i need the latest data to persist in file, which in this case it is JUl 09.
The problem is if i sort the data i will get JUN09 data as alphabetic wise JUN comes first,whereas i need JUL 09 data, if i sort it descending same problem occurs for different months. The uniq command is also not giving me the right output.
According to my logic i thought of converting the Month name to month number, and concatenate it with the year column, then sort and delete the duplicate lines, but its not working fine
Could you please suggest a shell script on this scenario.
I have data till 2011

michaelrozar17 · April 1, 2011, 2:32am

Does the below command help you..? If not provide few more sample data

awk -F, 'NR==FNR{a[$10","$11]=$1","$2","$3","$4","$5","$6","$7","$8","$9",";next}END{for(i in a)print ai}' inputfile inputfile

cgkmal · April 1, 2011, 4:46am

Hi vee_789,

With sort:

sort -uk1.52,1.53nr -k1.46,1.49Mr Inputfile | sort -uk1.52,1.53nr
"200","0","","11722","-63","","","","11722","OCT","11"
"200","0","","11722","-63","","","","11722","DEC","10"
"200","0","","11722","-63","","","","11722","JUL","09"

Hope it helps,

Regards

vee_789 · April 1, 2011, 4:50am

Hi, thanks but ur code is not giving me the desired output.
Wat i want is:
Consider for e.g these records

"200","0","","13011","-264","","","","13011","JUL","09"
"200","0","","13011","-264","","","","13011","JUL","09"
"200","0","","13011","-264","","","","13011","JUN","09"
"200","0","","13011","-263","","","","13011","JUL","09"
"200","0","","13011","-263","","","","13011","JUL","09"
"200","0","","13011","-263","","","","13011","AUG","09"

This should give me output as

"200","0","","13011","-264","","","","13011","JUL","09"
"200","0","","13011","-263","","","","13011","AUG","09"

From this it can be seen that the output has given me the latest record.
As in the duplicates are deleted and amongst those duplicate records the record with latest month value is given

kato · April 1, 2011, 5:10am

a bit clumsy but...

sort -r file | awk -F, '{if(date==$10$11){next}else{date=$10$11;print}}' OFS=,

Or did you not want the JUN record?

vee_789 · April 1, 2011, 5:18am

No, i dont need the june record as it is the duplicate record if we compare it with july's record, and as july is the latest month when compared with june, i need the july record. I can understand its a bit confusing.

pravin27 · April 1, 2011, 5:48am

Try this,

 awk -F"," 'BEGIN{d["JAN"] = 1
        d["FEB"] = 2
        d["MAR"] = 3
        d["APR"] = 4
        d["MAY"] = 5
        d["JUN"] = 6
        d["JUL"] = 7
        d["AUG"] = 8
        d["SEP"] = 9
        d["OCT"] = 10
        d["NOV"] = 11
        d["DEC"] = 12}
{y=$0;gsub(/"/,"",$10);gsub(/"/,"",$11);$10=d[$10];if(b==$5){if($10$11>c){a=y;b=$5;c=$10$11} else {b=$5}}else{print a;a=y;b=$5;c=$10$11;}}END{print a}' OFS="," inputfile

vee_789 · April 1, 2011, 6:06am

Hi Pravin27
Please can u explain me the logic.

cgkmal · April 1, 2011, 6:21am

Hi vee_789,

With sort, based on the sample:

sort -t "$(/bin/echo ",")" -uk10,10 -k10Mr inputfile | sort -t "$(/bin/echo ",")" -uk5,5r
"200","0","","13011","-264","","","","13011","JUL","09"
"200","0","","13011","-263","","","","13011","AGO","09"

Regards

vee_789 · April 1, 2011, 7:13am

Hi cgkmal, the command isn't working.

cgkmal · April 1, 2011, 4:56pm

Hi vee_789,

What is the inputfile you are using?

With your last sample:

"200","0","","13011","-264","","","","13011","JUL","09"
"200","0","","13011","-264","","","","13011","JUL","09"
"200","0","","13011","-264","","","","13011","JUN","09"
"200","0","","13011","-263","","","","13011","JUL","09"
"200","0","","13011","-263","","","","13011","JUL","09"
"200","0","","13011","-263","","","","13011","AUG","09"

and using the code I posted before:

sort -t "$(/bin/echo ",")" -uk10,10 -k10Mr inputfile | sort -t "$(/bin/echo ",")" -uk5,5r

I get the output you said:

"200","0","","13011","-264","","","","13011","JUL","09"
"200","0","","13011","-263","","","","13011","AUG","09"

It's desirable a more representative sample to work with and.

Regards