Find duplicate based on 'n' fields and mark the duplicate as 'D'

machomaddy · January 28, 2012, 2:07am

Hi,

In a file, I have to mark duplicate records as 'D' and the latest record alone as 'C'.

In the below file, I have to identify if duplicate records are there or not based on Man_ID, Man_DT, Ship_ID and I have to mark the record with latest Ship_DT as "C" and other as "D" (I have to create a new field at the end of the records as "C" or "D")

File 1
====
Man_ID|Man_Dt|Ship_Id|Ship_Dt|ItemID|Noof ITEMS|ItemNam
001|2010-12-31|11|2010-12-31|111|2|Jackets
002|2010-12-31|12|2010-12-31|111|1|Caps
001|2010-12-31|11|2009-11-31|111|2|Jackets
001|2010-12-31|11|2011-12-31|111|2|Jackets
003|2010-11-01|13|2011-12-31|111|1|Shoes

Expected Output

File 1
=====
Man_ID|Man_Dt|Ship_Id|Ship_Dt|ItemID|Noof ITEMS|ItemNam
 001|2010-12-31|11|2010-12-31|111|2|Jackets|D
002|2010-12-31|12|2010-12-31|111|1|Caps
001|2010-12-31|11|2009-11-31|111|2|Jackets|D
 001|2010-12-31|11|2011-12-31|111|2|Jackets|C
003|2010-11-01|13|2011-12-31|111|1|Shoes

parthmittal2007 · January 28, 2012, 4:51am

hi
you have provided the wrong input.
The output is not according to the input

machomaddy · January 28, 2012, 5:34am

Thanks parth!!

rdcwayx · January 28, 2012, 6:34am

awk '{s=$1 FS $2 FS $3} 
     NR==FNR{a++;b=FNR;next}
     FNR==1{print;next} 
     {if (a<2)
           {print}
      else
           {print (b==FNR)?$0 "|C":$0 "|D"}}' FS=\| OFS=\| infile infile

machomaddy · January 28, 2012, 8:32am

Thanks, rdcwayx!! Could please explain the code? It will be very helpful

parthmittal2007 · January 28, 2012, 11:41am

hi rdcwayx:

your code does not work in this case:

Man_ID|Man_Dt|Ship_Id|Ship_Dt|ItemID|Noof ITEMS|ItemNam
001|2010-12-31|11|2010-12-31|111|2|Jackets
002|2010-12-31|12|2010-12-31|111|1|Caps
001|2010-12-31|11|2012-11-31|111|2|Jackets
001|2010-12-31|11|2011-12-31|111|2|Jackets
003|2010-11-01|13|2011-12-31|111|1|Shoes

---------- Post updated at 11:38 AM ---------- Previous update was at 11:36 AM ----------

According to this data the output should be-:
Man_ID|Man_Dt|Ship_Id|Ship_Dt|ItemID|Noof ITEMS|ItemNam
001|2010-12-31|11|2010-12-31|111|2|Jackets|D
002|2010-12-31|12|2010-12-31|111|1|Caps
001|2010-12-31|11|2012-11-31|111|2|Jackets|C
001|2010-12-31|11|2011-12-31|111|2|Jackets|D
003|2010-11-01|13|2011-12-31|111|1|Shoes

---------- Post updated at 11:41 AM ---------- Previous update was at 11:38 AM ----------

and from your code the output is coming:

Man_ID|Man_Dt|Ship_Id|Ship_Dt|ItemID|Noof ITEMS|ItemNam
001|2010-12-31|11|2010-12-31|111|2|Jackets|D
002|2010-12-31|12|2010-12-31|111|1|Caps
001|2010-12-31|11|2012-11-31|111|2|Jackets|D
001|2010-12-31|11|2011-12-31|111|2|Jackets|C
003|2010-11-01|13|2011-12-31|111|1|Shoes

machomaddy · January 28, 2012, 12:25pm

oh!!! am sorry...I missed it...My bad. Yes the code works as what parthmittal says!

rdcwayx · January 28, 2012, 5:26pm

parthmittal2007:

hi rdcwayx:

your code does not work in this case:

Man_ID|Man_Dt|Ship_Id|Ship_Dt|ItemID|Noof ITEMS|ItemNam
001|2010-12-31|11|2010-12-31|111|2|Jackets
002|2010-12-31|12|2010-12-31|111|1|Caps
001|2010-12-31|11|2012-11-31|111|2|Jackets
001|2010-12-31|11|2011-12-31|111|2|Jackets
003|2010-11-01|13|2011-12-31|111|1|Shoes

---------- Post updated at 11:38 AM ---------- Previous update was at 11:36 AM ----------

According to this data the output should be-:
Man_ID|Man_Dt|Ship_Id|Ship_Dt|ItemID|Noof ITEMS|ItemNam
001|2010-12-31|11|2010-12-31|111|2|Jackets|D
002|2010-12-31|12|2010-12-31|111|1|Caps
001|2010-12-31|11|2012-11-31|111|2|Jackets|C
001|2010-12-31|11|2011-12-31|111|2|Jackets|D
003|2010-11-01|13|2011-12-31|111|1|Shoes

---------- Post updated at 11:41 AM ---------- Previous update was at 11:38 AM ----------

and from your code the output is coming:

Man_ID|Man_Dt|Ship_Id|Ship_Dt|ItemID|Noof ITEMS|ItemNam
001|2010-12-31|11|2010-12-31|111|2|Jackets|D
002|2010-12-31|12|2010-12-31|111|1|Caps
001|2010-12-31|11|2012-11-31|111|2|Jackets|D
001|2010-12-31|11|2011-12-31|111|2|Jackets|C
003|2010-11-01|13|2011-12-31|111|1|Shoes

Update the code to check the latest date.

awk '{s=$1 FS $2 FS $3}
     NR==FNR{a++;d=$4;gsub(/-/,"",d);max=(max>d)?max:d;if (max==d) b=FNR;next}
     FNR==1{print;next}
     {if (a<2)
           {print}
      else
           {print (b==FNR)?$0 "|C":$0 "|D"}}' FS=\| OFS=\|  infile infile