Read Two lines in a CSV File and Compare

Sheel · May 2, 2010, 12:12pm

Hi ,
I have a CSV file ( file.csv) with some data as
below:
A,1,abc,x,y,z,,xyz,20100101,99991231
A,1,abc,x,y,z,234,xyz,20100101,99991231
I have to delete the duplicate line based on
unique identifiers which are values in the
fields- 2,3,4,8.These coulmns in both the rows
have same value. I have delete the one that
does not have any value in 7th column.
Please help.

Thanks,
Sheel

malcomex999 · May 3, 2010, 3:19am

Try this if i get you correctly...

awk -F, '$7 && !dup[$2$3$4$8]++' infile

aigles · May 3, 2010, 3:44am

Well done !

I suggest a little modification for a more secure result :

awk -F, '$7 && !dup[$2,$3,$4,$8]++' infile

Jean-Pierre.

malcomex999 · May 3, 2010, 3:48am

I was just curious what difference will that make?

aigles · May 3, 2010, 4:12am

An example :

A,1,abc,x,y,z,,xyz,20100101,99991231
A,1,ab,cx,y,z,234,xyz,20100101,99991231

With your solution the 2 lines have the same key 1abcxxyz
Using subsranges gives different keys for the two lines 1,abc,x,wyz and 1,ab,cx,xyz (where , is SUBSEP).

Jean-Pierre.

ygemici · May 4, 2010, 4:45am

[root@sistem1lnx ~]# cat file
A,1,abc,x,y,z,,xyz,20100101,99991231
A,1,abc,x,y,z,1,xyz,20100101,99991231
A,1,abc,x,y,z,234,xyz,20100101,99991231
A,1,abc,x,y,z,a,xyz,20100101,99991231
A,1,abc,x,y,z,,xyz,20100101,99991231

[root@sistem1lnx ~]# ./allx
RESULTS
---------
A 1 abc x y z 1 xyz 20100101 99991231
A 1 abc x y z 234 xyz 20100101 99991231
A 1 abc x y z a xyz 20100101 99991231

 
oifs=$IFS
#Values
var=0
COUNT=2 # for result
 
while IFS=, read -r one two three four five six seven eight nine ten
      do
              while IFS=, read -r onee twoo threee fourr fivee sixx sevenn eightt ninee tenn
                  do
                      if [ "${two}" == "${twoo}" ] && [ "${three}" == "${threee}" ] && 
[ "${four}" == "${fourr}" ] && [ "${eight}" == "${eightt}" ] ; then
                            if [ "${one}" == "${onee}" ] && [ "${five}" == "${fivee}" ] && [ "${six}" == "${sixx}" ] && 
[ "${seven}" == "${sevenn}" ] && [ "${nine}" == "${ninee}" ] && [ "${ten}" == "${tenn}" ] ; then
                                 itself=ok
                                      else
                                         itself=notok
                                            if [ "${sevenn}" == "" ] ; then
                                                         ((++var))
                                            fi
                          fi
                      fi
                   done < file
 
while [ $(( COUNT -= 1 )) -gt 0 ]
          do
             echo "RESULTS"
             echo "-----------------------------------"
          done
 
    if [ $var -gt 0 ] && [ "$itself" == "notok" ]; then
                    echo ${one} ${two} ${three} ${four} ${five} ${six} ${seven} ${eight} ${nine} ${ten}
    fi
 
var=0
 
   done < file
IFS=$oifs

Sheel · May 5, 2010, 12:19pm

@malcomex999/Jean
Thank You both for the help . The logic is working fine and i have been able to do what i wanted.

awk -F, '$7 && !dup[$2,$3,$4,$8]++' infile

       But I would like to understand the logic behind  -F operation. How does the deletion happen here? What is the logic behind choosing the row to be deleted?

Can i use more than one field to decide the line to be deleted.If yes, how?

@ygemici
Thnx for the code. I have not been able to implement the logic till now. Will surely get back to you after using it.

Cheers,
Sheel