Remove all instances of duplicate records from the file

vukkusila · December 11, 2007, 8:25pm

Hi experts,
I am new to scripting. I have a requirement as below.

File1:

A|123|NAME1
A|123|NAME2
B|123|NAME3

File2:

C|123|NAME4
C|123|NAME5
D|123|NAME6

1) I have 2 merge both the files.
2) need to do a sort ( key fields are first and second field)
3) remove all the instances of duplicate records from the merged file and write write all these duplicate instances into one file.
4) rest of the records which are unique in the original source files, needs to be written into another file

outfiles:

file3:
A|123|NAME1
A|123|NAME2
C|123|NAME4
C|123|NAME5

File4:

B|123|NAME3
D|123|NAME6

Please help me with the solution as I am in real urgent. Appreciate your help.

Thank you

user_prady · December 12, 2007, 3:08am

If I am not wrong your each record in file1 and file2 seems to be unique data .I am pointing out the last character of your files.

SO if all data are unique all the records should go to File4 ..Is n't it?

Explain more clearly, so that you ll get a quick reply from this forum.
Beleive me here in this forum really brilliant and experts here to help you out at any time..

excluding me ..

Cheers
user_prady

Klashxx · December 12, 2007, 5:27am

Try something like:

 sort -t'|' -k 1,1 File1 File2|awk -F\| 'BEGIN{i=0}{
                              pat=$1"|"$2
                              ocurrences[pat]++
                              line=$0
                              i++
       }
       END {
          for (j=0;j<i;j++)
              {
              pat=substr(line[j],1,5)
              if (ocurrences[pat]>1)
                print line[j]>>"File3"
              else
                print line[j]>>"File4"
              }
      }'

radoulov · December 12, 2007, 6:50am

Another sort/Awk solution
(if your files are not already sorted as the samples you posted):

sort -t\| -k1,2 file1 file2|awk '{
	x[$1,$2]++
	y[NR] = $0
} END {
	for (i = 1; i <= NR; i++)
		print y > ((x[substr(y,1,5)] > 1) ? "file3" : "file4")
}' SUBSEP="|" FS="|"

Use nawk or /usr/xpg4/bin/awk on Solaris.

P.S. For variable column width: you should not use substr, but split for example:

sort -t\| -k1,2 file1 file2|awk '{
	x[$1,$2]++
	y[NR] = $0
} END {
	for (i = 1; i <= NR; i++)
		{
			tmp = y
			split(tmp,z)
			print tmp > ((x[z[1],z[2]] > 1) ? "file3" : "file4")
	}
}' SUBSEP="|" FS="|"