Remove duplicate values in a column(not in the file)

ratheeshjulk · August 22, 2016, 5:25pm

Hi Gurus,

I have a file(weblog) as below

 
 abc|xyz|123|agentcode=sample code abcdeeess,agentcode=sample code abcdeeess,agentcode=sample code abcdeeess|agentadd=abcd stereet 23343,agentadd=abcd stereet 23343
 sss|wwq|999|agentcode=sample1 code wqwdeeess,gentcode=sample1 code wqwdeeess,gentcode=sample1 code wqwdeeess|agentadd=ssss stereet sssss,agentadd=ssss stereet sssss
 awe|rez|777|agentcode=sample2 code dfsdfeess,agentcode=sample2 code dfsdfeess,agentcode=sample2 code dfsdfeess|agentadd=tttt stereet ttttt,agentadd=tttt stereet ttttt
 twe|tez|555|agentcode=sample3 code ddddddddd,dddddd,agentcode=sample3 code ddddddddd,dddddd|agentadd=tttt stereet ttttt,agentadd=tttt stereet ttttt

I want to remove the duplicate values from column 4 and 5. There is a possibility that same value may repeat with comma delimited. Comma can also come in the data as well .
My algorithm is to take column 1, 2, 3 (makes record unique) and then split column 4 and 5 based on commas, and then remove duplicate, join them back with comma(so that comma in the record wont be lost)

Is there a command with awk or perl

Out put should be like

 
 abc|xyz|123|agentcode=sample code abcdeeess|agentadd=abcd stereet 23343
 sss|wwq|999|agentcode=sample1 code wqwdeeess|agentadd=ssss stereet sssss
 awe|rez|777|agentcode=sample2 code dfsdfeess|agentadd=tttt stereet ttttt
 twe|tez|555|agentcode=sample3 code ddddddddd,dddddd|agentadd=tttt stereet ttttt

Don_Cragun · August 22, 2016, 6:10pm

Does the order of the resulting strings in fields 4 and 5 matter?
In other words, does it matter if the last line of the output shown above would be:

 twe|tez|555|dddddd,agentcode=sample3 code ddddddddd|agentadd=tttt stereet ttttt

instead of:

 twe|tez|555|agentcode=sample3 code ddddddddd,dddddd|agentadd=tttt stereet ttttt

It is easier and faster if the output order can be random; but it isn't hard to keep the input order if it matters.

vgersh99 · August 22, 2016, 6:16pm

could probably be simplified a bit, but it's a start...
awk -f rath.awk wevlogFile where rath.awk is:

BEGIN {
  FS=OFS="|"
  fA[4];fA[5]
}
function uniq(f,   s,a,at,i)
{
   s=""
   split($f, a, ",")
   for(i in a)
       at[a]
   for(i in at)
     s=(!s)? i:s "," i
   return(s)
}
{
   for(i=1; i<=NF; i++)
     printf("%s%s", (i in fA)?uniq(i):$i, (i==NF)?ORS:OFS)

}

Don_Cragun · August 22, 2016, 9:17pm

Hi ratheeshjulk,
On my system, with the sample input provided in post#1 in this thread, vgersh99's code produces the output:

 abc|xyz|123|agentcode=sample code abcdeeess|agentadd=abcd stereet 23343
 sss|wwq|999|gentcode=sample1 code wqwdeeess,agentcode=sample1 code wqwdeeess|agentadd=ssss stereet sssss
 awe|rez|777|agentcode=sample2 code dfsdfeess|agentadd=tttt stereet ttttt
 twe|tez|555|dddddd,agentcode=sample3 code ddddddddd|agentadd=tttt street ttttt

(using random order for the subfields in the fields that are being processed for duplicate entries). Different versions of awk might produce different random orders.

The following similar awk script produces output with the order of subfields in the fields maintained with the first copy of a duplicated subfield kept in place and later copies of that subfield dropped from the output:

awk '
function nodup(field,	f, n, loop, seen) {
	n = split($field, f, SFS)
	seen[$field = f[1]]
	for(loop = 2; loop <= n; loop++)
		if(!(f[loop] in seen)) {
			$field = $field SFS f[loop]
			seen[f[loop]]
		}
}
BEGIN {	FS = OFS = "|"
	SFS = ","
	low = 4
	high = 5
}
{	for(i = low; i <= high; i++)
		nodup(i)
}
1' weblog

and produces the output:

 abc|xyz|123|agentcode=sample code abcdeeess|agentadd=abcd stereet 23343
 sss|wwq|999|agentcode=sample1 code wqwdeeess,gentcode=sample1 code wqwdeeess|agentadd=ssss stereet sssss
 awe|rez|777|agentcode=sample2 code dfsdfeess|agentadd=tttt stereet ttttt
 twe|tez|555|agentcode=sample3 code ddddddddd,dddddd|agentadd=tttt street ttttt

with the same input. Note that the text marked in red is shown in both of our outputs because there is a difference between agentcode and gentcode that causes both subfields to appear in the output even though it does not appear in the output you said you wanted.

You haven't said what operating system you're using. If you are using a Solaris/SunOS system, you'll need to use /usr/xpg4/bin/awk or nawk instead of awk for both of our suggestions.

My code also assumes that the fields to be processed will always be adjacent no matter how many fields in your real input files need to be processed; vgersh99's code lets you select any set (contiguous or non-contiguous) of fields to be processed. If the fields you want to process in your real files are not contiguous, but it is necessary to keep output subfields in the order in which they were found in the input; it would be easy to modify my code to use the same scheme vgersh99 used to identify fields to be processed.

ratheeshjulk · August 23, 2016, 3:27pm

Thanks.. solution worked..