Remove duplicates separated by delimiter

enrikS · May 21, 2018, 4:21pm

First post, been browsing for 3 days and came out with nothing so far.

M3 C2 V5 D5 HH:FF A1-A2,A5-A6,A1-A2,A1-4 B4-B6,B2-B4,B4-B6,B1-B2

output should be

M3 C2 V5 D5 HH:FF A1-A2,A5-A6,A1-A4 B2-B4,B4-B6,B1-B2

On col 6 and 7 there are strings in form of Ax-Ax and Bx-Bx respectively. Each string are separated by a comma ",".

How can i remove strings that are duplicates across col 6 and col 7.
For e.g if A1-A2,A1-A2 are present on col 6, i want to keep only one.

awk '{ while(++i<=NF) printf (!a[$i]++) ? $i FS : ","; i=split("",a); print ""}' data

Saw a question like mine on SO , but im stuck.

What am i doing wrong ?

awk ' 
BEGIN { FS="\t" } ;
{
  split($6, valueArray,",");
  j=0;
  for (i in valueArray) 
  { 
    if (!( valueArray in duplicateArray))
    {
      duplicateArray[j] = valueArray;
      j++;
    }
  };
  printf $1 "\t";
  for (j in duplicateArray) 
  {
    if (duplicateArray[j]) {
      printf duplicateArray[j] ",";
    }
  }
  printf "\t";
  print $8

}'

After many failed attempts i came out with the solution of breaking the delimiters to remove duplicate fields across each row, only to realize that I need to regroup all As under col 6 and Bs under col 7. Back to square 1 !

So , the solution for me would be to remove duplicates separated by delimiter in a column. Tried a Perl approach but in vain.

Thank you for your help

*Update
Please note that in this given example, Col 6 and Col 7 are not sorted

RudiC · May 21, 2018, 6:02pm

Welcome to the forum.

Why is B4-B6 considered a duplicate? I can't see it twice or more in the input line.

RudiC · May 21, 2018, 6:20pm

Howsoever, how far would

awk '
function RMDUP(P1, T, TX, DL)   {for (n = split (P1, TMP, ","); n; n--)  T[TMP[n]]
                                 for (t in T)   {TX = TX DL t
                                                 DL = ","
                                                }
                                 return TX
                                }
        {$6 = RMDUP($6)
         $7 = RMDUP($7)
        }
 1
' file
M3 C2 V5 D5 HH:FF A1-A2,A5-A6,A1-A4 B4,B6,B1-B2,B2-B4,B4-B6

get you?

enrikS · May 22, 2018, 2:17am

Why is B4-B6 considered a duplicate? I can't see it twice or more in the input line.

made an error on first post, updated.

It worked. The way you wrote the code is self-explanatory, I'm baffled.
Thank you for your time.

Don_Cragun · May 22, 2018, 6:07am

Hi enrikS,
If keeping the input order of elements in fields 6 and 7 is important and you want <tab> as your output field separator (as shown in your second code snippet), you could also try:

awk '
BEGIN {	OFS = "\t"
}
function RMDUP(input,	i, n, NoDupArray, output, ValueArray) {
	n = split(input, ValueArray, /,/)
	NoDupArray[output = ValueArray[1]]
	for(i = 2; i <= n; i++)
		if(!(ValueArray in NoDupArray)) {
			output = output "," ValueArray
			NoDupArray[ValueArray]
		}
	return output
}
{	$6 = RMDUP($6)
	$7 = RMDUP($7)
}
1' data

In addition to the change you have already made to your original post, note also that if you want the field 6 output to be:

A1-A2,A5-A6,A1-A4

you can't have the input be:

A1-A2,A5-A6,A1-A2,A1-4

The above code produces the output:

M3	C2	V5	D5	HH:FF	A1-A2,A5-A6,A1-4	B4-B6,B2-B4,B1-B2

from the sample input you provided in post #1.

For some hints as to why your second code snippet didn't work, note that your awk code is specifying that the input field separator ( FS ) is a <tab> character, but there are no <tab>s in your sample input (just <space>s; no <tab>s). Therefore, your awk script is only seeing one input field; not eight. And split() ing an empty field (e.g., $6 ) produces an array with zero elements.

enrikS · May 22, 2018, 8:49am

Spent hours trying to figure out why it was not working, All this because of misinterpretation of space for tab.

As for Col 6 or Col 7, all my strings are sorted. [The ones used in this example are not ]. As order of the output was not necessary, I did not mind when i ran the test this morning. But it good to know that it can be sort. Will edit the post to include that info.
.
One question in regards to

{for (n = split (P1, TMP, ","); n; n--)  T[TMP[n]]  
                                                              for (t in T)   {TX = TX DL t                                                                                          DL = ","    }

,

Don't know how to formulate it properly, just going to give an e.g

M3    C2    A1    D5    HH:FF    A1-A2,A5-A6,A1-A4    B4-B6,B2-B4,B1-B2

delete array if $3 is present
In this case $3 = A1, ; A1-A2 and A1-A4 must be removed.

So basically, before I saw your method, I put the 3rd column in a new text file, and search for these arrays. I was wondering if using your method is less complex. Hopefully this week end, will give it a try.
Im still learning how to write my codes using different approach. 3 weeks ago did not even know how to use linux lol Been so hard to comment and ask question on SO without being labeled [witch-hunt]. Glad I found this forum.

RudiC · May 22, 2018, 10:17am

Not sure I understand correctly, but if you want to remove all elements from the array that match / contain $3, try (with your new sample code):

awk '
function RMDUP(P1, T, TX, DL)   {for (n = split (P1, TMP, ","); n; n--)  T[TMP[n]]
                                 for (t in T) if (!(t ~ $3))    {TX = TX DL t
                                                                 DL = ","
                                                                }
                                 return TX
                                }
        {$6 = RMDUP($6)
         $7 = RMDUP($7)
        }
 1
' file
M3 C2 A1 D5 HH:FF A5-A6 B1-B2,B2-B4,B4-B

Don_Cragun · May 22, 2018, 4:08pm

When RudiC says he doesn't understand clearly what you are trying to do, that is an indication that your specification is not clear.

I also think we need a better specification of what is to be removed when $3 is not an empty field. If $3 is A , should every subfield in $6 be removed or will $3 always contain a letter and a number? If a number is always present, should subfields matching that string be removed, or just strings that start with the same letter and the same number? (For example, if $3 contains A1 should it remove a subfield that starts with A12 or just subfields that are A1 or start with A1- ? If $3 contains A5 should all subfields of $6 containing A5 be removed (such as A1-A5 and A49-A51 , or should it just remove subfields that are A5 or start with A5- or end with -A5 ?)

enrikS · May 23, 2018, 11:54am

True, it was not clear at all, A1 can stands for A12 [learn that while using sed lol ], which in this case is a false positive., just like selecting A5; removing A51. I saw the post of RudiC, and it is in fact what I was trying to achieve.
@RudiC
what does

(t ~ $3)

stands for, I know that its a reference to field $3, but why the "~" ?

Yes indeed.
Thank you both for helping me.

RudiC · May 23, 2018, 12:30pm

That's awk 's matching operator ( man awk !). read: (t matches / contains $3), e.g. "A1-A4" matches / contains "A1" but wouldn't "A2". You could "sharpen" (narrow, restrict) the match with e.g. word boundaries like \b , \< , or \> (these may not be available in ALL regex engines): "A12-A14" ~ "\bA1\b" will yield FALSE .