Delete duplicated fields in a line

Gr4wk · March 16, 2014, 3:27pm

Hi,

I have files with this kind of format (separator is space):

A1 B1 C1 D1 E1 F1 D1 C1 G1 H1
A2 B2 C2 D2 E2 F2 D2 C2 G2 H2 
A3 B3 C3 D3 E3 F3 G3 D3 C3 H3
A4 B4 C4 D4 E4 F4 G4 D4 C4 H4

I want the output to be:

A1 B1 E1 F1 G1 H1
A2 B2 E2 F2 G2 H2
A3 B3 E3 F3 G3 H3
A4 B4 E4 F4 G4 H4

Any clue? Can I use awk for this?

bartus11 · March 16, 2014, 3:34pm

Try:

awk '{for (i=1;i<=NF;i++) a[$i]++;for (i=1;i<=NF;i++) if (a[$i]==1) printf $i" ";printf "\n"}' file

Akshay_Hegde · March 16, 2014, 3:49pm

Try :

$ awk '{delete B;for(i=1;i<=NF;i++){if($i in B){$i=$(B[$i])=x}B[$i]=i};$0=$0;$1=$1}1' file

A1 B1 E1 F1 G1 H1
A2 B2 E2 F2 G2 H2
A3 B3 E3 F3 G3 H3
A4 B4 E4 F4 G4 H4

Gr4wk · March 16, 2014, 3:59pm

Hi Bartus and Akhsay,

Both script working but for only first line.. the rest are not.

May be need few modifications? The field containing strings with different format (characters, numbers, etc)

Yoda · March 16, 2014, 5:06pm

Another approach that will work for posted data:

awk '
        {
                for ( i = 1; i <= NF; i++ )
                {
                        n = gsub ( "\\<"$i"\\>", "&", $0 )
                        if ( n > 1 )
                                gsub ( "\\<"$i"\\>", X, $0 )
                }
                $1 = $1
                print $0
        }
' file

Gr4wk · March 16, 2014, 5:08pm

Hi Scrunitzer,

It doesnt work.

This is the input format:

SEKK101 1C23.delay multiLink=0 dtx=0 sequence=1 >>> dtx=0 multiLink=0 sequence=0 >>>done.                      
SEKK106 1C22.delay multiLink=0 dtx=0 sequence=1 >>> dtx=0 multiLink=0 sequence=0 >>>done.                      
SEKK102 1C24.delay multiLink=0 dtx=0 sequence=1 >>> dtx=0 multiLink=0 sequence=0 >>>done.                      
SEKK101 1C20.delay multiLink=0 dtx=0 sequence=1 >>> dtx=0 multiLink=0 sequence=0 >>>done.                      
SEKK104 1C10.delay multiLink=0 dtx=0 sequence=1 >>> dtx=0 multiLink=0 sequence=0 >>>done.                      
SEKK104 1C11.delay multiLink=0 dtx=0 sequence=1 >>> dtx=0 multiLink=0 sequence=0 >>>done.                      
SEKK101 1C12.delay algoRithm=0 thresHold=10 upThresh=10 >>> upThresh=11 thresHold=10 algoRithm=0 >>>done.      
SEKK101 1C15.delay algoRithm=0 thresHold=10 upThresh=11 >>> upThresh=11 thresHold=11 algoRithm=0 >>>done.      
SEKK106 1C16.delay algoRithm=0 thresHold=10 upThresh=10 >>> upThresh=11 thresHold=10 algoRithm=0 >>>done.      
SEKK106 1C17.delay algoRithm=0 thresHold=10 upThresh=11 >>> upThresh=11 thresHold=11 algoRithm=0 >>>done.      
SEKK102 1C18.delay algoRithm=0 thresHold=10 upThresh=10 >>> upThresh=11 thresHold=10 algoRithm=0 >>>done.

Scrutinizer · March 16, 2014, 5:21pm

Hi Gr4wk, I had deleted my post already since it was not-fool proof anyway.. Try this one instead:

awk '{for(i=1; i<NF; i++) for(j=i+1; j<=NF; j++) if($i==$j) $i=$j=x; $0=$0; $1=$1}1' file

Gr4wk · March 17, 2014, 1:13am

Thanks Scrutinizer.. it works!

Can you explain what is the meaning of the code?

SriniShoo · March 17, 2014, 1:21am

awk '{delete a; delete b; for(i = 1; i <= NF; i++) {a = $i; b[$i]++}; for(i = 1; i <= length(a); i++) {if(b[$i] == 1) {printf "%s%s", a, FS}}; print ""}' file

Gr4wk · March 17, 2014, 1:27am

Good job SriniShoo.. your code working too.. can you explain please?

SriniShoo · March 17, 2014, 1:44am

delete a; delete b

to clear arrays a & b

for(i = 1; i <= NF; i++) {a = $i; b[$i]++}

Parse through the line and and store each field value int to different arrays
a - to print the output in an order
b - to cehck duplicates

for(i = 1; i <= length(a); i++) {if(b[$i] == 1) {printf "%s%s", a, FS}}

After I read the line, I am printint the values from array a if array b says it doesn't have duplicate values

printf "%s%s", a, FS

for formatting the output

Akshay_Hegde · March 17, 2014, 4:18am

Small addition to my old code, which I missed yesterday

$ awk '{delete B;for(i=1;i<=NF;i++){if($i in B){$i=$(B[$i])=x}B[$i]=i}$0=$0;$1=$1}1' file

SEKK101 1C23.delay sequence=1 >>> sequence=0 >>>done.
SEKK106 1C22.delay sequence=1 >>> sequence=0 >>>done.
SEKK102 1C24.delay sequence=1 >>> sequence=0 >>>done.
SEKK101 1C20.delay sequence=1 >>> sequence=0 >>>done.
SEKK104 1C10.delay sequence=1 >>> sequence=0 >>>done.
SEKK104 1C11.delay sequence=1 >>> sequence=0 >>>done.
SEKK101 1C12.delay upThresh=10 >>> upThresh=11 >>>done.
SEKK101 1C15.delay thresHold=10 >>> thresHold=11 >>>done.
SEKK106 1C16.delay upThresh=10 >>> upThresh=11 >>>done.
SEKK106 1C17.delay thresHold=10 >>> thresHold=11 >>>done.
SEKK102 1C18.delay upThresh=10 >>> upThresh=11 >>>done.

---------- Post updated at 02:48 PM ---------- Previous update was at 02:44 PM ----------

Add delete a to bartus11's approach it works here is modified version of bartus11

$ awk '{delete a;for (i=1;i<=NF;i++) a[$i]++;for (i=1;i<=NF;i++) if (a[$i]==1) printf $i" ";printf "\n"}' file

Scrutinizer · March 17, 2014, 1:52pm

Sure:

awk '
{                              # For every line in file "file"
  for(i=1; i<NF; i++)          # Iterate variable "i" over the number of fields-1
    for(j=i+1; j<=NF; j++)     # Do the same for variable j from i+1 to the number of fields
      if($i==$j) $i=$j=x       # If two of these fields are equal then make their values ""
  $0=$0                        # Recalculate the fields, if previously fields were made equal to "" 
                                    #then there are now fewer fields..
  $1=$1                        # Recalculate the record, so that any amount of spacing between fields 
                                    # is converted to the OFS which is a single space.  
}
1                              # Print the record
' file                         # Read the file "file"

Hope this helps..