Help with remove duplicated content

Input file:

hcmv-US25-2-3p	hsa-3160-5
hcmv-US33 hsa-47
hcmv-UL70-3p hsa-4508
hcmv-UL70-3p hsa-4486
hcms-US25 hsa-360-5
hcms-US25 hsa-4
hcms-US25 hsa-458
hcms-US25 hsa-44812
.
.

Desired Output file:

hcmv-US25-2-3p	hsa-3160-5
hcmv-US33 hsa-47
hcmv-UL70-3p hsa-4508
	     hsa-4486
hcms-US25 hsa-360-5
	  hsa-4
	  hsa-458
	  hsa-44812
.
.

I would like to remove those duplicate content based on column 1.
Many thanks for any advice.

awk '{if ($1==x){gsub(/./," ",$1)}else{x=$1;}}1' file

Guru.

1 Like

Hi,

Althought you already got the answer, here a solution using 'sed':

$ cat infile
hcmv-US25-2-3p  hsa-3160-5
hcmv-US33 hsa-47
hcmv-UL70-3p hsa-4508
hcmv-UL70-3p hsa-4486
hcms-US25 hsa-360-5
hcms-US25 hsa-4
hcms-US25 hsa-458
hcms-US25 hsa-44812
$ cat script.sed
:a

## Append next line to pattern space but last.
$! N

## If characters after '\n' are the same that characters after '^' ...
/^\([^ \t]\+[ \t]\+\).*\n\1.*$/ {
        :b
        ## Substitute each char with a space and repeat for all of them (instruction 'tb'). After that, read next line.
        s/\(\n[^ \t]*\)[^ \t][ \t]/\1  /
        tb
        ba
}

:c
## If there are two (or more) lines in pattern space... go to ':d' and read next line.
s/\n/\n/
td

## If last line, print it and exit.
$ { 
        P
        D
}

ba

## Print one line (until first '\n') and delete it.
:d
P
s/^[^\n]*\n//
tc
$ sed -f script.sed infile
hcmv-US25-2-3p  hsa-3160-5
hcmv-US33 hsa-47
hcmv-UL70-3p hsa-4508
             hsa-4486
hcms-US25 hsa-360-5
          hsa-4
          hsa-458
          hsa-44812

Regards,
Birei

Here's a Perl solution -

$
$
$ cat f9
hcmv-US25-2-3p  hsa-3160-5
hcmv-US33 hsa-47
hcmv-UL70-3p hsa-4508
hcmv-UL70-3p hsa-4486
hcms-US25 hsa-360-5
hcms-US25 hsa-4
hcms-US25 hsa-458
hcms-US25 hsa-44812
$
$
$ perl -lane 'print $F[0] eq $prev ? " " x length($prev) : $F[0], " ", $F[1]; $prev=$F[0]' f9
hcmv-US25-2-3p hsa-3160-5
hcmv-US33 hsa-47
hcmv-UL70-3p hsa-4508
             hsa-4486
hcms-US25 hsa-360-5
          hsa-4
          hsa-458
          hsa-44812
$
$
$

tyler_durden