Help with removing duplicate entries with awk or Perl

Amit_Pande · October 29, 2012, 10:25am

Hi,
I have a file which looks like:ke this : chr1 11127067 11132181 89 chr1 11128023 11128311 chr1 11130990 11131025 chr1 11127067 11132181 89 chr1 11128023 11128311 chr1 11131583 11131618 chr1 11127067 11132181 89 chr1 11131908 11132010 chr1 11130990 11131025 chr1 11127067 11132181 89 chr1 11131908 11132010 chr1 11131583 11131618 chr1 11127067 11132181 89 chr1 11130992 11131108 chr1 11130990 11131025 chr1 11127067 11132181 89 chr1 11130992 11131108 chr1 11131583 11131618

the expected output should be like this
chr1 11127067 11132181 89 chr1 11128023 11128311 chr1 11130990 11131025
I want all the duplicate lines to be removed from all the columns

and I want an output which should be able to remove duplicate entries from all the columns....

elixir_sinari · October 29, 2012, 10:29am

Please use code tags to preserve formatting in the data samples. Your input is barely readable.
That one line? or multiple lines?
What's the expected output?

Amit_Pande · October 29, 2012, 10:37am

the duplicate entries in all the columns should be removed...and sorry for the bad post

itkamaraj · October 29, 2012, 10:38am

can you see the below post and use CODE tag

elixir_sinari · October 29, 2012, 10:48am

Like this?

sed 's/chr1/\
&/g' file|awk 'NF{
a=""
for(i=1;i<=NF;i++)
 a=a " " $i
if(!(a in exists))
{
 print
 exists[a]
}
}'|paste -sd\\0 -

Amit_Pande · October 29, 2012, 10:55am

Thanks but this doesn't work.
all the duplicate entries in all the columns should be removed....

itkamaraj · October 29, 2012, 10:56am

Amit,

can you post your data with code tag. otherwise, we all give the solutions with some assumption.

elixir_sinari · October 29, 2012, 11:00am

Sorry for that.
I've edited my earlier post. Check it. Should work.

Amit_Pande · October 29, 2012, 11:10am

Thanks for your effort ....but still it does not work.

elixir_sinari · October 29, 2012, 11:13am

What do you mean by "duplicate entries"? Please define clearly in the context of your example.

I should've asked you this first (and paid heed to itkamaraj's implicit advice) and shouldn't have made an effort in the first place . :wall:

Scott · October 29, 2012, 11:14am

Maybe you should repost your data using code tags. Then everyone could reasonably expect to know exactly what they're working with.

Amit_Pande · October 29, 2012, 11:48am

    	 	 	 	 	 		 	 	     	 	 		 			chr1 			11127067 			11132181 			89 			chr1 			11128023 			11128311 			chr1 			11130990 			11131025 			5 		 		 			chr1 			11127067 			11132181 			89 			chr1 			11128023 			11128311 			chr1 			11131583 			11131618 			1 		 		 			chr1 			11127067 			11132181 			89 			chr1 			11131908 			11132010 			chr1 			11130990 			11131025 			5 		 		 			chr1 			11127067 			11132181 			89 			chr1 			11131908 			11132010 			chr1 			11131583 			11131618 			1 		 		 			chr1 			11127067 			11132181 			89 			chr1 			11130992 			11131108 			chr1 			11130990 			11131025 			5 		 		 			chr1 			11127067 			11132181 			89 			chr1 			11130992 			11131108 			chr1 			11131583 			11131618 			1 		 		 			chr1 			11127067 			11132181 			89 			chr1 			11128311 			11128447 			chr1 			11130990 			11131025 			5 		 		 			chr1 			11127067 			11132181 			89 			chr1 			11128311 			11128447 			chr1 			11131583 			11131618 			1 		 		 			chr1 			11127067 			11132181 			89 			chr1 			11130630 			11130711 			chr1 			11130990 			11131025 			5 		 		 			chr1 			11127067 			11132181 			89 			chr1 			11130630 			11130711 			chr1 			11131583 			11131618 			1 		 		 			chr1 			11127067 			11132181 			89 			chr1 			11130729 			11130979 			chr1 			11130990 			11131025 			5 		 		 			chr1 			11127067 			11132181 			89 			chr1 			11130729 			11130979 			chr1 			11131583 			11131618 			1 		 		 			chr1 			11127067 			11132181 			89 			chr1 			11131263 			11131553 			chr1 			11130990 			11131025 			5 		 		 			chr1 			11127067 			11132181 			89 			chr1 			11131263 			11131553 			chr1 			11131583 			11131618 			1 		 		 			chr1 			11127067 			11132181 			89 			chr1 			11131587 			11131709 			chr1 			11130990 			11131025 			5 		 		 			chr1 			11127067 			11132181 			89 			chr1 			11131587 			11131709 			chr1 			11131583 			11131618 			1 		 		 			chr1 			11127067 			11132181 			89 			chr1 			11132034 			11132488 			chr1 			11130990 			11131025 			5 		 		 			chr1 			11127067 			11132181 			89 			chr1 			11132034 			11132488 			chr1 			11131583 			11131618 			1

and the output should look like this for all the lines...sorry for the trouble friends...I am not an expert with computers

    	 	 	 	 	 		 	 	     	 	 		 			chr1 			11127067 			11132181 			89 			chr1 			11128023 			11128311 			chr1 			11130990 			11131025 			5 		 		 			
			
			
			
			chr1 			11131908 			11132010 			
			
			
			
		 		 			
			
			
			
			chr1 			11130992 			11131108 			chr1 			11131583 			11131618 			1

itkamaraj · October 29, 2012, 11:52am

hmm,, better you can attach your file ( save it as .txt format ) and click "Go Advanced"

choose the "manage attachements" and attach your file

Amit_Pande · October 29, 2012, 12:04pm

Hi,
I have attached the txt file...it contains all the details

itkamaraj · October 29, 2012, 12:37pm

save the below code as a.awk

!($2 in a){printf("%s %s ",$1,$2);a[$2]}
!($3 in b){printf("%s ",$3);b[$3]}
!($5 in c){printf("%s %s ",$4,$5);c[$5]}
!($6 in d){printf("%s %s ",$5,$6);d[$6]}
!($7 in e){printf("%s ",$7);e[$7]}
!($9 in f){printf("%s %s ",$8,$9);f[$9]}
!($10 in g){printf("%s %s",$10,$11);g[$10]}
{printf("\n")}

execute the awk command by

awk -f a.awk input.txt

durden_tyler · October 29, 2012, 7:01pm

$
$ cat input
chr1    11127067        11132181        89      chr1    11128023        11128311        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11128023        11128311        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11131908        11132010        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11131908        11132010        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11130992        11131108        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11130992        11131108        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11128311        11128447        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11128311        11128447        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11130630        11130711        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11130630        11130711        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11130729        11130979        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11130729        11130979        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11131263        11131553        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11131263        11131553        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11131587        11131709        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11131587        11131709        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11132034        11132488        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11132034        11132488        chr1    11131583        11131618        1
$
$
$ perl -F"\t" -lane 'for ($i=0; $i<=$#F; $i++) {
                       if (not defined $tokens{$i.":".$F[$i]}) {push @x, $F[$i]}
                       else {push @x, ""}
                     }
                     if (join("",@x) ne "") {
                       for ($i=0; $i<=$#x; $i++) { $line .= sprintf ("%-10s", $x[$i]) }
                       print $line;
                     }
                     for ($i=0; $i<=$#F; $i++) { $tokens{$i.":".$F[$i]}++ };
                     $line = ""; @x = ();
                    ' input
chr1      11127067  11132181  89        chr1      11128023  11128311  chr1      11130990  11131025  5
                                                                                11131583  11131618  1
                                                  11131908  11132010
                                                  11130992  11131108
                                                  11128311  11128447
                                                  11130630  11130711
                                                  11130729  11130979
                                                  11131263  11131553
                                                  11131587  11131709
                                                  11132034  11132488
$
$
$

tyler_durden

Amit_Pande · October 30, 2012, 9:20am

Sorry it doesn't work

pamu · October 30, 2012, 10:02am

try

awk '{for(i=1;i<=NF;i++){if(X[$i,i]++){$i=""}}}1' OFS="\t" file

Amit_Pande · October 30, 2012, 10:09am

Doesn't work...I have attached the result as output.txt. Kindly have a look.

pamu · October 30, 2012, 10:20am

Your expected output doesn't replicate what you say.

please look

chr1    11127067    11132181    89    chr1    11128023    11128311    chr1    11130990    11131025    5
                        chr1    11131908    11132010    chr1    11131583    11131618    1
                        chr1    11130992    11131108    
                        chr1    11128311    11128447

duplicate lines in column 1,2,3,5,6,7,8,9,10 should be removed while those that are not duplicate lines should be retained.

1) From column 6 and 7 only 4 lines are printed in expected output.(you can see there few more)
2) See red chr1 this also duplicates.(why they are printed.
3) And if you don't want to consider column 4 it should be present for all the lines right.?

Assuming you don't want consider column 4 for duplicates.

$ cat file
chr1    11127067        11132181        89      chr1    11128023        11128311        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11128023        11128311        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11131908        11132010        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11131908        11132010        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11130992        11131108        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11130992        11131108        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11128311        11128447        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11128311        11128447        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11130630        11130711        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11130630        11130711        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11130729        11130979        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11130729        11130979        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11131263        11131553        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11131263        11131553        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11131587        11131709        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11131587        11131709        chr1    11131583        11131618        1
chr1    11127067        11132181        89      chr1    11132034        11132488        chr1    11130990        11131025        5
chr1    11127067        11132181        89      chr1    11132034        11132488        chr1    11131583        11131618        1

$ awk '{for(i=1;i<=NF;i++){if((X[$i,i]++) && i!=4){$i=""}}}1' OFS="\t" file
chr1    11127067        11132181        89      chr1    11128023        11128311        chr1    11130990        11131025        5
                        89                                      11131583        11131618        1
                        89              11131908        11132010
                        89
                        89              11130992        11131108
                        89
                        89              11128311        11128447
                        89
                        89              11130630        11130711
                        89
                        89              11130729        11130979
                        89
                        89              11131263        11131553
                        89
                        89              11131587        11131709
                        89
                        89              11132034        11132488
                        89

And considering all the columns..

]$ awk '{for(i=1;i<=NF;i++){if(X[$i,i]++){$i=""}}}1' OFS="\t" file
chr1    11127067        11132181        89      chr1    11128023        11128311        chr1    11130990        11131025        5
                                                                11131583        11131618        1
                                        11131908        11132010

                                        11130992        11131108

                                        11128311        11128447

                                        11130630        11130711

                                        11130729        11130979

                                        11131263        11131553

                                        11131587        11131709

                                        11132034        11132488

Hope this helps you:)

pamu