remove consecutive duplicate rows

LMHmedchem · June 8, 2011, 5:25pm

I have some data that looks like,

1    3300665.mol   3300665    5177008   102.093
2    3300665.mol   3300665    5177008   102.093
3    3294015.mol   3294015    5131552   102.114
4    3294015.mol   3294015    5131552   102.114
5    3293734.mol   3293734    5129625   104.152
6    3293734.mol   3293734    5129625   104.152
7    3347497.mol   3347497    5510897   109.15
8    3294070.mol   3294070    5132258   113.096
9    3295423.mol   3295423    5141084   114.11
10   3295423.mol   3295423    5141084   114.11
11   3347551.mol   3347551    5511243   114.165
12   3347551.mol   3347551    5511243   114.165
13   3290635.mol   3290635    5108661   116.16
14   3290635.mol   3290635    5108661   116.16
15   3347550.mol   3347550    5511242   117.107
16   3347550.mol   3347550    5511242   117.107
17   3293127.mol   3293127    5119773   118.193
18   3293127.mol   3293127    5119773   118.193
19   3382893.mol   3382893    5728430   119.181
20   3382893.mol   3382893    5728430   119.181

with ~150,000 rows. I need to look for instances where the value in col $1 is the same in consecutive rows. When a duplicate row is found, I need to remove the duplicate row, and write it to a second file. I guess you would just create two new files, one with a single copy of each line (whether or not the row occurs one or more than once in the input file), and a second file that gives the rows that were found to be duplicates.

Minimal set of all rows (dups have been removed):

1    3300665.mol   3300665    5177008   102.093
3    3294015.mol   3294015    5131552   102.114
5    3293734.mol   3293734    5129625   104.152
7    3347497.mol   3347497    5510897   109.15
8    3294070.mol   3294070    5132258   113.096
9    3295423.mol   3295423    5141084   114.11
11   3347551.mol   3347551    5511243   114.165
13   3290635.mol   3290635    5108661   116.16
15   3347550.mol   3347550    5511242   117.107
17   3293127.mol   3293127    5119773   118.193
19   3382893.mol   3382893    5728430   119.181

Duplicates file:

 2    3300665.mol   3300665    5177008   102.093
 4    3294015.mol   3294015    5131552   102.114
 6    3293734.mol   3293734    5129625   104.152
 10   3295423.mol   3295423    5141084   114.11
 12   3347551.mol   3347551    5511243   114.165
 14   3290635.mol   3290635    5108661   116.16
 16   3347550.mol   3347550    5511242   117.107
 18   3293127.mol   3293127    5119773   118.193
 20   3382893.mol   3382893    5728430   119.181

I'm not sure how to go about this, and I can't do it in excel, so some assistance would be appreciated. There could be instances of 3 or more in a row, I'm not sure. I guess it makes sense to just keep the first instance of each multiple.

LMHmedchem

vgersh99 · June 8, 2011, 5:48pm

Based on your sample file, your key is NOT in the first field, but rather in the SECOND.
This will create 2 files: myInput_dup and myInput_uniq

nawk '{print $0 >> (FILENAME (($2 in dup)?"_dup":"_uniq"));dup[$2]}' myInput

ctsgnb · June 8, 2011, 6:03pm

Make all lines uniq (make duplicate consecutive line appear only once):

awk '{sub(".*"$2,$2)}1' yourfile | uniq

Log duplicate lines in a file (will be logged only lines that consecutively appear twice or more):

awk '{sub(".*"$2,$2)}1' yourfile | uniq -d >duplicate.txt

Log lines that appear not more than once consecutively in a file :

awk '{sub(".*"$2,$2)}1' yourfile | uniq -u >single.txt

Log all lines prefixed by the number of times they appear consecutively:

awk '{sub(".*"$2,$2)}1' yourfile | uniq -c >count.txt

You can renumber the lines if necessary with ... | cat -b or ... | cat -n if necessary

By the way, do you really care about the first field (line number) or can we get rid of it ?

LMHmedchem · June 8, 2011, 6:24pm

vgersh99:

Based on your sample file, your key is NOT in the first field, but rather in the SECOND.
This will create 2 files: myInput_dup and myInput_uniq
nawk '{print $0 >> (FILENAME (($2 in dup)?"_dup":"_uniq"));dup[$2]}' myInput

I was numbering the cols with $0 as the first col, is that not right? Now that I think about it, $0 is the whole line, if I remember right.

Will this work with awk, or do I need nawk?

ctsgnb:

Make all lines uniq (make duplicate consecutive line appear only once):
awk '{sub(".*"$2,$2)}1' yourfile | uniq 
By the way, do you really care about the first field (line number) or can we get rid of it ?

I probably need in index field, but I probably don't need to preserve the values from the input file. I could just do another line of awk to add a new index.

awk 'BEGIN{OFS="\t"} {print (NR>1?NR-1:"id"),$0}'

LMHmedchem

ctsgnb · June 8, 2011, 6:33pm

The sample you posted is "line numbered" is it the output of an awk command of yours ?
If so, show it to us and provide and example of the very initial input file you have, before any formatting.

---------- Post updated at 12:33 AM ---------- Previous update was at 12:30 AM ----------

When you state :

"I need to remove the duplicate row"

Do you means that 2 same consecutive lines should :
a) appear only once ?
or
b) should not appear at all ?

vgersh99 · June 8, 2011, 6:56pm

awk's field references are one-based.

If on Solaris use either nawk or /usr/xpg4/bin/awk.
Anywhere else, awk should do (most likely).

LMHmedchem · June 8, 2011, 7:18pm

The answer is b, needs to appear only once in the output. The formatting of the input file is pretty far back in the tool chain and I don't see much value in redoing that part. It is just as easy to add a new col. The string in $1 is the index anyway.

LMHmedchem

ctsgnb · June 8, 2011, 7:35pm

"The answer is b, needs to appear only once in the output"

...
sorry you confuse me :
a
b
b
c
c
a
d
should appear
a
b
c
a
d
or should appear
a
a
d
?

LMHmedchem · June 8, 2011, 9:03pm

And I'm going to keep confusing you, sorry I meant option a. It's been one of those days, since I'm trying to work on three different projects, one in python, one in c++, and one in bash.

a
a
b
c
c
d
e
e

should appear in the unique file as,

a
b
c
d
e

and also in the duplicate log as,

a
c
e

LMHmedchem

ctsgnb · June 9, 2011, 4:28am

Did you try what i suggested in my post #3:

awk '{sub(".*"$2,$2)}1' yourfile | uniq >output.txt

awk '{sub(".*"$2,$2)}1' yourfile | uniq -d >duplicate.txt

???

And check the content of the generated *.txt files

vidyadhar85 · June 9, 2011, 4:57am

Hope this help

 
awk 'v=$0;getline;{if(v=$0){print $0 > "dup.txt"}else{print v > "file.txt"}}'

ctsgnb · June 9, 2011, 5:43am

I haven't tried it but i suppose something like this would do the work as well :

awk '{p=c;c=$0}p{f=(p==c)?"dup.txt":"file.txt";print c > f}' inputfile

could be enforce a little (just in case p has the '0' value)

awk '{p=c;c=$0}length(p){f=(p==c)?"dup.txt":"file.txt";print c > f}' inputfile

shamrock · June 9, 2011, 10:43am

Yet another way of doing the same thing...

awk '{print $0 >> (x[$2]?"dups":"nodups");x[$2]=$0}' file

ctsgnb · June 9, 2011, 10:47am

Oooops i forgot something