Remove duplicate lines

zhshqzyc · April 29, 2011, 4:39pm

Hi, I have a huge file which is about 50GB. There are many lines. The file format likes

21 rs885550 0 9887804 C C T C C C C C C C
21 rs210498 0 9928860 0 0 C C 0 0 0 0 0 0
21 rs303304 0 9941889 A A A A A A A A A A 
22 rs303304 0 9941890 0 A A A A A A A A A

The question is that there are a few duplicate rows. The definition of the duplicate is meant the second column is repeated. It doesn't mean the entire line are exactly same, just the the second column.
In the above example, line 3 and 4 can be defined as duplicate. The have the same string

rs303304

We can remove either of them.
The delimeter could be a white space or tab.
After remove the duplicate lines the result will be saved in a new file.

Thanks.

mirni · April 29, 2011, 6:16pm

awk '{a[$2]++}a[$2]==1' input > output

Would print only first line of the duplicates, and ignore the others.
This approach builds an array though, so it's gonna increase the memory use as it goes on; it can be written much better if we can guarantee that the duplicate lines are clustered together. In that case, something like this:

awk 'last!=$2{print}{last=$2}' input > output

tukuyomi · April 29, 2011, 6:26pm

awk 'A[$2]++==0' infile > newfile

zhshqzyc · May 2, 2011, 8:57am

No guarantee that the duplicate lines are clustered and I worry the memory.
Can we use uniq command?

rdcwayx · May 2, 2011, 8:56pm

more shorter.

awk '!a[$2]++' infile