Remove rows with e column values

alpesh · June 15, 2012, 2:46pm

Hi All,

I have a big file with 232 columns and 9 million rows, I want to delete all rows with same column values in col3 through col232. Also the output should be sorted based on first 2 columns.

Here is a reduced example with 6 columns. I want to remove rows with duplicate values in col3 through col6.

chr1 234 A T G C
chr1 567 T T T T
chr1 123 A T T -
chr1 98   A A A T
chr2 46 T T T T
chr2 123 A A T T

expected output

chr1 98   A A A T
chr1 123 A T T -
chr1 234 A T G C
chr2 123 A A T T

deleted rows

chr1 567 T T T T
chr2 46 T T T T

Thanks, please help.

joeyg · June 15, 2012, 3:01pm

Are you expecting to delete more, or keep more? Might have an effect on how this is approached.
Also, you are not concerned about rows deleted? That was just to show how you wanted to solve this?

alpesh · June 15, 2012, 3:07pm

I am expecting to keep more rows, and the rows with duplicate values in columns are not meaningful to my analysis.

joeyg · June 15, 2012, 3:40pm

But, in theory, it should work....
The first command seems to work correctly - building T T T T
But the second is not quite working.

$ cat sample17.txt | cut -d" " -f3- | sort | uniq -d >sample17a.txt

$ cat sample17.txt |  egrep -v -f sample17a.txt
chr1 234 A T G C
chr1 567 T T T T
chr1 123 A T T -
chr1 98   A A A T
chr2 46 T T T T
chr2 123 A A T T

alpesh · June 15, 2012, 4:05pm

the option uniq -d extracts duplicate rows, I want to compare the value of columns in a single row.

For example I want to delete all of the following lines

A A A A
T T T T
- - - -

Scrutinizer · June 15, 2012, 4:46pm

So do you mean to:

remove all rows that have exactly the same value in cols 3 through 232, and
remove all rows that have duplicate values in cols 3 through 6, and
sort the output on cols 1 an 2 ?

alpesh · June 15, 2012, 5:03pm

I want to remove all rows that have exactly the same value in cols 3 through 232.
Col 3 through 6 was just a shortened example of cols 3 through 232 in the main file.

This has nothing to do with duplicate rows, they should be there.

Yes, the output should be sorted by col2 first and then col1,
so all rows with value 'chr1' in col1 should always appear before rows with the value 'chr2' in col1.

Valid output

chr1 232 A A G C
chr1 789 T T T - 
chr2 345 A A G C
chr3 456 A A G C

Invalid output rows

chr3 678 A A A A
chr5 765 G G G G
chr6 433 - - - -

Scrutinizer · June 15, 2012, 5:43pm

With your sample of 6 cols, try:

awk 'gsub(FS $3,"&")<4' infile | sort -k2,2n -k1,1

(So 4 should be 230 for 232 cols..)

--
Please view this link to learn the use of code tags.

alpesh · June 15, 2012, 6:02pm

Thanks a lot.

I had to change the sort parameters to meet my requirements.

awk 'gsub(FS $3,"&")<4' infile | sort -k 1,1 -k2,2n

Scrutinizer · June 15, 2012, 6:06pm

OK, good, I had it sorted this way because of this remark: