Remove rows with e column values

Hi All,

I have a big file with 232 columns and 9 million rows, I want to delete all rows with same column values in col3 through col232. Also the output should be sorted based on first 2 columns.

Here is a reduced example with 6 columns. I want to remove rows with duplicate values in col3 through col6.

chr1 234 A T G C
chr1 567 T T T T
chr1 123 A T T -
chr1 98   A A A T
chr2 46 T T T T
chr2 123 A A T T 

expected output

chr1 98   A A A T
chr1 123 A T T -
chr1 234 A T G C
chr2 123 A A T T 

deleted rows

chr1 567 T T T T
chr2 46 T T T T

Thanks, please help.

Are you expecting to delete more, or keep more? Might have an effect on how this is approached.
Also, you are not concerned about rows deleted? That was just to show how you wanted to solve this?

I am expecting to keep more rows, and the rows with duplicate values in columns are not meaningful to my analysis.

But, in theory, it should work....
The first command seems to work correctly - building T T T T
But the second is not quite working.

$ cat sample17.txt | cut -d" " -f3- | sort | uniq -d >sample17a.txt

$ cat sample17.txt |  egrep -v -f sample17a.txt
chr1 234 A T G C
chr1 567 T T T T
chr1 123 A T T -
chr1 98   A A A T
chr2 46 T T T T
chr2 123 A A T T

the option uniq -d extracts duplicate rows, I want to compare the value of columns in a single row.

For example I want to delete all of the following lines

A A A A
T T T T
- - - -

So do you mean to:

  • remove all rows that have exactly the same value in cols 3 through 232, and
  • remove all rows that have duplicate values in cols 3 through 6, and
  • sort the output on cols 1 an 2 ?

I want to remove all rows that have exactly the same value in cols 3 through 232.
Col 3 through 6 was just a shortened example of cols 3 through 232 in the main file.

This has nothing to do with duplicate rows, they should be there.

Yes, the output should be sorted by col2 first and then col1,
so all rows with value 'chr1' in col1 should always appear before rows with the value 'chr2' in col1.

Valid output

chr1 232 A A G C
chr1 789 T T T - 
chr2 345 A A G C
chr3 456 A A G C

Invalid output rows

chr3 678 A A A A
chr5 765 G G G G
chr6 433 - - - -

With your sample of 6 cols, try:

awk 'gsub(FS $3,"&")<4' infile | sort -k2,2n -k1,1

(So 4 should be 230 for 232 cols..)

--
Please view this link to learn the use of code tags.

Thanks a lot.

I had to change the sort parameters to meet my requirements.

awk 'gsub(FS $3,"&")<4' infile | sort -k 1,1 -k2,2n

OK, good, I had it sorted this way because of this remark: