I have a big file with 232 columns and 9 million rows, I want to delete all rows with same column values in col3 through col232. Also the output should be sorted based on first 2 columns.
Here is a reduced example with 6 columns. I want to remove rows with duplicate values in col3 through col6.
chr1 234 A T G C
chr1 567 T T T T
chr1 123 A T T -
chr1 98 A A A T
chr2 46 T T T T
chr2 123 A A T T
expected output
chr1 98 A A A T
chr1 123 A T T -
chr1 234 A T G C
chr2 123 A A T T
Are you expecting to delete more, or keep more? Might have an effect on how this is approached.
Also, you are not concerned about rows deleted? That was just to show how you wanted to solve this?
But, in theory, it should work....
The first command seems to work correctly - building T T T T
But the second is not quite working.
$ cat sample17.txt | cut -d" " -f3- | sort | uniq -d >sample17a.txt
$ cat sample17.txt | egrep -v -f sample17a.txt
chr1 234 A T G C
chr1 567 T T T T
chr1 123 A T T -
chr1 98 A A A T
chr2 46 T T T T
chr2 123 A A T T
I want to remove all rows that have exactly the same value in cols 3 through 232.
Col 3 through 6 was just a shortened example of cols 3 through 232 in the main file.
This has nothing to do with duplicate rows, they should be there.
Yes, the output should be sorted by col2 first and then col1,
so all rows with value 'chr1' in col1 should always appear before rows with the value 'chr2' in col1.
Valid output
chr1 232 A A G C
chr1 789 T T T -
chr2 345 A A G C
chr3 456 A A G C
Invalid output rows
chr3 678 A A A A
chr5 765 G G G G
chr6 433 - - - -