Extract lines with unique value using a specific column

houkto · October 6, 2013, 10:50pm

Hi there,

I need a help with extracting data from tab delimited file which look like this

#CHROM POS ID REF ALT Human Cow Dog Mouse Lizard
chr2 3033 . G C 0/0 0/0 0/0 1/1 0/0
chr3 35040 . G T 0/0 0/0 ./. 1/1 0/1
chr4 60584 . T G 1/1 1/1 0/1 1/1 0/0
chr10 7147815 . G A 0/0 1/1 0/0 0/0 ./.

I am only interested to what is unique to the mouse when compared the other species (the actual file have more species).

The desired output

#CHROM POS ID REF ALT Human Cow Dog Mouse Lizard
chr2 3033 . G C 0/0 0/0 0/0 1/1 0/0
chr3 35040 . G T 0/0 0/0 ./. 1/1 0/1

I will be grateful for your help

N.B the data file is 5 Gb in size.

Thanks

sea · October 6, 2013, 11:14pm

This should work:

grep ^#CHROM file
grep chr[23] file

Where file refers to /path/to/file

houkto · October 6, 2013, 11:17pm

Hi sea,

Thanks for your reply. I might not fully explain what I want. I am interested in values in Mouse column which are unique (in a row) when compared to other species. I don't see that being address in your code.

Thanks

sea · October 6, 2013, 11:22pm

erm, then its:

grep chr[23] /path/to/file|awk '{print $9}'

Hope this helps

houkto · October 6, 2013, 11:25pm

Hi again,

Thanks for your fast reply. This might be confusing, but I am more interested in the value underneath the Mouse i.e 1/1 or whenever its unique to other species and when it does then I will be interested in column such as Chrom and POS REF ALT.

Thanks

Don_Cragun · October 7, 2013, 1:12am

houkto:

Hi there,

I need a help with extracting data from tab delimited file which look like this
#CHROM POS ID REF ALT Human Cow Dog Mouse Lizard
chr2 3033 . G C 0/0 0/0 0/0 1/1 0/0
chr3 35040 . G T 0/0 0/0 ./. 1/1 0/1
chr4 60584 . T G 1/1 1/1 0/1 1/1 0/0
chr10 7147815 . G A 0/0 1/1 0/0 0/0 ./.
I am only interested to what is unique to the mouse when compared the other species (the actual file have more species).

The desired output
#CHROM POS ID REF ALT Human Cow Dog Mouse Lizard
chr2 3033 . G C 0/0 0/0 0/0 1/1 0/0
chr3 35040 . G T 0/0 0/0 ./. 1/1 0/1
I will be grateful for your help

N.B the data file is 5 Gb in size.

Thanks

You say that the file is tab delimited, but there are single space characters (rather than tabs) between fields in your sample file. Which is it?

How long is the longest line in your file? What operating system are you using and what is the LINE_MAX limit on your system. I.e. what is the output from the commands:

uname -a
getconf LINE_MAX

Are you saying that you want to print lines where the contents of the 9th field on the line is different from the contents of the 6th, 7th, 8th, and 10th fields? Is it always the 9th field that matters, is it always the field with the label Mouse in the 1st line in the file that matters, or is there some other way that your will let your script know which field matters?

Are the 1st five fields always ignored when comparing fields, or do the fields to be ignored vary?

Do you really want to print the entire line, or do you just want to print the 1st (#CHROM), 2nd (POS), 4th (REF), and 5th (ALT) fields from lines with unique Mouse data as indicated in your last message? Are those fields always in the same columns?

Subbeh · October 7, 2013, 4:04am

Based on your example input file, this should work:

awk '{for (i=6;i<=NF;i++) if ($i==$9 && i!=9) next}1' file

MadeInGermany · October 7, 2013, 4:08am

Print lines where the contents of the 9th field is different from the contents of the 6th, 7th, 8th, and 10th fields:

awk 'BEGIN{field=9} {p=1; for (species=6; species<=10; species++) if (species!=field && $species==$field) p=0} p' file

houkto · October 7, 2013, 12:07pm

Thanks MadeInGermany,

This worked like a charm. I am trying to spice things up and look for what is common between two species lets say 8th and 9th field against (that are unique) all other fields. How will this impact on your code ?

MadeInGermany · October 7, 2013, 5:59pm

Then it becomes

 awk '{p=1; if ($8!=$9) p=0; for (species=6; species<=10; species++) if (species!=8 && species!=9 && $species==$8) p=0} p'  file

Note that p=1 means to be printed, and there are two conditions that deny printing.
A bit faster is the brutal next meaning "jump to next cycle" and 1 meaning "{print}" - as was suggested by Subbeh:

awk '{if ($8!=$9) next; for (species=6; species<=10; species++) if (species!=8 && species!=9 && $species==$8) next} 1'  file