Grep or awk a unique and specific word across many fields

daashti · June 15, 2017, 5:23am

Hi there,

I have data with similar structure as this:

CHR	START-SNP	END-SNP	REF	ALT	PATIENT1	PATIENT2	PATIENT3	PATIENT4
chr1	69511	69511	A	G	homo	hetero	homo	hetero
chr2	69513	69513	T	C	.	hetero	homo	hetero
chr3	69814	69814	G	C	.	.	homo	homo
chr4	69815	69815	C	A	hetero	.	.	hetero

is there a way to report a string the whole string if words such homo or hetero is found across columns not minding fields with dots (.) which mean unknown. So the data looks like this:

CHR	START-SNP	END-SNP	REF	ALT	PATIENT1	PATIENT2	PATIENT3	PATIENT4
chr3	69814	69814	G	C	.	.	homo	homo
chr4	69815	69815	C	A	hetero	.	.	hetero

Thanks

RudiC · June 15, 2017, 5:58am

Not clear. You want lines to be printed to stdout if "words" (is that field contents?) other than "." occur twice (or more) / exactly twice in that line? Is that any field or fields starting from $6? Is that any "words" or just the "words" "homo" and "hetero"?

daashti · June 15, 2017, 6:18am

lines to be printed if homo or hetero which are field contents are constant (more than 2) across that string other than "." starting from the $6 and just the words homo and hetero

Thanks

RudiC · June 15, 2017, 6:40am

In the line starting with "chr1", "homo" count is 2 as is "hetero" count. Should that print or not, i.e. are more than one items allowed?

daashti · June 15, 2017, 6:45am

it shouldn't as the next filed has hetero. it should look for all fields after the 6th column.

I am trying to create to separate files one with hetero and one with homo. if its easier to code that way.

input:

CHR	START-SNP	END-SNP	REF	ALT	PATIENT1	PATIENT2	PATIENT3	PATIENT4
chr1	69511	69511	A	G	homo	hetero	homo	hetero
chr2	69513	69513	T	C	.	hetero	homo	hetero
chr3	69814	69814	G	C	.	.	homo	homo
chr4	69815	69815	C	A	hetero	.	.	hetero

when grep/awk for hetero
output 1:

CHR	START-SNP	END-SNP	REF	ALT	PATIENT1	PATIENT2	PATIENT3	PATIENT4
chr4	69815	69815	C	A	hetero	.	.	hetero

when grep/awk for homo
output 2:

CHR	START-SNP	END-SNP	REF	ALT	PATIENT1	PATIENT2	PATIENT3	PATIENT4
chr3	69814	69814	G	C	.	.	homo	homo

BTW the file I have has many PATIENT1-10000 columns

RudiC · June 15, 2017, 7:00am

Please become accustomed to provide decent context info of your problem.
It is always helpful to support a request with system info like OS and shell, related environment (variables, options), preferred tools, and adequate (representative) sample input and desired output data and the elaborate (!) logics connecting the two, to avoid ambiguities and keep people from guessing.

For your above problem, try

awk '
        {split ("",  C)
         for (i=6; i<=NF; i++) C[$i]++
         CM = C["homo"]
         CR = C["hetero"]
        }
(CM > 1) && !CR ||
(CR > 1) && !CM ||
NR == 1
' file

daashti · June 16, 2017, 10:10am

Worked like a charm.

Just a question if I want to the separation to include equal or more than 1.

Do I have to modify the code to this:

awk '
        {split ("",  C)
         for (i=6; i<=NF; i++) C[$i]++
         CM = C["homo"]
         CR = C["hetero"]
        }
(CM > 0) && !CR ||
(CR > 0) && !CM ||
NR == 1
' file

Thanks

RudiC · June 16, 2017, 3:53pm

You mean ONE or more homo AND NO hetero or NO homo AND ONE or more hetero? Then, yes, that's how you do it.
I have to admit, using >= 2 (or, in your second case, 1 ) would make the logics way clearer. I might adapt to that in the future...

daashti · June 16, 2017, 6:12pm

Does awk work in a range of columns where I can specify groups
for example: (group 1: ranging from 6-60 and group 2: from 61-100)
and my separation will be based on hetero in group 1 and homo in group 2 in the same string and vice versa. is this separation is achievable ?

As always I appreciate your efforts in making SCIENCE great again

Don_Cragun · June 16, 2017, 8:18pm

daashti:

Worked like a charm.

Just a question if I want to the separation to include equal or more than 1.

Do I have to modify the code to this:
awk '
   {split ("",  C)
   for (i=6; i<=NF; i++) C[$i]++
   CM = C["homo"]
   CR = C["hetero"]
   }
(CM > 0) && !CR ||
(CR > 0) && !CM ||
NR == 1
' file
Thanks

Yes, the code above should work. In this specific case (i.e., >0 ), you could further simplify the lines:

(CM > 0) && !CR ||
(CR > 0) && !CM ||
NR == 1

to:

CM && !CR ||
CR && !CM ||
NR == 1

And, as long as the only multi-character strings in your input file (other than in the header line) are the strings homo and hetero , and you want to create two output files you could also try:

NR == 1 || ((nhomo = gsub(/homo/, "&")) >= mincnt && !gsub(/hetero/, "&")) {
	print > "homo.txt"
}
NR == 1 || (gsub(/hetero/, "&") >= mincnt && !nhomo) {
	print > "hetero.txt"
}' file

Either of these suggested methods of processing your file could be modified to work on groups of fields. And, with the suggestions we have provided, you should be able to make an attempt to do so on your own. If you try it and can't quite get it to work, give us a clear specification of what you want to do, show us what you have tried to solve this update to the code we have provided, tell us where you're stuck, and we'll try to help you get to your goal.

RudiC · June 17, 2017, 1:26am

Certainly. Logics might be more intricate. The for loop now runs from field 6 until the last field. Make it two loops, one from 6 to 60, the other from 61 to 100.