CHR START-SNP END-SNP REF ALT PATIENT1 PATIENT2 PATIENT3 PATIENT4
chr1 69511 69511 A G homo hetero homo hetero
chr2 69513 69513 T C . hetero homo hetero
chr3 69814 69814 G C . . homo homo
chr4 69815 69815 C A hetero . . hetero
is there a way to report a string the whole string if words such homo or hetero is found across columns not minding fields with dots (.) which mean unknown. So the data looks like this:
CHR START-SNP END-SNP REF ALT PATIENT1 PATIENT2 PATIENT3 PATIENT4
chr3 69814 69814 G C . . homo homo
chr4 69815 69815 C A hetero . . hetero
Not clear. You want lines to be printed to stdout if "words" (is that field contents?) other than "." occur twice (or more) / exactly twice in that line? Is that any field or fields starting from $6? Is that any "words" or just the "words" "homo" and "hetero"?
lines to be printed if homo or hetero which are field contents are constant (more than 2) across that string other than "." starting from the $6 and just the words homo and hetero
it shouldn't as the next filed has hetero. it should look for all fields after the 6th column.
I am trying to create to separate files one with hetero and one with homo. if its easier to code that way.
input:
CHR START-SNP END-SNP REF ALT PATIENT1 PATIENT2 PATIENT3 PATIENT4
chr1 69511 69511 A G homo hetero homo hetero
chr2 69513 69513 T C . hetero homo hetero
chr3 69814 69814 G C . . homo homo
chr4 69815 69815 C A hetero . . hetero
when grep/awk for hetero
output 1:
CHR START-SNP END-SNP REF ALT PATIENT1 PATIENT2 PATIENT3 PATIENT4
chr4 69815 69815 C A hetero . . hetero
when grep/awk for homo
output 2:
CHR START-SNP END-SNP REF ALT PATIENT1 PATIENT2 PATIENT3 PATIENT4
chr3 69814 69814 G C . . homo homo
BTW the file I have has many PATIENT1-10000 columns
Please become accustomed to provide decent context info of your problem.
It is always helpful to support a request with system info like OS and shell, related environment (variables, options), preferred tools, and adequate (representative) sample input and desired output data and the elaborate (!) logics connecting the two, to avoid ambiguities and keep people from guessing.
For your above problem, try
awk '
{split ("", C)
for (i=6; i<=NF; i++) C[$i]++
CM = C["homo"]
CR = C["hetero"]
}
(CM > 1) && !CR ||
(CR > 1) && !CM ||
NR == 1
' file
You mean ONE or more homo AND NO hetero or NO homo AND ONE or more hetero? Then, yes, that's how you do it.
I have to admit, using >= 2 (or, in your second case, 1 ) would make the logics way clearer. I might adapt to that in the future...
Does awk work in a range of columns where I can specify groups
for example: (group 1: ranging from 6-60 and group 2: from 61-100)
and my separation will be based on hetero in group 1 and homo in group 2 in the same string and vice versa. is this separation is achievable ?
As always I appreciate your efforts in making SCIENCE great again
Yes, the code above should work. In this specific case (i.e., >0 ), you could further simplify the lines:
(CM > 0) && !CR ||
(CR > 0) && !CM ||
NR == 1
to:
CM && !CR ||
CR && !CM ||
NR == 1
And, as long as the only multi-character strings in your input file (other than in the header line) are the strings homo and hetero , and you want to create two output files you could also try:
Either of these suggested methods of processing your file could be modified to work on groups of fields. And, with the suggestions we have provided, you should be able to make an attempt to do so on your own. If you try it and can't quite get it to work, give us a clear specification of what you want to do, show us what you have tried to solve this update to the code we have provided, tell us where you're stuck, and we'll try to help you get to your goal.
Certainly. Logics might be more intricate. The for loop now runs from field 6 until the last field. Make it two loops, one from 6 to 60, the other from 61 to 100.