awk to update file based on 5 conditions

cmccabe · January 21, 2017, 2:26pm

I am trying to use awk to update the below tab-delimited file based on 5 different rules/conditions. The final output is also
tab-delimited and each line in the file will meet one of the conditions. My attemp is below as well though I am not very confident in it. Thank you :).

Condition 1: The field Classification has a default value of "VUS" for all lines in file

Condition 2: The CLINSIG field updates Classification with the value in it if it hasa lenghth < 12 , else it is Conflicting is the result

since it is possible for this field to have multiple strings in it I used the greatest single value "Likely Benign" and if the value in the field exceeds 12 characters
then "Conflicting" is the result, the multiple values are also separated by | symbol

Condition 3: If the Func.IDP.refGene = UTR then the value of Classification is Likely Benign ,
unleess CLINSIG had a value already

Condition 4: If the PopFreqMax > .01 then If the Classification is Likely Benign else it is VUS ,
unleess CLINSIG had a value already

Condition 5: If Func.IDP.refGene = spicing AND GeneDetail.IDP.refGene has +/- > 10
then the value of Classification is Likely Benign , unleess CLINSIG had a value already

Thank you :).

file

R_Index    Chr    Start    End    Ref    Alt    Func.IDP.refGene    GeneDetail.IDP.refGene    AAChange.IDP.refGene    PopFreqMax    CLINSIG    CLNDBN    Classification    Quality
1    chr1    40562993    40562993    T    C    UTR5    NM_000310.3:c.-83A>G    .    0.9    .    .    .    15
2    chr5    125887685    125887685    C    T    splicing    NM_001201377.1:exon14:c.1233+28G>A    .    0.82    .    .    .    10
3    chr16    2105400    2105400    C    T    splicing    NM_000548.4:exon6:c.482-3C>T    .    0.21    not provided|not provided|not provided|not provided|other|Benign    TSC    .    25
4    chr16    2110805    2110805    G    A    exonic    .    TSC2:NM_000548.4:exon11:c.1110G>A:p.Q370Q    .004    Pathogenic    TSC    .    40

Descri[tion of fields

awk 'NR==1{for(i=1;i<=NF;i++){print "Number of field in terms of NF is--> NF-" NF-i", value is-->" $i}}' file
Number of field in terms of NF is--> NF-13, value is-->R_Index
Number of field in terms of NF is--> NF-12, value is-->Chr
Number of field in terms of NF is--> NF-11, value is-->Start
Number of field in terms of NF is--> NF-10, value is-->End
Number of field in terms of NF is--> NF-9, value is-->Ref
Number of field in terms of NF is--> NF-8, value is-->Alt
Number of field in terms of NF is--> NF-7, value is-->Func.IDP.refGene
Number of field in terms of NF is--> NF-6, value is-->GeneDetail.IDP.refGene
Number of field in terms of NF is--> NF-5, value is-->AAChange.IDP.refGene
Number of field in terms of NF is--> NF-4, value is-->PopFreqMax
Number of field in terms of NF is--> NF-3, value is-->CLINSIG
Number of field in terms of NF is--> NF-2, value is-->CLNDBN
Number of field in terms of NF is--> NF-1, value is-->Classification
Number of field in terms of NF is--> NF-0, value is-->Quality

# default classification to "VUS" 
awk -F'\t' -v OFS='\t' 'NR>1{$(NF-1)="VUS"} 1' file > vus

# check clinvar
awk -F'\t' -v OFS='\t' '{if ($(NF-3=length(<12)=$NF-3) else "Conflicting" 1' vus > clinvar

# UTR check
awk -F'\t' -v OFS='\t' '{if ($(NF-7="UTR")="Likely Benign") else $NF-3} 1' clinvar > utr

# check PopFreq
awk -F'\t' -v OFS='\t' '{if ($(NF-4 > .01)($(NF-1}="Likely Benign")} 1' utr > popfreq

# splicing check
awk -F'\t' -v OFS='\t' '{if ($(NF-7="splicing") AND ($(NF-6)=+/1) else $NF-3} 1' popfreq > final

desired output

R_Index    Chr    Start    End    Ref    Alt    Func.IDP.refGene    GeneDetail.IDP.refGene    AAChange.IDP.refGene    PopFreqMax    CLINSIG    CLNDBN    Classification    Quality
1    chr1    40562993    40562993    T    C    UTR5    NM_000310.3:c.-83A>G    .    0.9    .    .    Likely Benign    15
2    chr5    125887685    125887685    C    T    splicing    NM_001201377.1:exon14:c.1233+28G>A    .    0.82    .    .    Likely Benign    10
3    chr16    2105400    2105400    C    T    splicing    NM_000548.4:exon6:c.482-3C>T    .    0.21    not provided|not provided|not provided|not provided|other|Benign    TSC    Conflicting    25
4    chr16    2110805    2110805    G    A    exonic    .    TSC2:NM_000548.4:exon11:c.1110G>A:p.Q370Q    .004    Pathogenic    TSC    Pathogenic    40

Don_Cragun · January 21, 2017, 6:52pm

Do you need the five intermediate files or is one awk script that reads file and just produces final sufficient?

When you say "My attemp(sic) is below as well though I am not very confident in it.", do you mean that you're getting syntax errors that you are unable to fix, or do you mean that you're getting output but you aren't confident that the results meet your specifications?

cmccabe · January 21, 2017, 7:10pm

Just the final is sufficient. What I mean by my attempts is that when I run the first awk for condition 1, I get the desired result of all N$NF-1 fields defaulting to VUS . The other 4 awks do not run and have syntax errors in them.

The intermediate files were only to verify that everything was as expected. As I am not proficient and it seems to be good practice, I tend to write more code then is needed to make sure. Hopefully as I learn more and improve that will be less. Thank you :).

Don_Cragun · January 21, 2017, 10:59pm

Conditions 3 through 5 all say: "unleess(sic) CLINSIG had a value already". Please explain what this means.

First: every field has a value (it might be an empty string, but that is still a value; but there are no empty fields in your sample input either).

Second: the only field that is assigned any value in your scripts and in your requirements is the Classification field, so to what does already refer? Did I miss something where one of these other fields is supposed to be assigned a value while running your scripts?

Third: Condition 1 unconditionally sets Classification to the string VUS . Condition 2 unconditionally sets Classification either to the value of the CLINSIG field or the string Likely Benign . Then Condition 3 may reset Classification to the string Likely Benign . Then Condition 4 sets Classification to the string Likely Benign or to the string VUS . And, finally, Condition 5 may reset Classification to the string Likely Benign . From this description of what your five conditions do, there is no need to perform Conditions 1, 2, or 3 (since Condition 4 always sets one of two values and Condition 5 might change the value Condition 4 set). What am I missing?

cmccabe · January 22, 2017, 8:16am

If CLINSIG has a value in it of Benign, then for the particular line the Classification is Benign . So I guess if CLINSIG has a value in it then that is the Classification and none of the conditions are applied.

My actual dataset does have some fields with null values in it. but I filled in those with . , that is what I mean by there are no empty fields.

I am not sure what you mean by already, you are correct that only Classification is assigned a value.

The first condition does unconditionally set Classification to VUS . That seems to be working.

Condition 2 sets the Classification to the value of CLINSIG unless CLINSIG has multiple values in it like in line 3 of file (if there are multiple values in it they are separated by a | . If this is the case then Classification is Conflicting .

Condition 3 is only run on the lines where Func.IDP.refGene = UTR then the value of Classification is Likely Benign ,
if CLINSIG is set to VUS , if it is any other value it is not changed.

Condition 4 is if the PopFreqMax > .01 then the Classification is Likely Benign unless there is a value other
then VUS there. If there is that means CLINSIG had a value already.

Condition 5 is only run if Func.IDP.refGene = spicing AND GeneDetail.IDP.refGene has +/- symbol in it AND the number
after it is > 10 then the value of Classification is Likely Benign, unless there is a value other
then VUS there. If there is that means CLINSIG had a value already.

Thank you :).

Don_Cragun · January 22, 2017, 11:44pm

cmccabe:

If CLINSIG has a value in it of Benign, then for the particular line the Classification is Benign . So I guess if CLINSIG has a value in it then that is the Classification and none of the conditions are applied.

My actual dataset does have some fields with null values in it. but I filled in those with . , that is what I mean by there are no empty fields.

I am not sure what you mean by already, you are correct that only Classification is assigned a value.

The first condition does unconditionally set Classification to VUS . That seems to be working.

Condition 2 sets the Classification to the value of CLINSIG unless CLINSIG has multiple values in it like in line 3 of file (if there are multiple values in it they are separated by a | . If this is the case then Classification is Conflicting .

Condition 3 is only run on the lines where Func.IDP.refGene = UTR then the value of Classification is Likely Benign ,
if CLINSIG is set to VUS , if it is any other value it is not changed.

Condition 4 is if the PopFreqMax > .01 then the Classification is Likely Benign unless there is a value other
then VUS there. If there is that means CLINSIG had a value already.

Condition 5 is only run if Func.IDP.refGene = spicing AND GeneDetail.IDP.refGene has +/- symbol in it AND the number
after it is > 10 then the value of Classification is Likely Benign, unless there is a value other
then VUS there. If there is that means CLINSIG had a value already.

Thank you :).

There are so many conflicting requirements here that I am completely confused. From two of your paragraphs above:

The 2nd paragraph quoted above says that if a line in the input file contains an | character in the CLINSIG in a line, the Classification field in the output for that line is to be set to Conflicting . Otherwise, the Classification field in the output for that line is to be set to the string that was in the CLINSIG field in that input line (even if that string was an empty string or just contains a period character ( . ). And, according to the last sentence in the 1st paragraph above, once this has been done Conditions 1, 3, 4, and 5 are always to be ignored. I'm sure that isn't what you mean, but it is what you have repeatedly stated.

You talked about empty fields above, but not in earlier posts. Whether or not a field is empty, it has a value. And, you have stated that if a field's value contains less than 12 characters (which certainly includes 0 if that field's value is an empty string or 1 if that field's value is a period), then that value becomes the Classification field's value.

Since I can't make any sense out of your stated requirements, let me see if I can restate your requirements in a way that makes sense to me, hoping that I capture what you intended to say.

Requirements:
Perform the following tests in sequence until the stated condition for a test evaluates to TRUE. For the 1st test whose condition evaluates to TRUE, set the value of the Classification field in that output line to the corresponding stated value:

If the value of the CLINSIG field contains a | character, set the Classification to Conflicted .
If the value of the CLINSIG field is not an empty string, is not the string . , and is not the string VUS ; set the Classification field to the value of the CLINSIG field.
If the value of the CLINSIG field is the string VUS and the value of the Func.IDP.refGene field is the string UTR , set the Classification field to the string Likely Benign .
If the value of the CLINSIG field is the string VUS and the value of the PopFreqMax field is greater than .01 , set the Classification field to the string Likely Benign .
If the value of the CLINSIG field is the string VUS , the value of the Func.IDP.refGene is the string splicing or the string spicing , and the absolute value of the GeneDetail.IDP.refGene field is greater than 10 ; set the Classification field to the string Likely Benign .
If none of the above tests succeeded, set the Classification field to the string VUS .

Would code written to meet the above restatement of your requirements do what you want.

cmccabe · January 23, 2017, 2:34am

Requirements:
Perform the following tests in sequence until the stated condition for a test evaluates to TRUE. For the 1st test whose condition evaluates to TRUE, set the value of the Classification field in that output line to the corresponding stated value:

If the value of the CLINSIG field contains a | character, set the Classification to Conflicted .

If the value of the CLINSIG field is not an empty string, is not the string . , and is not the string VUS ; set the Classification field to the value of the CLINSIG field.

If the value of the CLINSIG field is the string VUS and the value of the Func.IDP.refGene field is the string UTR , set the Classification field to the string Likely Benign .

If the value of the CLINSIG field is the string VUS and the value of the PopFreqMax field is greater than .01 , set the Classification field to the string Likely Benign .

If the value of the CLINSIG field is the string VUS , the value of the Func.IDP.refGene is the string splicing or the string spicing , and the absolute value of the GeneDetail.IDP.refGene field is greater than 10 ; set the Classification field to the string Likely Benign .

If none of the above tests succeeded, set the Classification field to the string VUS .

Would code written to meet the above restatement of your requirements do what you want.

If the value of CLINSIG is nothing will the code execute? So if CLINSIG has no value and Func.IDP.refGene is UTR is CLASSIFICATION still Likely Benign ?

CLINSIG may not have a value in it, if it doesnt will the code execute. Otherwise it is perfect. Thank you very much :).

Don_Cragun · January 23, 2017, 3:01am

As I said before, there is no such thing as a field not having a value. If there are no characters in a field, the value of that field is an empty string. And, yes, test 2 explicitly fails if the CLINSIG field is an empty string. And since an empty string does not contain | and an empty string is not the same as the string VUS , tests 1, 3, 4, and 5 also fail leaving test 6 to set the Classification field to VUS .

Now that you have a clearer specification for your script, can you write the awk script to perform those tests and produce your desired output? Try to write it and let us see what you come up with. If you get stuck, we'll try to help you fix your problems.

cmccabe · January 23, 2017, 1:53pm

I will attempt to write an awk that can be used with these requirements. It may be similar to post one in structure, but that is only because I am not yet proficient enough, only the final classified file is needed. Thank you :).

---------- Post updated at 12:53 PM ---------- Previous update was at 05:07 AM ----------

Below is my awk for the first 3 conditions as well as the sixth. I know that it needs work but, hopefully I got the basic concepts. Thank you :).

awk

awk -F'\t' -v OFS='\t' 'NR>1{ if ($(NF-3 == "|"));$(NF-1)="Conflicted"} 1' file > conflicted # condition 1
awk -F'\t' -v OFS='\t' 'NR>1{ if ($(NF-3 ~ /^(.|VUS)$/ );$(NF-1)=($(NF-3))} 1' conflicted > clinsig # condition 2
awk -F'\t' -v OFS='\t' 'NR>1{ if ($(NF-3 == "VUS"));$(NF-7)="UTR";$(NF-1)="Likely Benign"} 1' clinsig > utr # condition 3
awk -F'\t' -v OFS='\t' 'NR>1{$(NF-1)="VUS"} 1' #condition 6

Don_Cragun · January 24, 2017, 1:32pm

You still seem to be having trouble with awk syntax...

if ($(NF-3 == "|"))

does not look for a pipe symbol somewhere in the 3rd from the last field, it looks for a non-empty, non-zero value in field 0 (which with your sample input is ALWAYS true). This is because the number NF-3 will always be a number and will never be the character | ; so the value in parentheses will always be false (i.e. 0) and since $0 is not an empty string and is not a string that just contains (or evaluates to) the value 0, the expression evaluates to true. I used:

if(index($f["CLINSIG"], "|"))

for this test, but you could also just use:

if($f["CLINSIG"] ~ /|/)

All of your other if statement expressions have similar problems.

Something more like:

awk '
BEGIN {	# Set input and output field separators:
	FS = OFS = "\t"
	# Create list of needed field headers:
	nfh["Classification"]
	nfh["CLINSIG"]
	nfh["PopFreqMax"]
	nfh["Func.IDP.refGene"]
}
NR == 1 {
	# Create array to tranlate needed field headers to field numbers:
	for(i = 1; i <= NF; i++)
		if($i in nfh)
			f[$i] = i
	# Verify that all of the needed field headers were found:
	for(i in nfh)
		if(!(i in f)) {
			missing++
			printf("Needed field missing: %s\n", i)
		}
	# If one or more needed fields were not found, give up:
	if(missing)
		exit 1
}
NR > 1 {# Test #1:
	#for(i in nfh) printf("NR=%d: f[\"%s\",%d]=\"%s\"\n",
	#    NR, i, f, $f)
	if(index($f["CLINSIG"], "|"))
		$f["Classification"] = "Conflicting"
	else {	#Test #2:
		if($f["CLINSIG"] != "" && $f["CLINSIG"] != "." &&
		    $f["CLINSIG"] != "VUS")
		    	$f["Classification"] = $f["CLINSIG"]
		else	# Tests 3, 4, & 5:
			if($f["CLINSIG"] == "VUS" && ( \
			    ($f["Func.IDP.refGene"] == "UTR") || \
			    ($f["PopFreqMax"] > .01) || \
			    ($f["Func.IDP.refGene"] ~ /^spl?icing$/) \
			))	$f["Classification"] = "Likely Benign"
			else	# Test #6:
				$f["Classification"] = "VUS"
	}
	#printf(" out: f[\"%s\",%d]=\"%s\"\n", "Classification",
	#    f["Classification"], $f["Classification"])
}
1' file > final

with your sample input file (with each sequence of four <space> characters changed to a <tab> character) produces the output:

R_Index	Chr	Start	End	Ref	Alt	Func.IDP.refGene	GeneDetail.IDP.refGene	AAChange.IDP.refGene	PopFreqMax	CLINSIG	CLNDBN	Classification	Quality
1	chr1	40562993	40562993	T	C	UTR5	NM_000310.3:c.-83A>G	.	0.9	.	.	VUS	15
2	chr5	125887685	125887685	C	T	splicing	NM_001201377.1:exon14:c.1233+28G>A	.	0.82	.	.	VUS	10
3	chr16	2105400	2105400	C	T	splicing	NM_000548.4:exon6:c.482-3C>T	.	0.21	not provided|not provided|not provided|not provided|other|Benign	TSC	Conflicting	25
4	chr16	2110805	2110805	G	A	exonic	.	TSC2:NM_000548.4:exon11:c.1110G>A:p.Q370Q	.004	Pathogenic	TSC	Pathogenic	40

or, with the debugging statements uncommented:

R_Index	Chr	Start	End	Ref	Alt	Func.IDP.refGene	GeneDetail.IDP.refGene	AAChange.IDP.refGene	PopFreqMax	CLINSIG	CLNDBN	Classification	Quality
NR=2: f["Classification",13]="."
NR=2: f["PopFreqMax",10]="0.9"
NR=2: f["Func.IDP.refGene",7]="UTR5"
NR=2: f["CLINSIG",11]="."
 out: f["Classification",13]="VUS"
1	chr1	40562993	40562993	T	C	UTR5	NM_000310.3:c.-83A>G	.	0.9	.	.	VUS	15
NR=3: f["Classification",13]="."
NR=3: f["PopFreqMax",10]="0.82"
NR=3: f["Func.IDP.refGene",7]="splicing"
NR=3: f["CLINSIG",11]="."
 out: f["Classification",13]="VUS"
2	chr5	125887685	125887685	C	T	splicing	NM_001201377.1:exon14:c.1233+28G>A	.	0.82	.	.	VUS	10
NR=4: f["Classification",13]="."
NR=4: f["PopFreqMax",10]="0.21"
NR=4: f["Func.IDP.refGene",7]="splicing"
NR=4: f["CLINSIG",11]="not provided|not provided|not provided|not provided|other|Benign"
 out: f["Classification",13]="Conflicting"
3	chr16	2105400	2105400	C	T	splicing	NM_000548.4:exon6:c.482-3C>T	.	0.21	not provided|not provided|not provided|not provided|other|Benign	TSC	Conflicting	25
NR=5: f["Classification",13]="."
NR=5: f["PopFreqMax",10]=".004"
NR=5: f["Func.IDP.refGene",7]="exonic"
NR=5: f["CLINSIG",11]="Pathogenic"
 out: f["Classification",13]="Pathogenic"
4	chr16	2110805	2110805	G	A	exonic	.	TSC2:NM_000548.4:exon11:c.1110G>A:p.Q370Q	.004	Pathogenic	TSC	Pathogenic	40

Note that I had a typo in Test 1 in post #6:

If the value of the CLINSIG field contains a | character, set the Classification to Conflicted.

should have been:

If the value of the CLINSIG field contains a | character, set the Classification to Conflicting.

The code above implements this corrected Test 1.

Note that the output for input lines 2 and 3 has the Classification field set to VUS because none of Tests 1 through 5 are met by the data in those lines. Your desired output requested that that field be set to Likely Benign for both of those lines, but I don't see how either of those lines meet your Conditions 2 through 5 (my Tests 3 through 5) which are the tests that would set the Classification field to Likely Benign .

Note also that my code processes the header line to locate the names of the fields that are used in the tests and works using field names instead of trying to decipher what is supposed to be in $(NF - whatever).

cmccabe · January 26, 2017, 8:29am

Thank you very much for your help