awk to print fields that match using conditions and a default value for non-matching in two files

cmccabe · March 18, 2017, 10:30am

Trying to use awk to match the contents of each line in file1 with $5 in file2 . Both files are tab-delimited and there may be a space or special character in the name being matched in file2 , for example in file1 [/ICODE] the name is BRCA1 but in file2 the name is BRCA 1 or in file1 name is BCR but in file2 the name is BCR/ABL [/ICODE].

If there is a match and $5 of file2 and $7 has full gene sequence in it, then $5 and $4 are printed separated by a tab. If there is no match found then the name that was not matched and 279 are printed separated by a tab. The awk below does execute, but the output is not correct. Also I am not sure how to add in the condition to ensure $7 is full gene sequence .

The names in file2 may be partial matchto file1 , but in file1 they will always be complete. Like in the BRCA1 in file1 that matches the BRCA 1, BRCA2 in file2 . The full gene sequence in $7 may also be partial in file2 as is the case for BCRA1 . The file2 is not a controlled document so the case may be different as in fbn1 from file1 matching FBN1 in file2 . The awk seems close but not all conditions are accounted for. Thank you :).

awk 'BEGIN{FS=OFS="\t"}
  FNR==NR{
      if(NR>1){
          gsub(" ","",$5)       #removing white space
          n=split($5,v,"/")
          d[v[1]] = $4          #from split, first element as key
      }
      next
}{print $1, ($1 in d?d[$1]:279)}' file2 file1

BRCA1	279
BCR	806
SCN1A	279
fbn1	85

BRCA1	81
BCR	806
SCN1A	279
fbn1	85

file1

BRCA1
BCR
SCN1A
fbn1

file2

Tier	explanation	.	List code	gene	gene name	methodology	disease
Tier 1	.	.	811	DMD	dystrophin	deletion analysis and duplication analysis, if performed Publication Date: January 1, 2014	Duchenne/Becker muscular dystrophy
Tier 1	.	Jan-16	81	BRCA 1, BRCA2	breast cancer 1 and 2	full gene sequence and full deletion/duplication analysis	hereditary breast and ovarian cancer
Tier 1	.	Jan-16	70	ABL1	ABL1	gene analysis variants in the kinse domane	acquired imatinib tyrosine kinase inhibitor
Tier 1	.	.	806	BCR/ABL 1 	t(9;22)	major breakpoint, qualitative or quantitative	chronic myelogenous leukemia CML
Tier 1	.	Jan-16	85	FBN1	Fibrillin	full gene sequencing	heart disease
Tier 1	.	Jan-16	95	FBN1	fibrillin	del/dup	heart disease

Chubler_XL · March 19, 2017, 4:34pm

Try these changes:

awk 'BEGIN{FS=OFS="\t"}
{$0=toupper($0)}
FNR==NR{
   if(NR>1 && ($7 ~ "FULL GENE SEQUENC")) {
          gsub(" ","",$5)       #removing white space
          n=split($5,v,"/")
          d[v[1]] = $4          #from split, first element as key
      }
      next
}{print $1, ($1 in d?d[$1]:279)}' file2 file1

cmccabe · March 19, 2017, 5:08pm

I made a typo in on of the file1 lines, BCRA1 should be BCRA2 .

file1

BRCA2
BCR
SCN1A
fbn1

current output:

awk 'BEGIN{FS=OFS="\t"}
{$0=toupper($0)}
FNR==NR{
   if(NR>1 && ($7 ~ "FULL GENE SEQUENC")) {
          gsub(" ","",$5)       #removing white space
          n=split($5,v,"/")
          d[v[1]] = $4          #from split, first element as key
      }
      next
}{print $1, ($1 in d?d[$1]:279)}' file2 file1
BRCA2    279
BCR    279
SCN1A    279
FBN1    85

FULL GENE SEQUENC could also be case in sensitive so I added a check in for that... why isn't FULL GENE SEQUENCE used, when I try that I get all the names with a value of 279.

awk 'BEGIN{FS=OFS="\t"}
{$0=toupper($0)} {$7=toupper($7)}
FNR==NR{
   if(NR>1 && ($7 ~ "FULL GENE SEQUENC")) {
          gsub(" ","",$5)       #removing white space
          n=split($5,v,"/")
          d[v[1]] = $4          #from split, first element as key
      }
      next
}{print $1, ($1 in d?d[$1]:279)}' file2 file1
BRCA2    279 
BCR    279
SCN1A    279
FBN1    85

desired output

BRCA2    81   - match in line 2 of $5 in file 2, BRCA 1, BRCA2
BCR    279     - match in line 2 of $5 in file but $7 is not full gene sequence
SCN1A    279
fbn1    85

Thank you :).

Chubler_XL · March 19, 2017, 5:34pm

Trying to match full gene sequencing and full gene sequence as this is in the data.

cmccabe · March 19, 2017, 5:48pm

Makes sense now, thank you.

In order to capture BRCA2 from file1 with BRCA 1, BRCA2 from file2 , would:

 gsub(" ","",$5)       #removing white space

have to be to capture the matching name anywhere in that $5, I believe it is currently only looking in the first position, but BRCA2 is in position 2. Thank you :).

 gsub(" ","",""$5)       #removing white space

or is the problem that $7 is full gene sequence and full deletion/duplication analysis , that is full gene sequence is a partial match to the full line in $7 ?

Chubler_XL · March 20, 2017, 9:47pm

You could go thru all the values in $5 using comma , and slash / as separators.

awk 'BEGIN{FS=OFS="\t"}
{$0=toupper($0)}
FNR==NR{
   if(NR>1 && ($7 ~ "FULL GENE SEQUENC")) {
          gsub(" ","",$5)       #removing white space
          n=split($5,v,"[,/]")
          for(i=1; i<=n; i++)
             d[v] = $4          # from split, use each element as a key
      }
      next
}{print $1, ($1 in d?d[$1]:279)}' file2 file1

cmccabe · March 23, 2017, 12:30pm

Thank you very much for all your help :).