Trying to use awk
to match the contents of each line in file1
with $5
in file2
. Both files are tab-delimited
and there may be a space or special character in the name being matched in file2
, for example in file1
[/ICODE] the name is BRCA1
but in file2
the name is BRCA 1
or in file1
name is BCR
but in file2
the name is BCR/ABL
[/ICODE].
If there is a match and $5
of file2
and $7
has full gene sequence
in it, then $5
and $4
are printed separated by a tab. If there is no match found then the name that was not matched and 279
are printed separated by a tab. The awk below does execute, but the output is not correct. Also I am not sure how to add in the condition to ensure $7
is full gene sequence
.
The names in file2
may be partial matchto file1
, but in file1
they will always be complete. Like in the BRCA1
in file1
that matches the BRCA 1, BRCA2
in file2
. The full gene sequence
in $7
may also be partial in file2
as is the case for BCRA1
. The file2
is not a controlled document so the case may be different as in fbn1
from file1
matching FBN1
in file2
. The awk
seems close but not all conditions are accounted for. Thank you :).
awk 'BEGIN{FS=OFS="\t"}
FNR==NR{
if(NR>1){
gsub(" ","",$5) #removing white space
n=split($5,v,"/")
d[v[1]] = $4 #from split, first element as key
}
next
}{print $1, ($1 in d?d[$1]:279)}' file2 file1
BRCA1 279
BCR 806
SCN1A 279
fbn1 85
BRCA1 81
BCR 806
SCN1A 279
fbn1 85
file1
BRCA1
BCR
SCN1A
fbn1
file2
Tier explanation . List code gene gene name methodology disease
Tier 1 . . 811 DMD dystrophin deletion analysis and duplication analysis, if performed Publication Date: January 1, 2014 Duchenne/Becker muscular dystrophy
Tier 1 . Jan-16 81 BRCA 1, BRCA2 breast cancer 1 and 2 full gene sequence and full deletion/duplication analysis hereditary breast and ovarian cancer
Tier 1 . Jan-16 70 ABL1 ABL1 gene analysis variants in the kinse domane acquired imatinib tyrosine kinase inhibitor
Tier 1 . . 806 BCR/ABL 1 t(9;22) major breakpoint, qualitative or quantitative chronic myelogenous leukemia CML
Tier 1 . Jan-16 85 FBN1 Fibrillin full gene sequencing heart disease
Tier 1 . Jan-16 95 FBN1 fibrillin del/dup heart disease