I am trying to merge the below awk
, which compares two files looking for a match in $2
and then prints the line if two conditions are meet.
awk
awk 'FNR==NR{A[$2]=$0;next} ($2 in A){if($10>30 && $11>49){print A[$2]}}' F113.txt F113_tvc.bed
This code was improved and provided by @RavinderSingh13, thank you very much. I have ~500 files to process so I wanted to use all .txt
files in /home/cmccabe/Desktop/comparison/missing
and compare them to each matching numerical prefix in /home/cmccabe/Desktop/comparison/test_tvc
all ending in .bed
. Each filename in a directory will have a common numerical prefix:
So if there are three files, the three .txt
files in home/cmccabe/Desktop/comparison/missing
will look like:
F113.txt
H123.txt
S111.txt
and the three .bed
files in /home/cmccabe/Desktop/comparison/test_tvc
will look like:
F113_tvc.bed
H123_tvc.bed
S111_tvc.bed
So F113.txt
would be compared to F113_tvc.bed
, the matching numerical prefix is F113
.
If a match between the $2
values in eaach file is made and both conditions if($10>30 && $11>49
are meet, then the matching line from the .txt
file is printed in the out
under Match in both files and meet criteria
. If no match is found or the criterias is not meet then the line in the .txt
is printed in the out
under Missing in comparison:
.
The below code provided by @Don Cragun works great but since my data has changed a bit I made some updates to it:
(code that works perfect)
IAm=${0##*/}
InDir1='/home/cmccabe/Desktop/comparison/reference/10bp'
InDir2='/home/cmccabe/Desktop/comparison/validation/files'
OutDir='/home/cmccabe/Desktop/comparison/ref_val'
cd "$InDir1"
for file1 in *.txt
do # Grab file prefix.
p=${file1%%_*}
# Find matching file2.
file2=$(printf '%s' "$InDir2/$p"_*.vcf)
if [ ! -f "$file2" ]
then printf '%s: No single file matching %s found.\n' "$IAm" \
"$file1" >&2
continue
fi
# Create matching output filename.
out=${file2##*/}
out=${out%.vcf}_comparison.txt
printf '%s\t%s\t%s\n' "$InDir1/$file1" "$file2" "$OutDir/$out"
done | awk '
BEGIN { FS = OFS = "\t"
}
{ in1 = $1
in2 = $2
out = $3
print "Reading from " in1
while((getline < in1) == 1)
f1[$2 OFS $4 OFS $5]
close(in1)
print "Reading from " in2
while((getline < in2) == 1)
f2[$2 OFS $4 OFS $5]
close(in2)
print "Writing to " out
print "Match:" > out
for(k in f1)
if(k in f2) {
print k > out
delete f1[k]
delete f2[k]
}
print "Missing in Reference but found in IDP:" > out
for(k in f2) {
print k > out
delete f2[k]
}
print "Missing in IDP but found in Reference:" > out
for(k in f1) {
print k > out
delete f1[k]
}
close(out)
print "***"
}'
updated version which does not run with comments marked by --
IAm=${0##*/}
InDir1='/home/cmccabe/Desktop/comparison/missing' -- updated path to .txt files
InDir2='/home/cmccabe/Desktop/comparison/test_tvc' -- updated path to .bed files
OutDir='/home/cmccabe/Desktop/comparison/final' -- updated path to output
cd "$InDir1"
for file1 in *.txt
do # Grab file prefix.
p=${file1%%_*}
# Find matching file2.
file2=$(printf '%s' "$InDir2/$p"_*.bed) -- updated extension
if [ ! -f "$file2" ]
then printf '%s: No single file matching %s found.\n' "$IAm" \
"$file1" >&2
continue
fi
# Create matching output filename.
out=${file2##*/}
out=${out%.vcf}_final.txt -- updated output
printf '%s\t%s\t%s\n' "$InDir1/$file1" "$file2" "$OutDir/$out"
done | awk '
BEGIN { FS = OFS = "\t"
}
{ in1 = $1
in2 = $2
out = $3
print "Reading from " in1
while((getline < in1) == 1)
f1[$2] -- updated to look for each $2 in the .txt file
close(in1)
print "Reading from " in2
while((getline < in2) == 1)
f2[$2] -- updated to look for each $2 from the .txt file in the matching .bed file
close(in2)
print "Writing to " out
print "Match in both files and meet criteria:" > out
for(k in f1)
if(k in f2) {
print k > out
delete f1[k]
delete f2[k]
}
print "Missing in comparison:" > out
for(k in f2) {
print k > out
delete f2[k]
}
close(out)
print "***"
}'
I am not sure how to perform the two if
statements on the matching $2
values. Below are two sample input files as well as the desired output.
file1 (F113.txt)
Missing in IDP but found in Reference:
2 166848646 G A exonic SCN1A 68 13 16;20 0;0 17;15 0;0 0;0 0;0 c.[5139C>T]+[=] 52.94
2 166245888 G A exonic SCN1A 68 13 16;20 0;0 17;15 0;0 0;0 0;0 c.[5500G>T]+[=] 32
file2 (F113.bed)
Chrom Position Gene Sym Target ID Type Zygosity Genotype Ref Variant Var Freq Qual Coverage Ref Cov Var Cov
chr2 166245425 SCN2A AMPL5155065355 SNP Het C/T C T 54 100 50 23 27
chr2 166848646 SCN1A AMPL1543060606 SNP Het G/A G A 52.9411764706 100 68 32 36
desired output
Match in both files and meet criteria:
2 166848646 G A exonic SCN1A 68 13 16;20 0;0 17;15 0;0 0;0 0;0 c.[5139C>T]+[=] 52.94
Missing in comparison:
2 166245888 G A exonic SCN1A 68 13 16;20 0;0 17;15 0;0 0;0 0;0 c.[5500G>T]+[=] 32
I hope I have included enough information and thank you :).