awk to match and apply condtions to matchijng files in directories

I am trying to merge the below awk , which compares two files looking for a match in $2 and then prints the line if two conditions are meet.

awk

 awk 'FNR==NR{A[$2]=$0;next} ($2 in A){if($10>30 && $11>49){print A[$2]}}' F113.txt F113_tvc.bed

This code was improved and provided by @RavinderSingh13, thank you very much. I have ~500 files to process so I wanted to use all .txt files in /home/cmccabe/Desktop/comparison/missing and compare them to each matching numerical prefix in /home/cmccabe/Desktop/comparison/test_tvc all ending in .bed . Each filename in a directory will have a common numerical prefix:

So if there are three files, the three .txt files in home/cmccabe/Desktop/comparison/missing will look like:

F113.txt
H123.txt
S111.txt

and the three .bed files in /home/cmccabe/Desktop/comparison/test_tvc will look like:

F113_tvc.bed
H123_tvc.bed
S111_tvc.bed

So F113.txt would be compared to F113_tvc.bed , the matching numerical prefix is F113 .

If a match between the $2 values in eaach file is made and both conditions if($10>30 && $11>49 are meet, then the matching line from the .txt file is printed in the out under Match in both files and meet criteria . If no match is found or the criterias is not meet then the line in the .txt is printed in the out under Missing in comparison: .

The below code provided by @Don Cragun works great but since my data has changed a bit I made some updates to it:

 (code that works perfect)
IAm=${0##*/}

InDir1='/home/cmccabe/Desktop/comparison/reference/10bp'
InDir2='/home/cmccabe/Desktop/comparison/validation/files'
OutDir='/home/cmccabe/Desktop/comparison/ref_val'

cd "$InDir1"
for file1 in *.txt
do    # Grab file prefix.
    p=${file1%%_*}

    # Find matching file2.
    file2=$(printf '%s' "$InDir2/$p"_*.vcf)
    if [ ! -f "$file2" ]
    then    printf '%s: No single file matching %s found.\n' "$IAm" \
            "$file1" >&2
        continue
    fi

    # Create matching output filename.
    out=${file2##*/}
    out=${out%.vcf}_comparison.txt

    printf '%s\t%s\t%s\n' "$InDir1/$file1" "$file2" "$OutDir/$out"
done | awk '
BEGIN {    FS = OFS = "\t"
}
{    in1 = $1
    in2 = $2
    out = $3
    print "Reading from " in1
    while((getline < in1) == 1)
        f1[$2 OFS $4 OFS $5]
    close(in1)
    print "Reading from " in2
    while((getline < in2) == 1)
        f2[$2 OFS $4 OFS $5]
    close(in2)
    print "Writing to " out
    print "Match:" > out
    for(k in f1)
        if(k in f2) {
            print k > out
            delete f1[k]
            delete f2[k]
        }
    print "Missing in Reference but found in IDP:" > out
    for(k in f2) {
        print k > out
        delete f2[k]
    }
    print "Missing in IDP but found in Reference:" > out
    for(k in f1) {
        print k > out
        delete f1[k]
    }
    close(out)
    print "***"
}'

updated version which does not run with comments marked by --

IAm=${0##*/}

InDir1='/home/cmccabe/Desktop/comparison/missing'   -- updated path to .txt files
InDir2='/home/cmccabe/Desktop/comparison/test_tvc'  -- updated path to .bed files
OutDir='/home/cmccabe/Desktop/comparison/final'  -- updated path to output

cd "$InDir1"
for file1 in *.txt
do    # Grab file prefix.
    p=${file1%%_*}

    # Find matching file2.
    file2=$(printf '%s' "$InDir2/$p"_*.bed)  -- updated extension
    if [ ! -f "$file2" ]
    then    printf '%s: No single file matching %s found.\n' "$IAm" \
            "$file1" >&2
        continue
    fi

    # Create matching output filename.
    out=${file2##*/}
    out=${out%.vcf}_final.txt  -- updated output

    printf '%s\t%s\t%s\n' "$InDir1/$file1" "$file2" "$OutDir/$out"
done | awk '
BEGIN {    FS = OFS = "\t"
}
{  in1 = $1
    in2 = $2
    out = $3
    print "Reading from " in1
    while((getline < in1) == 1)
        f1[$2]  -- updated to look for each $2 in the .txt file
    close(in1)
    print "Reading from " in2
    while((getline < in2) == 1)
        f2[$2] -- updated to look for each $2  from the .txt file in the matching .bed file
    close(in2)
    print "Writing to " out
    print "Match in both files and meet criteria:" > out
    for(k in f1)
        if(k in f2) {
            print k > out
            delete f1[k]
            delete f2[k]
        }
    print "Missing in comparison:" > out
    for(k in f2) {
        print k > out
        delete f2[k]
    }
    close(out)
    print "***"
}'

I am not sure how to perform the two if statements on the matching $2 values. Below are two sample input files as well as the desired output.

file1 (F113.txt)

Missing in IDP but found in Reference:
2   166848646   G   A   exonic  SCN1A   68  13  16;20   0;0 17;15   0;0 0;0 0;0     c.[5139C>T]+[=] 52.94
2   166245888   G   A   exonic  SCN1A   68  13  16;20   0;0 17;15   0;0 0;0 0;0     c.[5500G>T]+[=] 32

file2 (F113.bed)

Chrom    Position    Gene Sym    Target ID    Type    Zygosity    Genotype    Ref    Variant    Var Freq    Qual    Coverage    Ref Cov    Var Cov
chr2    166245425   SCN2A   AMPL5155065355  SNP Het C/T C   T   54  100   50    23  27
chr2    166848646   SCN1A   AMPL1543060606  SNP Het        G/A   G  A   52.9411764706   100 68  32  36

desired output

Match in both files and meet criteria:
2   166848646   G   A   exonic  SCN1A   68  13  16;20   0;0 17;15   0;0 0;0 0;0     c.[5139C>T]+[=] 52.94
Missing in comparison:
2   166245888   G   A   exonic  SCN1A   68  13  16;20   0;0 17;15   0;0 0;0 0;0     c.[5500G>T]+[=] 32

I hope I have included enough information and thank you :).

Hello cmccabe,

Could you please try following and let me know how it goes then. I haven't tested it at all.

for file in "/home/cmccabe/Desktop/comparison/missing/*.txt"
do
	file1="/home/cmccabe/Desktop/comparison/test_tvc/${file%%.txt}"
	if [[ -f file1 ]]
	then
		 awk 'FNR==NR{A[$2]=$0;next} ($2 in A){if($10>30 && $11>49){print A[$2] >> "Match_in_both_files_and_meet_criteria";delete A[$2]}} END{for(i in A){print A >> "out_no_match_found_values"}}'  $file $file1
	fi
done

Thanks,
R. Singh

1 Like
for file in "/home/cmccabe/Desktop/comparison/missing/*.txt"
do
    file1="/home/cmccabe/Desktop/comparison/test_tvc/${file%%.txt}"
    if [[ -f file1 ]]
    then
         awk 'FNR==NR{A[$2]=$0;next} ($2 in A){if($10>30 && $11>49){print A[$2] >> "Match_in_both_files_and_meet_criteria";delete A[$2]}} END{for(i in A){print A >> "out_no_match_found_values"}}'  $file $file1 > /home/cmccabe/Desktop/comparison/final/${file}_final.txt
    fi
done

the portion in bold was added to store the results of each comparison in the /home/cmccabe/Desktop/comparison/final directory.

Here is the error I get. Thank you :).

awk: cmd. line:1: fatal: cannot open file `/home/cmccabe/Desktop/comparison/test_tvc//home/cmccabe/Desktop/comparison/missing/*' for reading (No such file or directory)

Hello cmccabe,

Could you please try following and let me know if this helps.

cd /home/cmccabe/Desktop/comparison/missing 
for file in *.txt
do
    file1="/home/cmccabe/Desktop/comparison/test_tvc/${file%%.txt}.bed"
    if [[ -f $file1 ]]
    then
         awk 'FNR==NR{A[$2]=$0;next} ($2 in A){if($10>30 && $11>49){print A[$2] >> "Match_in_both_files_and_meet_criteria";delete A[$2]}} END{for(i in A){print A >> "out_no_match_found_values"}}'  $file $file1
    fi
done

Also I am not sure why you are taking awk command's output into a file? If this is the case then you shoulduse following command then.

cd /home/cmccabe/Desktop/comparison/missing 
for file in *.txt
do
    file1="/home/cmccabe/Desktop/comparison/test_tvc/${file%%.txt}.bed"
    if [[ -f $file1 ]]
    then
         awk 'FNR==NR{A[$2]=$0;Q=FILENAME;next} ($2 in A){if($10>30 && $11>49){print A[$2] >> "Match_in_both_files_and_meet_criteria";print "Match found in both the files named " Q " and " FILENAME " is: " A[$2];delete A[$2]}} END{print "NON-matched lines between file named "Q " and " FILENAME " are: ";for(i in A){print A >> "out_no_match_found_values";print A}}'  $file $file1 > Output_final_file.txt
    fi
done

Again, I haven't tested it all so there may be a chance to tweak it a bit, kindly check it and let me know how it goes then.

Thanks,
R. Singh

1 Like

O used the below:

cd /home/cmccabe/Desktop/comparison/missing
for file in *.txt
do
    file1="/home/cmccabe/Desktop/comparison/test_tvc/${file%%.txt}"
    if [[ -f file1 ]]
    then
         awk 'FNR==NR{A[$2]=$0;Q=FILENAME;next} ($2 in A){if($10>30 && $11>49){print A[$2] >> "Match_in_both_files_and_meet_criteria";print "Match found in both the files named " Q " and " FILENAME " is: " A[$2];delete A[$2]}} END{print "NON-matched lines between file named "Q " and " FILENAME " are: ";for(i in A){print A >> "out_no_match_found_values";print A}}'  $file $file1 > /home/cmccabe/Desktop/comparison/final/Output_final_file.txt
    fi
done

The output of the awk is stored for each par of files that are compared as that match/difference is important to know. The command does run but there is no output file created for each. So if there are 3 file compared say:

From /home/cmccabe/Desktop/comparison/missing the file F113.txt is compared to /home/cmccabe/Desktop/comparison/test_tvc file F113_tvc.bed the matches and differences are the stored in the output at /home/cmccabe/Desktop/comparison/final called F113_final.txt

From /home/cmccabe/Desktop/comparison/missing the file H123.txt is compared to /home/cmccabe/Desktop/comparison/test_tvc file H123_tvc.bed the matches and differences are the stored in the output at /home/cmccabe/Desktop/comparison/final called H123_final.txt

From /home/cmccabe/Desktop/comparison/missing the file S111.txt is compared to /home/cmccabe/Desktop/comparison/test_tvc file S111_tvc.bed the matches and differences are the stored in the output at /home/cmccabe/Desktop/comparison/final called S111_final.txt

I hope this helps and thank you very much :).

Hello cmccabe,

Could you please try following and let me know if this helps you.

cd /home/cmccabe/Desktop/comparison/missing
for file in *.txt
do
    file1="/home/cmccabe/Desktop/comparison/test_tvc/${file%%.txt}.bed"
    if [[ -f $file1 ]]
    then
         awk 'FNR==NR{A[$2]=$0;Q=FILENAME;next} ($2 in A){if($10>30 && $11>49){print A[$2] >> "/path/to/file/Match_in_both_files_and_meet_criteria";print "Match found in both the files named " Q " and " FILENAME " is: " A[$2];delete A[$2]}} END{print "NON-matched lines between file named "Q " and " FILENAME " are: ";for(i in A){print A >> "/path/to/file/out_no_match_found_values";print A}}'  $file $file1 > /home/cmccabe/Desktop/comparison/final/Output_final_file.txt
    fi
done

Also files out_no_match_found_values and Match_in_both_files_and_meet_criteria in path /home/cmccabe/Desktop/comparison/missing couldn't be seen, because there is no complete path given for those files so in case you need them
in any other path please use absolute path eg--> /path/to/file/out_no_match_found_values for these output files and then it should fly. Let me know how it goes then.

Thanks,
R. Singh

1 Like

Here is what I have:

cd /home/cmccabe/Desktop/comparison/missing
for file in *.txt
do
    file1="/home/cmccabe/Desktop/comparison/test_tvc/${file%%.txt}.bed"
    if [[ -f file1 ]]
    then
         awk 'FNR==NR{A[$2]=$0;Q=FILENAME;next} ($2 in A){if($10>30 && $11>49){print A[$2] >> "/home/cmccabe/Desktop/comparison/missing/Match_in_both_files_and_meet_criteria";print "Match found in both the files named " Q " and " FILENAME " is: " A[$2];delete A[$2]}} END{print "NON-matched lines between file named "Q " and " FILENAME " are: ";for(i in A){print A >> "/home/cmccabe/Desktop/comparison/missing/out_no_match_found_values";print A}}'  $file $file1 > /home/cmccabe/Desktop/comparison/final/Output_final_file.txt
    fi
done

The code does run but there is no output files created in /home/cmccabe/Desktop/comparison/final .

Thank you very much :).

Please show us the output from the commands:

cd /home/cmccabe/Desktop/comparison; ls -l missing/*.txt test_tvc/*.bed
1 Like

Hi,

Try changing

if [[ -f file1 ]]

to

if [[ -f $file1 ]]
3 Likes

Here is the code as well as the output of the ls

#!/bin/bash

for file in /home/cmccabe/Desktop/comparison/missing/*.txt
do
    file1="/home/cmccabe/Desktop/comparison/test_tvc/${file%%.txt}.bed"
    if [[ -f $file1 ]]
    then
         awk 'FNR==NR{A[$2]=$0;Q=FILENAME;next} ($2 in A){if($10>30 && $11>49){print A[$2] >> "/home/cmccabe/Desktop/comparison/Match_in_both_files_and_meet_criteria";print "Match found in both the files named " Q " and " FILENAME " is: " A[$2];delete A[$2]}} END{print "NON-matched lines between file named "Q " and " FILENAME " are: ";for(i in A){print A >> "/home/cmccabe/Desktop/comparison/out_no_match_found_values";print A}}'  $file $file1 > /home/cmccabe/Desktop/comparison/final/Output_final_file.txt
    fi
done
cd /home/cmccabe/Desktop/comparison; ls -l missing/*.txt test_tvc/*.bed
-rw-rw-r-- 1 cmccabe cmccabe   756 Oct 11 16:43 missing/F113.txt
-rw-rw-r-- 1 cmccabe cmccabe  1214 Oct 11 16:43 missing/H123.txt
-rw-rw-r-- 1 cmccabe cmccabe   352 Oct 11 16:44 missing/S111.txt
-rw-rw-r-- 1 cmccabe cmccabe 12692 Oct 15 10:36 test_tvc/F113_tvc.bed
-rw-rw-r-- 1 cmccabe cmccabe 12183 Oct 11 16:33 test_tvc/H123_tvc.bed
-rw-rw-r-- 1 cmccabe cmccabe 11845 Oct 11 16:37 test_tvc/S111_tvc.bed

Taking your first .txt file as an example, let us see what your code is doing (remember that set -xv is your friend when trying to debug a shell script).

The for loop sets file to:

/home/cmccabe/Desktop/comparison/missing/F113.txt

Then you use the assignment:

    file1="/home/cmccabe/Desktop/comparison/test_tvc/${file%%.txt}.bed"

which sets file1 to:

/home/cmccabe/Desktop/comparison/test_tvc//home/cmccabe/Desktop/comparison/missing/F113.bed

and then your if statement correctly determines that there is no file with that name and skips the awk statement.

So maybe you would have more luck finding files to process (and therefore producing output), if you would change:

    file1="/home/cmccabe/Desktop/comparison/test_tvc/${file%%.txt}.bed"

to:

    file1=${file##*/}	# Strip off directory.
    file1="/home/cmccabe/Desktop/comparison/test_tvc/${file1%.txt}_tvc.bed"

I haven't even tried to figure out what your one-line awk script does, but I do note that with your sample directory listings you will be running this awk code three times and each time you run it, the output produced by the previous run will be destroyed. (Did you perhaps want >> instead of > as the redirection at the end of that script? Or maybe you want to redirect the output from the for loop to that file instead of repeatedly redirecting the output from the awk script. Which you want depends on whether you want to add to output from previous runs of your script or have each run of your script save only the results from that run.)

And, despite what greet_sed said, the if statements:

    if [[ -f $file1 ]]
    if [[ -f file1 ]]

(with or without the $ ) should have exactly the same effect when using the double square bracket conditional expressions. greet_sed was correct in saying that you need to use:

    if [[ -f $file1 ]]

instead of:

    if [[ -f file1 ]]

If you had been using one of the test commands:

    if [ -f "$file1" ]
    if test -f "$file1"

instead of conditional expressions, then not only would the $ be required, but also double-quotes should be added to protect against filenames containing field separation characters.

2 Likes

Please try out this :

Code :

ls /home/cmccabe/Desktop/comparison/missing/*.txt >/mydir/file1
cut -d '.' -f1 /mydir/file1 > /mydir/file2
ls /home/cmccabe/Desktop/comparison/test_tvc/*.bed > /mydir/file3

for i in `cat /mydir/file2`
do
   for j in `cat /mydir/file3`
    do
       echo "$j" | grep "^$i"
           if [ "$?" == "0" ]
            then
               if[ "$10" > "30" && "$11" > "49" ]
               then
               echo -e "$i\n"
               fi
           else
              echo -e "no match is found \n"
           fi
   done
done
 rm /mydir/file1 /mydir/file2 /mydir/file3

Basically first redirecting the *.txt s and *.beds in 2 different files and taking out the values before *.txt s in 3rd file removing *.txt from each line.
Later making 3rd file as the primary file and comparing its each line (using for loop) with each line of 2nd files values i.e *.beds(using for loop),by line starting with primary file's each lines .
Once the criteria meet, check the exit status is 0 then go for checking the 2nd condition "$10>30 && $11>49" and if both are met then display primary file's each lines else mention "No Match found".At last removing the temporary files created.

Thanks,
Sanghamitra

1 Like

Hello Sanghamitra C.,

Welcome to forums, hope you will enjoy learning/shraing knowledge here. Not sure if you have tested above code or not. There could be few points which we could to make above code better.
i- echo "$j" | grep "^$i" , could be changed to if [[ "$j" == "$j" ]] . Because we need to check either file names are equal or not.
ii- if[ "$10" > "30" && "$11" > "49" ] , for this code in shell $10 or $11 fields are not considered like that, they work in this format in awk . You could use cut to take the 10th and 11th field's values.
iii- for i in `cat /mydir/file2` and for j in `cat /mydir/file3` codes could be done by while loops for an example.

while read i
do
    while read j
    do
    .............(all code here)
    done < "/mydir/file3"
done < "/mydir/file2"
.............(rest of the code)

Thanks,
R. Singh

2 Likes

Hi Don, that does not seem to be an accurate statement.

The $ is still required for variable expansions within double bracket expressions (as well as within single brackets (test commands); a difference would be the double quote protection that would be needed in the case of single brackets)

A situation where $-signs are not required for basic variable expansions are within arithmetic expressions, but that is not the case here.

So IMO greet_sed was right after all.

3 Likes

Yes. You are absolutely correct. Thank you for catching this.

I have updated post #11, striking out the incorrect statements.

1 Like

Here is the command that produces the attached two output files:

#!/bin/bash

for file in /home/cmccabe/Desktop/comparison/missing/*.txt
do
    file1=${file##*/}    # Strip off directory.
    file1="/home/cmccabe/Desktop/comparison/test_tvc/${file1%.txt}_tvc.bed"
    if [[ -f "$file1" ]]
    then
         awk 'FNR==NR{A[$2]=$0;Q=FILENAME;next} ($2 in A){if($10>30 && $11>49){print A[$2] >> "/home/cmccabe/Desktop/comparison/Match_in_both_files_and_meet_criteria";print "Match found in both the files named " Q " and " FILENAME " is: " A[$2];delete A[$2]}} END{print "NON-matched lines between file named "Q " and " FILENAME " are: ";for(i in A){print A >> "/home/cmccabe/Desktop/comparison/out_no_match_found_values";print A}}'  $file $file1 > /home/cmccabe/Desktop/comparison/final/Output_final_file.txt
    fi
done

Can the filename that was being compared be included in each output line? Alternatively, each pair of files being compared have one file with matched and a non-matched included?

For example,
F113.txt is being compared to F113_tvc.bed and the output of that comparison is saved as prefix_final in /home/cmccabe/Desktop/comparison/final with a Match in both files and meet criteria : and Missing in comparison: section (this would be for non matches and lines where the criteria wasn't met). So using the below data from post 1 (each line is a newline):

file1 (F113.txt)

 Missing in IDP but found in Reference: 
2   166848646G   A   exonic  SCN1A   68  13  16;20   0;0 17;15   0;0 0;0 0;0     c.[5139C>T]+[=] 52.94 2   166245888   G   A   exonic  SCN1A   68  13  16;20   0;0 17;15   0;0 0;0 0;0     c.[5500G>T]+[=] 32

file2 (F113.bed)

 Chrom    Position    Gene Sym    Target ID    Type    Zygosity    Genotype    Ref    Variant    Var Freq    Qual    Coverage    Ref Cov    Var Cov 
chr2    166245425   SCN2A   AMPL5155065355  SNP Het C/T C   T   54  100   50    23  27 chr2    166848646   SCN1A   AMPL1543060606  SNP Het        G/A   G  A   52.9411764706   100 68  32  36

desired output (F113_final.txt)

 Match in both files and meet criteria: 
2   166848646   G   A   exonic  SCN1A   68  13  16;20   0;0 17;15   0;0 0;0 0;0     c.[5139C>T]+[=] 52.94 
Missing in comparison: 2   166245888   G   A   exonic  SCN1A   68  13  16;20   0;0 17;15   0;0 0;0 0;0     c.[5500G>T]+[=] 32

Thank all for your help :slight_smile:

Hey ,

Thanks Ravindra for the feedback.

I have couple of points which I just thought of mentioning :

  1. Regarding the command "
echo $j | grep "^i "

--> I did this , because as $j will take value lets say F113_abc_xyz.beds and $i will take the value F113, so I am checking if $j has value with line starting with F113 and if it is successful, it will return exit status 0.

I have provided the test I had done for this:

bash-4.3$ export i=F113_abc_efg.beds                                                                                                                          
bash-4.3$ echo $i                                                                                                                                             
F113_abc_efg.beds           
bash-4.3$ z=F113                                                                                                                                              
bash-4.3$ echo "$i" | grep "^$z"                                                                                                                              
F113_abc_efg.beds                                                                                                                                             
bash-4.3$ echo $?                                                                                                                                             
0           
bash-4.3$
  1. Regarding the 2nd point,thanks for the idea.I thought that $10 and $11 are being positional parameters.

3.Regarding 3rd point, really appreciate , while read is better option than for loops.

Thanks,
Sanghamitra

1 Like

Hi cmccabe,

Following minor change helps, hope i got you correctly :slight_smile:

file1=${file##*/}
getprefix=${file1%%.txt}
<rest of the code is same>
awk <same code> $file $file1 > ${getprefix}_final_file.txt

I have verified with given input samples using RavinderSingh13 solution and it gets your desired output as per post#16.

1 Like

Thank you all :slight_smile: