awk to compare each file in two directores by storing in variable

cmccabe · September 30, 2016, 2:58pm

In the below bash I am trying to read each file from a specific directory into a variable REF or VAL . Then use those variables in an awk to compare each matching file from REF and VAL . The filenames in the REF are different then in the VAL , but have a common id up until the _ I know the awk portion works as I can run files through them manually. However, I am not sure if I am doing the variable part correctly. Thank you :).

The two variables are:

REF comes from /home/cmccabe/Desktop/comparison/reference/10bp and is 3 files:

 S1234_ref.txt
 A5678_ref.txt
 T1111_ref.txt

VAR comes from

/home/cmccabe/Desktop/comparison/validation/files

and is 3 files:

 S1234_panel.vcf
 A5678_panel.vcf
 T1111_panel.vcf

So, S1234_ref.txt would be REF and S1234_panel.vcf would be VAL , then those two files would be compared. I think I am close but not too sure about the variables.

 REF=/home/cmccabe/Desktop/comparison/reference/10bp
VAL=/home/cmccabe/Desktop/comparison/validation/files
awk -F'\t' -v OFS='\t' 'FNR==1 { next }
      FNR == NR { file1[$2,$4,$5] = $2 FS $4 FS $5 }
      FNR != NR { file2[$2,$4,$5] = $2 FS $4 FS $5 }
            END { print "Match:"; for (k in file1) if (k in file2) print file1[k] # Or file2[k]
            print "Missing in Reference but found in IDP:"; for (k in file2) if (!(k in file1)) print file2[k]
            print "Missing in IDP but found in Reference:"; for (k in file1) if (!(k in file2)) print file1[k]
      }'$REF $VAL > /home/cmccabe/Desktop/comparison/ref_val/concordance.txt

RavinderSingh13 · September 30, 2016, 3:08pm

Hello cmccabe,

Could you please try to keep those both variables values into a single Input_file like as follows.

cat Input_file
S1234_ref.txt S1234_panel.vcf
A5678_ref.txt A5678_panel.vcf
T1111_ref.txt T1111_panel.vcf

Also to get the above Input_file you could use paste var_file ref_file > Input_file
Then you could try to read their values by a loop and try to put your awk inside it.

while read file1 file2
      awk '{your_logic}' $file1  $file2
done < "Input_file"

Could you please try it out and let us know how it goes then. If you have more requirements in here, kindly post with full details on same.

Thanks,
R. Singh

cmccabe · September 30, 2016, 4:09pm

Not sure if I follow completly, but here is what I did:

/home/cmccabe/Desktop/comparison/ref_val/out

out file used as input

F13_ref_FP_10bp.txt F13_epilepsy.vcf
M29_ref_FP_10bp.txt M29_epilepsy
H68_ref_FP_10bp.txt H68_marfan.vcf
T42_ref_FP_10bp.txt T42_epilepsy.vcf
H19_ref_FP_10bp.txt H19_marfan.vcf
T48_ref_FP_10bp.txt T48_marfan.vcf

while read file1 file2
         awk -F'\t' -v OFS='\t' 'FNR==1 { next }
         FNR == NR { file1[$2,$4,$5] = $2 FS $4 FS $5 }
         FNR != NR { file2[$2,$4,$5] = $2 FS $4 FS $5 }
              END { print "Match:"; for (k in file1) if (k in file2) print file1[k] # Or file2[k]
              print "Missing in Reference but found in IDP:"; for (k in file2) if (!(k in file1)) print file2[k]
              print "Missing in IDP but found in Reference:"; for (k in file1) if (!(k in file2)) print file1[k]
}'/home/cmccabe/Desktop/comparison/reference/10bp/$file1 /home/cmccabe/Desktop/comparison/validation/files/$file2 > /home/cmccabe/Desktop/comparison/ref_val/concordance.txt
done < "/home/cmccabe/Desktop/comparison/ref_val/out"

I give the full path to each:
file1
file2
out

Thank you :).

RavinderSingh13 · September 30, 2016, 4:16pm

cmccabe:

Not sure if I follow completly, but here is what I did:

/home/cmccabe/Desktop/comparison/ref_val/out

out file used as input

F13_ref_FP_10bp.txt F13_epilepsy.vcf
M29_ref_FP_10bp.txt M29_epilepsy
H68_ref_FP_10bp.txt H68_marfan.vcf
T42_ref_FP_10bp.txt T42_epilepsy.vcf
H19_ref_FP_10bp.txt H19_marfan.vcf
T48_ref_FP_10bp.txt T48_marfan.vcf

while read file1 file2
   awk -F'\t' -v OFS='\t' 'FNR==1 { next }
   FNR == NR { file1[$2,$4,$5] = $2 FS $4 FS $5 }
   FNR != NR { file2[$2,$4,$5] = $2 FS $4 FS $5 }
   END { print "Match:"; for (k in file1) if (k in file2) print file1[k] # Or file2[k]
   print "Missing in Reference but found in IDP:"; for (k in file2) if (!(k in file1)) print file2[k]
   print "Missing in IDP but found in Reference:"; for (k in file1) if (!(k in file2)) print file1[k]
}'/home/cmccabe/Desktop/comparison/reference/10bp/$file1 /home/cmccabe/Desktop/comparison/validation/files/$file2 > /home/cmccabe/Desktop/comparison/ref_val/concordance.txt
done < "/home/cmccabe/Desktop/comparison/ref_val/out"

I give the full path to each:
file1
file2
out

Thank you :).

Hello cmccabe,

As we have requested mutiple times, kindly do mention complete details about what's happening and what's not. Though you have used code as I suggested in post#2 but you haven't mention either that approach worked or not. Because you haven't showed us what are there in those Input_files(how Input_file looks with sample data) it is difficult to tell that you command will work or not, please let us know the complete details about your requirements with sample Input_file and ecpected output of your requirements with error messages(if any) you code by running any suggestions or your own codes, I hope this helps.

Thanks,
R. Singh

cmccabe · September 30, 2016, 4:33pm

I am not sure what you mean by create an input file , I created the input as the filenames of REF in $1 of and the filenames of VAL in $2 . I then gave the full path to the location of each filename in $1 and the full path to each filename in $2 . After giving the full paths to file1 file2 and input , I used the awk , it ran but nothing happened and no output was produced. I was trying to use your code but just misunderstood it. I hope this helps and thank you :).

RudiC · October 1, 2016, 8:23am

How about

for FN in $REF; do awk '...' $FN ${FN%%_*}*.vcf ; done

You may need to add the path to the .vcf files, though.

I have to second RavinderSingh13 in that all your specifications/requests need to be WAY clearer and more detailed/data driven!

cmccabe · October 1, 2016, 9:44am

@RudiC and @RavinderSingh13, thank you both for all of your help.

it looks like the script reads all the vcf files from REF and puts them in a variable FN . How do the txt files from VAL get used by the awk . The awk looks at each REF file and compares it to each VAL file looking for what's common and what's different. If a difference is found it identifies which file the missing data came from. The awk portion works on individual files, but I have over 500 to compare so a loop would help, however that is what I need help with :).

REF there are 250 files all located at /home/cmccabe/Desktop/comparison/reference/10bp

F13_ref_FP_10bp.txt
H19_ref_FP_10bp.txt

Data structure in REF

Chr    Start    End    Ref    Alt    Func.refGene    Gene.refGene    Coverage    Score    A(#F,#R)    C(#F,#R)    G(#F,#R)    T(#F,#R)    Ins(#F,#R)    Del(#F,#R)    SNP    Mutation    Frequency    Sanger
12    52200340    52200340    A    C    exonic    SCN8A    4129    28.3    1560;1672    413;453    0;0    0;0    0;2    31;0        c.[5070A>C]+[=]    20.97    
2    51254914    51254914    C    T    exonic    NRXN1    1562    25.5    0;0    536;218    0;0    574;234    0;0    0;0        c.[498G>A]+[=]    51.73    
X    67433722    67433722    C    T    exonic    OPHN1    2747    25.6    0;0    46;37    0;0    1211;1443    1;8    5;5        c.[579G>A]+[579G>A]    96.61

VAL there are 250 files all located at /home/cmccabe/Desktop/comparison/validation/files

F13_epilepsy.vcf
H19_marfan.vcf

Data structure in VAL

Chr    Start    End    Ref    Alt    Func.refGene    Gene.refGene    GeneDetail.refGene    ExonicFunc.refGene    AAChange.refGene    avsnp147    PopFreqMax    1000G_ALL    1000G_AFR    1000G_AMR    1000G_EAS    1000G_EUR    1000G_SAS    ExAC_ALL    ExAC_AFR    ExAC_AMR    ExAC_EAS    ExAC_FIN    ExAC_NFE    ExAC_OTH    ExAC_SAS    ESP6500siv2_ALL    ESP6500siv2_AA    ESP6500siv2_EA    CG46    dpsi_max_tissue    dpsi_zscore    SIFT_score    SIFT_pred    Polyphen2_HDIV_score    Polyphen2_HDIV_pred    Polyphen2_HVAR_score    Polyphen2_HVAR_pred    LRT_score    LRT_pred    MutationTaster_score    MutationTaster_pred    MutationAssessor_score    MutationAssessor_pred    CLINSIG    CLNDBN    CLNACC    CLNDSDB    CLNDSDBID    Quality    Reads    Zygosity    Phred    Classification    HGMD    Sanger
chr1    43395635    43395635    C    T    exonic    SLC2A1    .    synonymous SNV    SLC2A1:NM_006516:exon5:c.588G>A:p.P196P    rs2229682    0.23    0.12    0.024    0.21    0.08    0.19    0.15    0.18    0.044    0.19    0.074    0.23    0.21    0.19    0.19    0.15    0.049    0.2    0.12    -0.1558    -0.594    .    .    .    .    .    .    .    .    .    .    .    .    Benign    not_specified    RCV000081436.5    MedGen    CN169374    GOOD    399    het    19
chr1    43396414    43396414    G    A    exonic    SLC2A1    .    synonymous SNV    SLC2A1:NM_006516:exon4:c.399C>T:p.C133C    rs11537641    0.24    0.14    0.08    0.21    0.1    0.19    0.16    0.19    0.094    0.2    0.098    0.24    0.21    0.2    0.2    0.16    0.096    0.2    0.14    -0.0227    -0.121    .    .    .    .    .    .    .    .    .    .    .    .    Benign    not_specified    RCV000081433.6    MedGen    CN169374    GOOD    400    het    21
chr1    172410967    172410967    G    A    exonic    PIGC    .    nonsynonymous SNV    PIGC:NM_002642:exon2:c.796C>T:p.P266S,PIGC:NM_153747:exon2:c.796C>T:p.P266S    rs1063412    0.66    0.45    0.06    0.54    0.66    0.6    0.57    0.55    0.14    0.64    0.64    0.59    0.58    0.57    0.57    0.42    0.15    0.56    0.41    .    .    0.13    T    1.0    D    1.0    D    0.000    D    0.000    P    1.515    L    .    .    .    .    .    GOOD    399    het    19

desired output (example not using these files that compares a REF file to a VAL file and finds what's in common, what's different, and where the difference comes from, it includes some additional data as well from another script)

Match:
Chr    Start    Ref    Alt    Func.refGene    Gene.refGene    Quality    Reads    Zygosity    Phred
chr15    68521889    C    T    exonic    CLN6    GOOD    50    het    4
chr7    147183143    A    G    intronic    CNTNAP2    GOOD    382    het    22
chr2    167099158    A    G    exonic    SCN9A    GOOD    210    hom    55
Missing in Reference but found in IDP:
Chr    Start    Ref    Alt    Func.refGene    Gene.refGene    Quality    Reads    Zygosity    Phred
chr2    51666313    T    C    intergenic    NRXN1,NONE    GOOD    108    het    7
chr2    166903445    T    C    exonic    SCN1A    GOOD    400    het    28
Missing in IDP but found in Reference:
Chr    Start    Ref    Alt    Func.refGene    Gene.refGene    Mutation Call    Coverage    Score    Mutant Allele Frequency    A(#F,#R)    C(#F,#R)    G(#F,#R)    T(#F,#R)    ins(#F,#R)    del(#F,#R)    SNP db_ref    Region    
2    166210776    C    T    exonic    SCN2A    c.[2994C>T]+[=]    3095    23.1    24.56    0:0    1158:1177    0;0    457;303    1;0    0;0        No low coverage
7    148106478    -    GT    intronic    CNTNAP2    c.3716-5_3716-4insGT    4168    28.6    51.01    0;0    0;1    0;0    2199;1967    1129;997    0;1    rs60451214    No low

I hope this helps and apologize for the long post but think these are all the details. Thank you :).

Don_Cragun · October 1, 2016, 10:06am

What output file is (or output files are) supposed to be created?

It looks like you want to run your awk script 250 times each with a pair of input files, but the output from each of those 250 runs goes to a single output file and that single output file is overwritten (not appended to) each time your awk script is run.

Is the intent to create 250 different output files with a name corresponding to the names of the input files, or do you want one output file containing the concatenated contents of the 250 individual file comparisons? If it is one output file that you want, does there need to be some header to each section of the output specifying the input files processed to produce the following section of data in that output file? And, if so, what is the format of that header?

cmccabe · October 1, 2016, 10:14am

The awk does run on pairs and after running the awk is to create 250 different output files with a name corresponding to the names of the input files. Thank you :).

Don_Cragun · October 1, 2016, 10:20am

And what are the names of those output files supposed to be???

cmccabe · October 1, 2016, 10:42am

If the below are the files being used in the awk , comparing F13_ref_FP_10bp.txt to F13_epilepsy.vcf then the output should be F13_epilepsy_comparison.txt .

H19_ref_FP_10bp.txt to H19_marfan.vcf then the output should be H19_marfan_comparison.txt . The input into the awk and the output is tab delimited. Thank you :).

REF

F13_ref_FP_10bp.txt
H19_ref_FP_10bp.txt

VAL

F13_epilepsy.vcf
H19_marfan.vcf

Don_Cragun · October 1, 2016, 12:49pm

If we go back to post #3 in this thread, you said that you have a file named out containing pairs of input filenames:

F13_ref_FP_10bp.txt F13_epilepsy.vcf
M29_ref_FP_10bp.txt M29_epilepsy
H68_ref_FP_10bp.txt H68_marfan.vcf
T42_ref_FP_10bp.txt T42_epilepsy.vcf
H19_ref_FP_10bp.txt H19_marfan.vcf
T48_ref_FP_10bp.txt T48_marfan.vcf

and from post #1 we have pairs of filenames:

S1234_ref.txt S1234_panel.vcf
A5678_ref.txt A5678_panel.vcf
T1111_ref.txt T1111_panel.vcf

From these examples am I correct in assuming that we can skip creation of the out file and just look at /home/cmccabe/Desktop/comparison/validation/files/*_*.txt files and know that there will be a corresponding file in /home/cmccabe/Desktop/comparison/validation/files/ with a name that has the same unique string before the first underscore character and ending in the string .vcf ? Note that the name from post #3 quoted above marked in red does not end in .vcf . Was that a typo, or do some names in that directory not end in .vcf ?

Is the assumption that there is only one file with the string before the first underscore in both of those directories correct? Or, do you want a script that depends on you creating a file named /home/cmccabe/Desktop/comparison/ref_val/out that contains lines containing three values ( file1 filename, file2 filename, and output file filename)?

Is it OK to use an awk script that processes all 250 sets of input files in one invocation instead of invoking awk 250 times?

cmccabe · October 1, 2016, 1:12pm

All of the REF files end in .txt and are located in a folder at /home/cmccabe/Desktop/comparison/reference/10bp .

All of the VAL files end in .vcf and are located in a folder at /home/cmccabe/Desktop/comparison/validation/files . I did have a typo in post #3.

Yes, there will only be one file in each separate directory with the string before the first _ . So there is no need for an out file other then to know which samples were processed.

There may not always be 250 sets of input, that # is variable, but yes all of them can be processed at once rather than each set individually. is that what you mean? Thank you :).

RudiC · October 1, 2016, 2:56pm

No, unless the statement on REF 's contents in post#1 was not true. Given it IS true, FN assumes three file names ending in .txt

They are not. The proposal assumes that for every .txt - file name's prefix ID a respective .vcf file exists, possibly in another path (as mentioned in my post).

The operation of the awk script is not the topic of this thread, nor is the desired output.

Please! check if any of the proposals hitherto provide the needed input file pairs, might be adaptable to also provide the needed output file name, and comment on their aptitude.

And, please please please, get some structure into your future requests and relieve us from guessing!

Don_Cragun · October 1, 2016, 5:56pm

The sample inputs and output shown in post #7 in this thread shows headers in both input files and in the output file that are not handled by the code shown in any of your posts. The output shown in post #7 does not contain any data shown in either of the input files in post #7. The format of the output shown in post #7 seems to have complete lines from the input files, but your code only copies three fields from the input files to the output file. So, with absolutely no idea what is really supposed to be done by your awk code and no data that can be used for testing, the following has undergone completely unrealistic testing, but may give you an idea of how to write a shell and awk script that grabs related files from two directories and produces related output files in a third directory.

#!/bin/ksh
IAm=${0##*/}

InDir1='/home/cmccabe/Desktop/comparison/reference/10bp'
InDir2='/home/cmccabe/Desktop/comparison/validation/files'
OutDir='/home/cmccabe/Desktop/comparison/ref_val'

cd "$InDir1"
for file1 in *.txt
do	# Grab file prefix.
	p=${file1%%_*}

	# Find matching file2.
	file2=$(printf '%s' "$InDir2/$p"_*.vcf)
	if [ ! -f "$file2" ]
	then	printf '%s: No single file matching %s found.\n' "$IAm" \
		    "$file1" >&2
		continue
	fi

	# Create matching output filename.
	out=${file2##*/}
	out=${out%.vcf}_comparison.txt

	printf '%s\t%s\t%s\n' "$InDir1/$file1" "$file2" "$OutDir/$out"
done | awk '
BEGIN {	FS = OFS = "\t"
}
{	in1 = $1
	in2 = $2
	out = $3
	print "Reading from " in1
	while((getline < in1) == 1)
		f1[$2 OFS $4 OFS $5]
	close(in1)
	print "Reading from " in2
	while((getline < in2) == 1)
		f2[$2 OFS $4 OFS $5]
	close(in2)
	print "Writing to " out
	print "Match:" > out
	for(k in f1)
		if(k in f2) {
			print k > out
			delete f1[k]
			delete f2[k]
		}
	print "Missing in Reference but found in IDP:" > out
	for(k in f2) {
		print k > out
		delete f2[k]
	}
	print "Missing in IDP but found in Reference:" > out
	for(k in f1) {
		print k > out
		delete f1[k]
	}
	close(out)
	print "***"
}'

This was written and tested using a Korn shell, but this will also work with bash or any other shell that uses Bourne shell syntax AND performs parameter expansions required by the POSIX standards.

cmccabe · October 3, 2016, 1:19pm

Thank you very much for all of your help