Check ID in a file matches to the name of the file

I have a number of text tab files in my directory named 1.vcf 2.vcf etc. Each file file has headers of 120-130 rows starting with "#", it looks like this

...
##contig=<ID=GL000194.1,length=191469,assembly=hg19>
##contig=<ID=GL000225.1,length=211173,assembly=hg19>
##contig=<ID=GL000192.1,length=547496,assembly=hg19>
##contig=<ID=vcontig,length=337,assembly=hg19>
##reference=human_hg19.fasta
##source=SelectVariants
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  1
1       12012010        rs1000002       A       C       14325.14        .       AC=1;AF=0.500;AN=2;BaseQRankSum=-13...

As these files are created with an automated pipeline, I wish to introduce an id check, to see if each file name (1.vcf,2.vcf..) corresponds to the correct ID within the content file.
The ID is always present is the last line of the header after 'FORMAT'.
The files are always named according to ID.
I have been doing this manually so far, is there a way to script it ?

You mean the ID should match the file with the "extension" .vcf stripped off? Or any other extension? What should happen if the two match? What if they don't?

And, get rid of the DOS line terminators in your text files you wish to process on *nix...

Just need the name before the prefix <ids>.vcf to match with the <ids> within mentioned text file. The extn will always be .vcf. If the IDs dont match, it will be "false" and I will know there has been some id mix up during the processing of the pipeline. This is done on Linux

Try:

awk '$(NF-1)=="FORMAT" && $NF".vcf" != FILENAME{print FILENAME":" $NF;nextfile}' *.vcf

Or, if your .vcf files might be in DOS text file format:

awk '{sub(/\r$/,"")}$(NF-1)=="FORMAT" && $NF".vcf" != FILENAME{print FILENAME":" $NF;nextfile}' *.vcf

That should work with awk (or gawk ) on a Linux system. If you want to try this on a system where awk doesn't have the nextfile built-in function, you can remove the ;nextfile from the script and it should work just as well, but will run a little bit slower.

If someone wants to try this on a Solaris/SunOS system, change awk to nawk or /usr/xpg4/bin/awk (and remove the ;nextfile ).

1 Like

Try also

awk '/#.*FORMAT/ {exit 1 - ($NF == substr (FILENAME, 1, index(FILENAME, ".")-1))}' 1.vcf
echo $?
0
awk '/#.*FORMAT/ {exit 1 - ($NF == substr (FILENAME, 1, index(FILENAME, ".")-1))}' 2.vcf
echo $?
1

or, shamelessly stealing from Don Cragun's post,

awk '/#.*FORMAT/ {exit 1 - ($NF ".vcf" == FILENAME)}' 2.vcf
echo $?
1
1 Like

Thank you both. Don Cragun's code works for me

@Rudi how is it checking if the id matches ? For example, I tried the code on this file 2.vcf that looks like this

##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS     ID        REF ALT    QUAL FILTER INFO                              FORMAT      44
20     14370   rs6054257 G      A       29   PASS   NS=3;DP=14;AF=0.5;DB;H2           GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20     17330   .         T      A       3    q10    NS=3;DP=11;AF=0.017               GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3   0/0:41:3

The code should return that the ID 44 in the file does not match with the file name 2.vcf, it returns a value of 1.
VCF (variant call format) is only text tab delimited file

Hi nans,
RudiC's code and my code are intended to do different things.

My code processes all of the .vcf file in the current working directory and prints the name of the file and the ID found for each file in which the filename and the ID do not match.

RudiC's code processes one file at a time. If the filename and the ID in that file match, the exit code will be 0; if the filename and the ID do not match, the exit code will be 1. No output is printed either way, you just use the exit code of the script as a test to determine whether or not that file meets your expectations.

1 Like

Exit code 0 means OK or "successful completion of the program". Any other code is to be interpreted as error or "FALSE". As you didn't specify any desired ouput or result conveyance, this was the option of choice.

1 Like