AWK command to handle massive VCF files

overcraft · November 25, 2024, 8:50am

Hi there,

I'm working with huge VCF files — a tab-separated format to handle genetic information which has a header marked by the hashtag symbol #. The body can vary, but there are nine fixed columns followed by a variable number of columns usually in line with the number of samples analyzed.

The question is the following: I have some basic awk commands to handle already most scenarios; however, I'm now dealing with very large files (~55 GB) and with 254 columns aside from the 9 present by default. What I wish to do is to duplicate each column after the 9th, having their values separated by a pipe | and, lastly, have each column separated by a tab. Here is a snapshot of the file with some numbers filling the last column as well as the last line of the header:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  recombination

chr1    586001  >63041388>63041391      G       A       60      .       AC=80;AF=0.3125;AN=256;AT=>63041388>63041390>63041391,>63041388>63041389>63041391;NS=3;LV=0     GT    1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32|33|34|35|36|37|38|39|40|41|42|43|44|45|46|47|48|49|50|51|52|53|54|55|56|57|58|59|60|61|62|63|64|65|66|67|68|69|70|71|72|73|74|75|76|77|78|79|80|81|82|83|84|85|86|87|88|89|90|91|92|93|94|95|96|97|98|99|100|101|102|103|104|105|106|107|108|109|110|111|112|113|114|115|116|117|118|119|120|121|122|123|124|125|126|127|128|129|130|131|132|133|134|135|136|137|138|139|140|141|142|143|144|145|146|147|148|149|150|151|152|153|154|155|156|157|158|159|160|161|162|163|164|165|166|167|168|169|170|171|172|173|174|175|176|177|178|179|180|181|182|183|184|185|186|187|188|189|190|191|192|193|194|195|196|197|198|199|200|201|202|203|204|205|206|207|208|209|210|211|212|213|214|215|216|217|218|219|220|221|222|223|224|225|226|227|228|229|230|231|232|233|234|235|236|237|238|239|240|241|242|243|244|245|246|247|248|249|250|251|252|253|254

The problem, as you can see, is that what is generating this output is collapsing all 254 columns into 1... So, the end result I'm hoping for is as follows:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  recombination1    recombination2    recombination3

...    1|1    2|2    3|3    ...

ideally with each sample bearing its own header identifier e.g. recombination in this case — but really it can be anything. Thanks in advance!

Paul_Pedant · November 25, 2024, 2:11pm

If you have some basic awk knowledge, this should not be too hard (but 55GB is never going to be fast).

First, recognize that your data only has ten fields, due to the change of separator. So you can split that last field into an array:

split ($10, A, "|");

Just output the first nine fields of the input with printf ("%s\t%s\t...%s", $1, $2, ..., $9...); , noting there is no newline in that format.

Then append the elements of your split array like:

for (j = 1; j in A; ++j) printf ("\t%s|%s", A[j], A[j]);

and put a final printf ("\n");

Finally, remenber to plain print any line that is blank, or starts with a #.

overcraft · November 25, 2024, 2:38pm

@Paul_Pedant indeed, it works. I might need to tweak few things to preserve the first 8 columns, but I can now get the expected results for the last one.

Paul_Pedant · November 25, 2024, 4:01pm

I don't see any need to preserve the first 9 columns. They will stay in the original $0 input line, as determined by the default whitespace separator, and keep the same $1 to $10 names. The array created by split() just uses a copy of $10, and is totally separate to the input record (numbered) fields.

OK, I think I understand what you meant about "preserving the first 9 fields". I am suggesting two printf calls for each input line. The first outputs the 9 fields but without a newline, the second outputs the edited 254 fields. They come out as one long line, simply because the first one does not contain a newline.

In awk, print adds a newline itself, and inserts OFS (output field separators) between each variable. printf uses a C-like format statement that gives you much more control, including specifying exactly how to format each variable, and where to put separators and newlines.

overcraft · November 25, 2024, 6:05pm

@Paul_Pedant I see, I figured it out I should have read better the first few lines. Thanks again.

However, is there any clever way to print a header for each of the newly generated 254 columns? It can be the same for all of them e.g. rec, id, etc. but I really should have this replacing the old recombination field, otherwise the file would be unusable.

I was thinking to do it with sed but I will somehow get the number of fields/columns after the 9th. Do you think this could be a good approach?

MadeInGermany · November 25, 2024, 7:10pm

According to Wikipedia all columns should be tab-separated.
Then you can do two substitutions in field 10, but maybe won't be faster than a loop.

awk 'BEGIN{OFS="\t"} {gsub(/[|]/, "\t", $10); gsub(/[^\t]+/, "&|&", $10); print}'

overcraft · November 25, 2024, 7:32pm

@MadeInGermany I'm working on the header part as well. I managed to generate as many headers as per the new columns and stored them into a separate file.

They all go as follows: sample1...sample254, and are on separate lines in the file. I will need to get those into the original file replacing the recombination filed. I look up some way to do so e.g.

printf "\t%s\n" "$(paste -sd'\t' headers.txt)" && cat to_rehead.vcf

the problem is how to get those in place of the recombination for all new columns?

MadeInGermany · November 25, 2024, 8:06pm

awk misses a join function (like in perl and python), so a loop must assemble the output.
The following prints the header late, in the main loop, after a split of field 10:

awk '
# Set OutputFieldSeparator
BEGIN { OFS="\t" }
# Skip empty lines
!NF { next }
# Split header line to H[]
!header { header=1; split($0, H); next }
# Split field 10 to A[]
{ nA=split($10, A, "|") }
# Print header once
!hdone { hdone=1; outstr=H[1]; for (j=2; j <= 9; ++j) outstr=(outstr OFS H[j]); for (j=1; j <= nA; ++j) outstr=(outstr OFS "sample" j); print outstr }
# Reformat field 10 and print
{ outstr=""; for (j=1; j <= nA; ++j ) outstr=(outstr OFS A[j] "|" A[j]); $10=substr(outstr,2); print }
'

overcraft · November 25, 2024, 8:23pm

@MadeInGermany that's great, I really love it! Of course is beyond my level at the moment, but it's nice to have it so concise.

I figure out a way to do it with the following two commands:

for h in {1..$(awk -F'\t' '/^[^#]/ {print NF-9; exit}' to_rehead.vcf)}; do echo "sample${h}" >> headers.txt; done && sed -i '1s/^/#CHROM\nPOS\nID\nREF\nALT\nQUAL\nFILTER\nINFO\nFORMAT\n/' headers.txt

(printf "%s\n" "$(paste -sd'\t' headers.txt)" && sed '1d' to_rehead.vcf) > final.vcf

but is not ideal in my opinion because it generates accessory files which might be relevant if run at scale and add burden to the already big files I'm handling.

vgersh99 · November 25, 2024, 9:10pm

here's a very simplistic join:

function join(arr,sep,  str, cnt) {
     if (isarray(sep)) {
         str = sep[0]
         for (cnt=1; cnt in arr; cnt++) {
             str = str arr[cnt] sep[cnt]
         }
     }
     else {
         sep = (sep=="" ? OFS : sep)
         for (cnt=1; cnt in arr; cnt++) {
             str = (cnt>1? str sep : "") arr[cnt]
         }
     }
     return str
}

munkeHoller · November 25, 2024, 10:37pm

@overcraft , fyi - you may also look into vcftools bcftools duckdb, parquet, pandas (or a combination of those) when working with VCF data sets.

overcraft · November 26, 2024, 5:20pm

@munkeHoller thanks. I'm well aware of the first two tools and used them several times; however, I don't recall any specific function to handle this particular case.

Concerning duckdb, parquet I will look into them, they seem interesting alternatives. Regarding Pandas, I didn't immediately go there as I'm less familiar with Python...