Concatenate (pipe, |) the redirected output of an array job

overcraft · January 12, 2024, 11:25pm

Hi there, I have an array job of this form

#!/bin/bash
#
#SBATCH --nodes=1 --ntasks=1 --cpus-per-task=1
#SBATCH --time=24:00:00
#SBATCH --mem=10gb
#
#SBATCH --job-name=comparison
#SBATCH --output=variants.out
#SBATCH --array=[1-38]%38
#
#SBATCH --partition=<partition>
#
#SBATCH --account=<name>

NAMES=$1
d=$(sed -n "$SLURM_ARRAY_TASK_ID"p $NAMES)

cd /path/to/folder

SAMPLE="$(echo ${d}/output/*.g.vcf.gz)"
VCF="$(echo ${SAMPLE} | sed 's#/path/to/folder/sample_[0-9]\+/output/##' | sed 's/\.g.vcf.gz$//')"

./bcftools stats $d/output/${VCF}.vcf.gz > $d/output/${VCF}.stats
 grep "number of records:" $d/output/*.stats >> filtered_variants.txt

./bcftools stats $d/output/full/${VCF}_full.vcf.gz > $d/output/full/${VCF}_full.stats
 grep "number of records:" $d/output/full/*.stats >> default_variants.txt

where everything seems to work fine. However, after a careful inspection of the files filtered_variants.txt and default_variants.txt it appears that there is no connection between the selected folder at the bcftools step and the one from where I then grep a specific line from the stats output file.

So, my question is there a command to pipe the output of bcftools directly into the following grep in a one-line command? I suppose this will prevent the random association of the stats results for some samples with the one of others since the instruction will be read as a single command.

Thanks in advance!

munkeHoller · January 16, 2024, 1:43am

@overcraft , given, its not 100% clear (to me) exactly what you're after ..., below is what I've interpreted

given - two sample files :

VCF=GCA_000001215.4_current_ids
bcftools stats "${VCF}".vcf.gz | tee "${VCF}".stats | awk -vVCFile="^ID.*gz" -vRECS='^SN.*0.*number of records:' '$0 ~ VCFile {f=$3;next}; $0 ~ RECS {printf("%s: %s %s %s%ld\n", f,$3,$4,$5,$NF)}'
GCA_000001215.4_current_ids.vcf.gz: number of records:5633871

VCF=GCA_000001895.4_current_ids
bcftools stats "${VCF}".vcf.gz | tee "${VCF}".stats | awk -vVCFile="^ID.*gz" -vRECS='^SN.*0.*number of records:' '$0 ~ VCFile {f=$3;next}; $0 ~ RECS {printf("%s: %s %s %s%ld\n", f,$3,$4,$5,$NF)}'
GCA_000001895.4_current_ids.vcf.gz: number of records:4899379

#
# or, to get them all at once ...
#
cat *.stats | awk -vVCFile="^ID.*gz" -vRECS='^SN.*0.*number of records:' '$0 ~ VCFile {f=$3;next}; $0 ~ RECS {printf("%s: %s %s %s%ld\n", f,$3,$4,$5,$NF)}'
GCA_000001215.4_current_ids.vcf.gz: number of records:5633871
GCA_000001895.4_current_ids.vcf.gz: number of records:4899379

if that's not representative of what youi're after ... then you'll need to explain better - give worked example (at least 2 iterations) , and what is missing from your current attempt.

tks

MadeInGermany · January 16, 2024, 7:26am

*.stats matches all filenames that end with .stats.
If you want only the just generated one then repeat the filename:

./bcftools stats $d/output/${VCF}.vcf.gz > $d/output/${VCF}.stats
 grep "number of records:" $d/output/$d/output/${VCF}.stats >> filtered_variants.txt

Alternative:
tee passes its input to output like cat but makes a copy to a file.

./bcftools stats $d/output/${VCF}.vcf.gz | tee $d/output/${VCF}.stats | grep "number of records:" >> filtered_variants.txt

overcraft · January 18, 2024, 7:58pm

@munkeHoller and @MadeInGermany. Thanks a lot for your help once again!

I'm sorry if I haven't been very clear but I guess you had the right intuition of what I was planning on doing and provided the right answer.

Cheers!