Hi there, I have an array job of this form
#!/bin/bash
#
#SBATCH --nodes=1 --ntasks=1 --cpus-per-task=1
#SBATCH --time=24:00:00
#SBATCH --mem=10gb
#
#SBATCH --job-name=comparison
#SBATCH --output=variants.out
#SBATCH --array=[1-38]%38
#
#SBATCH --partition=<partition>
#
#SBATCH --account=<name>
NAMES=$1
d=$(sed -n "$SLURM_ARRAY_TASK_ID"p $NAMES)
cd /path/to/folder
SAMPLE="$(echo ${d}/output/*.g.vcf.gz)"
VCF="$(echo ${SAMPLE} | sed 's#/path/to/folder/sample_[0-9]\+/output/##' | sed 's/\.g.vcf.gz$//')"
./bcftools stats $d/output/${VCF}.vcf.gz > $d/output/${VCF}.stats
grep "number of records:" $d/output/*.stats >> filtered_variants.txt
./bcftools stats $d/output/full/${VCF}_full.vcf.gz > $d/output/full/${VCF}_full.stats
grep "number of records:" $d/output/full/*.stats >> default_variants.txt
where everything seems to work fine. However, after a careful inspection of the files filtered_variants.txt
and default_variants.txt
it appears that there is no connection between the selected folder at the bcftools
step and the one from where I then grep
a specific line from the stats output file.
So, my question is there a command to pipe the output of bcftools
directly into the following grep
in a one-line command? I suppose this will prevent the random association of the stats results for some samples with the one of others since the instruction will be read as a single command.
Thanks in advance!
@overcraft , given, its not 100% clear (to me) exactly what you're after ..., below is what I've interpreted
given - two sample files :
VCF=GCA_000001215.4_current_ids
bcftools stats "${VCF}".vcf.gz | tee "${VCF}".stats | awk -vVCFile="^ID.*gz" -vRECS='^SN.*0.*number of records:' '$0 ~ VCFile {f=$3;next}; $0 ~ RECS {printf("%s: %s %s %s%ld\n", f,$3,$4,$5,$NF)}'
GCA_000001215.4_current_ids.vcf.gz: number of records:5633871
VCF=GCA_000001895.4_current_ids
bcftools stats "${VCF}".vcf.gz | tee "${VCF}".stats | awk -vVCFile="^ID.*gz" -vRECS='^SN.*0.*number of records:' '$0 ~ VCFile {f=$3;next}; $0 ~ RECS {printf("%s: %s %s %s%ld\n", f,$3,$4,$5,$NF)}'
GCA_000001895.4_current_ids.vcf.gz: number of records:4899379
#
# or, to get them all at once ...
#
cat *.stats | awk -vVCFile="^ID.*gz" -vRECS='^SN.*0.*number of records:' '$0 ~ VCFile {f=$3;next}; $0 ~ RECS {printf("%s: %s %s %s%ld\n", f,$3,$4,$5,$NF)}'
GCA_000001215.4_current_ids.vcf.gz: number of records:5633871
GCA_000001895.4_current_ids.vcf.gz: number of records:4899379
if that's not representative of what youi're after ... then you'll need to explain better - give worked example (at least 2 iterations) , and what is missing from your current attempt.
tks
*.stats
matches all filenames that end with .stats
.
If you want only the just generated one then repeat the filename:
./bcftools stats $d/output/${VCF}.vcf.gz > $d/output/${VCF}.stats
grep "number of records:" $d/output/$d/output/${VCF}.stats >> filtered_variants.txt
Alternative:
tee
passes its input to output like cat
but makes a copy to a file.
./bcftools stats $d/output/${VCF}.vcf.gz | tee $d/output/${VCF}.stats | grep "number of records:" >> filtered_variants.txt
1 Like
@munkeHoller and @MadeInGermany. Thanks a lot for your help once again!
I'm sorry if I haven't been very clear but I guess you had the right intuition of what I was planning on doing and provided the right answer.
Cheers!