Hi there,
I keep overthinking about how can I optimize my pipeline. Essentially, I have multiple tools that will be executed on two haplotypes (hap1 and hap2) for a plant species.
The general structure is as follows:
INLUP_00001
├── hap1
└── hap2
(I will have 16 of these INLUP_????? parent directories)
So, with this in mind I organized a job array which reads from the following file
path/to/INLUP_00001/hap1
path/to/INLUP_00001/hap2
path/to/INLUP_00002/hap1
path/to/INLUP_00002/hap2
.
.
.
where I have a variable – ${HAP} – that discriminates which haplotype I'm working on, in which sub-directory the data will be written, and eventual names for each output. This seems to best optimize runtime and resource allocation.
However, there is a problem with the very first tool I'm using; this application is the one generating both hap1 and hap2 and does not accept the ${HAP} variable. In other words, I have no control on the outputs based on my job array list which will redundantly execute this single command 32 times not only causing issues but also wasting time and resources...
Is there a way to control for the execution of this command only one time for each INLUP sample while preserving the control on haplotypes with the ${HAP} variable within the job array?
I thought about alternatives with for
cycles applied to all other tools in the pipeline to accommodate hap1 and hap2, but they ended up making the script overly long in my opinion and more complex... also the resources allocated for the first tool cannot be easily partitioned/assigned to independent tasks for hap1 and hap2 for the other tools.
Any idea/help is much appreciated, sorry for the long message if more context is needed I can provide a short MWE of the first few commands.
hello,
this is far too vague to get any reasoned answer, what is this mystery 'first tool' .... how is it being invoked , why does this tool not accept ${HAP} (or can be configured to accept .... , where/why is the ${HAP} variable needed at this point (or whenever), what is the rest of the toolchain like after the 'first tool' is used ....
@munkeHoller apologies I will try to be more clear. The first tool is hifiasm
which takes in input some sequencing data and returns as output hap1 & hap2 (see below).
It cannot be tweaked to accept the ${HAP} variable since it works as such out of the box, this would require to intervene at development level by the authors...
This is the initial part of the pipeline; basically, after hifiasm
all other tools will need to work independently on both hap1 and hap2, hence why it would be great to have a way to isolate the run of hifiasm
to only one instance for each INLUP sample.
1 #!/bin/bash
2 #set -x
3 #SBATCH --nodes=1 --ntasks=1 --cpus-per-task=56
4 #SBATCH --time=200:00:00
5 #SBATCH --mem=100gb
6 #
7 #SBATCH --job-name=assembly_cond
8 #SBATCH --output=assembly_cond.out
9 ##SBATCH --array=[1-32]%4
10 #
11 #SBATCH --exclude=node[1-5],node[8-9]
12
### some comments explaining the pipeline
22
23 NAMES=$1
24 d=$(sed -n "$SLURM_ARRAY_TASK_ID"p $NAMES)
25
### part where I load modules
38
39 #[ -d $d/hap1 ] && [ -d $d/hap2 ] || mkdir -p $d/hap{1..2}
40 #[ -d $d/1.pre_scaff ] && [ -d $d/2.post_scaff ] || mkdir -p $d/1.pre_scaff $d/2.post_scaff
41
42 ID="$(echo ${d} | sed 's#/path/to/##' | sed 's#-[0-9]\+/hap[0-9]$##' | sed 's/_//')"
43 HOM_COV="$(echo ${d} | sed 's#/path/to/INLUP_[0-9]\+-##' | sed 's#/hap[0-9]$##')"
44 HAP="$(echo ${d} | sed 's#/path/to/INLUP_[0-9]\+-[0-9]\+/##')"
45 #HAP=("hap1" "hap2") this can eventually be used to wrap other tools in for cycles
46
47
48 cd $d/..
49
50
51 ###generating standard assemblies
52 if [ -s $d/../${ID}.filt.fastq.gz ]
53 then
54 echo "reads filtering already done"
55 else
56 source /opt/miniconda3/bin/activate p36
57 pbadapterfilt.sh -l 44 -m 97 -t 56; rm $d/../*.stats $d/../*.blastout $d/../*.blocklist
58 conda deactivate
59 fi
60
61 if [ -d $d/../${ID}.meryl ] && files=( "$d/../${ID}.meryl"/* ) && [[ -e ${files[0]} || -L ${files[0]} ]]
62 then
63 echo "k-mers count done"
64 else
65 meryl count k=21 threads=56 output ${ID}.meryl ${ID}.filt.fastq.gz
66 fi
67
68 if [ -s $d/${ID}.hap1.fasta.gz ] && [ -s $d/${ID}.hap2.fasta.gz ]
69 then
70 echo "haplotype-resolved assemblies already generated"
71 else
72 hifiasm -o $d/${ID}.asm -l 0 --hom-cov ${HOM_COV} --h1 $d/../hi-c/${ID}_1.fq.gz --h2 $d/../hi-c/${ID}_2.fq.gz $d/${ID}.filt.fastq.gz -t 56 &> $d/log_"$(date +"%Y%m%d-%I:%M")".txt #this tool by default produces hap1 and hap2 with no other actions to be taken
73 #for STR in "${HAP[@]}"; do ../gfatools gfa2fa $d/${ID}.asm.hic.${STR}.p_ctg.gfa > $d/${STR}/${ID}.${STR}.fasta; done #alternative with for
74 ../gfatools gfa2fa $d/${ID}.asm.hic.${HAP}.p_ctg.gfa > $d/${ID}.${HAP}.fasta #what I want, that is using the ${HAP} variable
75 rm $d/${ID}.asm*
76 fi
.
.
.
Your script runs the hifiasm
with arguments --h1 $d/hi-c/${ID}_1.fq.gz --h2 $d/hi-c/${ID}_2.fq.gz
I guess there could be just one --h1 $d/hi-c/${ID}_1.fq.gz
?
Please give example/excerpt of the NAMES file!
@MadeInGermany Indeed, this is how the NAME file looks like
/path/to/INLUP_00001-??/hap1
/path/to/INLUP_00001-??/hap2
/path/to/INLUP_00002-??/hap1
/path/to/INLUP_00002-??/hap2
/path/to/INLUP_00003-??/hap1
/path/to/INLUP_00003-??/hap2
/path/to/INLUP_00004-??/hap1
/path/to/INLUP_00004-??/hap2
/path/to/INLUP_00005-??/hap1
/path/to/INLUP_00005-??/hap2
/path/to/INLUP_00006-??/hap1
/path/to/INLUP_00006-??/hap2
/path/to/INLUP_00007-??/hap1
/path/to/INLUP_00007-??/hap2
/path/to/INLUP_00008-??/hap1
/path/to/INLUP_00008-??/hap2
/path/to/INLUP_00009-??/hap1
/path/to/INLUP_00009-??/hap2
/path/to/INLUP_00010-??/hap1
/path/to/INLUP_00010-??/hap2
/path/to/INLUP_00011-??/hap1
/path/to/INLUP_00011-??/hap2
/path/to/INLUP_00012-??/hap1
/path/to/INLUP_00012-??/hap2
/path/to/INLUP_00013-??/hap1
/path/to/INLUP_00013-??/hap2
/path/to/INLUP_00014-??/hap1
/path/to/INLUP_00014-??/hap2
/path/to/INLUP_00015-??/hap1
/path/to/INLUP_00015-??/hap2
/path/to/INLUP_00016-??/hap1
/path/to/INLUP_00016-??/hap2
The ?? is a two-digit number I obtain in order to determine the ${HOM_COV} and it can vary between samples; however, this is already indicative of the overall structure of the folder paths I have to access.
Apologies, there was a little mistake on my side the --h1
and --h2
options access a folder in the parent directory INLUP_?????-?? called /hi-c where both sequencing dataset are present (I fixed the code in the main thread).
Unfortunately, it's not possible to only indicate one flag as the algorithm will need two of them to operate correctly...
P. S. this is how the complete tree
structure of my folders look like. Sorry, but I don't have tree
on the cluster where I work, and I had to create examples locally...
> tree -h INLUP_00001/*
[ 64] INLUP_00001/1.pre_scaff
[ 64] INLUP_00001/2.post_scaff
[ 37] INLUP_00001/INLUP00233.fastq.gz
[ 42] INLUP_00001/INLUP00233.filt.fastq.gz
[ 64] INLUP_00001/INLUP00233.meryl
[ 64] INLUP_00001/hap1
[ 64] INLUP_00001/hap2
[ 128] INLUP_00001/hi-c
├── [ 36] INLUP00001_1.fq.gz
└── [ 36] INLUP00001_2.fq.gz