Hi there, I'm working on a bash
script to combine the content of two files which will run on an array job for 279 instances. Each one of these 279 samples is organized in folders with the following structure:
sgdp_001/
├── <name>.bam
├── <name>_chr_maf.bam
├── <name>_chr_maf.bam.bai
├── count_maf.txt
├── input
│ ├── <name>_chr.bam
│ └── <name>_chr.bam.bai
├── kff_dataset.txt
├── output
│ ├── <name>.g.vcf.gz
│ ├── <name>.g.vcf.gz.tbi
│ ├── <name>_reheaded.g.vcf.gz
│ ├── <name>.vcf.gz
│ ├── <name>.vcf.gz.tbi
│ ├── <name>.visual_report.html
│ └── sample.name
├── R1_sorted.fastq.gz
└── R2_sorted.fastq.gz
where <name>
is the name for each one of these samples, in this case for sgdp_001 is abh100. The sample.name file in the output directory contains a string which is exactly abh100 in this example, the same applies to all other 278 instances and respective names.
Each one of these samples has also a count_maf.txt file which looks like this with different values for each one of them:
1187029476 0 total (QC-passed reads + QC-failed reads)
1187029476 0 primary
0 0 secondary
0 0 supplementary
0 0 duplicates
0 0 primary duplicates
1152630233 0 mapped
97.10% N/A mapped %
1152630233 0 primary mapped
97.10% N/A primary mapped %
1187029476 0 paired in sequencing
593514738 0 read1
593514738 0 read2
1134955326 0 properly paired
95.61% N/A properly paired %
1151954166 0 with itself and mate mapped
676067 0 singletons
0.06% N/A singletons %
941358 0 with mate mapped to a different chr
491794 0 with mate mapped to a different chr (mapQ>=5)
Now, what I need to do is to extract the file name for the sample and print it in a tab-separated file format with the difference between total and supplementary from the above for all 279 samples. In this case it should look like this:
abh100 1187029476
I put together a little line of code that does almost what I need but the tab-separated part of it...; in fact, I'm getting the output of the subtraction on a new line. See below for an example:
{ cat $d/output/sample.name; head -4 $d/count_maf.txt | grep -P "total|supplementary" | awk '{print $1}' | paste -sd- - | bc; } >> difference_maf.txt
and result:
abh100
1187029476
Here, it's important to append to the final file as I want to collect this information for all 279 samples in a single place.
Moreover, I'm reading in each folder path — as /path/to/sgdp_001-279
— from a list which I feed to the script as an array job; hence, the $d
which should help me to do this across all instances. I tried several expedients with awk
, column
and printf
; the closer I got with the last one is this output:
abh100
1187029476
which is not ideal, as it will creates unnecessary empty lines. If anyone has any idea, any help is much appreciated. Thanks!