Combine two files output into a tab-separated third file

overcraft · April 14, 2024, 12:04pm

Hi there, I'm working on a bash script to combine the content of two files which will run on an array job for 279 instances. Each one of these 279 samples is organized in folders with the following structure:

sgdp_001/
├── <name>.bam
├── <name>_chr_maf.bam
├── <name>_chr_maf.bam.bai
├── count_maf.txt
├── input
│   ├── <name>_chr.bam
│   └── <name>_chr.bam.bai
├── kff_dataset.txt
├── output
│   ├── <name>.g.vcf.gz
│   ├── <name>.g.vcf.gz.tbi
│   ├── <name>_reheaded.g.vcf.gz
│   ├── <name>.vcf.gz
│   ├── <name>.vcf.gz.tbi
│   ├── <name>.visual_report.html
│   └── sample.name
├── R1_sorted.fastq.gz
└── R2_sorted.fastq.gz

where <name> is the name for each one of these samples, in this case for sgdp_001 is abh100. The sample.name file in the output directory contains a string which is exactly abh100 in this example, the same applies to all other 278 instances and respective names.

Each one of these samples has also a count_maf.txt file which looks like this with different values for each one of them:

1187029476	0	total (QC-passed reads + QC-failed reads)
1187029476	0	primary
0	0	secondary
0	0	supplementary
0	0	duplicates
0	0	primary duplicates
1152630233	0	mapped
97.10%	N/A	mapped %
1152630233	0	primary mapped
97.10%	N/A	primary mapped %
1187029476	0	paired in sequencing
593514738	0	read1
593514738	0	read2
1134955326	0	properly paired
95.61%	N/A	properly paired %
1151954166	0	with itself and mate mapped
676067	0	singletons
0.06%	N/A	singletons %
941358	0	with mate mapped to a different chr
491794	0	with mate mapped to a different chr (mapQ>=5)

Now, what I need to do is to extract the file name for the sample and print it in a tab-separated file format with the difference between total and supplementary from the above for all 279 samples. In this case it should look like this:

abh100    1187029476

I put together a little line of code that does almost what I need but the tab-separated part of it...; in fact, I'm getting the output of the subtraction on a new line. See below for an example:

{ cat $d/output/sample.name; head -4 $d/count_maf.txt | grep -P "total|supplementary" | awk '{print $1}' | paste -sd- - | bc; } >> difference_maf.txt

and result:

abh100
1187029476

Here, it's important to append to the final file as I want to collect this information for all 279 samples in a single place.
Moreover, I'm reading in each folder path — as /path/to/sgdp_001-279 — from a list which I feed to the script as an array job; hence, the $d which should help me to do this across all instances. I tried several expedients with awk, column and printf; the closer I got with the last one is this output:

abh100
                 1187029476

which is not ideal, as it will creates unnecessary empty lines. If anyone has any idea, any help is much appreciated. Thanks!

munkeHoller · April 14, 2024, 12:31pm

@overcraft , show the bash script please.

tbh, your 'explanation' is confusing afaics,
Give more than 1 example, showing ONLY files involved, please don't show files that are not involved as its just noise.

Show an sample (with at least 3 examples of the output expected), (we can guess but ....)

Matt-Kita · April 14, 2024, 12:34pm

Maybe this is what you're looking for? (unomptimized - can be shorter and use even fewer binaries e.g. awk probably can do all the calculations, but I'm not an awk expert; %d for printf assumes the difference is an integer number only, otherwise you might want to use e.g. "%s\t%s\n")

printf "%s\t%d\n" "$(<"$d"/output/sample.name)" "$(head -4 "$d"/count_maf.txt | grep -P "total|supplementary" | awk '{print $1}' | paste -sd- - | bc)" >> difference_maf.txt

overcraft · April 14, 2024, 1:02pm

@munkeHoller apologies. I start with sharing the entire code:

#!/bin/bash
#
#SBATCH --nodes=1 --ntasks=1 --cpus-per-task=112
#SBATCH --time=12:00:00
#SBATCH --mem=256gb
#
#SBATCH --job-name=reads_count
#SBATCH --output=count.out
#SBATCH --array=[1-279]%279
#
#SBATCH --partition=<partition>
#SBATCH --qos=<qos>
#
#SBATCH --account=<account>

NAMES=$1
d=$(sed -n "$SLURM_ARRAY_TASK_ID"p $NAMES)

cd /path/to/folder
SAMPLE="$(echo ${d}/input/*_chr.bam)"
BAM="$(echo ${SAMPLE} | sed 's#/path/to/folder/sgdp_[0-9]\+/input/##' | sed 's/_chr.bam$//')"

### Get stats, Q20 reads and organize them
./samtools flagstats $d/${BAM}_chr_maf.bam -@ 112 -O tsv > $d/count_maf.txt #overall stats MAF

{ cat $d/output/sample.name; head -4 $d/count_maf.txt | grep -P "total|supplementary" | awk '{print $1}' | paste -sd- - | bc; } >> difference_maf.txt

Here, are three sgdp folders with files of interest and samples' names:

sgdp_001 NAME: abh100
├── [  594]  count_maf.txt
├── [ 4.0K]  output
   └── [    7]  sample.name
sgdp_002 NAME: abh107
├── [  554]  count_maf.txt
├── [ 4.0K]  output
   └── [    7]  sample.name
sgdp_003 NAME: ALB212
├── [  555]  count_maf.txt
├── [ 4.0K]  output
   └── [    7]  sample.name

This is the desired output:

abh100    1187029476
abh107    1137440782
ALB212    1144644504

For completeness, I share the structure of the name.list file I'm using to read in the paths so that it all makes more sense:

/path/to/folder/sgdp_001
/path/to/folder/sgdp_002
/path/to/folder/sgdp_003

and of course the script is launched as follow: sbatch script.sh name.list. Thanks again!

overcraft · April 14, 2024, 1:15pm

@Matt-Kita Indeed, it works. Thanks and apologies as well for the not-very-clear explanation. I'm working with integers only, so no need for the %s.
Definitely, I'm also not an awk expert and mostly readapted the code from another task I worked on; I trust there would be a better way but, as it is somehow concise and understandable for me, I think I will keep it for now.

Thanks a lot, again!

munkeHoller · April 14, 2024, 3:52pm

@overcraft

$(head -4 "$d"/count_maf.txt | grep -P "total|supplementary" | awk '{print $1}' | paste -sd- - | bc)"

simplifies to:

awk '$3 == "total"{tot=$1} $3 =="supplementary"{print tot - $1 ;exit}' "$d"/count_maf.txt

awk processes only lines where the third field is total or supplementary , awk does arithmetic so can return total - supplementary then exits, no need for head or paste or bc to be invoked. There are other solutions , the above is just one of a few ...

another ... gawk '/total/{tot=$1}/supplementary/{print tot - $1 ;exit}' "$d"/count_maf.txt

overcraft · April 14, 2024, 4:29pm

@munkeHoller great thanks for the insights on awk. I will keep this in mind and try to make my commands less redundant.

Again, sorry if my first post was not very clear; for the future, once I will happen to have problems with multi-job tasks, I will make sure to include few input/desired output blocks to make things easier.

munkeHoller · April 14, 2024, 4:57pm

np, am sure I’d struggle comprehending FASTA data

MadeInGermany · April 14, 2024, 6:33pm

Another optimization:

{
read name < "$d"/output/sample.name
printf "%s\t" "$name"
awk '$3=="total"{tot=$1} $3=="supplementary"{sup=$1} NR==4{print tot - sup; exit}' "$d"/count_maf.txt
} >> difference_maf.txt`