Match and store numerical prefix to update files

In the bash below the unique headers of each vcf.gz are stored in a text file with the same name. That is if 16-0000-file.vcf.gz was used the header text file would be 16-0000-file_header.txt .
There can be multiple vcf.gz in a directory, usually 3, that I need to fix the header in each file before further processing it. My question is how can I match each text file with its vcf.gz and pass the stored variables of each to the reheader code ?

In the below I strip off the unique numerical prefix 16-0000 from both the vcf.gz and text file, but am not sure how to match the two files

IAm=${0##*/}

InDir1='/home/cmccabe/Desktop/NGS/test'
InDir2='/home/cmccabe/Desktop/NGS/test'
OutDir='/home/cmccabe/Desktop/NGS/test'

cd "$InDir1"
for file1 in *.txt
do	# Grab file prefix.
	p=${file1%%_*}

	# Find matching file2.
	file2=$(printf '%s' "$InDir2/$p"_*.vcf.gz)
	if [ ! -f "$file2" ]
	then	printf '%s: No single file matching %s found.\n' "$IAm" \
		    "$file1" >&2
		continue
	fi

# store matches
    out=${file1##*/} && ${file2##*/}

vcf.gz in directory (file2)

16-0000-file1.vcf.gz
16-0001-file2.vcf.gz
16-0002-file3.vcf.gz

matching text file in directory (file1)

16-0000-file1_header.txt
16-0001-file2_header.txt
16-0002-file3_header.txt

So the contents of 16-0000.txt would be used to update 16-0000.vcf.gz using the code below.

reheader code

# edit the header
logfile=/home/cmccabe/Desktop/NGS/test/process.log
for f in /home/cmccabe/Desktop/NGS/test/*.vcf.gz ; do
     echo "Start vcf add header creation: $(date) - file: $f"
     bname=`basename $f`
     pref=${bname%%.vcf.gz}
      bcftools reheader -h $file1 $file2 > ${pref}_fixed.vcf.gz
     echo "End add header creation: $(date) - file: $f"
done >> "$logfile"

From your example I don't understand what your script is doing that you don't want it to do or is not doing that you do want it to do.

Please show us:

  1. the output you are getting,
  2. the output you are hoping to get,
  3. the arguments that are currently being passed to reheader ,
  4. the arguments you want to be passed to reheader , and
  5. ls -l output from the three directories involved.
1 Like
#!/bin/bash

# gunzip files
logfile=/home/cmccabe/Desktop/NGS/test/process.log
for f in /home/cmccabe/Desktop/NGS/test/*.vcf ; do
     echo "Start vcf.gz creation: $(date) - file: $f"
     bname=`basename $f`
     gzip $f
     echo "End vcf.gz creation: $(date) - file: $f"
done >> "$logfile"

# find undefined annotations
logfile=/home/cmccabe/Desktop/NGS/test/process.log
for f in /home/cmccabe/Desktop/NGS/test/*.vcf.gz ; do
     echo "Start vcf missing header creation: $(date) - file: $f"
     bname=`basename $f`
     pref=${bname%%.vcf.gz}
    bcftools view -h $f > /home/cmccabe/Desktop/NGS/test/${pref}_header.txt
     echo "End missing header creation: $(date) - file: $f"
done >> "$logfile"

# match files
IAm=${0##*/}

InDir1='/home/cmccabe/Desktop/NGS/test'
InDir2='/home/cmccabe/Desktop/NGS/test'
OutDir='/home/cmccabe/Desktop/NGS/test'

cd "$InDir1"
for file1 in *.txt
do	# Grab file prefix.
	p=${file1%%_*}

	# Find matching file2.
	file2=$(printf '%s' "$InDir2/$p"_*.vcf)
	if [ ! -f "$file2" ]
	then	printf '%s: No single file matching %s found.\n' "$IAm" \
		    "$file1" >&2
		continue
	fi
# store matches
    out=${file1##*/} && ${file2##*/}

# edit the header
logfile=/home/cmccabe/Desktop/NGS/test/process.log
for f in /home/cmccabe/Desktop/NGS/test/*.vcf.gz ; do
     echo "Start vcf edit header creation: $(date) - file: $f"
     bname=`basename $f`
     pref=${bname%%.vcf.gz}
      bcftools reheader -h $file1 $file2 > ${pref}_fixed.vcf.gz
     echo "End edit header creation: $(date) - file: $f"
done >> "$logfile"

Current output
syntax error: unexpected end of file , though the portion in bold creates the files that are need in the directory.

desired output

file1_header.txt is stored as $file1 matched with file1.vcf.gz stored as $file2
file2_header.txt is stored as $file1 matched with file2.vcf.gz stored as $file2f
file3_header.txt is stored as $file1 matched with file3.vcf.gz stored as $file2

Currently no arguments are being passed to reheader (in italics in the command)

desired arguments passed to reheader

$file1
$file2

The loop would use each argument in the below to create a new file updated with the fixed header:

bcftools reheader -h $file1 $file2 > ${pref}_fixed.vcf.gz
ls -l
total 12288
-rwxrwx---+ 1 cmccabe Domain Users 2031823 Jan 31 11:07 file1.vcf.gz
-rwxrwx---+ 1 cmccabe Domain Users    9312 Feb  2 13:17 file1_header.txt
-rwxrwx---+ 1 cmccabe Domain Users 2361873 Jan 31 11:07 file2.vcf.gz
-rwxrwx---+ 1 cmccabe Domain Users    9315 Feb  2 13:17 file2_header.txt
-rwxrwx---+ 1 cmccabe Domain Users 1816662 Jan 31 11:07 file3.vcf.gz
-rwxrwx---+ 1 cmccabe Domain Users    9313 Feb  2 13:17 file3_header.txt
-rwxrwx---+ 1 cmccabe Domain Users    1356 Feb  2 13:22 process.log
-rwxrwx---+ 1 cmccabe Domain Users    1278 Feb  2 13:17 process.log~

I hope this helps and thank you very much :).

Please explain in English what the following code is intended to do:

# store matches
    out=${file1##*/} && ${file2##*/}

Given that the variable out is never used after being defined by the above statement, are these two lines of code needed?

What happens if you insert the missing done in your script before or after the code discussed above (which should resolve the syntax error in your script)?

1 Like

The following code:

# store matches
    out=${file1##*/} && ${file2##*/}

is intended to match and store file1_header.txt as $file1 and match and store file1_header.vcf.gz as $file2 .
These two variables are then passed to the reheader command to be processed. After the command executes the variables are reset using the remaining two files.
I will rerun the code omitting the out and adding the done . Thank you :).

---------- Post updated at 04:53 PM ---------- Previous update was at 02:25 PM ----------

The portion of code in bold in the previous post is unchanged, but I updated the match portion of the code to:
adding the missing done did allow the script to execute.

# match files
IAm=${0##*/}

InDir1='/home/cmccabe/variants'
InDir2='/home/cmccabe/variants'

cd "$InDir1"
for file1 in *.txt
do    # Grab file prefix.
    p=${file1%%_*}

    # Find matching file2.
    file2=$(printf '%s' "$InDir2/$p"_*.vcf)
    if [ ! -f "$file2" ]
    then    printf '%s: No single file matching %s found.\n' "$IAm" \
            "$file1" >&2
        continue
    fi
        echo "file1is:"$file1
        echo "file2 is:"$file2
done

The output that I get in the terminal is:

No single file matching 16_0000_file-a_header.txt found.
No single file matching 16_0001_file-b_header.txt found.
ls -l /home/cmccabe/variants
total 92
-rw-rw-r-- 1 cmccabe cmccabe  8895 Feb  2 16:37 16_0000_file-a_header.txt
-rw-rw-r-- 1 cmccabe cmccabe 29066 Jan 24 12:06 16_0000_file-a.vcf.gz
-rw-rw-r-- 1 cmccabe cmccabe  8895 Feb  2 16:37 16_0001_file-b_header.txt
-rw-rw-r-- 1 cmccabe cmccabe 29066 Jan 24 12:06 16_0001_file-b.vcf.gz
-rw-rw-r-- 1 cmccabe cmccabe   860 Feb  2 16:37 process.log

desired output

$file1=16-0000.txt
$file2=16-0000.vcf.gz

$file1=16-0001.txt
$file2=16-0001.vcf.gz

Basically, on the first pass the $file1 variable is the .txt and $file2 variable is the matching vcf.gz . Those variable are passed to the reheader command it is executed. After it executes the variables are reset using the other files in the directory. Since the numerical prefix is always unique that is used to perform the match between files. There will also also be a match vcf.gz and .txt . Thank you :).

Given that there is nothing in your script that attempts to print the string $file1= nor the string $file2= , I have no idea how you would expect your script to produce that output???
Furthermore, since your ls output doesn't show any files with any of the filenames shown on the right hand side of those strings, I can't guess at how you would expect your script to create those names and find those names as existing files. (But of course I did ask for ls -l output from all three of your directories and you only showed output from one of those directories and didn't indicate which one of those three directories was the current working directory for the listing you provided, so maybe your names would make sense.)

The confusion between the the punctuation you use in the names of your files completely confuses any attempt to guess at what you want. Above we have .txt file filenames:

16_0000_file-a_header.txt
16_0001_file-b_header.txt
16-0000.txt
16-0001.txt

quoted from post #5 (with two showing underscores between the leading sequences of digits and two showing hyphens between the leading sequences of digits, while the names shown in post #1 were:

16-0000-file1_header.txt
16-0001-file2_header.txt
16-0002-file3_header.txt

using a completely different filename format.

Is it really that hard to give us an accurate description of the files you're working with and what you're trying to do?

1 Like

Some of the file names and locations were different because I was at the office then at home and that was a problem leading to typos. I apologize for that and am back in the office. The details below are the exact files in the directory.

There are always 3 vcf.gz files in the following format below in each directory:

Files in /home/cmccabe/Desktop/NGS/test

16-0000_File-A_variant_strandbias_readcount.vcf.gz
16-0002_File-A_variant_strandbias_readcount.vcf.gz
16-0005_File-A_variant_strandbias_readcount.vcf.gz

The numeric prefix of each is always 7 characters xx-xxxx and unique. The portion of code in the previous post uses each vcf.gz file to create a corresponding .txt file with only the unique numeric prefix followed by _header .

Files in /home/cmccabe/Desktop/NGS/test (if code works as expected)

16-0000_File-A_variant_strandbias_readcount.vcf.gz
16-0002_File-B_variant_strandbias_readcount.vcf.gz
16-0005_File-A_variant_strandbias_readcount.vcf.gz
16-0000_header.txt
16-0002_header.txt
16-0005_header.txt

I am trying to match the 16-0000.vcf.gz with the 16-0000.txt and read 16-0000.vcf.gz into $file1 and 16-0000.txt into $file2 .

These two variables $file1 and $file2 would be passed to the reheader command, in bold,below which is part of a loop .

After the $file1 and $file2 are processed for 16-0000 , the process is repeated for the remaining two files, 16-0002 and 16-0005 .

reheader command

logfile=/home/cmccabe/Desktop/NGS/test/process.log
for f in /home/cmccabe/Desktop/NGS/test/*.vcf.gz ; do
     echo "Start vcf edit header creation: $(date) - file: $f"
     bname=`basename $f`
     pref=${bname%%.vcf.gz}
     bcftools reheader -h $file1 $file2 > ${pref}_fixed.vcf.gz
     echo "End edit header creation: $(date) - file: $f"
done >> "$logfile"

There is only one directory that contains the files and that is /home/cmccabe/Desktop/NGS/test . The numerical prefixes are not being stripped off and used to perform the match between files nor are the variables being passed to reheader . Thank you :).

ls -l /home/cmccabe/Desktop/NGS/test
total 6116
-rw-rw-r-- 1 cmccabe cmccabe    9350 Feb  3 07:08 16-0000_File-A_variant_strandbias_readcount_header.txt
-rw-rw-r-- 1 cmccabe cmccabe 2031861 Jan 31 11:07 16-0000_File-A_variant_strandbias_readcount.vcf.gz
-rw-rw-r-- 1 cmccabe cmccabe    9353 Feb  3 07:08 16-0002_File-B_variant_strandbias_readcount_header.txt
-rw-rw-r-- 1 cmccabe cmccabe 2361911 Jan 31 11:07 16-0002_File-B_variant_strandbias_readcount.vcf.gz
-rw-rw-r-- 1 cmccabe cmccabe    9351 Feb  3 07:08 16-0005_File-C_variant_strandbias_readcount_header.txt
-rw-rw-r-- 1 cmccabe cmccabe 1816700 Jan 31 11:07 16-0005_File-C_variant_strandbias_readcount.vcf.gz
-rw-rw-r-- 1 cmccabe cmccabe    2622 Feb  3 07:08 process.log
-rw-rw-r-- 1 cmccabe cmccabe    1278 Feb  2 13:17 process.log~

Does this help? Thank you very much :).