Bash lookup matching digits for secong file

In the bash below the user selects the file to be used. The digits of each file are unique and used to automatically locate the next file to be used in the process. The problem I can not seem to fix is that the full path needs to be referenced in the second portion and it is not currently. Is there a better way? Thank you :).

select1 files (user selects 123_base_counts.txt)

123_base_counts.txt
456_base_counts.txt

files used that match digits in file

123_variant_strandbias_readcount.vcf.hg19_multianno_removed_final (this one is automatically selected because it has the same starting digits as the original file)
456_variant_strandbias_readcount.vcf.hg19_multianno_removed_final

These are all files in the directory:

123_variant_strandbias_readcount.vcf.hg19_multianno_removed_final.txt
456_variant_strandbias_readcount.vcf.hg19_multianno_removed_final.txt

Bash

FILESDIR=/home/cmccabe/Desktop/NGS/API/5-14-2016/bedtools
ANNOVARDIR=/home/cmccabe/Desktop/NGS/API/5-14-2016/vcf/overall/annovar

PS3="please select a file1 to analyze with a panel: " # specify file
select file1 in $(cd ${FILESDIR};ls);do break;done
        file1=`basename ${FILESDIR}/${file1}`
        printf "FILE is: ${file1} and will be used to filter reads, identify target bases and genes less than 20 and 30 reads, create a low coverage bed for vizulization, calculate 20x and 30x coverage, and filter the vcf for the 98 gene epilepsy panel"
logfile=/home/cmccabe/Desktop/NGS/API/5-14-2016/process.log
for file1 in /home/cmccabe/Desktop/NGS/API/5-14-2016/bedtools/$file1; do
     bname=$(basename $file1)
     pref=${bname%%.txt}
     grep -wFf /home/cmccabe/Desktop/NGS/panels/EPILEPSY_unix_trim.bed $file1 > /home/cmccabe/Desktop/NGS/API/5-14-2016/panel/reads/${pref}_EPILEPSY.txt
     done >> "$logfile"
# filter vcf
printf "\n\n"
printf "These are all vcf files in the directory:  \n"
ls ${ANNOVARDIR}
file1=`basename ${FILESSDIR}/${file1}`  # file matched
file2=`basename ${ANNOVARDIR}/${file1%%_*}`
path=${ANNOVARDIR}/${file1%%_*}
     printf "The matching identifier for file2 is: ${file2} and will be used filtered using the epilepsy genes\n"
     echo "The full filename is $path"

Currently

1) 123_base_counts.txt
2) 456_base_counts.txt 
please select a file1 to analyze with a panel: 1
FILE is: 123_base_counts.txt and will be used to filter reads, identify target bases and genes less than 20 and 30 reads, create a low coverage bed for vizulization, calculate 20x and 30x coverage, and filter the vcf for the 98 gene epilepsy panel

These are all vcf files in the directory:  
123_variant_strandbias_readcount.vcf.hg19_multianno_removed_final.txt
456_variant_strandbias_readcount.vcf.hg19_multianno_removed_final.txt

The matching identifier for file2 is: 123 and will be used filtered using the epilepsy genes
The full file name is /home/cmccabe/Desktop/NGS/API/5-14-2016/vcf/overall/annovar/123

We have seen most of this in earlier threads. And, we understand why your current bash code produces the output it produces (although I don't understand some of your code that seems to just be creating extra work for you).

What I don't understand is what output you are hoping to create that is different from the output you are currently getting???

1 Like

The below bash (though not optimized), yields the desired result for one entry. That is depending on the digits in the file manually selected in the first process, the second file used is automatically selected using the matching digits along with the full path. The problem is this seems to work for the first file but not for others. Thank you :).

file manually selected: 123_base_counts.txt

123_base_counts.txt
456_base_counts.txt

file selected automatically using the matching digits in (/home/user)

123_variant_strandbias_readcount.vcf.hg19_multianno_removed_final.txt
456_variant_strandbias_readcount.vcf.hg19_multianno_removed_final.txt

bash

# manual selection of file
FILESDIR=/home/cmccabe/Desktop/NGS/API/5-14-2016/bedtools
ANNOVARDIR=/home/user

PS3="please select a file to analyze with a panel: " # specify file1
select file1 in $(cd ${FILESDIR};ls);do break;done
          file1=`basename ${FILESDIR}/${file1}`
          printf "FILE is: ${file1} and will be used

# automatic file based on match
FILESDIR=/home/cmccabe/Desktop/NGS/API/5-14-2016/bedtools # match directory
ANNOVARDIR=/home/cmccabe/Desktop/NGS/API/5-14-2016/vcf/overall/annovar # search directory
printf "\n\n"
printf "These are all vcf files in the directory: \n"
ls ${ANNOVARDIR}
file1=`basename ${FILESSDIR}/${file1}`  # file matched
file2=(${ANNOVARDIR}/${file1%%_*}*)
     printf "file2 is: ${file2} and will be used

output

1) 123_base_counts.txt 
2) 456_base_counts.txt 

please select a file to analyze with a panel: 1
FILE is: 123_base_counts.txt and will be used to filter reads, identify target bases and genes less than 20 and 30 reads, create a low coverage bed for visualization, calculate 20x and 30x coverage, and filter the vcf for the 98 gene epilepsy panel

These are all files in the new directory: 
123_variant_strandbias_readcount.vcf.hg19_multianno_removed_final.txt
456_variant_strandbias_readcount.vcf.hg19_multianno_removed_final.txt
file2 is: /home/cmccabe/Desktop/NGS/API/5-14-2016/vcf/overall/annovar/123_variant_strandbias_readcount.vcf.hg19_multianno_removed_final.txt and will be used 

second time results
1) 123_base_counts.txt  
2) 456_base_counts.txt


please select a file to analyze with a panel: 2
FILE is: 456_base_counts.txt and will be used

These are all files in the new directory: 
123_variant_strandbias_readcount.vcf.hg19_multianno_removed_final.txt
456_variant_strandbias_readcount.vcf.hg19_multianno_removed_final.txt
file2 is: /home/user/123_variant_strandbias_readcount.vcf.hg19_multianno_removed_final.txt and will be used

Why are you using an array for file2 ? I thought there was supposed to be a single file in both directories starting with the string that is the number before the first underscore in the name of the file selected in the first directory?

What output do you get when your run your script with tracing enabled:

bash -xv your_script_name
1 Like

For a shell that has both select and arrays (neither of which are required by the standards), the following seems to work, if I correctly understand what you're trying to do:

#!/bin/ksh
# manual selection of file
ANNOVARDIR=/home/cmccabe/Desktop/NGS/API/5-14-2016/vcf/overall/annovar
FILESDIR=/home/cmccabe/Desktop/NGS/API/5-14-2016/bedtools
PS3="please select a file to analyze with a panel: "

cd "$FILESDIR"
select file1 in $(ls)
do	[ "$file1" != "" ] && break
done
printf "FILE is: ${file1} and will be used\n\n"

# automatic file based on match
cd "$ANNOVARDIR"
printf "These are all vcf files in the directory:\n"
ls
file2=("${file1%%_*}"*)
file2=$PWD/$file2
printf "file2 is: $file2 and will be used\n"

Shells that provide both select and arrays include recent bash and 1993 or later versions of ksh (there may be others). This has been tested with both ksh (version: 93u+ 2012-08-01) and bash (version: 3.2.57(1)-release (x86_64-apple-darwin15)).

what do recommend? you are right that

I run the bash as part of a shell download.sh ? Thank you :).

We may have crossed paths... See if post #5 helps.

Thank you very much for your help :slight_smile: