Bash to move specific files to directory based on match to file

I am trying to mv each of the .vcf files in the variants folder to the folder in /home/cmccabe/f2 that the .vcf id is found in file . $2 in file will always have the id of a .vcf in the variants folder . The line in blue staring with R_2019 in file up to the -v5.6 will always be an exact match to a folder in /home/cmccabe/f2 . There may be multiple folders in /home/cmccabe/f2 but will only have one match in file . There also may be mulitple id's but always only one .vcf in /home/cmccabe/f1/variants .

When a match is found between the folder in /home/cmccabe/f2 and the R_ in file , then the id(s) in $2 will be found in /home/cmccabe/f1/variants as a .vcf . Each .vcf is then moved to the matching folder in /home/cmccabe/f2 in a the sub-folder variants . This is the last step of a procedure that I am stuck on. I have included an attempt in bash and included comments, but im sure there is a better way. Thank you :).

file in /home/cmccabe/f1

IonCode_0007 19-0004-La-Fi
IonCode_0009 19-0005-Last-Firs
IonCode_0011 19-0008-LastN-FirstN
IonCode_0013 190320-Control
R_2019_03_12_13_59_54_user_S5-0271-100-v5.6_Oncomine_Childhood_Cancer_Research_DNA_and_Fusions

IonCode_0005 19-0000-LastName-FirstName
IonCode_0001 19-0001-Las-Fir
IonCode_0003 190319-Control
R_2019_03_12_11_10_20_user_S5-0271-99-v5.6_Oncomine_Childhood_Cancer_Research_DNA_and_Fusions

variants folder in /home/cmccabe/f1

19-0000-LastName-FirstName.vcf
19-0001-Las-Fir.vcf
190319-Control.vcf
19-0004-La-Fi.vcf
19-0005-Last-Firs.vcf
19-0008-LastN-FirstN.vcf
190320-Control.vcf

current structure of /home/cmccabe/f2

R_2019_03_12_11_10_20_user_S5-0271-99   ---parent directory ---
     - bam    --- sub-folder ---
     - qc     --- sub-folder ---
     - 19-0000-LastName-FirstName
             - variants
     - 19-0001-Last-Firs
             - variants
    - 190319-Control
             - variants
R_2019_03_12_13_59_54_user_S5-0271-100   ---parent directory ---
     - bam    --- sub-folder ---
     - qc     --- sub-folder ---
     19-0004-La-Fi
         - variants
     - 19-0005-Last-Firs
        - variants
     - 19-0008-LastN-FirstN
        - variants
     - 190320-Control.vcf
        -variants

desired structure of /home/cmccabe/f2

R_2019_03_12_11_10_20_user_S5-0271-99   ---parent directory ---
     - bam    --- sub-folder ---
     - qc     --- sub-folder
     - 19-0000-LastName-FirstName
              - variants
                   19-0000-LastName-FirstName.vcf
     - 19-0001-Last-Firs
             - variants
                  19-0001-Last-Firs.vcf
     - 190319-Control
             - variants
                   190319-Control.vcf
R_2019_03_12_13_59_54_user_S5-0271-100   ---parent directory ---
     - bam    --- sub-folder ---
     - qc     --- sub-folder ---
     - 19-0004-La-Fi
         - variants
            19-0004-La-Fi.vcf
     - 19-0005-Last-Firs
        - variants
          19-0005-Last-Firs.vcf
     - 19-0008-LastN-FirstN
        - variants
           19-0008-LastN-FirstN.vcf
     - 190320-Control.vcf
        -variants
            190320-Control.vcf

possible bash

for file in /home/cmccabe/f1/variants/*.vcf ; do
  bname=$(basename $file) # strip of path
  VCF="$(echo $bname|cut -d. -f1)" # remove .vcf extension
     f=$(printf '%s' /home/cmccabe/f1/file/${VCF})  ## # Find matching id
       FILE2=$(awk '{print $2}' $f') # set VCF lookup to column
          for RDIR in "$DIR"/R_2019* ; do FOLDER=${RDIR%%-v5.6*}; done  ## trim folder match in RDIR from -v5.6 and store in FOLDER
          if [[ $VCF = $FILE2 ]] # only execute file on match
                 then
                    mkdir -p /home/cmccabe/f2/$FOLDER/variants  ## create variants sub-folder
                   mv /home/cmccabe/f1/file/$VCF /home/cmccabe/f2/$FOLDER/$VCF/variants  ## move vcf to folder/id/variants
          fi  ## end if
done  ## close loop

What operating system are you using for this exercise?

It seems that the text description of your problem says that everything you need to find the files to be moved and the locations to which they should be moved is found in a file named /home/cmccabe/f1/file , but your script is treating that regular file as a directory. What am I missing?

Furthermore, you go to a lot of work to create a variable named VCF which contains the name of a file after stripping off the .vcf filename extension. But when you start moving the .vcf files, you use $VCF as the name of those files without reinstating the filename extension???

I then got completely lost when you started a loop on all of the R_2019* files in $DIR . Note that the DIR variable is never defined in your script and is never mentioned in your description of what you are trying to do.

I'm having a hard time guessing at what files are being processed by the code:

 FILE2=$(awk '{print $2}' $f')

(which should have "$f" instead of $f ). I'm guessing that this will set FILE2 to a list of filenames that you are then treating as a single filename; but since I don't know what the contents are of the file that has been selected by $f ; I'm lost.

I'm assuming that you have tried running your script and it is failing to work. What diagnostics is it printing, or if there aren't any, in what way is it failing to do what you want it to do?

Please indent your code to show its structure. Then comments like "end if" and "end loop" won't be needed and we won't have to wonder where the start of the "if" and "loop" are located. I know the shell doesn't care about indentation, but you are a human and you're asking humans on this forum to read your code. Lack of indentation makes it make difficult for humans (including you) to understand what your code is trying to do.

2 Likes

I am using ubuntu 14.04 as my os.

/home/cmccabe/f1/file is the path to file (which has all the necessary information for the move, (folder name, ids).

The for loop on RDIR was for trimming the R_2019 [/ICODE] in file to match the folder name in /home/cmccabe/f2 but is undefined and maybe should be /home/cmccabe/f1/file . The FILE2=$(awk '{print $2}' $f') was then intended to read each id from file1 in FILE2 . The code executes but nothing is moved and set -x shows the variables not being populated correctly as you already knew :).I indented the code above but add comments to help me learn and help me in my logic. Thank you for your help:).

I rewrote the script (well a portion) and most of the variables seem good: $STRING is the same as FILE2 , I just changed the name to hopefully be more clear as I am looking for a string. However, the loop is not working so only the first id is retained in $STRING . I think I am on the right track but is there a better way? Thank you :).

set -x
DIR=/home/cmccabe/f1
DEST=/home/cmccabe/f2
for file in "$DIR"/variants/*.vcf ; do
  bname=$(basename $file) # strip of path
    VCF="$(echo $bname|cut -d. -f1)" # remove .vcf extension   
  for f in "$DIR"/file; do STRING=( $(awk '{for(i=2; i<=NF; i++) print $i}' "$DIR"/file) ); echo "This is the string" "$STRING"; done
done

set -x

cmccabe@Satellite-M645:~$ set -x
cmccabe@Satellite-M645:~$ DIR=/home/cmccabe/f1
+ DIR=/home/cmccabe/f1
cmccabe@Satellite-M645:~$ DEST=/home/cmccabe/f2
+ DEST=/home/cmccabe/f2
cmccabe@Satellite-M645:~$ for file in "$DIR"/variants/*.vcf ; do
>   bname=$(basename $file) # strip of path
>     VCF="$(echo $bname|cut -d. -f1)" # remove .vcf extension   
>   for f in "$DIR"/file; do STRING=( $(awk '{for(i=2; i<=NF; i++) print $i}' "$DIR"/file) ); echo "This is the string" "$STRING"; done
> done
+ for file in '"$DIR"/variants/*.vcf'
++ basename /home/cmccabe/f1/variants/19-0000-LastName-FirstName.vcf
+ bname=19-0000-LastName-FirstName.vcf
++ echo 19-0000-LastName-FirstName.vcf
++ cut -d. -f1
+ VCF=19-0000-LastName-FirstName
+ for f in '"$DIR"/file'
+ STRING=($(awk '{for(i=2; i<=NF; i++) print $i}' "$DIR"/file))
++ awk '{for(i=2; i<=NF; i++) print $i}' /home/cmccabe/f1/file
+ echo 'This is the string' 19-0000-LastName-FirstName
This is the string 19-0000-LastName-FirstName
+ for file in '"$DIR"/variants/*.vcf'
++ basename /home/cmccabe/f1/variants/19-0002-L-F.vcf
+ bname=19-0002-L-F.vcf
++ echo 19-0002-L-F.vcf
++ cut -d. -f1
+ VCF=19-0002-L-F
+ for f in '"$DIR"/file'
+ STRING=($(awk '{for(i=2; i<=NF; i++) print $i}' "$DIR"/file))
++ awk '{for(i=2; i<=NF; i++) print $i}' /home/cmccabe/f1/file
+ echo 'This is the string' 19-0000-LastName-FirstName
This is the string 19-0000-LastName-FirstName
+ for file in '"$DIR"/variants/*.vcf'
++ basename /home/cmccabe/f1/variants/19-0004-La-Fi.vcf
+ bname=19-0004-La-Fi.vcf
++ echo 19-0004-La-Fi.vcf
++ cut -d. -f1
+ VCF=19-0004-La-Fi
+ for f in '"$DIR"/file'
+ STRING=($(awk '{for(i=2; i<=NF; i++) print $i}' "$DIR"/file))
++ awk '{for(i=2; i<=NF; i++) print $i}' /home/cmccabe/f1/file
+ echo 'This is the string' 19-0000-LastName-FirstName
This is the string 19-0000-LastName-FirstName
+ for file in '"$DIR"/variants/*.vcf'
++ basename /home/cmccabe/f1/variants/19-0020-Las-Fir.vcf
+ bname=19-0020-Las-Fir.vcf
++ echo 19-0020-Las-Fir.vcf
++ cut -d. -f1
+ VCF=19-0020-Las-Fir
+ for f in '"$DIR"/file'
+ STRING=($(awk '{for(i=2; i<=NF; i++) print $i}' "$DIR"/file))
++ awk '{for(i=2; i<=NF; i++) print $i}' /home/cmccabe/f1/file
+ echo 'This is the string' 19-0000-LastName-FirstName
This is the string 19-0000-LastName-FirstName
+ for file in '"$DIR"/variants/*.vcf'
++ basename /home/cmccabe/f1/variants/190319-Control.vcf
+ bname=190319-Control.vcf
++ echo 190319-Control.vcf
++ cut -d. -f1
+ VCF=190319-Control
+ for f in '"$DIR"/file'
+ STRING=($(awk '{for(i=2; i<=NF; i++) print $i}' "$DIR"/file))
++ awk '{for(i=2; i<=NF; i++) print $i}' /home/cmccabe/f1/file
+ echo 'This is the string' 19-0000-LastName-FirstName
This is the string 19-0000-LastName-FirstName
+ for file in '"$DIR"/variants/*.vcf'
++ basename /home/cmccabe/f1/variants/190320-Control.vcf
+ bname=190320-Control.vcf
++ echo 190320-Control.vcf
++ cut -d. -f1
+ VCF=190320-Control
+ for f in '"$DIR"/file'
+ STRING=($(awk '{for(i=2; i<=NF; i++) print $i}' "$DIR"/file))
++ awk '{for(i=2; i<=NF; i++) print $i}' /home/cmccabe/f1/file
+ echo 'This is the string' 19-0000-LastName-FirstName
This is the string 19-0000-LastName-FirstName

I apologize for not getting back to you sooner. (I was distracted for a few days by other activities.)

Have you made any progress on resolving this problem?

2 Likes

I have been able to get a working solution that produces my desired results... using set -x and the below modifications

if [[ $VCF = ${STRING[*]} ]] # only execute file on match
         then
           RSTRING=$(awk '/R_2019/' "$DIR"/run)  ## search for lines matching R_2019 pattern
              VCFRUN=$(awk -F '\n' -v RS="" -v ref="$VCF" '$0 ~ ref {print $NF}' "$DIR"/file)  ## search file for matching $VCF and return last column ($2)
           RUN="$(echo $RSTRING|cut -d- -f1,2,3)" ## remove after third _ in line with R_2019
                mv "$DIR"/variants/${VCF}.vcf "$DEST"/"$RUN"/"$VCF"/variants  ## move vcf to folder in destination

This matched each .vcf and moved the match to the correct run file. Maybe this will help others as well.

Thank you very much for your help :).