Rename file using partial match to another

In the below I am trying to rename the contents within each data subfolder in a specific run, based on a partial match of the IonCode_0000_ in each file in the data subdirectory to $1 in f1 . There will be multiple runs in f1 but each run in $uniq is unique and will be found in f1 and the rename values stored in $string . The below code is commented as to what I think is going on. Thank you :).

f1

IonCode_0404 00-0000-xxx-xxx-xxx
IonCode_0402 11-1111-yy-yy-yyy
R_2019_00_00_00_00_00_xxxx_xx1-127-xxx_xxx_xxx_xxx_xx_xx_xx

IonCode_0402 22-2222-zz-zzzz-zzz
R_2019_00_00_00_00_00_xxxx_xx1-126-xxx_xxx_xxx_xxx_xx_xx_xx

IonCode_0404 10-0000-aa-aa-aa
IonCode_0412 55-1111-bb-bbb-bbb
R_2019_00_00_00_00_00_xxxx_xx1-120-xxx_xxx_xxx_xxx_xx_xx_xx
/path/to/run/R_2019_00_00_00_00_00_xxxx_xx1-127
      data   --- sub-folder ---
      IonCode_0402_xxx.xxx_xxx.bam
      IonCode_0402_xxx.xxx_xxx.bam.bai
      IonCode_0404_xxx.xxx_xxx.bam
      IonCode_0404_xxx.xxx_xxx.bam.bai
dir=/path/to/run/
for run in "$dir"/R_2019* ; do  ## # matching "R_2019*" to operate on desired directory and expand
  uniq=${run##*/}  ## store run with no path as s5
  cd "$dir"/"$uniq"/data  ## change directory to subfolder
   string=$(awk -F '\n' -v RS="" -v ref="$uniq" '$0 ~ ref {d=split($0, val, " "); for(i=2;i<d;i+=2) printf "%s ",val; printf "\n"}' "$dir"/f1)  ## loop through f1 for unique run and store $2 in string
   for $f in "$dir"/"$s5"/data/*.bam* ; do sample_basename=$(basename "${f}") ;
     rename_file_path="$string" ## define rename string
     cmd=$(sed -n "/$f/,/IonCode_[0-9][0-9][0-9][0-9]_*/{s/\(.*\.bam\) \(.*\)/mv \1 \2/g}" $rename_file_path)  ## rename file in data subfolder matching IonCode_ to f1 and replacing with $2 of f1
  done
done

desired in data

11-1111-yy-yy-yyy_test.bam
11-1111-yy-yy-yyy_test.bam.bai
00-0000-xxx-xxx-xxx_test.bam
00-0000-xxx-xxx-xxx_test.bam.bai

You were pretty close on this. I used a here-string ( <<< ) piped into a while read block. As we want to change directories in the loop using a sub shell ensures the pwd is reset after each rename loop.

change echo mv in red below to mv if you are happy with what it's doing

dir=/path/to/run/
for run in "$dir"/R_2019* ; do  ## # matching "R_2019*" to operate on desired directory and expand
  uniq=${run##*/}  ## store run with no path as s5
  while read from to
  do
     (
       cd "$dir"/"$uniq"/data
       for file in *.bam*
       do
          newname=${file/$from/$to}
          [ -f "$file" ] && [ "$newname" != "$file" ] && echo mv "$file" "$newname"
       done
     )
  done <<<$(
     awk -F '\n' -v RS="" -v ref="$uniq" '
         $0 ~ ref {
             d=split($0, val);
             for(i=1;i<d;i++) print val;
          }' "$dir"/f1
  )  ## loop through f1 for unique run and populate from and to
done
1 Like

Looks like the last file in the directory is getting renamed with both matching not the unique. Is another loop needed or all the values on one line instead of separate? Thank you :).

mv IonCode_0404_xxx.xxx_xxx.bam 00-0000-xxx-xxx-xxx IonCode_0402 11-1111-yy-yy-yyy_xxx.xxx_xxx.bam
mv IonCode_0404_xxx.xxx_xxx.bam.bai 00-0000-xxx-xxx-xxx IonCode_0402 11-1111-yy-yy-yyy_xxx.xxx_xxx.bam.bai

desired

11-1111-yy-yy-yyy_test.bam   ---- this is IonCode_0402_xxx.xxx_xxx ---
11-1111-yy-yy-yyy_test.bam.bai   ---- this is IonCode_0402_xxx.xxx_xxx ---
00-0000-xxx-xxx-xxx_test.bam   ---- this is IonCode_0404_xxx.xxx_xxx ---
00-0000-xxx-xxx-xxx_test.bam.bai ---- this is IonCode_0404_xxx.xxx_xxx ---

It looks a bit complicated.
Perhaps you want to do the following?

dir=/path/to/run
ind=0
while read a b c
do
  if [ -n "$b" ]
  then
    fsearch[ind]=$a
    mvto[ind]=$b
    ((ind++))
  elif [ -z "$a" ]
  then
    ind=0
  else
    while [ $ind -gt 0 ]
    do
      ((ind--))
      echo "In $dir/$a/data/ rename ${fsearch[ind]}*.bam* to ${mvto[ind]}_test.bam*" 
    done
  fi
done < $dir/f1
1 Like

What do you want to happen here? f1 requires that IonCode_0404 in directory *127* be renamed to both 10-0000-aa-aa-aa and 00-0000-xxx-xxx-xxx . Is this a mistake in the data file or how should the script handle this?

Change the newname assignment to this newname=${file/$from*.bam/${to}_test.bam}

1 Like

Looks like only one pair gets renamed with both values: Thank you :).

4 original files:

IonCode_0402_xxx.xxx_xxx.bam
IonCode_0402_xxx.xxx_xxx.bam.bai
IonCode_0404_xxx.xxx_xxx.bam
IonCode_0404_xxx.xxx_xxx.bam.bai

Current:

IonCode_0402_xxx.xxx_xxx.bam
IonCode_0402_xxx.xxx_xxx.bam.bai
00-0000-xxx-xxx-xxx IonCode_0402 11-1111-yy-yy-yyy_test.bam.bam
00-0000-xxx-xxx-xxx IonCode_0402 11-1111-yy-yy-yyy_test.bam.bam.bai

Desired after rename:

11-1111-yy-yy-yyy_test.bam   ---- this is IonCode_0402_xxx.xxx_xxx ---
11-1111-yy-yy-yyy_test.bam.bai   ---- this is IonCode_0402_xxx.xxx_xxx ---
00-0000-xxx-xxx-xxx_test.bam   ---- this is IonCode_0404_xxx.xxx_xxx ---
00-0000-xxx-xxx-xxx_test.bam.bai ---- this is IonCode_0404_xxx.xxx_xxx ---

I further qualified my question (see #5 above) - this appears to be a problem with the data file.

If the actual renames were done instead of echo, only the first match would apply as the file would then have a different name and the 2nd rename would not be attempted. Red lines will not occur as file has already be renamed on lines 1 and 2:

$ ./cmccabe_rename 
mv IonCode_0404_xxx.xxx_xxx.bam 00-0000-xxx-xxx-xxx_test.bam
mv IonCode_0404_xxx.xxx_xxx.bam.bai 00-0000-xxx-xxx-xxx_test.bam.bai
mv IonCode_0402_xxx.xxx_xxx.bam 11-1111-yy-yy-yyy_test.bam
mv IonCode_0402_xxx.xxx_xxx.bam.bai 11-1111-yy-yy-yyy_test.bam.bai
mv IonCode_0404_xxx.xxx_xxx.bam 10-0000-aa-aa-aa_test.bam
mv IonCode_0404_xxx.xxx_xxx.bam.bai 10-0000-aa-aa-aa_test.bam.bai
1 Like

I'm not sure I understand completly but in f1 the same IonCode may appear multiple times. However, the value in uniq is always unique and the each IonCode above each uniq unill the newline in f1 will be found in the data as a pair. That is in f1 IonCode_0404 but in data IonCode_0404.bam and IonCode_0404.bam.bai . In f1 IonCode_0402 but in data IonCode_0402.bam and IonCode_0402.bam.bai.
Both IonCode pairs are renamed with the $2 values from each matching IonCode above uniq with _test after it. Thank you very much :).

That is not the case in the demo f1 from post #1 red uniq is duplicated:

IonCode_0404 00-0000-xxx-xxx-xxx
IonCode_0402 11-1111-yy-yy-yyy
R_2019_00_00_00_00_00_xxxx_xx1-127-xxx_xxx_xxx_xxx_xx_xx_xx

IonCode_0402 22-2222-zz-zzzz-zzz
R_2019_00_00_00_00_00_xxxx_xx1-126-xxx_xxx_xxx_xxx_xx_xx_xx

IonCode_0404 10-0000-aa-aa-aa
IonCode_0412 55-1111-bb-bbb-bbb
R_2019_00_00_00_00_00_xxxx_xx1-127-xxx_xxx_xxx_xxx_xx_xx_xx
1 Like

My apologies, I have corected the typo in post 1 and here as well. All 3 uniq values in f1 will always be different I just transcribed them wrong. Line 3 (the duplicate) will never be there (computers make less mistakes) . Thank you :).

IonCode_0404 00-0000-xxx-xxx-xxx
IonCode_0402 11-1111-yy-yy-yyy
R_2019_00_00_00_00_00_xxxx_xx1-127-xxx_xxx_xxx_xxx_xx_xx_xx

IonCode_0402 22-2222-zz-zzzz-zzz
R_2019_00_00_00_00_00_xxxx_xx1-126-xxx_xxx_xxx_xxx_xx_xx_xx

IonCode_0404 10-0000-aa-aa-aa
IonCode_0412 55-1111-bb-bbb-bbb
R_2019_00_00_00_00_00_xxxx_xx1-120-xxx_xxx_xxx_xxx_xx_xx_xx

I put an echo "These are the files:" $file statement and the files in data before the script executes are:

These are the files: IonCode_0402_xxx.xxx_xxx.bam
These are the files: IonCode_0402_xxx.xxx_xxx.bam.bai
These are the files: IonCode_0404_xxx.xxx_xxx.bam
These are the files: IonCode_0404_xxx.xxx_xxx.bam.bai

after the script executes:

00-0000-xxx-xxx-xxx IonCode_0402 11-1111-yy-yy-yyy IonCode_0402 22-2222-zz-zzzz-zzz IonCode_0404 10-0000-aa-aa-aa IonCode_0412 55-1111-bb-bbb-bbb_test.bam
00-0000-xxx-xxx-xxx IonCode_0402 11-1111-yy-yy-yyy IonCode_0402 22-2222-zz-zzzz-zzz IonCode_0404 10-0000-aa-aa-aa IonCode_0412 55-1111-bb-bbb-bbb_test.bam.bai
IonCode_0402_xxx.xxx_xxx.bam
IonCode_0402_xxx.xxx_xxx.bam.bai

Thank you :).

Don't know what that "after the script executes" is showing. Are your filenames ending up with spaces in the etc. like shown above?

Here is the script I'm using:

dir=/path/to/run/
for run in "$dir"/R_2019* ; do  ## # matching "R_2019*" to operate on desired directory and expand
  uniq=${run##*/}  ## store run with no path as s5
  while read from to
  do
     (
       cd "$dir"/"$uniq"/data
       for file in *.bam*
       do
          newname=${file/$from*.bam/${to}_test.bam}
          [ -f "$file" ] && [ "$newname" != "$file" ] && mv "$file" "$newname"
       done
     )
  done <<<$(
     awk -F '\n' -v RS="" -v ref="$uniq" '
         $0 ~ ref {
             d=split($0, val);
             for(i=1;i<d;i++) print val;
          }' "$dir"/f1
  )  ## loop through f1 for unique run and populate from and to
done

And here is my test:

$ find /path/to/run -type f -print
/path/to/run/f1
/path/to/run/R_2019_00_00_00_00_00_xxxx_xx1-126-xxx_xxx_xxx_xxx_xx_xx_xx/data/IonCode_0402_xxx.xxx_xxx.bam
/path/to/run/R_2019_00_00_00_00_00_xxxx_xx1-126-xxx_xxx_xxx_xxx_xx_xx_xx/data/IonCode_0402_xxx.xxx_xxx.bam.bai
/path/to/run/R_2019_00_00_00_00_00_xxxx_xx1-127-xxx_xxx_xxx_xxx_xx_xx_xx/data/IonCode_0404_xxx.xxx_xxx.bam
/path/to/run/R_2019_00_00_00_00_00_xxxx_xx1-127-xxx_xxx_xxx_xxx_xx_xx_xx/data/IonCode_0404_xxx.xxx_xxx.bam.bai
$ ./cmccabe_rename 
$ find /path/to/run -type f -print
/path/to/run/f1
/path/to/run/R_2019_00_00_00_00_00_xxxx_xx1-126-xxx_xxx_xxx_xxx_xx_xx_xx/data/22-2222-zz-zzzz-zzz_test.bam
/path/to/run/R_2019_00_00_00_00_00_xxxx_xx1-126-xxx_xxx_xxx_xxx_xx_xx_xx/data/22-2222-zz-zzzz-zzz_test.bam.bai
/path/to/run/R_2019_00_00_00_00_00_xxxx_xx1-127-xxx_xxx_xxx_xxx_xx_xx_xx/data/00-0000-xxx-xxx-xxx_test.bam
/path/to/run/R_2019_00_00_00_00_00_xxxx_xx1-127-xxx_xxx_xxx_xxx_xx_xx_xx/data/00-0000-xxx-xxx-xxx_test.bam.bai
1 Like

After the rename scripts runs then only one pair of the files is renamed with both values in it, with a space in between. This is shown in the above, but im not sure why. Your output looks good. Thank you :).

Here is what I get:

with echo mv

mv IonCode_0402_xxx.xxx_xxx.bam _test.bam
mv IonCode_0402_xxx.xxx_xxx.bam.bai _test.bam.bai
mv IonCode_0404_xxx.xxx_xxx.bam _test.bam
mv IonCode_0404_xxx.xxx_xxx.bam.bai _test.bam.bai

with mv

_test.bam
_test.bam.bai

Thank you :).

Are you using bash shell? Can you post output with this additional debugging:

#!/bin/bash
dir=/path/to/run/
for run in "$dir"/R_2019* ; do  ## # matching "R_2019*" to operate on desired directory and expand
  uniq=${run##*/}  ## store run with no path as s5
  while read from to
  do
     (
       cd "$dir"/"$uniq"/data
       echo "Rename from:$from to:$to"
       for file in *.bam*
       do
          newname=${file/$from*.bam/${to}_test.bam}
          [ -f "$file" ] && [ "$newname" != "$file" ] && echo mv "$file" "$newname"
       done
     )
  done <<<$(
     awk -F '\n' -v RS="" -v ref="$uniq" '
         $0 ~ ref {
             d=split($0, val);
             for(i=1;i<d;i++) print val;
          }' "$dir"/f1
  )  ## loop through f1 for unique run and populate from and to
done

Here is how this looks with my setup:

$ ./cmccabe_rename 
Rename from:IonCode_0404 to:10-0000-aa-aa-aa
Rename from:IonCode_0412 to:55-1111-bb-bbb-bbb
Rename from:IonCode_0402 to:22-2222-zz-zzzz-zzz
mv IonCode_0402_xxx.xxx_xxx.bam 22-2222-zz-zzzz-zzz_test.bam
mv IonCode_0402_xxx.xxx_xxx.bam.bai 22-2222-zz-zzzz-zzz_test.bam.bai
Rename from:IonCode_0404 to:00-0000-xxx-xxx-xxx
mv IonCode_0404_xxx.xxx_xxx.bam 00-0000-xxx-xxx-xxx_test.bam
mv IonCode_0404_xxx.xxx_xxx.bam.bai 00-0000-xxx-xxx-xxx_test.bam.bai
Rename from:IonCode_0402 to:11-1111-yy-yy-yyy
1 Like

Here is the output, yes I am using bash shell. Thank you :).

 with echo
Rename from: to:
mv IonCode_0402_xxx.xxx_xxx.bam _test.bam
mv IonCode_0402_xxx.xxx_xxx.bam.bai _test.bam.bai
mv IonCode_0404_xxx.xxx_xxx.bam _test.bam
mv IonCode_0404_xxx.xxx_xxx.bam.bai _test.bam.bai

Files in directory to rename

ll
total 8
drwxr--r-- 2 cmccabe cmccabe 4096 Oct 28 09:59 ./
drwxr--r-- 3 cmccabe cmccabe 4096 Oct 28 08:12 ../
-rw-rw-r-- 1 cmccabe cmccabe    0 Oct 28 09:56 IonCode_0404_xxx_xxx_xxx.bam
-rw-rw-r-- 1 cmccabe cmccabe    0 Oct 28 09:58 IonCode_0404_xxx_xxx_xxx.bam.bai
-rw-rw-r-- 1 cmccabe cmccabe    0 Oct 28 09:50 IonCode_0402_xxx_xxx_xxx.bam
-rw-rw-r-- 1 cmccabe cmccabe    0 Oct 28 09:54 IonCode_0402_xxx_xxx_xxx.bam.bai

without echo in the code only one pair of files gets renamed with both values

ll
total 8
drwxr--r-- 2 cmccabe cmccabe 4096 Oct 28 09:59 ./
drwxr--r-- 3 cmccabe cmccabe 4096 Oct 28 08:12 ../
-rw-rw-r-- 1 cmccabe cmccabe    0 Oct 28 09:56 00-0000-xxx-xxx-xxx IonCode_0402 11-1111-yy-yy-yyy_test.bam
-rw-rw-r-- 1 cmccabe cmccabe    0 Oct 28 09:58 00-0000-xxx-xxx-xxx IonCode_0402 11-1111-yy-yy-yyy_test.bam.bai
-rw-rw-r-- 1 cmccabe cmccabe    0 Oct 28 09:50 IonCode_0402_xxx_xxx_xxx.bam
-rw-rw-r-- 1 cmccabe cmccabe    0 Oct 28 09:54 IonCode_0402_xxx_xxx_xxx.bam.bai

I suspect you have some corruption in the format of the f1 file. Perhaps it has been edited with a DOS editor or something like that.

Can you post here the output of od -c /path/to/run/f1

1 Like

Here is the output of od -c /path/to/f1 . Thank you :).

0000000   I   o   n   C   o   d   e   _   0   4   0   4       0   0   -
0000020   0   0   0   0   -   x   x   x   -   x   x   x   -   x   x   x
0000040  \n   I   o   n   C   o   d   e   _   0   4   0   2       1   1
0000060   -   1   1   1   1   -   y   y   -   y   y   -   y   y   y  \n
0000100   R   _   2   0   1   9   _   0   0   _   0   0   _   0   0   _
0000120   0   0   _   0   0   _   x   x   x   x   _   x   x   1   -   1
0000140   2   7   -   x   x   x   _   x   x   x   _   x   x   x   _   x
0000160   x   x   _   x   x   _   x   x   _   x   x  \n  \n   I   o   n
0000200   C   o   d   e   _   0   4   0   2       2   2   -   2   2   2
0000220   2   -   z   z   -   z   z   z   z   -   z   z   z  \n   R   _
0000240   2   0   1   9   _   0   0   _   0   0   _   0   0   _   0   0
0000260   _   0   0   _   x   x   x   x   _   x   x   1   -   1   2   6
0000300   -   x   x   x   _   x   x   x   _   x   x   x   _   x   x   x
0000320   _   x   x   _   x   x   _   x   x  \n  \n   I   o   n   C   o
0000340   d   e   _   0   4   0   4       1   0   -   0   0   0   0   -
0000360   a   a   -   a   a   -   a   a  \n   I   o   n   C   o   d   e
0000400   _   0   4   1   2       5   5   -   1   1   1   1   -   b   b
0000420   -   b   b   b   -   b   b   b  \n   R   _   2   0   1   9   _
0000440   0   0   _   0   0   _   0   0   _   0   0   _   0   0   _   x
0000460   x   x   x   _   x   x   1   -   1   2   0   -   x   x   x   _
0000500   x   x   x   _   x   x   x   _   x   x   x   _   x   x   _   x
0000520   x   _   x   x  \n

Nothing wrong with the data file, it matches what I'm using here byte-for-byte. Can you post the script you are using?

1 Like

I pass run_dir as an argument instead of hardcoding dir . I think that is the only difference. I made that change to make it easier for others. Thank you :slight_smile:

run_dir=$1
for run in "$run_dir" ; do  ## # grab run to operate on desired directory
   uniq=${run_dir##*/}  ## store run with no path as uniq
while read from to
  do
     (
       cd "$run_dir"/bam
       echo "Rename from:$from to:$to"
       for file in *.bam*
       do
          newname=${file/$from*.bam/${to}_RNA.bam}
          [ -f "$file" ] && [ "$newname" != "$file" ] && mv "$file" "$newname"
       done
     )
  done <<<$(
     awk -F '\n' -v RS="" -v ref="$uniq" '
         $0 ~ ref {
             d=split($0, val);
             for(i=1;i<d;i++) print val;
          }' "$run_dir"/f1
  )  ## loop through f1 for unique run and populate from and to
done