Rename file in directory using contents within each file

cmccabe · December 26, 2019, 4:42pm

In the below there are two generic .vcf files (genome.S1.vcf and genome.S2.vcf) in a directory. There wont always be two genaric files but I am trying to use bash to rename each of these generic files with specfic text (unique identifier) within in each .vcf . The text will always be different, but it will always be in the same position (after the word FORMAT) on the same line (that starts with #). Each .vcf is tab-delimited , not sure if my attempt is the best way, but hopefully it helps. Thank you :).

genome.S1.vcf

...
...
...
##FILTER=<ID=NotGenotyped,Description="Locus contains forcedGT input alleles which could not be genotyped">
##FILTER=<ID=PloidyConflict,Description="Genotype call from variant caller not consistent with chromosome ploidy">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	NAME1_S1
chr10	323215	.	A	.	.	LowGQX	END=323313;BLOCKAVG_min30p3a	GT:GQX:DP:DPF:MIN_DP	.:.:0:0:0
chr10	323314	.	C	.	.	LowGQX;LowDepth	END=323397;BLOCKAVG_min30p3a	GT:GQX:DP:DPF:MIN_DP	0/0:3:1:0:1

genome.S2.vcf

...
...
...
##FILTER=<ID=NotGenotyped,Description="Locus contains forcedGT input alleles which could not be genotyped">
##FILTER=<ID=PloidyConflict,Description="Genotype call from variant caller not consistent with chromosome ploidy">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	11-1111-ID_S5
chr10	323215	.	A	.	.	LowGQX	END=323313;BLOCKAVG_min30p3a	GT:GQX:DP:DPF:MIN_DP	.:.:0:0:0
chr10	323314	.	C	.	.	LowGQX;LowDepth	END=323385;BLOCKAVG_min30p3a	GT:GQX:DP:DPF:MIN_DP	.:.:0:0:0

desired (each vcf in directory renamed with unique identifier)

NAME1_S1.vcf

...
...
...
##FILTER=<ID=NotGenotyped,Description="Locus contains forcedGT input alleles which could not be genotyped">
##FILTER=<ID=PloidyConflict,Description="Genotype call from variant caller not consistent with chromosome ploidy">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	NAME1_S1
chr10	323215	.	A	.	.	LowGQX	END=323313;BLOCKAVG_min30p3a	GT:GQX:DP:DPF:MIN_DP	.:.:0:0:0
chr10	323314	.	C	.	.	LowGQX;LowDepth	END=323397;BLOCKAVG_min30p3a	GT:GQX:DP:DPF:MIN_DP	0/0:3:1:0:1

11-1111-ID_S5.vcf

...
...
...
##FILTER=<ID=NotGenotyped,Description="Locus contains forcedGT input alleles which could not be genotyped">
##FILTER=<ID=PloidyConflict,Description="Genotype call from variant caller not consistent with chromosome ploidy">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	11-1111-ID_S5
chr10	323215	.	A	.	.	LowGQX	END=323313;BLOCKAVG_min30p3a	GT:GQX:DP:DPF:MIN_DP	.:.:0:0:0
chr10	323314	.	C	.	.	LowGQX;LowDepth	END=323385;BLOCKAVG_min30p3a	GT:GQX:DP:DPF:MIN_DP	.:.:0:0:0

bash

cd /path/to/files
for f in *.vcf ; do # loop through all vcf files
    new="$(head -1 "$f" | awk '{print $10}').vcf" # store value of $10 in new
    if [ ! -f "$new" ]; then # if original file doesn't match new
        echo -e Renaming $f to $new # log rename
        mv "$f" "$new" # rename original to new
    fi # close if
done # close loop

nezabudka · December 26, 2019, 10:37pm

Hi
Try this

new=$(awk '/FORMAT/ {print $10}' $f)

or

new=$(sed /FORMAT/!d;s/^.*\t//' $f)

if the file is large, it's useful to add "quit" command

new=$(sed /FORMAT/!d;s/^.*\t//;q' $f)

--- Post updated at 07:37 ---

You can try with one command

awk '/FORMAT/ {system("mv "FILENAME" "$10)}' *.vcf

nezabudka · December 27, 2019, 1:16am

I'm sorry, you should add a check for the existence of the file

awk '/FORMAT/ {system("mv -n "FILENAME" "$10)}' *.vcf

Scrutinizer · December 27, 2019, 2:47am

Note: -n (no clobber) for the mv command is non-standard extension to the POSIX standard. Alternatively, try using -i for interactive use (but some systems ignore -i when used in a non-interactive manner, so test this also), or (better) try testing for file existence beforehand.

RavinderSingh13 · December 27, 2019, 6:35am

Hello nez,

How are you?
I hope you are doing fine

For your this solution, if you ask me IMHO we could avoid using renaming of a Input_file with system while reading Input_file itself could cause issues. Since Input_file is being read and we are renaming it.

IMHO, I would go with approach where will check for string FORMAT in line and print the rename shell command(I am hoping each Input_file should have only 1 rename because once Input_file which is being read is renamed can't be find again in system, since no same name file is existing now).

So what I am doing here is I am printing shell commands by same condition used in your provided code as follows:

awk '/FORMAT/{print "if [[ -n " s1 FILENAME s1 " ]]; then      echo " s1 "Input_file named " FILENAME " is already present." s1 "; else      mv " s1 FILENAME s1 OFS $10"; fi"}' *.vcf

For a sample file named file3 output will be as follows.

if [[ -n file3 ]]; then      echo Input_file named file3 is already present.; else      mv file3 ; fi

Now above will print rename commands, if OP is happy with above commands we could use | bash to raname them.

awk '/FORMAT/{print "if [[ -n " s1 FILENAME s1 " ]]; then      echo " s1 "Input_file named " FILENAME " is already present." s1 "; else      mv " s1 FILENAME s1 OFS $10"; fi"}' *.vcf | bash

Apologies if I missed here something, I thought to give my views here, cheers

Thanks,
R. Singh

nezabudka · December 27, 2019, 6:50am

Hi and thanks
No, you can see what will happen

cat>>file.txt<<EOF
1
2
FORMAT new
3
4
EOF
cp file.txt file2.txt
ls
file.txt file2.txt
awk '/FORMAT/ {system("mv -n "FILENAME" "$2)}; {print $0, FILENAME}' *.txt
1 file2.txt
2 file2.txt
3 file2.txt
FORMAT new file2.txt
4 file2.txt #<<<-not changed
5 file2.txt
6 file2.txt
1 file.txt
2 file.txt
3 file.txt
FORMAT new file.txt
4 file.txt
5 file.txt
6 file.txt
ls
file.txt new

just open file descriptors

Scrutinizer · December 27, 2019, 7:44am

Hi Ravinder,

As long as a mv operation is performed on the same file system - as is the case here - that should not pose a problem, since mv then only manipulates directory data: A file name is nothing more than a directory entry, a pointer (a hard link) to the file itself.

When a process opens a file for reading, the operation system creates an entry (file descriptor) to represent that file and stores information about that opened file in memory. So then the directory entry is no longer used.

The mv operation is thus free to manipulate the directory entry.

So for the process that has opened and is reading the file, nothing changes as the directory data is being changed.
When it is done reading it just closes the file descriptor.

Also, the file list expanded by the glob is expanded before being passed to the awk script, so new file names are not passed to the script.

S.

RudiC · December 28, 2019, 8:24am

Even if the file is moved between file systems, there should not be a problem as long as the file is kept open. Even though the file and it's contents ARE moved by the mv command, the OS keeps the file readable until it is closed and unlinked. See below, using nezabudka's one liner extended by mv 's -v ( --verbose ) option

awk '/FORMAT/ {system("mv -vn "FILENAME " /tmp/" $10)}' f*
copied 'file1' -> '/tmp/NAME1_S1'
removed 'file1'
copied 'file2' -> '/tmp/11-1111-ID_S5'
removed 'file2'

and the respective lsof output (before and after (but before file closing) the mv operation) :

awk     2885 user    3r   REG   8,41      477    171 /mnt/9/file1
.
.
.
awk     2885 user    3r   REG   8,41      477    171 /mnt/9/file1 (deleted)

Even though attributed "deleted", the file's contents is still available and readable. Of course, once unlinked, the file can't be reopened / reused in its original location.

cmccabe · December 28, 2019, 3:57pm

Thank you all

nezabudka · December 30, 2019, 10:41pm

rudic:

...
and the respective lsof output (before and after (but before file closing) the mv operation) :
awk     2885 user    3r   REG   8,41      477    171 /mnt/9/file1
.
.
.
awk     2885 user    3r   REG   8,41      477    171 /mnt/9/file1 (deleted)
Even though attributed "deleted", the file's contents is still available and readable. Of course, once unlinked, the file can't be reopened / reused in its original location.

Hi @RudiC
I'm embarrassed to ask how can I repeat this conclusion. Broke all my spears
Can't beat these elusive moments of time.
Thanks

RudiC · December 31, 2019, 5:38am

To replicate above output, try

awk '/FORMAT/ {system("lsof -p$$; mv -vn "FILENAME " /tmp/" $10 "; lsof -p$$")}' fi*
 .
.
.
sh      4523 user    3r   REG   8,41      477 260226 /mnt/9/file1
copied 'file1' -> '/tmp/NAME1_S1'
removed 'file1'
.
.
.

 sh      4523 user    3r   REG   8,41      477 260226 /mnt/9/file1 (deleted)

lsof is installed from its own package on my Ubuntu distro. man lsof :

nezabudka · December 31, 2019, 6:26am

Thanks @RudiC
I already thought that I would have a brain dislocation.
something like It turned out on my system in this way.

awk '/FORMAT/ {system("lsof -p $(pidof awk); mv -vn "FILENAME" "$10"; lsof -p $(pidof awk)")}' fi*

my regards