sed parser behaving strange on replacing multiple words in multiple files

sammy777888 · December 8, 2017, 4:43am

I have 4000 files like

$cat clus_grp_seq10_g.phy 

 18 1002
anig_OJJ65951_1     ATGGTTTCGCAGCGTGATAGAGAATTGTTTAGGGATGATATTCGCTCGCGAGGAACGAAGCTCAATGCTGCCGAGCGCGAGAGTCTGCTAAGGCCATATCTGCCAGATCCGTCTGACCTTCCACGCAGGCCACTTCAGCGGCGCAAGAAGGTTCCTCG
aver_OOF92921_1     ATGGTTTCGCAACGAGAT---------AGAGAATTGAATATCACGGCTTCCTCAGGGGTCTCTGGCATTATGCTGGTGCTCAGATGAGGTTTGGC
anid_EAW13573_1     ATGGTCTCACAGCGTGACAGAGAGTTGGCTGTTGAATACCAGGGCTATCTCAGGGGTTTGTGGCATTACGCTGGGGCCCAGATGCGATTTGGC
azon_EAW20028_1     ATGGCCCTAGCACGTGATAGAGAATTACTGAGGGACACTATTCGCACCCAAGGGACCGCACTTACTGCTGCCGATCGCGAAAATATCCTGAAGCCATATCTGCCGGATCCATCAGAACTTGCACGTCGGCCACTACAGCGACAGAAGAAAGC
awen_EED46037_1     ATGGTATCACAACGGGATAGAGTGGTGTGTCTGCC------------------------------------------------CTCTACAGGTCA------AAACAGTGCGAAATA---------AA
acar_EAL84889_1     ATGGCCCT
akaw_EAWE3573_1     ---------ATGGTCTCAC---------AGCGTGACAGAGAGT---------TGGCTGTTGAATACCAGGGCTATCTCAGGGGTTTGTGGCATTACGC

I want to replace 7 patterns (aver, anid, anig, acar, azon, awen, akaw) in all the files. The resulting file should be like (No change in file name)

$cat clus_grp_seq10_g.phy 

 18 1002
anig     ATGGTTTCGCAGCGTGATAGAGAATTGTTTAGGGATGATATTCGCTCGCGAGGAACGAAGCTCAATGCTGCCGAGCGCGAGAGTCTGCTAAGGCCATATCTGCCAGATCCGTCTGACCTTCCACGCAGGCCACTTCAGCGGCGCAAGAAGGTTCCTCG
aver     ATGGTTTCGCAACGAGAT---------AGAGAATTGAATATCACGGCTTCCTCAGGGGTCTCTGGCATTATGCTGGTGCTCAGATGAGGTTTGGC
anid     ATGGTCTCACAGCGTGACAGAGAGTTGGCTGTTGAATACCAGGGCTATCTCAGGGGTTTGTGGCATTACGCTGGGGCCCAGATGCGATTTGGC
azon     ATGGCCCTAGCACGTGATAGAGAATTACTGAGGGACACTATTCGCACCCAAGGGACCGCACTTACTGCTGCCGATCGCGAAAATATCCTGAAGCCATATCTGCCGGATCCATCAGAACTTGCACGTCGGCCACTACAGCGACAGAAGAAAGC
awen     ATGGTATCACAACGGGATAGAGTGGTGTGTCTGCC------------------------------------------------CTCTACAGGTCA------AAACAGTGCGAAATA---------AA
acar     ATGGCCCT
akaw     ---------ATGGTCTCAC---------AGCGTGACAGAGAGT---------TGGCTGTTGAATACCAGGGCTATCTCAGGGGTTTGTGGCATTACGC

I wrote a bash script for this

#!/bin/bash
j=1
for ((i=0;i<=4000;i++));
do
echo "$j"

sed -e s/'aver_[^ ]*'/aver/g clus_grp_seq"$j"_g.phy | sed -e s/'anid_[^ ]*'/anid/g | sed -e s/'anig_[^ ]*'/anig/g | sed -e s/'acar_[^ ]*'/acar/g | sed -e s/'azon_[^ ]*'/azon/g | sed -e s/'awen_[^ ]*'/awen/g | sed -e s/'akaw_[^ ]*'/akaw/g -> clus_grp_seq"$j"_g.phy
wait
let j++
done

but the parser is making several files completely blank. In the folder some files like clus_grp_seq2000_g.phy does not exists, in such case blank file like clus_grp_seq2000_g.phy is OK. But in cases even the file exists in the folder like clus_grp_seq10_g.phy as shown above the parser is making blank files.
Please let me know the problem or suggest an alternative solution.

RudiC · December 8, 2017, 5:02am

Not digging too deep, I can see that

your single quoting of the sed commands is consistently wrong - the entire respective command needs to be quoted.
the redirection into the to-be-modified file truncates it before anything is read from it, so I'm very surprised that only several files should be completely blank.
running 7 sed s (i.e. creating 7 processes) for 4000 file is resource hungry and may become somewhat slow.
running a for loop from 0 to 4000 with i the loop variable, why do you use another j variable?

How about

for FN in clus*phy; do sed '/aver\|anid\|anig\|acar\|azon\|awen\|akaw/ s/_[^ ]*//' $FN > ${FN}.tmp; done

Move the .tmp files to the origial ones when happy.