Run sed and awk in multiple files in adirectory

Dieunel · June 4, 2018, 1:34am

Dear linux users

I was running around of 200 djob for a Blastp search in a cluster. All my input files were protein fasta file (prot.fna.1, prot.fna.2 ...prot.fna.200). The output of each individual slurm job is located in a corresponding file ending with *test (prot.fna.1.test, prot.fna.2.test ...prot.fna.200.test) in the same directory. Unfortunately, these Jobs were canceled due to time limit on the node. Now I want to extract all the remaining sequences from my protein fasta files a way to run them again and all the results could be concatenated. Here his what I doing :

I look for the first string of one *test file with this command:

(awk '{print $1}' prot.fna.1.test | tail -n1)

, this scrip print me the �pattern�
2. All the sequences after this matching pattern in the corresponding fasta input (prot.fasta.1) is printed using this command :

cat prot.fasta.1 | sed -e '1,/pattern/ d' | sed -ne '/^>/,$ p'

Repeating this for 200 files one by one is time consuming. I want to run this script in all the files , but I can't. I am writing you to see if you can help me implement this please. Here is what I am doing using these scripts :

[dderilus@boqueron ERR598955_orfm_250_out]$ awk '{print $1}' ERR598955_orfm.fna.11.test | tail -n1
  ERR598955.6981687_74_5_4
  [dderilus@boqueron ERR598955_orfm_250_out]$ cat ERR598955_orfm.fna.11 | sed -e '1,/ERR598955.6981687_74_5_4/ d' | sed -ne '/^>/,$ p' > first_out.fna1.11
  [dderilus@boqueron ERR598955_orfm_250_out]$ awk '{print $1}' ERR598955_orfm.fna.12.test | tail -n1
  ERR598955.7664144_89_2_3
  [dderilus@boqueron ERR598955_orfm_250_out]$ cat ERR598955_orfm.fna.1 | sed -e '1,/ERR598955.7664144_89_2_3/ d' | sed -ne '/^>/,$ p' > first_out.fna1.12
  [dderilus@boqueron ERR598955_orfm_250_out]$ less first_out.fna1.12
  [dderilus@boqueron ERR598955_orfm_250_out]$ cat ERR598955_orfm.fna.12 | sed -e '1,/ERR598955.7664144_89_2_3/ d' | sed -ne '/^>/,$ p' > first_out.fna1.12
  [dderilus@boqueron ERR598955_orfm_250_out]$ awk '{print $1}' ERR598955_orfm.fna.13.test | tail -n1
  ERR598955.8364684_101_2_4
  [dderilus@boqueron ERR598955_orfm_250_out]$ cat ERR598955_orfm.fna.13 | sed -e '1,/ERR598955.8364684_101_2_4/ d' | sed -ne '/^>/,$ p' > first_out.fna1.13
  [dderilus@boqueron ERR598955_orfm_250_out]$ awk '{print $1}' ERR598955_orfm.fna.14.test | tail -n1
  ERR598955.9053411_57_6_5
  [dderilus@boqueron ERR598955_orfm_250_out]$ cat ERR598955_orfm.fna.14 | sed -e '1,/ERR598955.9053411_57_6_5/ d' | sed -ne '/^>/,$ p' > first_out.fna1.14
  [dderilus@boqueron ERR598955_orfm_250_out]$ awk '{print $1}' ERR598955_orfm.fna.15.test | tail -n1
  ERR598955.9746341_78_3_2
  [dderilus@boqueron ERR598955_orfm_250_out]$ cat ERR598955_orfm.fna.15 | sed -e '1,/ERR598955.9746341_78_3_2/ d' | sed -ne '/^>/,$ p' > first_out.fna1.15
  [dderilus@boqueron ERR598955_orfm_250_out]$ awk '{print $1}' ERR598955_orfm.fna.16.test | tail -n1
  ERR598955.10426164_9_3_3
  [dderilus@boqueron ERR598955_orfm_250_out]$ cat ERR598955_orfm.fna.16 | sed -e '1,/ERR598955.10426164_9_3_3/ d' | sed -ne '/^>/,$ p' > first_out.fna1.16
  [dderilus@boqueron ERR598955_orfm_250_out]$ awk '{print $1}' ERR598955_orfm.fna.17.test | tail -n1
  ERR598955.11123991_2_2_2
  [dderilus@boqueron ERR598955_orfm_250_out]$ cat ERR598955_orfm.fna.17 | sed -e '1,/ERR598955.11123991_2_2_2/ d' | sed -ne '/^>/,$ p' > first_out.fna1.17
  [dderilus@boqueron ERR598955_orfm_250_out]$ awk '{print $1}' ERR598955_orfm.fna.18.test | tail -n1
  ERR598955.11810206_3_6_1
  [dderilus@boqueron ERR598955_orfm_250_out]$ cat ERR598955_orfm.fna.18 | sed -e '1,/ERR598955.11810206_3_6_1/ d' | sed -ne '/^>/,$ p' > first_out.fna1.18
  [dderilus@boqueron ERR598955_orfm_250_out]$ awk '{print $1}' ERR598955_orfm.fna.19.test | tail -n1
  ERR598955.12519405_1_4_4
  [dderilus@boqueron ERR598955_orfm_250_out]$ cat ERR598955_orfm.fna.19 | sed -e '1,/ERR598955.12519405_1_4_4/ d' | sed -ne '/^>/,$ p' > first_out.fna1.19

This is time consuming, I would be very grateful if you can help me to do that with one script.

Thanks in advance

Cordially

RudiC · June 4, 2018, 8:40am

Welcome to the forum.

I understand you want to automate a task to run over 200 files. Programming / scripting is for exactly this, and I'm pretty sure your request can be fulfilled elegantly and fast. Unfortunately I (at least) don't really understand what your after. Please rephrase your request, and supply representative sample data.

MadeInGermany · June 4, 2018, 11:15am

Proposal: a bash script with a for loop

#!/bin/bash
fmask="ERR598955_orfm.fna"
for f in $fmask.*
do
  [ -f "$f" ] || continue # ensure this is an existing file
  ext=${f#$fmask} # strip off the leading fmask
  pattern=$(awk '{print $1; exit}' "$f") # pattern is the 1st word in the 1st line
  sed -n '1,/^'"${pattern}"'$/ d ; /^>/,$ p' "$f" > "first_out.$ext"
done

You might need to work on it...

Dieunel · June 4, 2018, 9:09pm

Thank you for your quick response and help I am a biginner on linux. I will try to run this for loop script with my data to see if it is working.

I am sorry for my english

Cordially

---------- Post updated at 09:09 PM ---------- Previous update was at 06:20 PM ----------

Dear moderator

This bash script generates a file for each command line. However all the files are empty. Is there any way that I can improve it. I am sorry if my question looks trivial.I am just starting with linux programming.

Regards

MadeInGermany · June 5, 2018, 7:43am

Ok, I made some mistakes, correction follows.

#!/bin/bash
fmask="ERR598955_orfm.fna"
for f in $fmask.*
do
  [ -f "$f" ] || continue # ensure this is an existing file
  ext=${f#$fmask} # strip off the leading fmask
  ext=${ext#.} # strip off a leading dot
  pattern=$(awk '{x=$1} END {print x}' "$f") # pattern is the 1st word in the last line
  newfile=first_out.$ext
  sed -n '1,/'"${pattern}"'/d; /^>/,$p' "$f" > "$newfile"
done

If it does not do what you want, please run it in debug mode:

/bin/bash -x scriptname

Dieunel · June 5, 2018, 10:18am

Dear moderator

Thank you foir the follow up. Unfortunately inspite that I run it in debuging mode, iu have empty output, here is what I did :

$ cat sov.sh 
#!/bin/bash
fmask="ERR598955_orfm.fna"
for f in $fmask.*
do
  [ -f "$f" ] || continue # ensure this is an existing file
  ext=${f#$fmask} # strip off the leading fmask
  ext=${ext#.} # strip off a leading dot
  pattern=$(awk '{x=$1} END {print x}' "$f") # pattern is the 1st word in the last line
  newfile=first_out.$ext
  sed -n '1,/^'"${pattern}"'/d; /^>/,$p' "$f" > "$newfile"
done
[dderilus@boqueron test]$ ls
ERR598955_orfm.fna.1  ERR598955_orfm.fna.1.test  ERR598955_orfm.fna.2  ERR598955_orfm.fna.2.test  sov.sh
[dderilus@boqueron test]$ /bin/bash -x sov.sh 
+ fmask=ERR598955_orfm.fna
+ for f in '$fmask.*'
+ '[' -f ERR598955_orfm.fna.1 ']'
+ ext=.1
+ ext=1
++ awk '{x=$1} END {print x}' ERR598955_orfm.fna.1
+ pattern=GGSSFMGCPSSVMSPASGYSKPAIILNSVVHPIKDDPPHKRSVNTVFQNYALFPHMTVSQNIG
+ newfile=first_out.1
+ sed -n '1,/^GGSSFMGCPSSVMSPASGYSKPAIILNSVVHPIKDDPPHKRSVNTVFQNYALFPHMTVSQNIG/d; /^>/,$p' ERR598955_orfm.fna.1
+ for f in '$fmask.*'
+ '[' -f ERR598955_orfm.fna.1.test ']'
+ ext=.1.test
+ ext=1.test
++ awk '{x=$1} END {print x}' ERR598955_orfm.fna.1.test
+ pattern=ERR598955.61408_2_2_1
+ newfile=first_out.1.test
+ sed -n '1,/^ERR598955.61408_2_2_1/d; /^>/,$p' ERR598955_orfm.fna.1.test
+ for f in '$fmask.*'
+ '[' -f ERR598955_orfm.fna.2 ']'
+ ext=.2
+ ext=2
++ awk '{x=$1} END {print x}' ERR598955_orfm.fna.2
+ pattern=LSEKKSSQNPLLFSICLIFFWTTFLILPEKAFWRV
+ newfile=first_out.2
+ sed -n '1,/^LSEKKSSQNPLLFSICLIFFWTTFLILPEKAFWRV/d; /^>/,$p' ERR598955_orfm.fna.2
+ for f in '$fmask.*'
+ '[' -f ERR598955_orfm.fna.2.test ']'
+ ext=.2.test
+ ext=2.test
++ awk '{x=$1} END {print x}' ERR598955_orfm.fna.2.test
+ pattern=ERR598955.712540_97_1_3
+ newfile=first_out.2.test
+ sed -n '1,/^ERR598955.712540_97_1_3/d; /^>/,$p' ERR598955_orfm.fna.2.test
[dderilus@boqueron test]$ ls -sh
total 317M
148M ERR598955_orfm.fna.1   16M ERR598955_orfm.fna.1.test  149M ERR598955_orfm.fna.2  5.3M ERR598955_orfm.fna.2.test     0 first_out.1     0 first_out.1.test     0 first_out.2     0 first_out.2.test  4.0K sov.sh

MadeInGermany · June 5, 2018, 10:28am

The ^ before the pattern requires the pattern to be at the very beginning of the line. Change

sed -n '1,/^'"${pattern}"'/d; /^>/,$p' "$f" > "$newfile"

to

sed -n '1,/'"${pattern}"'/d; /^>/,$p' "$f" > "$newfile"

Dieunel · June 5, 2018, 11:17am

Saludos

I edited the script as suggested but I still have empty output, here is the report :

$ ls -sh
total 317M
148M ERR598955_orfm.fna.1   16M ERR598955_orfm.fna.1.test  149M ERR598955_orfm.fna.2  5.3M ERR598955_orfm.fna.2.test  4.0K sov.sh

[dderilus@boqueron test]$ cat sov.sh 
#!/bin/bash
fmask="ERR598955_orfm.fna"
for f in $fmask.*
do
  [ -f "$f" ] || continue # ensure this is an existing file
  ext=${f#$fmask} # strip off the leading fmask
  ext=${ext#.} # strip off a leading dot
  pattern=$(awk '{x=$1} END {print x}' "$f") # pattern is the 1st word in the last line
  newfile=first_out.$ext
  sed -n '1,/'"${pattern}"'/d; /^>/,$p' "$f" > "$newfile"
done

[dderilus@boqueron test]$ /bin/bash -x sov.sh 
+ fmask=ERR598955_orfm.fna
+ for f in '$fmask.*'
+ '[' -f ERR598955_orfm.fna.1 ']'
+ ext=.1
+ ext=1
++ awk '{x=$1} END {print x}' ERR598955_orfm.fna.1
+ pattern=GGSSFMGCPSSVMSPASGYSKPAIILNSVVHPIKDDPPHKRSVNTVFQNYALFPHMTVSQNIG
+ newfile=first_out.1
+ sed -n '1,/GGSSFMGCPSSVMSPASGYSKPAIILNSVVHPIKDDPPHKRSVNTVFQNYALFPHMTVSQNIG/d; /^>/,$p' ERR598955_orfm.fna.1
+ for f in '$fmask.*'
+ '[' -f ERR598955_orfm.fna.1.test ']'
+ ext=.1.test
+ ext=1.test
++ awk '{x=$1} END {print x}' ERR598955_orfm.fna.1.test
+ pattern=ERR598955.61408_2_2_1
+ newfile=first_out.1.test
+ sed -n '1,/ERR598955.61408_2_2_1/d; /^>/,$p' ERR598955_orfm.fna.1.test
+ for f in '$fmask.*'
+ '[' -f ERR598955_orfm.fna.2 ']'
+ ext=.2
+ ext=2
++ awk '{x=$1} END {print x}' ERR598955_orfm.fna.2
+ pattern=LSEKKSSQNPLLFSICLIFFWTTFLILPEKAFWRV
+ newfile=first_out.2
+ sed -n '1,/LSEKKSSQNPLLFSICLIFFWTTFLILPEKAFWRV/d; /^>/,$p' ERR598955_orfm.fna.2
+ for f in '$fmask.*'
+ '[' -f ERR598955_orfm.fna.2.test ']'
+ ext=.2.test
+ ext=2.test
++ awk '{x=$1} END {print x}' ERR598955_orfm.fna.2.test
+ pattern=ERR598955.712540_97_1_3
+ newfile=first_out.2.test
+ sed -n '1,/ERR598955.712540_97_1_3/d; /^>/,$p' ERR598955_orfm.fna.2.test
[dderilus@boqueron test]$ ls -sh
total 317M
148M ERR598955_orfm.fna.1   16M ERR598955_orfm.fna.1.test  149M ERR598955_orfm.fna.2  5.3M ERR598955_orfm.fna.2.test     0 first_out.1     0 first_out.1.test     0 first_out.2     0 first_out.2.test  4.0K sov.sh

Regards

MadeInGermany · June 6, 2018, 11:58am

Apparently
1.

sed -n '1,/ERR598955.712540_97_1_3/d; /^>/,$p' ERR598955_orfm.fna.2.test

has an empty output, while your initial
2.

cat ERR598955_orfm.fna.1 | sed -e '1,/ERR598955.7664144_89_2_3/ d' | sed -ne '/^>/,$ p'

works?
Does
3.

cat ERR598955_orfm.fna.2.test | sed -e '1,/ERR598955.712540_97_1_3/ d' | sed -ne '/^>/,$ p'

work?
I do not see a functional difference between 1. and 3.
Could you post the ERR598955_orfm.fna.2.test? (if too long, only the relevant part, AND WRAP IT IN CODE TAGS for readability, or attach the file in the Advanced Editor).

Dieunel · June 7, 2018, 9:42am

Dear moderator

Thank you for yor assistance, I finally get the code and and since I have various files on my directory. With a for loop It can work on all my directory. Here is the code :

<<

for filename in ERR*_orfm.fna.*.test ; do
    last_seq=$(awk '{print $1}' $filename | tail -n1)
    seqname=$(basename $filename .test)
    outname="$seqname".first_out
    echo "cat $seqname | sed -e "1,/$last_seq/ d" | sed -ne '/^>/,$ p' > $outname"
done

>>

Regards

MadeInGermany · June 7, 2018, 12:47pm

The main goal is that you understand

how you can modify the manual commands
insert variables where the values are variable, and
construct the variables from a loop variable.