Dear linux users
I was running around of 200 djob for a Blastp search in a cluster. All my input files were protein fasta file (prot.fna.1, prot.fna.2 ...prot.fna.200). The output of each individual slurm job is located in a corresponding file ending with *test (prot.fna.1.test, prot.fna.2.test ...prot.fna.200.test) in the same directory. Unfortunately, these Jobs were canceled due to time limit on the node. Now I want to extract all the remaining sequences from my protein fasta files a way to run them again and all the results could be concatenated. Here his what I doing :
- I look for the first string of one *test file with this command:
(awk '{print $1}' prot.fna.1.test | tail -n1)
, this scrip print me the �pattern�
2. All the sequences after this matching pattern in the corresponding fasta input (prot.fasta.1) is printed using this command :
cat prot.fasta.1 | sed -e '1,/pattern/ d' | sed -ne '/^>/,$ p'
Repeating this for 200 files one by one is time consuming. I want to run this script in all the files , but I can't. I am writing you to see if you can help me implement this please. Here is what I am doing using these scripts :
[dderilus@boqueron ERR598955_orfm_250_out]$ awk '{print $1}' ERR598955_orfm.fna.11.test | tail -n1
ERR598955.6981687_74_5_4
[dderilus@boqueron ERR598955_orfm_250_out]$ cat ERR598955_orfm.fna.11 | sed -e '1,/ERR598955.6981687_74_5_4/ d' | sed -ne '/^>/,$ p' > first_out.fna1.11
[dderilus@boqueron ERR598955_orfm_250_out]$ awk '{print $1}' ERR598955_orfm.fna.12.test | tail -n1
ERR598955.7664144_89_2_3
[dderilus@boqueron ERR598955_orfm_250_out]$ cat ERR598955_orfm.fna.1 | sed -e '1,/ERR598955.7664144_89_2_3/ d' | sed -ne '/^>/,$ p' > first_out.fna1.12
[dderilus@boqueron ERR598955_orfm_250_out]$ less first_out.fna1.12
[dderilus@boqueron ERR598955_orfm_250_out]$ cat ERR598955_orfm.fna.12 | sed -e '1,/ERR598955.7664144_89_2_3/ d' | sed -ne '/^>/,$ p' > first_out.fna1.12
[dderilus@boqueron ERR598955_orfm_250_out]$ awk '{print $1}' ERR598955_orfm.fna.13.test | tail -n1
ERR598955.8364684_101_2_4
[dderilus@boqueron ERR598955_orfm_250_out]$ cat ERR598955_orfm.fna.13 | sed -e '1,/ERR598955.8364684_101_2_4/ d' | sed -ne '/^>/,$ p' > first_out.fna1.13
[dderilus@boqueron ERR598955_orfm_250_out]$ awk '{print $1}' ERR598955_orfm.fna.14.test | tail -n1
ERR598955.9053411_57_6_5
[dderilus@boqueron ERR598955_orfm_250_out]$ cat ERR598955_orfm.fna.14 | sed -e '1,/ERR598955.9053411_57_6_5/ d' | sed -ne '/^>/,$ p' > first_out.fna1.14
[dderilus@boqueron ERR598955_orfm_250_out]$ awk '{print $1}' ERR598955_orfm.fna.15.test | tail -n1
ERR598955.9746341_78_3_2
[dderilus@boqueron ERR598955_orfm_250_out]$ cat ERR598955_orfm.fna.15 | sed -e '1,/ERR598955.9746341_78_3_2/ d' | sed -ne '/^>/,$ p' > first_out.fna1.15
[dderilus@boqueron ERR598955_orfm_250_out]$ awk '{print $1}' ERR598955_orfm.fna.16.test | tail -n1
ERR598955.10426164_9_3_3
[dderilus@boqueron ERR598955_orfm_250_out]$ cat ERR598955_orfm.fna.16 | sed -e '1,/ERR598955.10426164_9_3_3/ d' | sed -ne '/^>/,$ p' > first_out.fna1.16
[dderilus@boqueron ERR598955_orfm_250_out]$ awk '{print $1}' ERR598955_orfm.fna.17.test | tail -n1
ERR598955.11123991_2_2_2
[dderilus@boqueron ERR598955_orfm_250_out]$ cat ERR598955_orfm.fna.17 | sed -e '1,/ERR598955.11123991_2_2_2/ d' | sed -ne '/^>/,$ p' > first_out.fna1.17
[dderilus@boqueron ERR598955_orfm_250_out]$ awk '{print $1}' ERR598955_orfm.fna.18.test | tail -n1
ERR598955.11810206_3_6_1
[dderilus@boqueron ERR598955_orfm_250_out]$ cat ERR598955_orfm.fna.18 | sed -e '1,/ERR598955.11810206_3_6_1/ d' | sed -ne '/^>/,$ p' > first_out.fna1.18
[dderilus@boqueron ERR598955_orfm_250_out]$ awk '{print $1}' ERR598955_orfm.fna.19.test | tail -n1
ERR598955.12519405_1_4_4
[dderilus@boqueron ERR598955_orfm_250_out]$ cat ERR598955_orfm.fna.19 | sed -e '1,/ERR598955.12519405_1_4_4/ d' | sed -ne '/^>/,$ p' > first_out.fna1.19
This is time consuming, I would be very grateful if you can help me to do that with one script.
Thanks in advance
Cordially