How to append two fasta files?

dineshkumarsrk · June 11, 2019, 8:44am

I have two fasta files as shown below,

File:1
>Contig_1:90600-91187 
AAGGCCATCAAGGACGTGGATGAGGTCGTCAAGGGCAAGGAACAGGAATTGATGACGGTC
>Contig_98:35323-35886
GACGAAGCGCTCGCCAAGGCCGAAGAAGAAGGCCTGGATCTGGTCGAAATCCAGCCGCAG               
>Contig_24:26615-28387
GCTGCGGCGCTGATCCTGGCGGCCCGCGCCGAGGAGATCGCCCGTTTGGAGCGCGGCGAA

File2
>Contig_1:90600-91187
GACCGTCATCAATTCCTGTTCCTTGCCCTTGACGACCTCATCCACGTCCTTGATGGCCTT 
>Contig_24:26615-28387
TTCGCCGCGCTCCAAACGGGCGATCTCCTCGGCGCGGGCCGCCAGGATCAGCGCCG

Both files are having the same fasta headers but vary in terms of sequences. Hence, I need to replace File:2 sequences in File:1 as shown below,

Expected outcome:
>Contig_1:90600-91187
GACCGTCATCAATTCCTGTTCCTTGCCCTTGACGACCTCATCCACGTCCTTGATGGCCTT 
>Contig_98:35323-35886
GACGAAGCGCTCGCCAAGGCCGAAGAAGAAGGCCTGGATCTGGTCGAAATCCAGCCGCAG
>Contig_24:26615-28387
TTCGCCGCGCTCCAAACGGGCGATCTCCTCGGCGCGGGCCGCCAGGATCAGCGCCG

I tried with

cat

command, but it is concatenating all the sequences instead of replacing the sequences as I mentioned above.
Please help me to do the same.
Thank you in advance.

vgersh99 · June 11, 2019, 12:05pm

not sure if I follow your samples and the desired output, but I think you might want this.

awk -f dinesh.awk file2 file1 where dinesh.awk is:

FNR==NR {
  if (FNR%2)
     head=$0
  else
     f2[head]=$0
  next
}
{
  if (FNR%2) {
     head=$0
     print
  }
  else
     print (head in f2)? f2[head] : $0
}

results in:

>Contig_1:90600-91187
GACCGTCATCAATTCCTGTTCCTTGCCCTTGACGACCTCATCCACGTCCTTGATGGCCTT
>Contig_98:35323-35886
GACGAAGCGCTCGCCAAGGCCGAAGAAGAAGGCCTGGATCTGGTCGAAATCCAGCCGCAG
>Contig_24:26615-28387
TTCGCCGCGCTCCAAACGGGCGATCTCCTCGGCGCGGGCCGCCAGGATCAGCGCCG

yifangt · June 11, 2019, 2:32pm

I put my answer back as I met the scenarios:
1) when file2.fasta contains more entries than in file1.fasta and vice versa. and
2) when the sequence part can have more than one row;

awk 'BEGIN{RS=">";FS="\n"} {A[$1]=$2} END{for (i in A) {if (i) print ">"i,FS,A}}' file1.fasta file2.fasta

This only works for 2-line fasta sequence files (i.e. each entry has two lines, One starts with ">" as the header, the other is the DNA sequence. @vgersh99, could you please elaborate your code for scenario 2)? Thanks!

file1.fasta
>Contig_1:90600-91187
GACCGTCATCAATTCCTGTTCCTTGCCCTTGACGACCTCATCCACGTCCTTGATGGCCTT 
>Contig_24:26615-28387
TTCGCCGCGCTCCAAACGGGCGATCTCCTCGGCGCGGGCCGCCAGGATCAGCGCCG
>Contig_98:35323-35886
GACGAAGCGCTCGCCAAGGCCGAAGAAGAAGGCCTGGATCTGGTCGAAATCCAGCCGCAG               
AAGGCCATCAAGGACGTGGATGAGGTCGTCAAGGGCAAGGA

file2.fasta:
>Contig_1:90600-91187 
AAGGCCATCAAGGACGTGGATGAGGTCGTCAAGGGCAAGGAACAGGAATTGATGACGGTC
AAGGCCATCAAGGACGTGGATGAGGTCGTCAAGGGCAAGGAACAGG
AAGGCCATCAAGGACGTGGATGAGGTCGTCAAGGGCAAGGA
>Contig_24:26615-28387
GCTGCGGCGCTGATCCTGGCGGCCCGCGCCGAGGAGATCGCCCGTTTGGAGCGCGGCGAA

output:
>Contig_1:90600-91187 
AAGGCCATCAAGGACGTGGATGAGGTCGTCAAGGGCAAGGAACAGGAATTGATGACGGTC
AAGGCCATCAAGGACGTGGATGAGGTCGTCAAGGGCAAGGAACAGG
AAGGCCATCAAGGACGTGGATGAGGTCGTCAAGGGCAAGGA
>Contig_24:26615-28387
GCTGCGGCGCTGATCCTGGCGGCCCGCGCCGAGGAGATCGCCCGTTTGGAGCGCGGCGAA
>Contig_98:35323-35886
GACGAAGCGCTCGCCAAGGCCGAAGAAGAAGGCCTGGATCTGGTCGAAATCCAGCCGCAG               
AAGGCCATCAAGGACGTGGATGAGGTCGTCAAGGGCAAGGA

vgersh99 · June 11, 2019, 3:10pm

yifangt:

I put my answer back as I met the scenarios:
1) when file2.fasta contains more entries than in file1.fasta and vice versa. and
2) when the sequence part can have more than one row;
awk 'BEGIN{RS=">";FS="\n"} {A[$1]=$2} END{for (i in A) {if (i) print ">"i,FS,A}}' file1.fasta file2.fasta
This only works for 2-line fasta sequence files (i.e. each entry has two lines, One starts with ">" as the header, the other is the DNA sequence. @vgersh99, could you please elaborate your code for scenario 2)? Thanks!

My previously posted attempt was for the provided sample files ONLY - 2 lines for each sequence: 1 header and 1 "data" lines.
Had the OP elaborated on the possible other sample data files, an alternate solution would have been provided.

RudiC · June 12, 2019, 6:10pm

May I paraphrase your request: You want file2's sequences to prevail and file1's sequences inserted only if there's no equivalent in file2. If so, try

tr $'\n>' $' \n' <file1|  grep -vf <(sed -n 's/>//p' file2) | cat  <(tr $'\n>' $' \n' <file2) - |  tr  -s $' \n' $'\n>'
>Contig_1:90600-91187
GACCGTCATCAATTCCTGTTCCTTGCCCTTGACGACCTCATCCACGTCCTTGATGGCCTT
>Contig_24:26615-28387
TTCGCCGCGCTCCAAACGGGCGATCTCCTCGGCGCGGGCCGCCAGGATCAGCGCCG
>Contig_98:35323-35886
GACGAAGCGCTCGCCAAGGCCGAAGAAGAAGGCCTGGATCTGGTCGAAATCCAGCCGCAG
>

with any length fasta sequences. The traiing ">" can be taken care of piping through | sed '$d' if need be. Please be aware that your desired output's "Contig_98" line is missing an "AG" at the end.

yifangt · June 13, 2019, 11:26am

That's exactly what I meant!

Thanks RudiC!

Scrutinizer · June 13, 2019, 12:30pm

awk version you could try:

awk 'BEGIN{FS=RS; RS=">"; ORS=""} FNR>1{A[$1]=RS $0} END{for(i in A) print A}'  file1 file2

note: ensure that there are no excess trailing spaces as is the case with the file samples.

yifangt · June 13, 2019, 3:09pm

Scrutinizer:
Could you elaborate this part ?

FS=RS; RS=">";

I tried FS=RS=">"; but it did not work. What's the difference between the two: FS=RS; RS=">"; vs FS=RS=">";

Scrutinizer · June 13, 2019, 3:14pm

Hi yifangt,

The default value for RS is '\n'
So after FS=RS; RS=">" FS is '\n' and RS is '>'

If we use FS=RS=">" , then they both become equal to '>'

S.

MadeInGermany · June 13, 2019, 4:57pm

Post #5 is interesting.
I found it can be enhanced, by a ^ anchor (match at the beginning only), and by inserting a ^ character that becomes an anchor for the following grep

tr $'\n>' $' \n' <file1|  grep -vf <(sed -n 's/^>/^/p' file2) | cat  <(tr $'\n>' $' \n' <file2) - |  tr  -s $' \n' $'\n>'

Post #7 did not work for me.? Did not further analyze it.
Here is an embedded multi-line awk code that works for me:

#!/bin/sh
awk '
{
  ishead=($1~/^>/)
}
FNR==NR {
  # file1: store everything in F[head]
  if (ishead) {
     head=$1
     sep=""
  } else {
     F[head]=(F[head] sep $1)
     sep=ORS
  }
  next
}
{
  # file2: print everything, delete corresponding F[head] from file1
  if (ishead) {
    delete F[$1]
  }
  print
}
END {
  # print the remaining F[head] from file1
  for (head in F) {
    print head
    print F[head]
  }
}
' file1 file2

I chose $1 not $0 because I found some trailing spaces in the given examples; spaces are stripped if $1 is used.

Scrutinizer · June 13, 2019, 5:03pm

Post #7 probably did not work for you because there are excess trailing spaces that need to be removed in the file samples, but that are not going to be present in the actual FASTA files, see the note underneath...

--
Here they are without the spaces:
File1:

>Contig_1:90600-91187
GACCGTCATCAATTCCTGTTCCTTGCCCTTGACGACCTCATCCACGTCCTTGATGGCCTT
>Contig_24:26615-28387
TTCGCCGCGCTCCAAACGGGCGATCTCCTCGGCGCGGGCCGCCAGGATCAGCGCCG
>Contig_98:35323-35886
GACGAAGCGCTCGCCAAGGCCGAAGAAGAAGGCCTGGATCTGGTCGAAATCCAGCCGCAG               
AAGGCCATCAAGGACGTGGATGAGGTCGTCAAGGGCAAGGA

File2:

>Contig_1:90600-91187
AAGGCCATCAAGGACGTGGATGAGGTCGTCAAGGGCAAGGAACAGGAATTGATGACGGTC
AAGGCCATCAAGGACGTGGATGAGGTCGTCAAGGGCAAGGAACAGG
AAGGCCATCAAGGACGTGGATGAGGTCGTCAAGGGCAAGGA
>Contig_24:26615-28387
GCTGCGGCGCTGATCCTGGCGGCCCGCGCCGAGGAGATCGCCCGTTTGGAGCGCGGCGAA

Of course another thing is the order that is mixed up when because it is undefined in the array structure in awk. That could easily be fixed of course if need be.

MadeInGermany · June 13, 2019, 5:14pm

Ah now I see - and you do not have the $1 workaround because you set a different RS...
After trimming the input files it worked.