Help with merge data with a reference sequence

I have two input file.:

File 1 is a large reference sequence (A large Fasta sequence);

File 1 (is a file which first line is the header description and line other ">" is its corresponding word and counting from 1 till end of file);

>Data_1
ASWDADAQTWQQGSAAAAASDAFAFA
.
.

File 2 is list of data that I interested to replace specific word at specific location in File 1;
File 2 (3 column in File 2 and is tab-delimited format);

Data_1 2 Z
Data_1 3 T
Data_1 10 A
Data_1 11 T
.
.

Desired Output File

>Data_1
AZTDADAQTATQGSAAAAASDAFAFA
.
.

File 1 is a long record Fasta file (Include a first line of header description and line after ">" is its corresponding word).
File 2 is a file got 3 column (Tab-delimited).
First column is the header description (without ">") of File 1;
Second column is the position of word that I wanna to replace in File 1 based on Third column data;
Third column is the word that I wanna to replace it which specific word/specific position of File 1;

Awk code try

awk -F "\t" '(FNR==1){x++} (x==1){a[$1][$2]=$3;next} (x==2){if($0~/>/){h=$0;sub(/^.*Data/,"",h);sub(/ .*/,"",h)} else{seq[h]=seq[h]$0}} END{for(i in a){s=0; for(j in a){m=m substr(seq,s,j-1) a[j];s=j+1} m=m substr(seq,s); print ">Data"i"\n"m}}' File 2 File 1

I would like to replace specific word (exclude header >Data_1) at specific location in File 1 if it is list on File 2 data.
My main objective is hope to replace specific word at specific location in File 1 based on the record provided in File 2 (specific position and replace with new word based on File 2).

Thanks for any advice.

Try this:-

awk '
        NR == FNR {
                A[">"$1 FS $2] = $3
                next
        }
        /^>/ {
                T = $0
                print
                next
        }
        {
                for ( i = 1; i <= length; i++ )
                {
                        if ( ( T FS i ) in A )
                                printf "%s", A[T FS i]
                        else
                                printf "%s", substr( $0, i, 1 )
                }
                printf "\n"
        }
' file2 file1

Can the sequences in your FASTA file be spread over multiple lines?

awk 'NR==FNR {a[$1,$2]=$2; b[$1,$2]=$3; c[$1]=$1; next}
/^>/ {w=$0; sub(".*> *", "", w)}
! /^>/ && c[w] {for (i in a) $(a)=b}
1
' file2 FS= OFS= file1

If fasta sequences are always only a single line:

awk '
  NR==FNR { 
    R[$1,$2]=$3
    next
  }
  FNR>1 {
    s=x
    for(i=1; i<=length($2); i++) s=s (($1,i) in R ? R[$1,i] : substr($2,i,1))
    print RS $1 FS s
  }
' file2 RS=\> FS='\n' file1

----
Note: FS= (the extension that if FS is equal to the empty string, each character becomes a separate field) is not part of POSIX and may or may not work with your version of awk.

Hi,

The fasta sequence is only a very long single line :slight_smile:

---------- Post updated at 04:52 AM ---------- Previous update was at 04:47 AM ----------

Hi,

Sorry.
Mind to know why it will return syntax error when I type it as a one line awk command at my terminal?

Is it I should run your awk command as a shell script instead?
Thanks for advice.

Hi, which script are you referring to?
What is your OS?
How do you paste it?

Hi,

It seems like no work :frowning:
It return the header together with the length of the fasta file I believe.

awk 'NR==FNR {a[$1,$2]=$2; b[$1,$2]=$3; c[$1]=$1; next} /^>/ {w=$0; sub(".*> *", "", w)} ! /^>/ && c[w] {for (i in a) $(a)=b} 1 ' file2 FS= OFS= file1

>Data_1
2421442

Thanks.

Probably this is because of what I mentioned in the note in post #5 about FS=

Hi Scrutinizer,

I try with your awk code. It seems to return "Syntax error" :frowning:

awk:  NR==FNR { R[$1,$2]=$3 next } FNR>1 { s=x for(i=1; i<=length($2); i++) s=s (($1,i) in R ? R[$1,i] : substr($2,i,1)) print RS $1 FS s }
awk:                        ^ syntax error
awk:  NR==FNR { R[$1,$2]=$3 next } FNR>1 { s=x for(i=1; i<=length($2); i++) s=s (($1,i) in R ? R[$1,i] : substr($2,i,1)) print RS $1 FS s }
awk:                                           ^ syntax error
awk:  NR==FNR { R[$1,$2]=$3 next } FNR>1 { s=x for(i=1; i<=length($2); i++) s=s (($1,i) in R ? R[$1,i] : substr($2,i,1)) print RS $1 FS s }
awk:                                                                                                                     ^ syntax error

I just type the below command at my terminal:

awk ' NR==FNR { R[$1,$2]=$3 next } FNR>1 { s=x for(i=1; i<=length($2); i++) s=s (($1,i) in R ? R[$1,i] : substr($2,i,1)) print RS $1 FS s } ' file2 RS=\> FS='\n' file1

I type it as a long awk command :frowning:
My Operation System is "x86_64 x86_64 x86_64 GNU/Linux". My awk is "GNU Awk 3.1.7".

Would it be the main problem cause it return syntax error?
Thanks a lot and again for your advice.

---------- Post updated at 05:01 AM ---------- Previous update was at 05:00 AM ----------

Hi,

I believe so :frowning:
Do you have any advice regarding my concern?

Sorry.
Still quite new about awk,perl, etc shell script and programming :frowning:

You did not turn it into a one-liner properly, watch the semicolons. Try:

awk 'NR==FNR{R[$1,$2]=$3; next} FNR>1{s=x; for(i=1; i<=length($2); i++) s=s (($1,i) in R ? R[$1,i] : substr($2,i,1)); print RS $1 FS s}' file2 RS=\> FS='\n' file1

But you do not have to turn it into a one-liner, you can also paste multiple lines or put it in a file and execute that.

---
With your version of awk the other script should work too, probably you forgot to put semicolons there too

Hi Scrutinizer,

Thanks again.
It worked perfectly with my sample sequence provided.
However I aware if I replace it with my real own data set.
It just print out the original File 1 :frowning:

Would it the fasta sequence length issue?
My original file is around 2 million word and is single line.

My File 1 is 2 Line; First Line is header description and second line is a very long word (around 2 million).

My File 2 is tab-delimited file.
First column is the header of File 1;
Second column is the word to replace in File 1;
Third column is position of word to replace in File 1;

I think that is too long. The Fasta format allows wrapping of the sequence over multiple lines. That should be an option in the program you used to generate the file with.

Please indicate if you would like to go that route, then I can adjust my suggestion, so that it works for that format as well..

1 Like

Many thanks for your help.

I split the long sequence into 100 word a line now.
Unfortunately the output file just return the header with the first 100 word record :frowning:

This is NOT what you specified in post#1:

Data_1 2 Z
Data_1 3 T
Data_1 10 A
Data_1 11 T
2 Likes

Yes as I mentioned, it will only work with single sequence line FASTA.
Try this instead, which should no work with a wrapped (multi-line) FASTA sequence:

awk '
  NR==FNR {
    R[$1,$2]=$3
    next
  }

  FNR>1 {
    h=$1
    len=length($2)
    print RS h
    for(i=2; i<=NF; i++) {
      s=x
      for(j=1; j<=len; j++) {
        pos=j+(i-2)*len
        s=s ((h,pos) in R ? R[h,pos] : substr($i,j,1))
      }
      print s
    }
  }
' FS='\t' file2 FS=" " RS=\> file1
1 Like

Thanks for reminding, RudiC.
Sorry for my mistake.

I just edit my Post 1.
Thanks a lot.

So that means the sample of file2 also changes?

Also, your sample file2 is not TAB-delimited

I corrected post #16 so that it works for TAB delimited file2
Could you check the order and if the file is indeed TAB delimited

1 Like

Thanks, Scrutinizer.

Might to know how to correct Syntax error issue again?
If I run it as a long awk command at terminal.
It will return Syntax error etc.

If I copy and paste the whole command to a file called "run.sh" and execute it as "sh run.sh".
It will still return some Syntax error :frowning:

Sorry and thanks for your guide and advice.

---------- Post updated at 06:28 AM ---------- Previous update was at 06:25 AM ----------

Hi Scrutinizer,

File 1 is a one line long record Fasta file (Include a first line of header description and second line is its corresponding nucleotide sequence).
File 2 is a file got 3 column (Tab-delimited).
First column is the header description (without ">") of File 1;
Second column is the word to replace in File 1;
Third column is position of word to replace in File 1;

Basically it is still same as my original question.
Just I forget to mention that my file 2 is a tab-delimited file :frowning:

Sorry for confusing.
I just edited my thread to clarify it.

---------- Post updated at 06:29 AM ---------- Previous update was at 06:28 AM ----------

My main objective is hope to replace all specific word in File 1 based on the record provided in File 2 (specific position and replace with new word based on File 2).

Yes but now your sample file2 does not match the description. Which one is right and if it is not the sample, could you correct the sample?

And you meant 3 fields presumably, not lines ...

1 Like