Help with merge data with a reference sequence

cpp_beginner · July 29, 2016, 12:38pm

I have two input file.:

File 1 is a large reference sequence (A large Fasta sequence);

File 1 (is a file which first line is the header description and line other ">" is its corresponding word and counting from 1 till end of file);

>Data_1
ASWDADAQTWQQGSAAAAASDAFAFA
.
.

File 2 is list of data that I interested to replace specific word at specific location in File 1;
File 2 (3 column in File 2 and is tab-delimited format);

Data_1 2 Z
Data_1 3 T
Data_1 10 A
Data_1 11 T
.
.

Desired Output File

>Data_1
AZTDADAQTATQGSAAAAASDAFAFA
.
.

File 1 is a long record Fasta file (Include a first line of header description and line after ">" is its corresponding word).
File 2 is a file got 3 column (Tab-delimited).
First column is the header description (without ">") of File 1;
Second column is the position of word that I wanna to replace in File 1 based on Third column data;
Third column is the word that I wanna to replace it which specific word/specific position of File 1;

Awk code try

awk -F "\t" '(FNR==1){x++} (x==1){a[$1][$2]=$3;next} (x==2){if($0~/>/){h=$0;sub(/^.*Data/,"",h);sub(/ .*/,"",h)} else{seq[h]=seq[h]$0}} END{for(i in a){s=0; for(j in a){m=m substr(seq,s,j-1) a[j];s=j+1} m=m substr(seq,s); print ">Data"i"\n"m}}' File 2 File 1

I would like to replace specific word (exclude header >Data_1) at specific location in File 1 if it is list on File 2 data.
My main objective is hope to replace specific word at specific location in File 1 based on the record provided in File 2 (specific position and replace with new word based on File 2).

Thanks for any advice.

Yoda · July 29, 2016, 12:50pm

Try this:-

awk '
        NR == FNR {
                A[">"$1 FS $2] = $3
                next
        }
        /^>/ {
                T = $0
                print
                next
        }
        {
                for ( i = 1; i <= length; i++ )
                {
                        if ( ( T FS i ) in A )
                                printf "%s", A[T FS i]
                        else
                                printf "%s", substr( $0, i, 1 )
                }
                printf "\n"
        }
' file2 file1

Scrutinizer · July 29, 2016, 12:53pm

Can the sequences in your FASTA file be spread over multiple lines?

rdrtx1 · July 29, 2016, 1:00pm

awk 'NR==FNR {a[$1,$2]=$2; b[$1,$2]=$3; c[$1]=$1; next}
/^>/ {w=$0; sub(".*> *", "", w)}
! /^>/ && c[w] {for (i in a) $(a)=b}
1
' file2 FS= OFS= file1

Scrutinizer · July 29, 2016, 4:51pm

If fasta sequences are always only a single line:

awk '
  NR==FNR { 
    R[$1,$2]=$3
    next
  }
  FNR>1 {
    s=x
    for(i=1; i<=length($2); i++) s=s (($1,i) in R ? R[$1,i] : substr($2,i,1))
    print RS $1 FS s
  }
' file2 RS=\> FS='\n' file1

----
Note: FS= (the extension that if FS is equal to the empty string, each character becomes a separate field) is not part of POSIX and may or may not work with your version of awk.

cpp_beginner · July 31, 2016, 5:52am

Hi,

The fasta sequence is only a very long single line

---------- Post updated at 04:52 AM ---------- Previous update was at 04:47 AM ----------

Hi,

Sorry.
Mind to know why it will return syntax error when I type it as a one line awk command at my terminal?

Is it I should run your awk command as a shell script instead?
Thanks for advice.

Scrutinizer · July 31, 2016, 5:54am

Hi, which script are you referring to?
What is your OS?
How do you paste it?

cpp_beginner · July 31, 2016, 5:54am

Hi,

It seems like no work
It return the header together with the length of the fasta file I believe.

awk 'NR==FNR {a[$1,$2]=$2; b[$1,$2]=$3; c[$1]=$1; next} /^>/ {w=$0; sub(".*> *", "", w)} ! /^>/ && c[w] {for (i in a) $(a)=b} 1 ' file2 FS= OFS= file1

>Data_1
2421442

Thanks.

Scrutinizer · July 31, 2016, 5:58am

Probably this is because of what I mentioned in the note in post #5 about FS=

cpp_beginner · July 31, 2016, 6:01am

Hi Scrutinizer,

I try with your awk code. It seems to return "Syntax error"

awk:  NR==FNR { R[$1,$2]=$3 next } FNR>1 { s=x for(i=1; i<=length($2); i++) s=s (($1,i) in R ? R[$1,i] : substr($2,i,1)) print RS $1 FS s }
awk:                        ^ syntax error
awk:  NR==FNR { R[$1,$2]=$3 next } FNR>1 { s=x for(i=1; i<=length($2); i++) s=s (($1,i) in R ? R[$1,i] : substr($2,i,1)) print RS $1 FS s }
awk:                                           ^ syntax error
awk:  NR==FNR { R[$1,$2]=$3 next } FNR>1 { s=x for(i=1; i<=length($2); i++) s=s (($1,i) in R ? R[$1,i] : substr($2,i,1)) print RS $1 FS s }
awk:                                                                                                                     ^ syntax error

I just type the below command at my terminal:

awk ' NR==FNR { R[$1,$2]=$3 next } FNR>1 { s=x for(i=1; i<=length($2); i++) s=s (($1,i) in R ? R[$1,i] : substr($2,i,1)) print RS $1 FS s } ' file2 RS=\> FS='\n' file1

I type it as a long awk command
My Operation System is "x86_64 x86_64 x86_64 GNU/Linux". My awk is "GNU Awk 3.1.7".

Would it be the main problem cause it return syntax error?
Thanks a lot and again for your advice.

---------- Post updated at 05:01 AM ---------- Previous update was at 05:00 AM ----------

Hi,

I believe so
Do you have any advice regarding my concern?

Sorry.
Still quite new about awk,perl, etc shell script and programming

Scrutinizer · July 31, 2016, 6:03am

You did not turn it into a one-liner properly, watch the semicolons. Try:

awk 'NR==FNR{R[$1,$2]=$3; next} FNR>1{s=x; for(i=1; i<=length($2); i++) s=s (($1,i) in R ? R[$1,i] : substr($2,i,1)); print RS $1 FS s}' file2 RS=\> FS='\n' file1

But you do not have to turn it into a one-liner, you can also paste multiple lines or put it in a file and execute that.

---
With your version of awk the other script should work too, probably you forgot to put semicolons there too

cpp_beginner · July 31, 2016, 6:17am

Hi Scrutinizer,

Thanks again.
It worked perfectly with my sample sequence provided.
However I aware if I replace it with my real own data set.
It just print out the original File 1

Would it the fasta sequence length issue?
My original file is around 2 million word and is single line.

My File 1 is 2 Line; First Line is header description and second line is a very long word (around 2 million).

My File 2 is tab-delimited file.
First column is the header of File 1;
Second column is the word to replace in File 1;
Third column is position of word to replace in File 1;

Scrutinizer · July 31, 2016, 6:25am

I think that is too long. The Fasta format allows wrapping of the sequence over multiple lines. That should be an option in the program you used to generate the file with.

Please indicate if you would like to go that route, then I can adjust my suggestion, so that it works for that format as well..

cpp_beginner · July 31, 2016, 6:31am

Many thanks for your help.

I split the long sequence into 100 word a line now.
Unfortunately the output file just return the header with the first 100 word record

RudiC · July 31, 2016, 6:54am

This is NOT what you specified in post#1:

Data_1 2 Z
Data_1 3 T
Data_1 10 A
Data_1 11 T

Scrutinizer · July 31, 2016, 7:02am

Yes as I mentioned, it will only work with single sequence line FASTA.
Try this instead, which should no work with a wrapped (multi-line) FASTA sequence:

awk '
  NR==FNR {
    R[$1,$2]=$3
    next
  }

  FNR>1 {
    h=$1
    len=length($2)
    print RS h
    for(i=2; i<=NF; i++) {
      s=x
      for(j=1; j<=len; j++) {
        pos=j+(i-2)*len
        s=s ((h,pos) in R ? R[h,pos] : substr($i,j,1))
      }
      print s
    }
  }
' FS='\t' file2 FS=" " RS=\> file1

cpp_beginner · July 31, 2016, 7:18am

Thanks for reminding, RudiC.
Sorry for my mistake.

I just edit my Post 1.
Thanks a lot.

Scrutinizer · July 31, 2016, 7:20am

So that means the sample of file2 also changes?

Also, your sample file2 is not TAB-delimited

I corrected post #16 so that it works for TAB delimited file2
Could you check the order and if the file is indeed TAB delimited

cpp_beginner · July 31, 2016, 7:29am

Thanks, Scrutinizer.

Might to know how to correct Syntax error issue again?
If I run it as a long awk command at terminal.
It will return Syntax error etc.

If I copy and paste the whole command to a file called "run.sh" and execute it as "sh run.sh".
It will still return some Syntax error

Sorry and thanks for your guide and advice.

---------- Post updated at 06:28 AM ---------- Previous update was at 06:25 AM ----------

Hi Scrutinizer,

File 1 is a one line long record Fasta file (Include a first line of header description and second line is its corresponding nucleotide sequence).
File 2 is a file got 3 column (Tab-delimited).
First column is the header description (without ">") of File 1;
Second column is the word to replace in File 1;
Third column is position of word to replace in File 1;

Basically it is still same as my original question.
Just I forget to mention that my file 2 is a tab-delimited file

Sorry for confusing.
I just edited my thread to clarify it.

---------- Post updated at 06:29 AM ---------- Previous update was at 06:28 AM ----------

My main objective is hope to replace all specific word in File 1 based on the record provided in File 2 (specific position and replace with new word based on File 2).

Scrutinizer · July 31, 2016, 7:31am

Yes but now your sample file2 does not match the description. Which one is right and if it is not the sample, could you correct the sample?

And you meant 3 fields presumably, not lines ...