[Solved] Editing the alphabet's based on position information

I do have a file of the following format

file 1

>SAM
ATGCTCCTTAGCTACGTAGCAAGTAGAAAAAA
AGCGCGAGCATTGAAGCGGAGGAGAGGAGGA
TGAGATGATGACCCAGTATAAAGAGTGATGAT

like this above file. file 1 has 1000's of lines. I would like to edit this file1 using the information from file2 (see below), by replacing the 4th alphabet or alphabet at the position 4 C with T; 10th position alphabet A to C ......61th position alphabet G to A and 66th position alphabet A to C.
file 2

4 C T
10 A C
19 G A
38 G T
61 G A
66 A C

I am expecting the output file, file 3 with the edits.

>SAM
ATGTTCCTTCGCTACGTAACAAGTAGAAAAAA
AGCGCTAGCATTGAAGCGGAGGAGAGGAAGA
TGCGATGATGACCCAGTATAAAGAGTGATGAT

Please let me know the best way edit this file1 and create the new file 3 using awk or sed or perl

Your data doesn't looks like having length of 66 , as per the question.

No it doesn't have a length of 66. In the second file, file2 shows the position on file1 where the edits has to be done. The actual file have 1000's of letters. Hope this helps

What have you tried?

first split the string with perl (see example 4 here) then check and replace according to position and letter ...

Your sample output file does by no means fit the edits defined in file 2 applied to file1, e.g. line 2 pos 10 should have become a C ; and where does line 2 pos 29 A come from?
Please post representative samples, and your solution attempts so far.

Counts don't go for each line. It starts from position 1 and goes continuously to the end of the file. It is not counted separately for each line. Hope this makes it clearer. I tried to put in it all in a hash using perl but was not successful in getting the output

So - how do you count <newline> chars?

That is where i am not able to proceed. On thing in the real file that I am using to edit has exactly 60 alphabet in each line.

Try this

awk     'NR==FNR        {F[$1]=$2; R[$1]=$3; next}
                        {for (i=1; i<=NF; i++) if ($i == F[++CNT]) $i=R[CNT]}
         1
        ' file2 FS="" OFS="" file1
ATGTTCCTTCGCTACGTAACAAGTAGAAAAAA
AGCGCTAGCATTGAAGCGGAGGAGAGGAAGA
TGCGATGATGACCCAGTATAAAGAGTGATGAT

I am getting the same file1 as output with out the edits. Am I missing some thing

Strange. I used your file1 and file2, and you see the result above being what you requested as an output.

This what I get (i have diferent names for file1 and file 2)

awk     'NR==FNR        {F[$1]=$2; R[$1]=$3; next}
                        {for (i=1; i<=NF; i++) if ($i == F[++CNT]) $i=R[CNT]}
         1
        ' list.txt FS="" OFS="" sam.fasta
>SAM
ATGCTCCTTAGCTACGTAGCAAGTAGAAAAAA
ATCGCGAGCATTGAAGCGGAGGAGAGGAGGA
TGAGATGATGACCCAGTATAAAGAGTGATGAT

So you have ">SAM" in the file? I read it as being a prompt and a command... In your spec, you didn't count it nor took it into account! Should have mentioned that.
Add FNR > 1 to the second line...

Great. Thank you it worked

---------- Post updated at 04:18 PM ---------- Previous update was at 01:56 PM ----------

Actually I got an error message when I ran the above awk code to my actual file with 17,377 lines. The following was the error message:

Segmentation fault (core dumped)

Is it a memory issue? Please let me know a solution

Not sure. As F[ ] elements are created when referenced, memory may become exhausted, but then the error msg should complain about memory allocation problems. Try this:

awk     'NR == FNR      {F[$1]=$2; R[$1]=$3; next}
         FNR > 1        {for (i=1; i<=NF; i++) {
                                 if ($i == F[++CNT]) $i=R[CNT]
                                 delete F[CNT]
                                }
                        }
         1
        ' file2 FS="" OFS="" file1 
1 Like

I think it worked. It didn't give any error messages. Thank you very much