[Solved] Editing the alphabet's based on position information

Lucky_Ali · June 14, 2013, 6:04pm

I do have a file of the following format

file 1

>SAM
ATGCTCCTTAGCTACGTAGCAAGTAGAAAAAA
AGCGCGAGCATTGAAGCGGAGGAGAGGAGGA
TGAGATGATGACCCAGTATAAAGAGTGATGAT

like this above file. file 1 has 1000's of lines. I would like to edit this file1 using the information from file2 (see below), by replacing the 4th alphabet or alphabet at the position 4 C with T; 10th position alphabet A to C ......61th position alphabet G to A and 66th position alphabet A to C.
file 2

4 C T
10 A C
19 G A
38 G T
61 G A
66 A C

I am expecting the output file, file 3 with the edits.

>SAM
ATGTTCCTTCGCTACGTAACAAGTAGAAAAAA
AGCGCTAGCATTGAAGCGGAGGAGAGGAAGA
TGCGATGATGACCCAGTATAAAGAGTGATGAT

Please let me know the best way edit this file1 and create the new file 3 using awk or sed or perl

rveri · June 14, 2013, 7:51pm

Your data doesn't looks like having length of 66 , as per the question.

Lucky_Ali · June 14, 2013, 7:56pm

No it doesn't have a length of 66. In the second file, file2 shows the position on file1 where the edits has to be done. The actual file have 1000's of letters. Hope this helps

Scott · June 14, 2013, 8:04pm

What have you tried?

Just_Ice · June 15, 2013, 3:31am

first split the string with perl (see example 4 here) then check and replace according to position and letter ...

RudiC · June 15, 2013, 4:57am

Your sample output file does by no means fit the edits defined in file 2 applied to file1, e.g. line 2 pos 10 should have become a C ; and where does line 2 pos 29 A come from?
Please post representative samples, and your solution attempts so far.

Lucky_Ali · June 16, 2013, 12:25pm

Counts don't go for each line. It starts from position 1 and goes continuously to the end of the file. It is not counted separately for each line. Hope this makes it clearer. I tried to put in it all in a hash using perl but was not successful in getting the output

RudiC · June 16, 2013, 12:40pm

So - how do you count <newline> chars?

Lucky_Ali · June 16, 2013, 12:48pm

That is where i am not able to proceed. On thing in the real file that I am using to edit has exactly 60 alphabet in each line.

RudiC · June 16, 2013, 1:06pm

Try this

awk     'NR==FNR        {F[$1]=$2; R[$1]=$3; next}
                        {for (i=1; i<=NF; i++) if ($i == F[++CNT]) $i=R[CNT]}
         1
        ' file2 FS="" OFS="" file1
ATGTTCCTTCGCTACGTAACAAGTAGAAAAAA
AGCGCTAGCATTGAAGCGGAGGAGAGGAAGA
TGCGATGATGACCCAGTATAAAGAGTGATGAT

Lucky_Ali · June 16, 2013, 1:35pm

I am getting the same file1 as output with out the edits. Am I missing some thing

RudiC · June 16, 2013, 1:40pm

Strange. I used your file1 and file2, and you see the result above being what you requested as an output.

Lucky_Ali · June 16, 2013, 1:44pm

This what I get (i have diferent names for file1 and file 2)

awk     'NR==FNR        {F[$1]=$2; R[$1]=$3; next}
                        {for (i=1; i<=NF; i++) if ($i == F[++CNT]) $i=R[CNT]}
         1
        ' list.txt FS="" OFS="" sam.fasta
>SAM
ATGCTCCTTAGCTACGTAGCAAGTAGAAAAAA
ATCGCGAGCATTGAAGCGGAGGAGAGGAGGA
TGAGATGATGACCCAGTATAAAGAGTGATGAT

RudiC · June 16, 2013, 1:54pm

So you have ">SAM" in the file? I read it as being a prompt and a command... In your spec, you didn't count it nor took it into account! Should have mentioned that.
Add FNR > 1 to the second line...

Lucky_Ali · June 16, 2013, 4:18pm

Great. Thank you it worked

---------- Post updated at 04:18 PM ---------- Previous update was at 01:56 PM ----------

Actually I got an error message when I ran the above awk code to my actual file with 17,377 lines. The following was the error message:

Segmentation fault (core dumped)

Is it a memory issue? Please let me know a solution

RudiC · June 16, 2013, 4:41pm

Not sure. As F[ ] elements are created when referenced, memory may become exhausted, but then the error msg should complain about memory allocation problems. Try this:

awk     'NR == FNR      {F[$1]=$2; R[$1]=$3; next}
         FNR > 1        {for (i=1; i<=NF; i++) {
                                 if ($i == F[++CNT]) $i=R[CNT]
                                 delete F[CNT]
                                }
                        }
         1
        ' file2 FS="" OFS="" file1

Lucky_Ali · June 16, 2013, 4:46pm

I think it worked. It didn't give any error messages. Thank you very much