Help with compare 2 column content and corrected/replaced word

perl_beginner · July 29, 2016, 12:40am

Input File

CGGCGCCTCGCNNNCGAGCG    CGGCGCGCCGAATCCGTGCG
TCGCNGC GCGCCGC
ACGGCNNNNN     ACGGCCTCGCG
CGGCNGCCCGCCC   CGGCGCGCCGTCC

Desired Output File

CGGCGCCTCGCNNNCGAGCG    CGGCGCGCCGAATCCGTGCG CGGCGCCTCGCATCCGAGCG
TCGCNGC GCGCCGC TCGCCGC
ACGGCNNNNN     ACGGCCTCGCG ACGGCTCGCG
CGGCNGCCCGCCC   CGGCGCGCCGTCC CGGCGGCCCGCCC

First and Second Column are always the same number of characteristics (words). I wanna Third Column will print out exactly the same words as First Column but corrected/replaceed all N based on corresponding word position at Second Column word.

It seems a bit complicated
Thanks for any advice.

RavinderSingh13 · July 29, 2016, 2:08am

Hello perl_beginner,

Thank you for asking good question, please keep it up. Coming to your requirement now, off course SHELL can't understand you BOLD characters(which you actually made for us to understand) so considering that you character/characters N will come in a continuous sequence only, following may help you in same.

awk '{split("ATC:C:TCGCG:G", array,":");$(NF+1)=$1;sub(/N*N/,array[NR],$NF);print}'   Input_file

Output will be as follows.

CGGCGCCTCGCNNNCGAGCG CGGCGCGCCGAATCCGTGCG CGGCGCCTCGCATCCGAGCG
TCGCNGC GCGCCGC TCGCCGC
ACGGCNNNNN ACGGCCTCGCG ACGGCTCGCG
CGGCNGCCCGCCC CGGCGCGCCGTCC CGGCGGCCCGCCC

Here you need to give you all strings which you want to be substituted(in newly created 3rd column) in split("ATC:C:TCGCG:G", array,":") highlighted column of split according to their sequence/line vice and it should fly then. If you have more permutations/combinations for this please do let us know on same then.

EDIT: We could put split code into BEGIN section so that array will be created only once. As follows a minor change in above code.

awk 'BEGIN{split("ATC:C:TCGCG:G", array,":")};{$(NF+1)=$1;sub(/N*N/,array[NR],$NF);print}'  Input_file

Thanks,
R. Singh

perl_beginner · July 29, 2016, 2:26am

Hi R. Singh, thanks again for your prompt reply and help

I try your awk command with other record, it seems no work.
I believe is due to

split("ATC:C:TCGCG:G", array,":")

which is different from other record.

Is there any way can let the awk automatic replace all the N in first column based on the corresponding position at second column word?

I was thinking to use the split command to split all the word in first column and second column.
Then use

awk if else

to print out the word based on second column when the first column got "N".

Thanks a lot and again for your advice.

RavinderSingh13 · July 29, 2016, 2:38am

Hello perl_beginner,

Sorry, I didn't see the point like it is same position in column 2 where you want to get the replacements, so could you please try following.

awk '{$(NF+1)=$1;match($1,/N*N/);$NF=substr($NF,1,RSTART-1) substr($2,RSTART,RLENGTH) substr($NF,RSTART+RLENGTH);print}'   Input_file

Output will be as follows.

CGGCGCCTCGCNNNCGAGCG CGGCGCGCCGAATCCGTGCG CGGCGCCTCGCATCCGAGCG
TCGCNGC GCGCCGC TCGCCGC
ACGGCNNNNN ACGGCCTCGCG ACGGCCTCGC
CGGCNGCCCGCCC CGGCGCGCCGTCC CGGCGGCCCGCCC

Thanks,
R. Singh

perl_beginner · July 29, 2016, 3:12am

Thanks a lot and very much, R. Singh.
It worked perfectly now

---------- Post updated at 02:12 AM ---------- Previous update was at 01:44 AM ----------

Hi R. Singh,

Sorry again for disturbing.
I just find out one more new interesting issue.

Is it possible that your awk command continue to search through all the "N" in first column and replace based on corresponding position at second column?

I notice if I have 3 N at different position of first column.
The awk command will replace only the first N and stop replace other N in the first string.
eg.

NCGTNGGCGTCGGCGN      GCGTCGGCGTGGGCGT      GCGTNGGCGTCGGCGN

At the above example, it will only replace the first occurrence N at first column and stop replace second and third N at first column.

Thanks a lot and again.

RudiC · July 29, 2016, 3:39am

Try an adaption of RavinderSingh13's fine proposal:

awk '
        {$3 = $1
         while (match ($3, /N*N/)) $3 =  substr($3, 1, RSTART-1) substr($2, RSTART, RLENGTH) substr($3, RSTART+RLENGTH)
        }
1
' file

perl_beginner · July 29, 2016, 3:44am

Thanks, RudiC.
It solve my inquiry regarding more than 1 N at different position at first column data

Many thanks and again.