Use awk to replace numbers in a file with a column from another file

nastaziales · July 5, 2018, 10:23am

Hello,

I am trying to make a awk code that will take 2 files, a txt file like this :

1   1   88                        c(1:38, 42, 102)
2   2  128 c(39:41, 43:101, 103:105, 153, 155:189, 292, 344:369)
3   3   84                     c(190:249, 603, 606:607, 609:629)
4   4   12                                   c(250:251, 253:262)
5   6   51                     c(263, 265:291, 293:313, 315:316)
6   8   28                                       c(314, 317:343)
7   9   60            c(370:385, 561:587, 589:602, 604:605, 608)
8  10   39                                               386:424

and if the numbers in blue match with the numbers in red of the 2nd column of a pdb file 2

ATOM      1  N   PRO   889      24.289  17.277 -19.912  1.00  0.00           N  
ATOM      2  CA  PRO   889      25.072  18.509 -19.702  1.00  0.00           C  
ATOM      3  C   PRO   889      24.200  19.747 -19.486  1.00  0.00           C  
ATOM      4  O   PRO   889      24.602  20.661 -18.749  1.00  0.00           O  
ATOM      5  N   THR   890      23.002  19.770 -20.124  1.00  0.00           N  
ATOM      6  CA  THR   890      22.044  20.878 -20.060  1.00  0.00           C  
ATOM      7  C   THR   890      21.613  21.209 -18.629  1.00  0.00           C  
ATOM      8  O   THR   890      21.429  20.303 -17.812  1.00  0.00           O  
ATOM      9  N   VAL   891      21.484  22.513 -18.332  1.00  0.00           N

will be replaced from the 5th column (in green) for example the output has to be like this:

1   1   88                        c(889:898 , 899, 914)

...
Thank you for your time :)

vgersh99 · July 5, 2018, 10:52am

It's quite difficult understand/see how you arrived at your desired output given a sample input. Where did 914 come from, for example?
Could you give it another try explaining? Maybe with a more representative data files...
Also your choice of colors might not be optimal - I can hardly see this green

RudiC · July 5, 2018, 11:35am

On top of what vgersh99 already said, I can't find a number in blue in file1 that matches any number in red in file2 (except for the 1 in the first line).

jim_mcnamara · July 5, 2018, 10:26pm

Changed color to darker green. As asked I cannot see how to arrive at a solution, either.

Scrutinizer · July 6, 2018, 12:43am

Indeed it seems like the values required to arrive at the example output are missing from the pdb file.

Perhaps something like this is intended:

awk '
  NR==FNR {                    # Read the pdb file
    A[$2]=$5                   # Store the pdb values in Array Aa as a lookup table
    next
  }

  {                            # read the txt file, use the letter c as the field separator
    split($2,SEP,/[0-9]*/)     # put all field separators of field $2 in array SEP
    n=split($2,VAL,/[^0-9]*/)  # put all values of $2 in array VAL
    for(i in VAL)              # for every value
      if(VAL in A)          # if it is in column 2 of the pdb file
        VAL=A[VAL]       # replace it with the corresponding value of column 5 in the pdb file
    $2=SEP[1]                  # replace $2 with the first separator
    for(i=2; i<n; i++)         # enumerate over fields and separators
      $2=$2 VAL SEP      # reassemble $2 with the fields and separators
    print
  }
' file.pdb FS=c OFS=c file.txt

nastaziales · July 6, 2018, 4:41am

I'm sorry, I will try to explain it better,

file n.1


1   1   88                        c(1:5, 7, 9) 

2   2  128 c(39:41, 43:101, 103:105, 153, 155:189, 292, 344:369) 

3   3   84                     c(190:249, 603, 606:607, 609:629) 

4   4   12                                   c(250:251, 253:262)

 5   6   51                     c(263, 265:291, 293:313, 315:316)
 6   8   28                                       c(314, 317:343)

where 1:5, 7, 9 is the atomic number,

file n.2

 ATOM      1  N   PRO   889      24.289  17.277 -19.912  1.00  0.00           N  

 ATOM      2  CA  PRO   889      25.072  18.509 -19.702  1.00  0.00           C   

ATOM      3   C   PRO   889      24.200  19.747 -19.486  1.00  0.00           C  

ATOM      4   O   PRO   889      24.602  20.661 -18.749  1.00  0.00           O  

 ATOM      5   N   THR   890      23.002  19.770 -20.124  1.00  0.00           N   

ATOM      6  CA  THR   890      22.044  20.878 -20.060  1.00  0.00           C  

 ATOM      7  C   THR   890      21.613  21.209 -18.629  1.00  0.00           C   

ATOM       8  O   THR   890      21.429  20.303 -17.812  1.00  0.00           O   

ATOM      9   N   VAL   891      21.484  22.513 -18.332  1.00  0.00           N

the second column is also the atomic number (blue), the fifth column is the residue number(red),
so i want the atomic number in file 1 to be replaced with the residue number taken from file 2. So as an output

before
1 1 88 c(1:5, 7, 9)

After
1 1 88 c(889:890, 890, 891)

RudiC · July 6, 2018, 8:03am

Wouldn't it make sense to group the result numbers in ranges and single numbers, where applicable? Like 889:890, 890, 891 would be one range 889:891 only, and if there was a gap, it should read like 889:891,893:894 ? If so, try (stealing from scrutinizer's approach):

awk '
NR==FNR         {A[$2] = $5
                 next
                }

                {CNT = 0
                 TMP = ""
                 gsub (/[)(     ]/, _, $2)
                 n = split ($2, VAL, ",")
                 for (i=1; i<=n; i++)   {if (1 == split (VAL, LMT, ":")) LMT[2] = LMT[1]
                                         for (j=LMT[1]; j<=LMT[2]; j++) RES[++CNT] = A[j]
                                        }
                 $2 = "c(" RES[1]
                 for (i=2; i<=CNT; i++) {DLT = RES - RES[i-1]
                                         if (DLT > 1)   {$2  = $2 TMP "," RES
                                                         TMP = ""
                                                        }
                                           else if (DLT == 1) TMP = ":" RES
                                        }
                 $2 = $2 TMP ")"
                 print
                }
' file2 FS=c file1

Scrutinizer · July 6, 2018, 8:49am

@OP: Have you tried my suggestion in post #5
With the second set of input files it produces:

1   1   88                        c(889:890, 890, 891) 
2   2  128 c(39:41, 43:101, 103:105, 153, 155:189, 292, 344:369) 
3   3   84                     c(190:249, 603, 606:607, 609:629) 
4   4   12                                   c(250:251, 253:262)
5   6   51                     c(263, 265:291, 293:313, 315:316)
6   8   28                                       c(314, 317:343)

nastaziales · July 6, 2018, 11:04am

Thank you very much Rudic and Scrutinizer they work both perfectly !