Make change to variable value inside of awk script

LMHmedchem · February 6, 2018, 4:03pm

Hello,

I have text data that looks like this,


  Mrv16a3102061815532D          

  6  6  0  0  0  0            999 V2000
   -0.4018    1.9634    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -1.1163    1.5509    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -1.1163    0.7259    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.4018    0.3134    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    0.3127    0.7259    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.3127    1.5509    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
  2  3  2  0  0  0  0
  3  4  1  0  0  0  0
  4  5  2  0  0  0  0
  5  6  1  0  0  0  0
  1  6  2  0  0  0  0
M  END
>  <id>
1
>  <name>
pyridine
>  <mw>
79.102
$$$$

I have the following awk code that looks for the > <name> tag, stores the value on the next line, and then writes it to the first line of the file.

# name field tag to look for
name_field='<name>'
# value to add to beginning of name string
pre='ID_'
awk -v find_name=$name_field -v pre=$pre ' { OUT[++CNT] = $0 }
                                      F==1 { NAME = pre$0; F = 0 }
                            $0 ~ find_name { F = 1 }
                              $0 == "$$$$" { print NAME; for(i=2; i<=CNT; i++) print OUT; delete OUT; CNT = 0 }
                                         ' > $output_file_name

The results look like this,

ID_pyridine
  Mrv16a3102061815532D          

  6  6  0  0  0  0            999 V2000
   -0.4018    1.9634    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -1.1163    1.5509    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -1.1163    0.7259    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.4018    0.3134    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    0.3127    0.7259    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.3127    1.5509    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
  2  3  2  0  0  0  0
  3  4  1  0  0  0  0
  4  5  2  0  0  0  0
  5  6  1  0  0  0  0
  1  6  2  0  0  0  0
M  END
>  <id>
1
>  <name>
pyridine
>  <mw>
79.102
$$$$

I have a problem with later code in cases where there are spaces in the value of <name> and I would like to substitute underscore for space in the value of NAME in the above awk code before it is written. Is there a way to do this?

Thanks,

LMHmedchem

rdrtx1 · February 6, 2018, 4:18pm

F==1 { gsub(" ", "_"); NAME = pre$0; F = 0 }

MadeInGermany · February 6, 2018, 4:32pm

Because the gsub() runs on $0 here (no 3rd argument --> $0), you can alter the other output as well, by placing the { OUT[++CNT] = $0 } after it.

LMHmedchem · February 6, 2018, 7:31pm

Thanks, that worked well.

This is the revised code,

# name field tag to look for
name_field='<name>'
# value to add to beginning of name string
pre='ID_'
awk -v find_name=$name_field -v pre=$pre ' { OUT[++CNT] = $0 }
                                      F==1 { gsub(" ", "_"); NAME = pre$0; F = 0 }
                            $0 ~ find_name { F = 1 }
                              $0 == "$$$$" { print NAME; for(i=2; i<=CNT; i++) print OUT; delete OUT; CNT = 0 }
                                         ' > $output_file_name

It's nice to know how to do that as I'm sure it won't be the last time it comes up. Is gsub() part of awk or a call to a different tool?

Thanks for the tip, in this case, the name line is the only one that I need to modify.

---------- Post updated at 07:31 PM ---------- Previous update was at 05:07 PM ----------

Perhaps I spoke too soon about not needing to make space replacements in other places in the code. What I need to do is use the space replaced version of NAME on the first line as the original code does, and also use it for the line following the <name> tag.

I was thinking something like this,

awk -v find_name=$name_field -v pre=$pre ' { if(F == 1) { gsub(" ", "_"); NAME = pre$0; F = 0; OUT[++CNT] = NAME }
                                             else { OUT[++CNT] = $0 }
                                           }
                            $0 ~ find_name { F = 1 }
                              $0 == "$$$$" { print NAME; for(i=2; i<=CNT; i++) print OUT; delete OUT; CNT = 0 }
                                         ' > $output_file_name

I think this will work but perhaps a more generalized solution would be better to allow for substitution on any requested line but not the entire input.

LMHmedchem

rdrtx1 · February 6, 2018, 7:43pm

In original post, swap lines:

F==1 { gsub(" ", "_"); NAME = pre$0; F = 0 }
     { OUT[++CNT] = $0 }

LMHmedchem · February 7, 2018, 12:35am

This works, as does the suggestion I posted above. I'm not sure which is preferable except that the suggestion of rdrtx1 does not require the conditional.

I assume that in the above case gsub() is changing the value of $0?

LMHmedchem

Don_Cragun · February 7, 2018, 2:07am

Yes, the gsub() call modifies $0 if no third argument is specified. MadeInGermany already said this in post #3 in this thread. (And, you quoted it and thanked him for that tip in post #4.)

RudiC · February 7, 2018, 3:22pm

I' afraid the code you posted doesn't give the desired output in post#1 as it suppresses the Mrv16a3102061815532D line. Plus, it is a bit overcomplicated.

Try also

awk -v find_name=$name_field -v pre=$pre '
                {OUT[++CNT] = $0
                }

$0 ~ find_name  {getline OUT[++CNT]
                 gsub (/ /, "_", OUT[CNT])
                 OUT[0] = pre OUT[CNT]
                }

$0 == "$$$$"    {for(i=0; i<=CNT; i++) print OUT
                }
' file

LMHmedchem · February 8, 2018, 12:40pm

Was this addressed to me? I have run this version of the code on many files and it seems to work fine.

# name field tag to look for
name_field='<name>'
# value to add to beginning of name string
pre='ID_'
awk -v find_name=$name_field -v pre=$pre ' { OUT[++CNT] = $0 }
                                      F==1 { gsub(" ", "_"); NAME = pre$0; F = 0 }
                            $0 ~ find_name { F = 1 }
                              $0 == "$$$$" { print NAME; for(i=2; i<=CNT; i++) print OUT; delete OUT; CNT = 0 }
                                         ' > $output_file_name

I like your method of changing the value stored in the output array instead of storing the modified value in a separate variable.

awk -v find_name=$name_field -v pre=$pre ' { OUT[++CNT] = $0 }
                           $0 ~ find_name  { getline OUT[++CNT]
                                             gsub (/ /, "_", OUT[CNT])
                                             OUT[0] = pre OUT[CNT] }
                           $0 == "$$$$"    { for(i=0; i<=CNT; i++) print OUT }
                                         ' file

If I read your code correctly, each line is read and stored in an array. When a line containing find_name is found ($0 ~ find_name), the next line is read by getline() using the incremented counter as in index. The character substitution is done and the modified line is assigned to the array by index value. When the end of record is reached ($0 == "$$$$"), the record is output.

It seems like the getline OUT[++CNT] instruction would cause the counter to be off by one. Is the changed value of ++CNT only in scope inside the {}?

In my bash, it seems like the array runs from 1 to n and not from 0 to n. If I run for(i=1; i<=CNT; i++) print OUT I get the entire record printed. That is why in my version I output the first line from a variable and then start the rest of the output at i=2 .

LMHmedchem

RudiC · February 8, 2018, 2:12pm

Yes.

Do you need the line Mrv16a3102061815532D or not? The output as posted has it, the script doesn't provide it. Or, is there an empty line at the begin of the input not shown in the input sample?

Your analysis is correct.

No. The getline reads a new line that needs to be inserted exactly one count above the last line read. Your desired header ("ID_" + name value) will be inserted BEFORE all the other lines at the array's zero element. If the Mrv16a3102061815532D line in NOT wanted, assign to OUT[1] and run the loop starting from i=1 .