Problem getting sed to work with variables

LMHmedchem · February 28, 2018, 12:51am

Hello,

I am processing text files looking for a string and replacing the first occurrence of the string with something else.

For the text,

id	Name
1	methyl-(2-methylpropoxy)-oxoammonium
2	N-amino-N-(methylamino)-2-nitrosoethanamine
3	3-methoxy-3-methyloxazolidin-3-ium
4	1,3-dihydroxypropan-2-yl-methyl-methyleneammonium
5	(1R)-1,2,3,3-tetraamino-2-propen-1-ol
6	2-(ethoxyamino)guanidine
7	O-[(2S)-2-aminoazopropyl]hydroxylamine
8	N-$l^{1}-oxidanyl-N-[(2-methylpropan-2-yl)oxy]methanamine
9	(1R)-1,2,3,3-tetraamino-2-propen-1-ol
10	1-amino-1-ethoxyguanidine

I am replacing the first instance of (1R)-1,2,3,3-tetraamino-2-propen-1-ol with 0_(1R)-1,2,3,3-tetraamino-2-propen-1-ol

If I do the following in sed,

sed '0,/(1R)-1,2,3,3-tetraamino-2-propen-1-ol/s//0_(1R)-1,2,3,3-tetraamino-2-propen-1-ol/' input > output.txt

I get the necessary results.

If I add variables to the command line,

current_name="(1R)-1,2,3,3-tetraamino-2-propen-1-ol";
new_name="0_(1R)-1,2,3,3-tetraamino-2-propen-1-ol";
sed -e "0,/$current_name/s//$new_name/" input > output.txt

I still get the necessary results. When, however, I assign current_name and new_name from a bash array and other bash variables,

current_name="${FIELD[1]}"
new_name='dup_'$name_count'_'$current_name

I do not get the modified output and the file is unchanged. Apparently sed is not able to match the pattern in the file. There are any number of non-standard characters in the data so I don't know if that is an issue or not. The difference that I can see is that when I assign new_name="0_(1R)-1,2,3,3-tetraamino-2-propen-1-ol" , I am able to quote the string but when I assign current_name="${FIELD[1]}" I am not able to quote/escape special characters like ( in the string.

It seems like I just am missing some combination of single and double quotes to do the job but I haven't been able to progress past this.

Suggestions would be appreciated.

LMHmedchem

RudiC · February 28, 2018, 3:12am

What's the contents of ${FIELD[1]} ? How did you define it?

LMHmedchem · February 28, 2018, 11:31am

The script begins by reading a file and looking for duplicate values in a specific column. These are retrieved by,

# set input field separator to newline so each line is stored in an array element
IFS=$'\n'
# use sort and uniq to output duplicate lines in to array
dup_list=( $(cat "$base_file" | sort -k2 | uniq  -f1 -D) )

so the file is sorted on column 2 and uniq ignores the first column.

For the above data the output would be,

5	(1R)-1,2,3,3-tetraamino-2-propen-1-ol
9	(1R)-1,2,3,3-tetraamino-2-propen-1-ol

Then I iterate over the array to parse the lines and capture individual names,

for dup_name in "${dup_list[@]}"
do
   # parse on tab
   unset FIELD; IFS=$'\t' read -a FIELD <<< "$dup_name"
   # assign second column to current name
   current_name="${FIELD[1]}"
done

When I echo $current_name I get the correct value but it doesn't work with the sed command I posted.

LMHmedchem

RudiC · February 28, 2018, 12:53pm

Your code(s) are working for me:

current_name="${FIELD[1]}"
sed -e "0,/$current_name/s//$new_name/" $base_file 
id    Name
1    methyl-(2-methylpropoxy)-oxoammonium
2    N-amino-N-(methylamino)-2-nitrosoethanamine
3    3-methoxy-3-methyloxazolidin-3-ium
4    1,3-dihydroxypropan-2-yl-methyl-methyleneammonium
5    dup_0_(1R)-1,2,3,3-tetraamino-2-propen-1-ol
6    2-(ethoxyamino)guanidine
7    O-[(2S)-2-aminoazopropyl]hydroxylamine
8    N-$l^{1}-oxidanyl-N-[(2-methylpropan-2-yl)oxy]methanamine
9    (1R)-1,2,3,3-tetraamino-2-propen-1-ol
10    1-amino-1-ethoxyguanidine

One problem might be that your input data do have DOS line terminators (<CR> = 0x0D = ^M = \r); did you try without?

BTW, your approach seems somewhat complicated. Does it do anything else or is its sole purpose to add a counter to the first instance of duplicates?

MadeInGermany · February 28, 2018, 1:25pm

Attention: sed uses RE, so any RE-special character or the / separator will cause a malfunction.
Is the goal to index all the duplicates?
Then consider this robust awk solution

awk '
  BEGIN { FS=OFS="\t" }
  NR==FNR { if (dup[$2]++==1) dup[$2]++; next }
  dup[$2]>1 { $2=("dup_" --dup[$2]-1 "_" $2) }
  { print }
' input input

With a trick the dup array discovers the duplicates AND and counts the index (backwards though).

RudiC · February 28, 2018, 2:02pm

Similar approach:

 awk '{LINE[NR] = $0; CNT[$2]++} END {for (i=1; i<=NR; i++) {$0 = LINE; if (CNT[$2]-- > 1) $2 = "0_" $2; print}}' OFS="\t" file

LMHmedchem · February 28, 2018, 2:54pm

These are unix files, so there shouldn't be an issue with EOL. I am a bit mystified as to why it works from the command line but not from my script.

There are a number of things that need to be done. I need to identify and re-name duplicates in several files. Every name needs to be unique, so I am finding the dups and adding an indexed prefix to each instance. There could be more than one duplicate string.

I suspect that something like this may be the issue but I am not sure why RudiC is able to run it.

I think your code would work well if I only had one file to change. I need to change the name and then look up the name in several other files and propagate the change so that all files have the revised name.

This is the current script

#!/bin/bash

# base file to check for duplicate names
base_file=$1

# set input field separator to newline so each line is stored in an array element
IFS=$'\n'
# use sort and uniq to output duplicate lines in to array
dup_list=( $(cat "$base_file" | sort -k2 | uniq  -f1 -D) )

# count for indexed name
name_count=0
# to identify when we have a new duplicate
current_dup=''

# loop on duplicate names
for dup_name in "${dup_list[@]}"
do

   # use second field for name
   unset FIELD; IFS=$'\t' read -a FIELD <<< "$dup_name"

   # set name value
   current_name="${FIELD[1]}"

   # if no current dup has been set
   if [ "$current_dup" == "" ]; then
      # set base to check for new duplicate
      current_dup=$current_name
      #create new dup name
      # name count is already 0 so no need to increment
      new_name='dup_'$name_count'_'$current_name
   # if the current name matches the current dup, increment counter
   elif [ "$current_dup" == "$current_name" ]; then
      # increment counter
      name_count=$((name_count+1))
      # create name based on incremented counter
      new_name='dup_'$name_count'_'$current_name
   # if there is a new dup series
   elif [ "$current_dup" != "$current_name" ]; then
      # set base to new duplicate
      current_dup=$current_name
      # reset name counter
      name_count=0
      #create new dup name
      new_name='dup_'$name_count'_'$current_name
   fi

   # test print
   echo $new_name
 
   # find first instance of dup name in base file and replace
   sed "0,/$current_name/s//$new_name/" $base_file > 'revised_'$base_file

   # make changes in other files

done

When I run this on the attached file test_base.txt, I get the printed output I expect,

dup_0_(1R)-1,2,3,3-tetraamino-2-propen-1-ol
dup_1_(1R)-1,2,3,3-tetraamino-2-propen-1-ol
dup_2_(1R)-1,2,3,3-tetraamino-2-propen-1-ol
dup_0_2-[2-hydroxyethyl(methyl)amino]ethanol
dup_1_2-[2-hydroxyethyl(methyl)amino]ethanol
dup_2_2-[2-hydroxyethyl(methyl)amino]ethanol

All of the duplicates are identified and renamed with an indexed prefix. This works very fast and I have each duplicate name in scope in the do loop where I can work on other files.

At this point I am not able to make changes in other files, which is annoying. I am sure that the logic above is overly complex.

Could the problem also be how I am reading the data into the array?

LMHmedchem

RudiC · February 28, 2018, 3:55pm

Give this a try with a few different input files

awk '
function PRT()  {TMP = $0
                 for (i=1; i<=MXLN; i++)        {$0 = LINE
                                                 if (MAX[$2] > 1) $2 = "dup_" 0+CNT[$2]++ "_" $2
                                                 print  >  FN
                                                }
                 $0 = TMP
                }


FNR == 1        {if (NR>1) PRT()
                 FN = FILENAME
                 sub (/^(.*\/)*/, "revised_", FN)
                 delete LINE
                 delete MAX
                 delete CNT
                }
                {LINE[FNR] = $0
                 MAX[$2]++
                 MXLN = FNR
                }

END             {PRT()
                }

' OFS="\t" /tmp/test_base.txt

If your awk version doesn't provide the delete array command, replace it by split ("", array) .

MadeInGermany · February 28, 2018, 4:12pm

Your #1 problem is that your output file is overwritten by each loop cycle, so will only contain the last sed output.
Your #2 problem is that [ ] are RE-special characters.
(While the ( ) are only special in ERE not RE).
--
I think my solution does what you intended.

#!/bin/bash
# in each given base file check for duplicate names
for base_file
do
 awk '                                  
  BEGIN { FS=OFS="\t" }
# process the base_file and count the dups, +1 if a dup was met
  NR==FNR { if (dup[$2]++==1) dup[$2]++; next }
# process the base_file again, if a dup then add a dup_#_ prefix
  dup[$2]>1 { $2=("dup_" --dup[$2]-1 "_" $2) }
  { print }
 ' "$base_file" "$base_file" > revised_"$base_file"
done

LMHmedchem · March 1, 2018, 6:10pm

The posted solutions seem to work, but don't solve the issue of needing to make changes in other files. I have my script working by changing the input and output names so I'm not overwriting the changes made earlier in the loop.

This version makes the changes in the first file base_file and then in a second file sdf_file . The second file is more complex because it is a multi-line record file where the string that needs to be changed occurs in two places. I have added an awk call that does this part.

#!/bin/bash

# base file to check for duplicate names
base_file=$1
# additional file where names also need to be changed
sdf_file=$2

# copy base_file to make changes in
cp -fp $base_file  temp_file_1

# copy sdf_file to make changes in
cp -fp $sdf_file  temp_file_sdf_1

# set input field separator to newline so each line is stored in an array element
IFS=$'\n'
# use sort and uniq to output duplicate lines in to array
dup_list=( $(cat "$base_file" | sort -k2 | uniq  -f1 -D) )

# index for modified duplicate names
name_count=0
# name of current duplicate to check for new name series
current_dup=''

# loop on list of duplicate names
for dup_name in "${dup_list[@]}"
do

   # use second field for name
   unset FIELD; IFS=$'\t' read -a FIELD <<< "$dup_name"
   # set current name value
   current_name="${FIELD[1]}"

   # if no current dup has been set
   if [ "$current_dup" == "" ]; then
      # set value to check against  for new duplicate
      current_dup=$current_name
      #create indexed dup name
      new_name='dup_'$name_count'_'$current_name
   # if the current name matches the current dup, increment counter
   elif [ "$current_dup" == "$current_name" ]; then
      # increment counter
      name_count=$((name_count+1))
      # create name based on incremented counter
      new_name='dup_'$name_count'_'$current_name
   # if there is a new dup series
   elif [ "$current_dup" != "$current_name" ]; then
      # set value to new duplicate name
      current_dup=$current_name
      # reset name index prefix value
      name_count=0
      #create new dup name
      new_name='dup_'$name_count'_'$current_name
   fi

   # replace first instance of duplicate name in base_file copy with first indexed name
   sed "0,/\t$current_name$/s//\t$new_name/" 'temp_file_1' > 'temp_file_2'

   # rename temp output so that it is the input for the next loop
   # prevents changes from being overwritten
   mv 'temp_file_2'  'temp_file_1'

   # revise current_name to find in sdf
   current_name='ID_'$current_name
   # revise new_name to write to sdf copy
   new_name='ID_'$new_name
   # set check value
   check=1
   # set found value
   found=0

   # make corresponding change to sdf file
   # this should find $find_name on both lines where it exists
   # and then replace it with $new_name when output is written
   # once the first instance is found, checking stops and the rest of the file is output unchanged
   cat temp_file_sdf_1 | \
   awk -v find=$current_name \
       -v replace=$new_name \
       -v found=$found \
       -v check=$check ' check == 1 { OUT[++CNT] = $0;
                                      if ( $0 == find ){
                                         OUT[CNT] = replace;
                                         found = 1;
                                      }
                                      else if ( $0 == "$$$$" && found == "1") {
                                         for(i=1; i<=CNT; i++) print OUT;
                                         delete OUT;
                                         CNT = 0;
                                         check = 0;
                                      }
                                      else if ( $0 == "$$$$" && found == "0") {
                                         for(i=1; i<=CNT; i++) print OUT;
                                         delete OUT;
                                         CNT = 0;
                                       }
                                    }
                         check == 0 { print $0 }' > temp_file_sdf_2

   # rename temp output so that it is the input for the next loop
   # prevents changes from being overwritten
   mv 'temp_file_sdf_2'  'temp_file_sdf_1'

done

# change name from temp name to output file name
mv 'temp_file_1'  'revised_'$base_file

# change name from temp name to output file name
mv 'temp_file_sdf_1'  'revised_'$sdf_file

The awk code stores each record in an array until the end of record is reached ($$$$). Along the way, if a line is found that matches the name that needs to be changed, the array element for that line is overwritten with the revised name. When the end of record is reached, the record is written to the new file. This is also set up so that the records are only read and checked until the replacement is found and implemented. After that, the indicator "check" is set to 0 and all remaining rows are printed unchanged and unchecked.

The test files can be run with,
./rename_duplicates.sh test_base.txt test_sdf.txt

This works on test files I have tried so far and is reasonably fast. I am still concerned about the comment that that [ ] are RE-special characters. The test files attached with the script do contain this character and the character is involved with the substitution, so I'm not sure why it is working.

I am certainly not married to this code, but I do need a solution that will work on multiple files. Some of the files are larger (50MB-100MB) so the above solution may be slow in some cases.

LMHmedchem

Chubler_XL · March 1, 2018, 7:46pm

This solution as 1 awk program appears to work OK with your test data.

Note: I used index() and substr() instead of gsub() to avoid possible RE character issues that MadeInGermany identified.

# usage ./rename_duplicates.sh  test_base.txt  test_sdf.sdf

# base file to check for duplicate names
base_file=$1
# sdf with duplicate structures
sdf_file=$2

awk '
  BEGIN { FS=OFS="\t" }
  FNR==1{ if(file++ > 1) {printf "" > "revised_"FILENAME } }
# process the base_file and count the dups, +1 if a dup was met
  file==1 { if (dup[$2]++==1) dup[$2]++; next }

# process the base_file again, if a dup then add a dup_#_ prefix
  file==2 && dup[$2]>1 { repcnt[$2]++; $2=("dup_" repcnt[$2]-1 "_" $2)}
  file==2 { print >> "revised_"FILENAME }

# replace all duplicate keys in repcnt[] with the dup_ddd string
  file==3 {
      for(check in repcnt) {
          pos=index($0, check)
          if (pos) {
             fdup[check]++
             old=$0
             $0=""
             while(pos) {
                 $0=$0 substr(old,1,pos-1) "dup_" fdup[check] - 1 "_" check
                 old=substr(old, pos + length(check))
                 pos=index(old, check)
             }
             # Some efficiency - when all dups replaced dont check for it again
             if (fdup[check] == repcnt[check]) delete repcnt
             $0=$0 old
          }
      }
      print $0 "\n$$$$" >> "revised_"FILENAME
  }
 ' "$base_file" "$base_file" FS="" RS="\n[$]{4}\n" $sdf_file

LMHmedchem · March 2, 2018, 1:10am

Your script works in part. In the test file there are two sets of duplicate strings with three instances each,

(1R)-1,2,3,3-tetraamino-2-propen-1-ol
(1R)-1,2,3,3-tetraamino-2-propen-1-ol
(1R)-1,2,3,3-tetraamino-2-propen-1-ol
2-[2-hydroxyethyl(methyl)amino]ethanol
2-[2-hydroxyethyl(methyl)amino]ethanol
2-[2-hydroxyethyl(methyl)amino]ethanol

For the copy of the base file, these are replaced with the intended indexed unique names,

dup_0_(1R)-1,2,3,3-tetraamino-2-propen-1-ol  # on line 10 of revised_test_base.txt
dup_1_(1R)-1,2,3,3-tetraamino-2-propen-1-ol  # on line 18 of revised_test_base.txt
dup_2_(1R)-1,2,3,3-tetraamino-2-propen-1-ol  # on line 36 of revised_test_base.txt
dup_0_2-[2-hydroxyethyl(methyl)amino]ethanol # on line 20 of revised_test_base.txt
dup_1_2-[2-hydroxyethyl(methyl)amino]ethanol # on line 46 of revised_test_base.txt
dup_2_2-[2-hydroxyethyl(methyl)amino]ethanol # on line 79 of revised_test_base.txt

In the sdf file, the indexed replacement is only partial.

ID_dup_0_(1R)-1,2,3,3-tetraamino-2-propen-1-ol  # substituted on lines 366 and 390 of revised_test_sdf.txt
ID_dup_1_(1R)-1,2,3,3-tetraamino-2-propen-1-ol  # substituted on lines 728 and 752
ID_dup_2_(1R)-1,2,3,3-tetraamino-2-propen-1-ol  # substituted on lines 1540 and 1564
ID_dup_0_2-[2-hydroxyethyl(methyl)amino]ethanol # substituted on lines 818 and 842

however,

ID_dup_1_2-[2-hydroxyethyl(methyl)amino]ethanol  
ID_dup_2_2-[2-hydroxyethyl(methyl)amino]ethanol

do not appear anywhere in the file and duplicate values for ID_2-[2-hydroxyethyl(methyl)amino]ethanol still appear in the two remaining duplicate records at lines 1993,2017 and 3491,3515.

I don't see where this is failing, but I also don't understand what you did very well. It is about 100 times faster than my script which will make a difference with the bigger files.

LMHmedchem

Chubler_XL · March 2, 2018, 7:37am

small mistake replace line:

 if (fdup[check] == repcnt[check]) delete repcnt

with

 if (fdup[check] == repcnt[check]) delete repcnt[check]