Script to process a list of items and uncomment lines with that item in a second file

LMHmedchem · January 8, 2020, 9:40pm

Hello,

I have a src code file where I need to uncomment many lines.

The lines I need to uncomment look like,

C      CALL l_r(DESNAME,DESOUT, 'Gmax',     ESH(10),  NO_APP, JJ)

The comment is the "C" in the first column. This needs to be deleted so that there are 6 spaces preceding "CALL". The key on this line is 'Gmax'. This lets me know that the line needs to be uncommented.

I have a list of such keys

Gmax
Gmin
HS10
HS2
HS9
Hmax

Each key (including the single quotes) will occur only once in the src file being processed. I need to process the list file to look in the src file and uncomment the proper lines. There are about 300 keys in the list file.

This is what I tried,

#! /bin/bash

# file with list of items to find in modify_file
list_file=$1
# file for which a modified copy will be created
modify_file=$2
# final output file
new_file='new_'$modify_file

# copy the file being modified
cp -f $modify_file  work_on_file.txt

# loop through all the lines in the list_file
while IFS= read -r line
do

   # add single quotes
   look_for=\'$line\'

   echo "looking for line with " $look_for

   # look for the value in the current line
   # if the value is found on a line, print the substring of $0 skipping the first character
   # if the value is not on the line, just print the line
   awk -v find_to_modify=$look_for ' { if ($0 ~ find_to_modify)
                                          { print substr($0,2); }
                                       else
                                          { print $0; }
                                     } ' work_on_file.txt > new_file

   # rename the modified file from this loop for the next loop
   mv new_file  work_on_file.txt

done < "$list_file"

# rename final copy
mv work_on_file.txt  $new_file

This simply reads the list file and one at a time looks for the items in the file to modify. The awk code looks for the presence of the key on each line (including the single quotes) and if found prints the substring skipping the first character. When the key is not found on the line, the line is printed unmodified. After the key is processed, the akw modified file is renamed to be the file awk is working on for the next loop.

This works as far as I can tell. I am writing an entire copy of the modified file for each key in the list file, so this is not very efficient. The file renaming at the end of the loop is a bit kludgy as well. This only takes about 7 seconds to run, so maybe I am being picky and should just let it be but I thought I would ask if there were other suggestions.

LMHmedchem

RudiC · January 9, 2020, 3:54am

Try

awk -v"SQ='" '
FNR == NR       {PAT[NR]="^C.*" SQ $0 SQ
                 MX = NR
                 next
                }
                {for (i=1; i<=MX; i++) if ($0 ~ PAT) {sub (/^./,_)
                                                         break
                                                        }
                }
1
' list_file modify_file

MadeInGermany · January 9, 2020, 4:51am

Or do all in bash - here is a well documented bash version:

#! /bin/bash
# bash v3 or higher

# file with list of items to find in modify_file
list_file=$1
# file for which a modified copy will be created
modify_file=$2
# final output file
new_file='new_'$modify_file

echo "reading values from '$list_file'"
# clear (and implicityly declare) an array
list=()
while read line
do
  echo "adding '$line' to list"
  list+=("$line")
done < $list_file

echo "reading '$modify_file'"
# loop through all the lines in the list_file
while IFS= read -r line <&3
do
   # look for the value in the current line
   # if the value is found on a line, print the substring of $0 skipping the first character
   # otherwise just print the line

   echo "processing line $((++linecnt))"
    # preset the new line
   outline=$line
   # loop through the array values
   for i in "${list[@]}"
   do
      look_for="'$i'"
      # =~ is ERE match, == is glob match
      # $look_for in quotes --> do not interpret special characters in it
      #if [[ $line =~ "$look_for" ]]
      if [[ $line == *"$look_for"* ]]
      then
         outline=${line:1}
         # further matches will do the same, so we can break the loop here
         break
      fi
   done
   # print the new line
   printf "%s\n" "$outline" >&4
done 3< "$modify_file" 4> "$new_file"

rbatte1 · January 9, 2020, 9:05am

How about using a single line sed like this:-

sed "s/^C\(      CALL.*'\(Gmax\|Gmin\|HS10\|HS2\|HS9\|Hmax\)'.*\)/\1/" source_file_name > target_file_name

It's a little messy to read, so:-

The s command calls substitution
There is the start of line marker with ^ and then the literal character C that we want to remove if we match the condition between the first / pair
The escaped brackets \( and \) wrap a section of the line matched so we can use it later. There is only 1 such grouping in this regular expression.
There are the six spaces and the literal word you want to be sure you are matching so we then get CALL (six leading spaces)
We then don't care much about what the next part of the line looks like, so we use a single wildcard character . and the following * repeats for zero or more, so any number of characters
We then have the literal text 'Gmax' to look for. The ' characters is a literal because the expression is wrapp with " . The alternate strings you have need to be grouped and alternated The group is wrapped (again) with and escaped bracket, so \( and \) and the strings listed inside. The alternator separator | also has to be escaped, hence you end up with this part being \ to avoid being interpreted. We want the literal characters
We then have the same .* as above to match the rest of the line and end the group with )
After the separating / that shows the end of the expression we have the start of what to substitute it with. We substitute the lines matched with the first group we matched, i.e. the bit in ( and ) above. Here we use \1 to represent the first (only) matched part, which is everything excluding the leading C as required. For completeness, you also have the following available to you:-
[list]
\0 - the entire original record matched
\1 - the first group matched (in this case the entire line excluding the leading C
\2 - the second group match, in this case one of Gmax , Gmin etc. as matched, if that's useful in any way.
[/list]
Unmatched strings (not a leading C or not containing Gmax or whatever) are just printed as they are.

Does this meet your need? Does the explanation make sense?

You could be brave and use the -i flag and no target file to just update the source file, but I'd recommend testing it first to make sure you are happy.

If the list of alternates is getting overly complex, you could pout them in a reference file, one line at a time and build the list for your command, something like:-

#!/bin/bash

item_list=""
while read item
do
   if [ "${item_list}" = "" ]
   then
      item_list="${item}"
   else
      item_list="${item_list}\|${item}"
   fi
done < reference_file_name

echo "${item_list}"              # Just so you can see

sed "s/^C\(      CALL.*'\(${item_list}\)'.*\)/\1/" source_file_name > target_file_name

Perhaps run this with bash -xv your_script_name to check what it's doing.

I hope that this helps,
Robin

RudiC · January 9, 2020, 9:14am

Nice approach indeed!
Could be curtailed to

sed -r "/'($(paste -sd\| file2))'/s/^C//" file1

, including the item list file as well.

rbatte1 · January 9, 2020, 10:49am

I defer to your superior offering (and I will steal it if I ever have a similar need ;))

Kindest regards,
Robin

LMHmedchem · January 14, 2020, 12:15pm

I went with this method inserted into a script. It worked well (and very quickly) the first time I tried it, but there was no output the second time. I will have to investigate what I did there.

I also made a second try before there were any responses here. This ended up looking more like the code posted by MadeInGermany where I read in the file to be modified and stored it in an array. I then did a double loop with the outside loop being my list file and the inside loop being the array with the file to be modified. Each item in the list was searched against the lines in the array. If a match was found, the array element was modified to remove the comment and then there was a break in the inner loop. The modified array was printed at the end. This approach means that each file is read in once and the output was written once, instead of once for each list item.

It seems to me that sed must be doing more or less the same thing under the hood. Every list item must be checked against every item in the file to be modified, at least until a match is found. I wasn't able to rationalize if it was more efficient to have one or the other file be the inner loop. The only approach I could think of that would be faster would be to identify the 'Gmax" value on each line of the file to be modified and then loop up that value in a map holding the list. That would, however, involve much more significant parsing of the lines to extract the 'Gmax' value. It is very nice to have a glob match, especially when there isn't a clear and consistent delimiter. If the list was the inner loop, you could delete each array element when a match was found and thus shorten the search as the process continues but deleting and shifting around array elements also takes resources.

Does anyone know what sed is doing to achieve the result so quickly? Is it mainly that is is using compiled code?

LMHmedchem

RudiC · January 14, 2020, 2:51pm

Difficult to believe with unmodified input files. Pls report back your findings.

A regex using "alternation" as brought into play in rbatte1's post #4 will be the most efficient approach, as it will scan each input line once with all alternations "in parallel".The "command substitution" to produce the alternation ( paste ing from file2) will be done once, and upfront.

Don't use file operations in the inner loop if at all avoidable. They're costly and have to be repeated for every single line read / operated upon in the outer loop.

Yes if you

are sure no more occurrences of the element will come
have access to the algorithm. True for your own shell script (slooow by itself), false for binary commands like sed .

I guess it's optimized for (complex!) regex matching. And yes, compiled code.