Finding a text in files & replacing it with unique strings

gordom · January 25, 2013, 7:43pm

Hallo Everyone.
I have to admit I'm shell scripting illiterate . I need to find certain strings in several text files and replace each of the string by unique & corresponding text.
I prepared a csv file with 3 columns: <filename>;<old_pattern>;<new_pattern>

dominik@dominik-VirtualBox:~/Pulpit/test/1$ cat file.csv
 file1.txt;abc;123
 file2.txt;XYZ;6789

A very kind guy (much more skilled than me) helped me with the script:

dominik@dominik-VirtualBox:~/Pulpit/test/1$ cat script.sh
 for i in `cat file.csv`; do
 file=`echo $i | cut -d ";" -f1`;
 pattern1=`echo $i | cut -d ";" -f2`;
 pattern2=`echo $i | cut -d ";" -f3`;
 sed -i "s/$pattern1/$pattern2/" $file;
 done

To find & replace text I run the script and grep:

dominik@dominik-VirtualBox:~/Pulpit/test/1$ bash script.sh
dominik@dominik-VirtualBox:~/Pulpit/test/1$ grep . file{1,2}*

With the above file.csv example the script works fine and do what's intended. The problem starts if <old_pattern> and <new_pattern> text have white spaces:

dominik@dominik-VirtualBox:~/Pulpit/test/1$ cat file.csv
 file1.txt;abc jkl nm;1 2 3
 file2.txt;XYZ rt;67 89

In that case script returns errors.I tried to modify file.csv by putting text between quotation marks but it didn't help. How the script should be adjusted to work with text consisting of white spaces? I would appreciate any help from you. Thank you very much in advance. Regards,
gordom

Don_Cragun · January 25, 2013, 9:09pm

How, exactly, do your current scripts fail? What diagnostic messages are being written? What output are you getting and what output do you want? Please also give us sample input files (and all of the desired outputs corresponding to those sample input files).

Does the 2nd field in file.csv contain fixed strings or regular expressions?
With your current scripts, it looks like you intentionally have leading spaces in file.csv that are not part of the file names and you have trailing spaces on some lines that you don't want to appear in the replacement strings. Is there some reason why you have leading and trailing spaces in file.csv?

Would you prefer to perform all of these actions in a single awk script instead of calling cut three times per file processed and sed once per file processed?

spacebar · January 25, 2013, 9:20pm

Try it like this:

while read rec
do
  file=`echo $rec | cut -d ";" -f1`
  pattern1=`echo $rec | cut -d ";" -f2`
  pattern2=`echo $rec | cut -d ";" -f3`
  sed -i "s/${pattern1}/${pattern2}/" $file
done <file.csv

bakunin · January 25, 2013, 9:47pm

First off, could you PLEASE stay away from the text formatting! Actually your text contained more formatting tags than actual text. If one tries to quote a part of it like i did it is hard work to sift through this endless stream of size- font- and whatnot-tags.

gordom:

 file=`echo $i | cut -d ";" -f1`;
 pattern1=`echo $i | cut -d ";" -f2`;
 pattern2=`echo $i | cut -d ";" -f3`;
The problem starts if <old_pattern> and <new_pattern> text have white spaces:

The problem is in the lines i quoted for you. All these lines are unquoted strings and therefore spaces are processed away by the shell. The shell has a so-called "internal field separator", which is the space char per default. This way the shell understands that you give two arguments (and not one which contains a space char) to a command in the following line:

command arg1 arg2

If you don't want this behavior, you would have to quote:

command "arg1 arg2"

The same is the case with your lines: if a part of a variable contains a space char the shell would see the following:

pattern1=`echo $i | cut -d ";" -f2`;         # before processing
pattern1=abc def geh;         # after evaluating the subshell

In this case the interpretable command would be "pattern1=abc" and "def" and "geh" would be treated as (indecipherable) other commands, which perhaps leads to some error message, in addition to "pattern1" not having the value you expect it to have.

First off, you really, really should not use backticks. Use them never, never ever, but use "$(....)" instead.

Further, in this case you should use neither because the use of "echo | cut" is completely unnecessary. The shell can do this well on its own and it is even shorter and a lot faster:

while IFS=";" read file pattern replacement ; do
     sed -i "s/${pattern}/${replacement}/" "$file"
done

Another thing is: you really should NOT use "sed -i". The reasons are explained here in detail. Use a temporary file instead and delete it afterwards:

while IFS=";" read file pattern replacement ; do
     sed "s/${pattern}/${replacement}/" "$file" > tmpfile
     mv tmpfile "$file"
done

I hope this helps.

bakunin

RudiC · January 26, 2013, 4:47am

Little to add to bakunin's explanations, except for the data source to read from: redirect input of the entire loop to your file.csv: ... done < file.csv

gordom · January 28, 2013, 8:35am

Sorry for that - it was unintentionally. I pasted a text from word processor - my mistake. The original post was already corrected.

That exactly what was happening.

In the end, thanks everyone for help. Specially bakunin & RudiC - the script seems to work perfect now. Great
Regards,
gordom