awk Replace Multiple patterns within a list_file with One in target_file

mpvphd · December 19, 2017, 1:48pm

I'm facing a problem

1) I got a list_file intended to be used for inlace replacement like this

  Replacement pattern ; Matching patterns

    EXTRACT ___________________
    toto ; tutu | tata | tonton  | titi 
    bobo ; bibi | baba | bubu | bebe 
    etc. 14000 lines !!!
    _____________________________

2) I got a target file in witch I want to replace thoses paterns

EXTRACT INPUT _______________
    hello my name is bob and I am a Titi and I like bubu
    _____________________________

I want it to become

EXTRACT OUTPUT ______________
    hello my name is bob and I am a toto and I like bobo
    _____________________________

Actually I am using awk to try to achieve this with this command :

   awk -F';' 'NR==FNR{A[$1]=$2; next} IGNORECASE = 1 {for(i in A) gsub(/A/,i)}1' simplifier_FR.txt text.txt

Sadly awk doesn't seems to understand the pipe � | � character as a OR indicator ... I have also tried to achieve this with sed but this option goes very slowly aven if it works

does anyone have a better idea ?
Thanks
M

MadeInGermany · December 19, 2017, 2:26pm

awk DOES understand a | character in a RE because it actually takes ERE, just like GNU sed with the -r option.
But a standard sed does NOT.
Your awk code has several bugs.
Is this homework/coursework?

mpvphd · December 19, 2017, 2:41pm

I am trying to send a regex with pipes to do a
'pattern OR pattern OR ...'
with 'pattern | pattern | ...'

for example with one replacement :

echo 'toto; tutu | tata | tonton | titi ' | awk '{gsub(/ tutu | tata | tonton | titi /," toto ")}1'
gives 
toto; toto | toto | toto | toto

with

awk -F';' 'NR==FNR{A[$1]=$2; next} IGNORECASE = 1 {for(i in A) gsub(/A/,i)}1�

I expect to :
1 ) register an array A with $2 as content and $1 as key
so in the fist line
$2 =' tutu | tata | tonton | titi '
$1 = ' toto '
2 ) replace with gsub(/$2/,$1)}1
so in the fist line
awk 'IGNORECASE = 1 {gsub(/ tutu | tata | tonton | titi /," toto ")}1

actualy i am looking to -f option
Is that a good idea ?
I am thinking about doing

BEGIN
{replacing command 1}
{replacing command 2}
etc.
END

What coold I do ?

MadeInGermany · December 19, 2017, 3:31pm

Yes, your idea with an ERE and pipe-OR works.
The main bug in your awk code is: the ERE is in / / (or in " ") when it is a constant. Not if it's in a variable!
Then, the input words have spaces around. How does it find the last word when there is no trailing space?
Then, you use the assignment IGNORECASE = 1 as a condition. Fortunately it is always true so the following { block } is run. Better have no condtion and set the variable once at the BEGINning!
Attempt to fix the bugs (untested)

awk -F';' 'BEGIN { IGNORECASE = 1 } NR==FNR { A[$1] = $2; next } { x = (" " $0 " "); for (i in A) gsub(A, i, x); sub(/^ /, "", x); sub(/ $/, "", x); print x }'

mpvphd · December 19, 2017, 4:01pm

thank you that works
the probleme came from my awk version but thanks for your answer !!!!

Don_Cragun · December 19, 2017, 6:21pm

I don't know what you mean about the problem being the version of awk you were using when there were so many logic errors in your code. But, if you have it working now, congratulations.

Note, however, that in addition to the corrections MadeInGermany already listed, you also need to be absolutely sure that your first input file has exactly one <space> character before and after each word you're searching for as possible text to be replaced. For example, with the sample data you provided, no changes would be made to the following lined of text:

The word tonton in this text will not be changed to toto because there aren't
two <space> characters following any occurrence of tonton in this sentence, but
there is one <space> before tonton and two <space>s after tonton in your sample
simplifier_FR.txt file.

You might also want to note that if there are any punctuation characters before or after any of the words you want to replace, the code you're using won't find and/or replace them.