How to replace matching words defined in one file on another file?

dineshkumarsrk · June 15, 2019, 2:53am

I have file1 and file2 as shown below,
file1:

((org14/1-131541:0.11535,((org29/1-131541:0.00055,org7/1-131541:0.00055)1.000:0.10112,((org17/1-131541:0.07344,(org23/1-131541:0.07426,((org10/1-131541:0.00201,org22/1-131541:0.00243)1.000:0.02451,

file2:

org14=india
org29=america
org7=srilanka
org17=africa
org23=europe
org10=brazil
org22=china

I need to replace the words in file1, based on the matching words defined in file2.

The expected outcome is shown below,

((india/1-131541:0.11535,((america/1-131541:0.00055,srilanka/1-131541:0.00055)1.000:0.10112,((africa/1-131541:0.07344,(europe/1-131541:0.07426,((brazil/1-131541:0.00201,china/1-131541:0.00243)1.000:0.02451,

.

I could use replace option in gedit, but here I need to replace list of words. Hence, Please help me to do the same.

Thank you in advance.

RudiC · June 15, 2019, 3:21am

This problem has been solved umpteen times in these forums. Did you bother to search, or look into the proposals given below under "More UNIX and Linux Forum Topics You Might Find Helpful"?

Howsoever, try

awk 'FNR==NR{REP[$1]=$2; next} {for (r in REP) gsub(r, REP[r])}1' FS="=" file2 file1

Scrutinizer · June 15, 2019, 3:24am

Hi, try:

awk '
  NR==FNR {
    A[$1]=$2
    next
  } 
  {
    for(i=1; i<=NF; i++)
      if($i in A)
        sub($i,A[$i])
    print
  }
'  FS="=" file2 FS='[(/,]' file1

Don_Cragun · June 15, 2019, 6:42pm

Note that RudiC's and Scrutinizer's suggestions both depend on the fact that the orgX and orgXX strings in file2 are distinct. Had file2 also contained the line:

org2=japan

both of those suggestions might randomly have resulted in japan9 appearing in the output instead of america , japan3 appearing instead of europe , and japan2 appearing instead of china .

If this might be a problem for you, you would either need to be sure that all of your orgXX strings are the same length or sort your orgXX values by decreasing numerical value of XX and process the substitutions from beginning to end in sequence (like Scrutinizer did) instead of using for (r in REP) (like RudiC did).

And, if using Scrutinizer's code and a single orgXX string might occur more than once in a line of input (which does not happen in your sample), you would need to use gsub() instead of sub() to get the desired results.

MadeInGermany · June 16, 2019, 3:20am

In post #3, isn't

      if($i in A)
        $i=A[$i]

more correct?
--
I see now, awk will reformat the line, substituting the FS characters with spaces.

Scrutinizer · June 16, 2019, 5:30am

Yes that is correct, #3 uses exact strings, so it correctly identifies the right field, and the sub() in itself isn't the problem either, since iteration occurs over the fields and not over the key value pairs (therefore it can substititute multiple occurrences on one line), but the problem is in the replacement part, it was attempting to use sub() on the record instead of a direct assignment to the field, to avoid losing the file separators.

This adaptation should fix that:

awk '
  NR==FNR {
    A[$1]=$2
    next
  } 
  {
    for(i=1; i<=NF; i++) {
      n=split($i, F, /[(,]/)
      org=F[n]
      if(org in A)
        sub(org, A[org], $i)
    } 
    print
  }
'  FS="=" file2 FS=/ OFS=/ file1