Search and replace multiple patterns in a particular column only - efficient script

ss112233 · November 26, 2014, 1:48pm

Hi Bigshots,

I have a pattern file with two columns. I have another data file. If column 1 in the pattern file appears as the 4th column in the data file, I need to replace it (4th column of data file) with column 2 of the pattern file. If the pattern is found in any other column, it should not be replaced.

Ex:
Pattern File:

opq,098
rst,765
xyz,321

Data File:

xyz,122,913,opq,876
rst,956,921,xyz,012
456,890,903,rst,467

Output File:

xyz,122,913,098,876
rst,956,921,321,012
456,890,903,765,467

Please note, first column of data file is not to be replaced.

I achieved this by looping the pattern file and doing awk line by line. But this is taking a lot of time. I want a faster script - shell script or PERL would do.

I saw corona688's reply in the thread titled "Replace column that matches specific pattern, with column data from another file".

However it is not addressing column specific search and replace requirement.

RudiC · November 26, 2014, 1:57pm

Please use code tags as required by forum rules!

Show us your unsatifactory attempt that you want enhanced.

Chubler_XL · November 26, 2014, 4:19pm

If you have sufficient memory to load the entire pattern file into an awk array I'd solve this problem like this:

awk -F, 'FNR==NR{p[$1]=$2; next}
$4 in p { $4=p[$4] }
1' OFS=, pattern_file data_file > output_file

ongoto · November 26, 2014, 5:03pm

Needs testing with real data sets...

#!/bin/bash

lines=0
while read f1
do
    lines=$(( $lines + 1 ))
    kee2=$(echo $f1 | cut -d, -f4)
    val2=$(grep -e "^$kee2" trydata2)
    if [[ $val2 ]]; then
        repl=${val2##*,}
        sed -i "$lines s/\b$kee2/$repl/g" trydata1
    fi  
done < trydata1
cat trydata1

# output
# ------
# xyz,122,913,098,876
# rst,956,921,321,012
# 456,890,903,765,467

Chubler_XL · November 26, 2014, 5:20pm

@ongoto, this is likely to be just as slow (or slower) than OP's original script, it also has potential to replace partial values and columns other than #4.

Consider input line:

000,5xyz1,000,xyz,000

ongoto · November 26, 2014, 10:15pm

@ Chubler_XL
You are abolutely right. I was only using the provided data.

This might work for the situation you presented...
sed -i "$lines s/\b$kee2/$repl/g" trydata1
I'll do the edit.

But that still doesn't cure the 'other columns' bit, does it?

---------- Post updated at 07:15 PM ---------- Previous update was at 04:22 PM ----------

The theory here is to reduce disk reads by a margin.
Disk writes can't be helped unless one builds a file
in memory and writes it out all at once.
Bash speed just is what it is...and so is my skill set.

Your AWK example is 9 times faster on my machine! That's BIG!

#!/bin/bash

# Load both data files into memory
< trydata1 mapfile data1
< trydata2 mapfile data2
for f1 in ${data1
[*]};
do
    lines=$(( $lines + 1 ))
    kee2=$(echo $f1 | cut -d, -f4)
    for val2 in ${data2
[*]};
    do
        if [[ $val2 =~ ^$kee2 ]]; then
            repl=${val2##*,}
            sed -i "$lines s/\b$kee2/$repl/g" trydata1
        fi
    done
done
cat trydata1

# Using this data set with no column issues...
# ----------------------
# xyz,122,913,opq,876
# rst,956,921,xyz,012
# 456,890,903,rst,467
# 000,5xyz1,000,xyz,000
# 4567,5rst1,opq,xyz,000
# rst,5opq1,rst,02opq,000
# 000,5xyz1,xyz,opq,000

# real    0m0.028s
# user    0m0.008s
# sys    0m0.009s

# On a machine running @ 2GHz
# If my math is right...
# 0.028 seconds to process 7 lines
# time to process 30,000 lines ~ 2 minutes

MadeInGermany · December 1, 2014, 3:39pm

@ongoto, reason#1 for the inefficiency is the many rewrites of the output file with sed.
The following uses bash for writing the output once. Still using awk for a fast+precise lookup of the pattern file. (Could be replaced by an inner while loop in bash using the same technique as the outer loop.)

while IFS="," read -a x
do
 lookup=`awk -F"," '$1==s {print $2}' s="${x[3]}" pattern_file`
 [ -n "$lookup" ] && x[3]=$lookup
 echo "${x[@]}"
done < data_file | tr ' ' ',' > output_file

Of course awk is much faster than bash, and this tasks fits 100% for awk.