Replace sting in FILE1.txt with FILE2.txt. FILE1.txt record must have at least one state is repeated once.But need to replace only from second occurrence in record in FILE1.txt
Condition: order of searching the records in FILE2.txt is impartent.
So when the FILE2.txt record entry is matched with FILE1.txt then break the loop and don't search again the FILE2.txt next record. see below exampe.
FILE1.txt
----------
TEXAS CALIFORNIA TEXAS
DALLAS CALIFORNIA CALIFORNIA DALLAS DALLAS TEXAS
FILE2.txt
------------
TEXAS,TX
DALLAS,DA
CALIFORNIA,CA
NEWYORK,NY
output:
-------
TEXAS CALIFORNIA TX
\(TEXAS is matched so replaced TEXAS with TX in 2nd occurrence\)
DALLAS CALIFORNIA CALIFORNIA DA DA TEXAS
\(DALLAS and CALIFORNIA is matched more than once but as the order in FILE2.txt is impartent so DALLAS is coming first than CALIFORNIA so replacing DALLS only and breaking the loop and not searching again for CALIFORNIA\)
I have implemented this using while loop and its working as expected but as the FILE1.txt have Millions of records and FILE2.txt have 50 records, so its taking Hours to complete. Any AWK solution for this to speed up the performance please ?
Hi,thanks for looking in to this. Here is the code.
echo "Replace the string matches only once or except FIRST occurence replace ALL."
tot_cnt=`wc -l < $REP_FILE_PATH/$REP_FILE`
while IFS='' read -r line; do (to preserve leading and trailing spacees used IFS='' read -r )
i=0
while read line_1; do
field[1]=`cut -d',' -f1 <<<"$line_1"`
field[2]="`cut -d',' -f2 <<<"$line_1"`
cnt=`echo -n "$line" | grep -o "${field[1]}" | wc -l`
if [[ "$cnt" -gt 1 ]] ; then
sed -e "s/"${field[1]}"/"${field[2]}"/2g" <<<"$line" >> tmp.txt
break
fi
let i++
done < file2.txt
done< file1.txt
You don't need awk (or similar) to improve the performance of your script. Just by the look on it, it can be seen that you run six commands (= six new processes) in the inner loop, times 50 for the lines in file 2, times millions for the lines in file1 (opening file2 millions times (even though buffered/cached)).
With your input data, and after cleaning out a few quirks in your code snippet, I find
time . XX
real 0m0.308s
user 0m0.192s
sys 0m0.119s
, while
time . YY
real 0m0.014s
user 0m0.012s
sys 0m0.000s
with YY being
while IFS='' read -r line
do while IFS=, read field1 field2
do TMP=${line//$field1}
if [ $(( (${#line}- ${#TMP}) / ${#field1} )) -gt 1 ]
then sed "s/"$field1"/"$field2"/2g" <<<"$line" >> tmp.txt
break
fi
done < file2
done < file1
cat tmp.txt
TEXAS CALIFORNIA TX
DALLAS CALIFORNIA CALIFORNIA DA DA TEXAS
An even faster solution might be to use an array to hold file2's contents, and have the outer loop read file1, and an inner loop to iterate through the array doing the comparisons/modifications.
---------- Post updated at 22:00 ---------- Previous update was at 21:36 ----------
Modification using arrays; adapt to taste...:
unset i
while IFS=, read field1[++i] field2; do : ; done < file2
while IFS='' read -r line
do for (( i=1; i<=${#field1[@]}; i++ ))
do TMP=${line//${field1[$i]}}
if [ $(( (${#line}- ${#TMP}) / ${#field1[$i]} )) -gt 1 ]
then sed "s/"${field1[$i]}"/"${field2[$i]}"/2g" <<<"$line" >> tmp.txt
break
fi
done
done < file1
Timing is similar to the first version; looks like the disk cache is quite powerful:
time . ZZ
real 0m0.015s
user 0m0.003s
sys 0m0.013s