Hi all and greetings from Ireland!
I have not used much unix or awk/sed in years and have forgotten a lot.
Easy enough query tho.
I am cleansing/fixing 10,000 postal addresses using global replacements.
I have 2 pipe delimited files , one is basically a spell checker for geographical areas. The second file is actual addresses.
Sample file 1 - 100+ lines (basically a spell checker):
|Irlllland|Ireland|
|Dubblin|Dublin|
|Corrk|Cork|
etc..
Sample file 2 - 10,000+ lines (Addresses to be cleansed):
|10 Main Street Irlllland|
|11 High Road Irlllland|
|1 High Road, Corrk|
The output required is :
|10 Main Street Ireland|
|11 High Road Ireland|
|1 High Road, Cork|
I am very rusty but reckon I need a loop with a global substition in it.
I used to know unix, awk and sed reasonably well but have forgotten the basic syntax.
All helpers there?
What about this approch in sed?
- Making a pattern file.
sed -e 's!|!/!g' -e 's/^/s&/' file1 >sed_pattern_file
- Using the pattern file to do replacement in file2
sed -f sed_pattern_file file2
Output:
And the one in awk:
awk 'BEGIN{ FS="|"; i=1; while((getline < "file1") > 0) { arr=$2; arr_val[i++]=$3; } } { for (j=1;j<i;j++) { gsub(arr[j],arr_val[j],$0); } print; }' file2
Another approach with awk:
awk 'BEGIN{FS="[ |]"}
NR==FNR{a[$2]=$3;next}
$5 in a {$5=a[$5]}
{print}' file1 file2
If you get errors use nawk, gawk or /usr/xpg4/bin/awk on Solaris.
I think I may have confused the issue for the last post. (franklin52)
The $5 was confusing me!
I deliberately spelt Ireland incorrectly to demonstrate the requirement.
Unfortunately I chose the letter "L" (in lower case) to demonstrate the mispelling. A lower case "L" looks the same as the pipe symbol.
Presumably the elegant last post should be adjusted to reflect the letter "L" issue.
Incidentally, I will study the solutions provided in more detail.
The code provided made me realise how much I used to love playing with "awk" and also how much a few lines of code can achieve.