Dear all,
I am working with names and I have a large file of names in which some words are written together (upto 4 or 5) and their corresponding single forms are also present in the word-list.
An example would make this clear
annamarie
mariechristine
johnsmith
johnjoseph smith
john
smith
anna
marie
mary
christine
The program should split the words in the list basing itself on the single forms which are there. Thus
annamarie anna-marie
mariechristine marie christine
johnsmith john smith
johnjosephsmith
In the case of the last since
joseph
is missing, the program could suitably tag the missing element and show the word as
john !joseph! smith
The script would prove especially helpful in separating words in languages such as German whch have a large number of compounded words.
Could the script in awk posted on this site (thanks to yinyuemi) and which I am posting below (which does something similar but it takes words from an external dictionary), be modified to work within the same database instead of referring to an external dictionary. I have tried to modify it but it just does not work.
awk 'NR==FNR{a[NR]=$1;b[$1]=1;x=NR}
NR>FNR{IGNORECASE=1;{for (j=1;j<=x;j++){for(i=1;i<=NF;i++) if(length($i)>length(a[j]) && !($i in b) && $i~a[j] && $i!=a[j])
{gsub(a[j]," "a[j]" ",$0)}}}}END{$1=toupper(substr($1,1,1))substr($1,2);print}' lookup raw
Any help given would be gratefully acknowledged.
How about this:
awk '
NR==FNR{a[NR]=$1;b[$1]=1;x=NR}
NR>FNR{
IGNORECASE=1;
for(j=1;j<=x;j++){
for(i=1;i<=NF;i++) {
if(length($i)>length(a[j]) && $i~a[j] && $i!=a[j])
gsub(a[j]," "a[j]" ",$0)
}
}
for(i=1;i<=NF;i++)
printf (i>1?" ":"") (($i in b)?$i:"!"$i"!")
print ""
}' infile infile
Many thanks. I copied the script and ran it on the file which I had proposed as a sample. I got no results.
Have I done something wrong ? I am on Windows and maybe this is the cause; but awk/gawk should run on any environment.
This is tantalising to see a solution and not be able to use it.
Many thanks once more for your kind help.
Make sure you copy solution exactly as it appears (including the file name on the end of the line twice):
Output:
anna marie
marie christine
john smith
john !joseph! smith
john
smith
anna
marie
mary
christine
Many thanks for taking the trouble to help me out. I copied the program as such retaining the instructions you had given:
When I retain the
, I get the following error message:
gawk: singlesplit.awk:13: }' infile infile
gawk: singlesplit.awk:13: ^ Invalid char ''' in expression
When I do away with the
, I get no response: the output file does not pop up on the screen.
Is there a problem in copying the code. This what I copied and got :
NR==FNR{a[NR]=$1;b[$1]=1;x=NR}
NR>FNR{
IGNORECASE=1;
for(j=1;j<=x;j++){
for(i=1;i<=NF;i++) {
if(length($i)>length(a[j]) && $i~a[j] && $i!=a[j])
gsub(a[j]," "a[j]" ",$0)
}
}
for(i=1;i<=NF;i++)
printf (i>1?" ":"") (($i in b)?$i:"!"$i"!")
print ""
}' infile infile
Sorry to hassle you like this, but I am really desperate to get the solution.
Many thanks once again for your patience and kindness.
OK I think I might know what is going on, you are putting the awk code in a file and then calling it with the awk -f progfile
option
Remove ' infile infile
from your singlesplit.awk program file and call awk like this:
awk -f singlesplit.awk infile infile
1 Like
Many thanks. You made my day. The script works. I should have thought of removing the infile infile and giving them at command prompt.
Many thanks once again for all your kind help and your patience.
---------- Post updated at 10:05 PM ---------- Previous update was at 07:42 PM ----------
Sorry to sound ungrateful. The script works. But my file is around 300 thousand words and the script is very slow.
Any means of speeding it up, an array or some such device. Many thanks for all your help and sorry to pester you like this.
Yet another way to do the same thing without reading the input file twice...
awk '{
a[$0] = $0
x[$0] = $0
} END {
for (i in a) {
for (j in x)
if (length(a) < length(x[j]))
if (gsub(a, " "a" ", x[j]))
z[j] = x[j]
}
for (i in z) {
m = split(z, u, " ")
for (j = 1; j <= m; j++) {
if (u[j] in a)
r = r ? r " " u[j] : u[j]
else
r = r ? r " !" u[j] "!" : "!" u[j] "!"
}
print r
}
}' file
This is because the solution involves many sequencial searchs through the whole array.
Remember the solution we use in your Splitting concatenated words thread? We split each input line up into a number of substrings and did a direct lookup on each, I think a similar solution is required here.
Try this:
NR==FNR{for(i=1;i<=NF;i++)a[$i]; next}
function lsr(c,p) {
for(p=1;p<=length(c);p++)
if(tolower(substr(c,1,p)) in a) break;
if (p<=length(c)) return substr(c,1,p);
return "";
}
{
for(i=1;i<=NF;i++) {
A=$i
while(length(A)) {
s=lsr(A);
if (!s) printf "!";
while (!s && length(A)) {
printf substr(A,1,1);
A=substr(A,2);
s=lsr(A);
if (s || !length(A)) printf "! ";
}
printf "%s ", s;
A=substr(A,length(s)+1)
}
}
printf "\n";