Why does this awk script not work correctly?

gimley · May 1, 2017, 1:46am

I have a large database with English on the left hand side and Indic words on the left hand.
It so happens that since the Indic words have been entered by hand, there are duplicates in the entries.
The structure is as under:

English headword=Indic gloss,Indic gloss

A small sample will explain

10=
10th=,
11=,,
11th=
12=
12th=
13=,,
13th=
14=
14th=
15=
15th=,
16=
16th=
175=, 
17=
17th=
18=
18th=
190=
19=
19th=
1=
1st=,
20=
20th=
21=
21st=
22=
22nd=
23=
23rd=
24-hour interval=
24-karat gold= , , ,

As can be seen some duplicates in the Indicword are present:

13=,,
11=,,

I wrote an Awk script to remove such duplicates

# script to remove dupes from a row with structure word=word
BEGIN{FS="="}
{for(i=1;i<=NF;i++){a[$i]++;}for(i in a){b=b"="i}{sub("=","",b);$0=b;b="";delete a}}1

However when the script runs, it mangles the output file.
What has gone wrong?
Many thanks for your kind help.

---------- Post updated at 12:46 AM ---------- Previous update was at 12:45 AM ----------

Sorry the English is on Lefthand and Indic on right hand separated by

.

Don_Cragun · May 1, 2017, 2:21am

Without showing us the output your hope to get from your sample input, without telling us whether or not the order of the indic glosses on the right side of the equal sign matters, without telling us what operating system you're using, and without telling us how the output you are currently getting is "mangled"; we can make lots of assumptions about what might be wrong that have absolutely nothing to do with what might or might not be your actual problem.

But, one thing that is obvious is that with with FS="=" the comma separated string on the right side of the equal sign in each input line is a single field. One might guess that you either want to split $2 on commas or you want to set FS using FS="[=,]" and loop through fields 2, through NF instead of 1 through NF .

Don_Cragun · May 1, 2017, 2:54am

Assuming that the order of the order of the indic glosses has to be kept as they appear in the input (only removing duplicated indic glosses), assuming that you're using a version of awk that conforms to the requirements stated by the POSIX standards, you might try replacing your awk code with:

BEGIN { FS = "[=,]"
}
{       o = $1
        ofs = "="
        for(i = 2; i <= NF; i++) 
                if(!($i in s)) {
                        o = o ofs $i
                        s[$i]
                        ofs = ","
                }
        print o
        for(i in s)
                delete s 
}

which, with your sample input, produces the output:

10=
10th=,
11=,
11th=
12=
12th=
13=,
13th=
14=
14th=
15=
15th=,
16=
16th=
175=, 
17=
17th=
18=
18th=
190=
19=
19th=
1=
1st=,
20=
20th=
21=
21st=
22=
22nd=
23=
23rd=
24-hour interval=
24-karat gold= , , ,

If the output order of indic glosses on the right hand side doesn't matter, this code could be simplified.

gimley · May 1, 2017, 3:00am

Sorry, I should have been more clear.
I work under Windows and hence DOS.
Basically as you can see the dictionary has a structure

English headword=Indic gloss,Indic gloss

as shown in the sample below:

10=
10th=,
11=,,
11th=
12=
12th=
13=,,
13th=
14=
14th=
15=
15th=,
16=
16th=
175=, 
17=
17th=
18=
18th=
190=
19=
19th=
1=
1st=,
20=
20th=
21=
21st=
22=
22nd=
23=
23rd=
24-hour interval=
24-karat gold= , , ,

Since the database was made by hand at times, there are words repeated in the Indic glosses as shown in the sample below:

13=,,
11=,,

What I needed was an awk script to identify such repeated entries and delete the duplicate entry.
Thus the sample above would be reduced as under

13=,
11=,

I had written the following awk script to do the job:

# script to remove dupes from a row with structure word=word,word
BEGIN{FS="="}
{for(i=1;i<=NF;i++){a[$i]++;}for(i in a){b=b"="i}{sub("=","",b);$0=b;b="";delete a}}1

However when I ran the script on the sample, it produced a mangled output:

10=
10th=,
,,=11
11th=
=12
12th=
,,=13
=13th
=14
=14th
=15
15th=,
=16
16th=
, =175
17=
=17th
18=
=18th
=190
19=
19th=
=1
,=1st
=20
=20th
=21
21st=
=22
=22nd
=23
23rd=
=24-hour interval
 , , , =24-karat gold

I hope the above clarifies the situation. Identifying dupes visually is both time-consuming and prone to error.

---------- Post updated at 02:00 AM ---------- Previous update was at 01:57 AM ----------

By the time I had posted the clarifications, you had already replied. Many thanks, it worked and swept through a dictionary of 70,000 words and removed all the dupes.
I will now study the script to see where I went wrong

MadeInGermany · May 1, 2017, 3:42am

The for loop can be shortened, and a classic split trick clears an array.

BEGIN { FS = "[=,]"
}
{       o = $1 "=" $2
        s[$2]
        for(i = 3; i <= NF; i++) 
                if(!($i in s)) {
                        o = o "," $i
                        s[$i]
                }
        print o
# clear s[]
        split("",s)
}

RudiC · May 1, 2017, 4:31am

If the order of the indic glosses is unimporrtant, try also

awk -F= '
        {for (MX=n=split($2, T, ","); n>0; n--) C[T[n]]
         printf "%s=", $1
         DL = ""
         for (c in C)   {printf "%s%s", DL, c
                         DL = ","
                        }
         printf RS
         split ("",C)
        }
' file
10=
10th=,
11=,
11th=
12=
12th=
13=,
13th=
.
.
.
24-hour interval=
24-karat gold= , , ,

gimley · May 1, 2017, 4:39am

Many thanks. I tested the script and it worked beautifully.
The loop is an interesting feature
Thanks to all who so very kindly give their time to help out.