Modify script to remove dupes with two delimiters

gimley · January 24, 2017, 12:07am

Hello,
I have a script which removes duplicates in a database with a single delimiter

The script is given below:

# script to remove dupes from a row with structure word=word
BEGIN{FS="="}
{for(i=1;i<=NF;i++){a[$i]++;}for(i in a){b=b"="i}{sub("=","",b);$0=b;b="";delete a}}1

How do I modify the script to remove duplicates in a database with two

A small pseudo-sample is given below.

=m=Prefix signifying negation.
=m=Prefix signifying negation.
=ind=Interjection expressing disapprobation.
=int=An interjection expressing contempt,unconcern,disbelief.
=m=A figure;a mark.The thigh.An act of a play.
=n=Arithmetic.
=ind=Interjection expressing disapprobation.
=int=An interjection expressing contempt,unconcern,disbelief.
=m=A figure;a mark.The thigh.An act of a play.
=n=Arithmetic.

I tried to modify the delimiter part in the script using

{FS="=""*'"="}

But it resulted in a totally garbled output
Since the file is very large, normal editors do not remove the dupes and hence the request

Don_Cragun · January 24, 2017, 1:37am

Your description isn't clear enough to understand what you're trying to do.

I don't see any lines in your input that have a duplicated field (with = as the field separator). So there doesn't seem to be anything that needs to be done to remove duplicated fields in a line.

There is nothing in your code that makes any attempt to compare lines. If you were trying to remove duplicated lines the = would have no relevance; just using:

sort -u file

would do that (assuming that your database is in a text file named file).

You haven't shown us what output you hope to produce from your sample "text" and you haven't told us what form it takes. (Is you sample stored in a text file, in an Oracle database file, in some other type of database, or something else???)

What output are you hoping to produce from your pseudo-sample?

gimley · January 24, 2017, 1:54am

I am sorry. I should have been more explicit.
My database has a word or a phrase followed by its part of speech and eventually the meaning of the same, each of which are delimited by

As can be seen in the example below:

=m=Prefix signifying negation.

It so happens that while compiling the dictionary, duplicates have crept into the database and what I need is a tool to remove these duplicates. As a pseudoexample, here is the sample of the input:

=m=Prefix signifying negation.
=m=Prefix signifying negation.
=ind=Interjection expressing disapprobation.
=int=An interjection expressing contempt,unconcern,disbelief.
=m=A figure;a mark.The thigh.An act of a play.
=n=Arithmetic.
=ind=Interjection expressing disapprobation.
=int=An interjection expressing contempt,unconcern,disbelief.
=m=A figure;a mark.The thigh.An act of a play.
=n=Arithmetic.

The expected output would clean out all duplicates and store only unique strings, as shown in the output below:

=m=Prefix signifying negation.
=ind=Interjection expressing disapprobation.
=int=An interjection expressing contempt,unconcern,disbelief.
=m=A figure;a mark.The thigh.An act of a play.
=n=Arithmetic.

I hope this clarifies the query. The script I had provided handled only one delimiter

and I wanted to know if the awk script could be modified to suit this issue. Many thanks.
I work in a Windows environment.

Don_Cragun · January 24, 2017, 2:26am

You didn't answer the question about what type of file is being processed! And, that is even more important now that we know you're working on a Windows system (while posting your question in a forum devoted to UNIX and UNIX-like operating systems).

If you have awk , you must have installed some UNIX utilities on your Windows system. Did you try the sort command I suggested? If so, what did it do? If not, why not?

An common, easy way to remove duplicated lines using awk is:

awk '!a[$0]++' file

but, of course, that depends on file being a text file (as defined by UNIX systems); a DOS file that doesn't have a line terminator may silently drop the last (incomplete) line in a DOS file.

gimley · January 24, 2017, 2:48am

I had tried this but had forgotten to save the file as a Unix file. The moment I saved it in Unix format, the duplicates were eliminated.
Many thanks for your patience and help

MadeInGermany · January 24, 2017, 3:10am

awk can remove a WinDos \r

awk '
sub(/\r$/,"")
!($0 in a) {print; a[$0]}
' file

This 2nd awk line looks more complex but saves some memory.

gimley · January 24, 2017, 3:12am

Thanks a lot. I tried it on my Dos file and it worked perfectly.