Remove lines containing 2 or more duplicate strings

martinsmith · January 18, 2016, 5:22am

Within my text file i have several thousand lines of text with some lines containing duplicate strings/words. I would like to entirely remove those lines which contain the duplicate strings.

Eg;

One and a Two
Unix.com is the Best
This as a Line Line
Example duplicate sentence with the word duplicate

Output;

One and a Two
Unix.com is the Best

The letter case doesn't matter.

Much Thanks as always for your help

RudiC · January 18, 2016, 5:43am

From what length is a "string" a string? Is "a" a string?

---------- Post updated at 11:40 ---------- Previous update was at 11:40 ----------

Does case matter?

---------- Post updated at 11:43 ---------- Previous update was at 11:40 ----------

As a starting point:

awk '{for (i=1; i<=NF; i++) for (j=i+1; j<=NF; j++) if ($i == $j) next}1' file
One and a Two
Unix.com is the Best

RavinderSingh13 · January 18, 2016, 5:49am

Hello martinsmith,

Could you please try this and let me know if this helps. I am ignoring case sensitivity here so it will match all kind of same words either they are in capital or small letters.
So let's say following is the Input_file:

One and a Two
Unix.com is the Best
This as a Line Line
Example duplicate sentence with the word DUPLICATE
UNIX is very good GOOD

Now following is the code for same.

awk 'BEGIN{IGNORECASE = 1} {for(i=1;i<=NF;i++){for(j=1;j<=NF;j++){if($j==$i){A[$i]++;}};if(A[$i]>1){for(i in A){delete A;next}}};print;for(i in A){delete A}}'  Input_file

Output will be as follows.

One and a Two
Unix.com is the Best

Thanks,
R. Singh

martinsmith · January 18, 2016, 5:58am

rudic:

From what length is a "string" a string? Is "a" a string?

---------- Post updated at 11:40 ---------- Previous update was at 11:40 ----------

Does case matter?

---------- Post updated at 11:43 ---------- Previous update was at 11:40 ----------

As a starting point:
awk '{for (i=1; i<=NF; i++) for (j=i+1; j<=NF; j++) if ($i == $j) next}1' file
One and a Two
Unix.com is the Best

The string length was between 3 to 12 characters. ( words which were identical ).

I tried your solution and it works like a charm. Thank you Rudi

---------- Post updated at 07:58 PM ---------- Previous update was at 07:53 PM ----------

ravindersingh13:

Hello martinsmith,

Could you please try this and let me know if this helps. I am ignoring case sensitivity here so it will match all kind of same words either they are in capital or small letters.
So let's say following is the Input_file:
One and a Two
Unix.com is the Best
This as a Line Line
Example duplicate sentence with the word DUPLICATE
UNIX is very good GOOD
Now following is the code for same.
awk 'BEGIN{IGNORECASE = 1} {for(i=1;i<=NF;i++){for(j=1;j<=NF;j++){if($j==$i){A[$i]++;}};if(A[$i]>1){for(i in A){delete A;next}}};print;for(i in A){delete A}}'  Input_file
Output will be as follows.
One and a Two
Unix.com is the Best
Thanks,
R. Singh

Thanks R. Singh. It worked but seems to have taken some extra lines out. I believe Rudi's solution matched the patterns/words exactly since some words were similar spelling but different.

Anyways Much Thanks as usual. Cheers

Scrutinizer · January 18, 2016, 6:16am

Adapting RudiC's suggestion for case independence and minimum string length:

awk '{for (i=1; i<NF; i++) for (j=i+1; j<=NF; j++) if ((tolower($i) == tolower($j)) && length($i)>=3) next}1' file

--
Note: IGNORECASE is GNU awk only.

RavinderSingh13 · January 18, 2016, 8:31am

Hello martinsmith,

Adding one more solution here without 2 time loops, which may help you here. Let's say following is the Input_file.

One and a Two
Unix.com is the Best
This as a Line Line
Example duplicate sentence with the word DUPLICATE
UNIX is very good GOOD

Then following is the code.

awk '{for(i=1;i<=NF;i++){A[tolower($i)]};if(NF == length(A)){print};delete A}'  Input_file

Output will be as follows.

One and a Two
Unix.com is the Best

EDIT: Above solution should work fine but each time condition will be invoked in for loop so a little change as follows will avoid that also.

 awk '{for(i=1;i<=NF;i++){A[tolower($i)]}};{if(NF == length(A)){print};delete A}'  Input_file

So above I am closing the loop before and after completion of it I am executing the condition part then.

Thanks,
R. Singh

MadeInGermany · January 18, 2016, 9:33am

Interesting solution. In short it is

awk '{for(i=1; i<=NF; i++) A[tolower($i)]} (NF==length(A)); {delete A}'

But works only with a recent GNU awk.
Other awk versions say "fatal: attempt to use array `A' in a scalar context" or "syntax error" or do not display anything.

RudiC · January 18, 2016, 9:35am

I like that approach. Somewhat abbreviated:

awk '{delete A; for(i=1; i<=NF; i++) if (++A[tolower($i)] > 1) next} 1' file
One and a Two
Unix.com is the Best

Scrutinizer · January 18, 2016, 10:28am

Note:

delete A is BSD awk and GNU awk only. With regular awks use for(i in A) delete A or split("",A)
length(A) is GNU sed only, with regular awk, use a counter..

--
So POSIX awk complaint versions of the last three suggestions:

awk '{for(i=1; i<=NF; i++) A[tolower($i)]; c=0; for(i in A) {delete A; c++}} NF==c' file

and

awk '{split(x,A); for(i=1; i<=NF; i++) if (A[tolower($i)]++) next} 1' file

--
To also consider minimum string length:

awk '{split(x,A); for(i=1; i<=NF; i++) if (length($i)>2 && A[tolower($i)]++) next} 1'  file

--
@ravinder: nice approach

MadeInGermany · January 18, 2016, 12:32pm

The usual trick to delete an array is

split("",A)

Scrutinizer · January 18, 2016, 12:37pm

Thanks MadeInGermany, forgot about that one, added to my post...

MadeInGermany · January 18, 2016, 3:10pm

Best of breed

awk '{split("",A); for(i=1; i<=NF; i++) if (A[tolower($i)]++) next} 1' file

(Just seeing Scrutinizer has it already)

Aia · January 18, 2016, 8:41pm

cat martinsmith.file

One and a Two
Unix.com is the Best
This as a Line Line
Example will be the same that EXAMPLE
Example duplicate sentence with the word duplicate

perl -ne 'print unless /(\b\w+\b).*\g1/i' martinsmith.file

One and a Two
Unix.com is the Best

MadeInGermany · January 19, 2016, 7:26am

Aia, your solution does not work. First, there is a g too many. Second, the \b is not replicated to the \1. But even if I improve it like

perl -ne 'print unless /\b(\w+)\b.*\b\1\b/i' martinsmith.file

it won't print the following line

Unix.unix should be printed

martinsmith · January 19, 2016, 9:37am

Thank you everyone! Lot's of awesome solutions for this problem. Very much appreciated!

Aia · January 19, 2016, 1:19pm

madeingermany:

Aia, your solution does not work. First, there is a g too many. Second, the \b is not replicated to the \1. But even if I improve it like
[...]it won't print the following line
Unix.unix should be printed

It is working as designed
Unix.unix should be printed NOT

---------- Post updated at 11:19 AM ---------- Previous update was at 07:38 AM ----------

Please, refer to the perldoc to know what \g1 does.

MadeInGermany · January 19, 2016, 2:38pm

Ok, one more experts posting.
The \g1 was introduced in Perl 5.10 and behaves like \1 (I tested with Perl 5.8 only, my bad).
The perl solution treats Unix.unix as two words while the awk solution treats it as one word.
Regarding my \b comment, only my version prints both

No duplicat sentence with the word duplicate
No duplicate sentence with the word duplicat

(Now I have tested with perl 5.8 and 5.18)

Aia · January 19, 2016, 2:54pm

Could that be a bug or oversight in the AWK sugestion? Maybe is enough for the OP intention, however, a word normally is not only defined by characters separated by spaces.

bakunin · January 19, 2016, 2:57pm

Seeing all these elaborate awk solutions i wonder if sed wouldn't be easier:

sed '/\([^ ]*\) \1/d' file

It is little known that back references ("\1") can be used not only in the replacement string but also in the search regexp.

Btw.: "word" here is something surrounded by whitespace, not a certain number of characters. It is easy to put such a further restriction in if it is indeed needed.

I hope this helps.

bakunin

Scrutinizer · January 19, 2016, 3:06pm

@Bakunin, that would only work with adjacent words and would also match partial patterns:

$ echo foo foobar | sed '/\([^ ]*\) \1/d'
$

And because of the zero or more match:

$ echo abc def ghi | sed '/\([^ ]*\) \1/d'
$