Remove lines containing 2 or more duplicate strings

Within my text file i have several thousand lines of text with some lines containing duplicate strings/words. I would like to entirely remove those lines which contain the duplicate strings.

Eg;

One and a Two
Unix.com is the Best
This as a Line Line
Example duplicate sentence with the word duplicate

Output;

One and a Two
Unix.com is the Best

The letter case doesn't matter.

Much Thanks as always for your help :slight_smile:

From what length is a "string" a string? Is "a" a string?

---------- Post updated at 11:40 ---------- Previous update was at 11:40 ----------

Does case matter?

---------- Post updated at 11:43 ---------- Previous update was at 11:40 ----------

As a starting point:

awk '{for (i=1; i<=NF; i++) for (j=i+1; j<=NF; j++) if ($i == $j) next}1' file
One and a Two
Unix.com is the Best
1 Like

Hello martinsmith,

Could you please try this and let me know if this helps. I am ignoring case sensitivity here so it will match all kind of same words either they are in capital or small letters.
So let's say following is the Input_file:

One and a Two
Unix.com is the Best
This as a Line Line
Example duplicate sentence with the word DUPLICATE
UNIX is very good GOOD

Now following is the code for same.

awk 'BEGIN{IGNORECASE = 1} {for(i=1;i<=NF;i++){for(j=1;j<=NF;j++){if($j==$i){A[$i]++;}};if(A[$i]>1){for(i in A){delete A;next}}};print;for(i in A){delete A}}'  Input_file

Output will be as follows.

One and a Two
Unix.com is the Best

Thanks,
R. Singh

1 Like

The string length was between 3 to 12 characters. ( words which were identical ).

I tried your solution and it works like a charm. Thank you Rudi :b:

---------- Post updated at 07:58 PM ---------- Previous update was at 07:53 PM ----------

Thanks R. Singh. It worked but seems to have taken some extra lines out. I believe Rudi's solution matched the patterns/words exactly since some words were similar spelling but different.

Anyways Much Thanks as usual. Cheers

Adapting RudiC's suggestion for case independence and minimum string length:

awk '{for (i=1; i<NF; i++) for (j=i+1; j<=NF; j++) if ((tolower($i) == tolower($j)) && length($i)>=3) next}1' file

--
Note: IGNORECASE is GNU awk only.

1 Like

Hello martinsmith,

Adding one more solution here without 2 time loops, which may help you here. Let's say following is the Input_file.

One and a Two
Unix.com is the Best
This as a Line Line
Example duplicate sentence with the word DUPLICATE
UNIX is very good GOOD

Then following is the code.

awk '{for(i=1;i<=NF;i++){A[tolower($i)]};if(NF == length(A)){print};delete A}'  Input_file

Output will be as follows.

One and a Two
Unix.com is the Best

EDIT: Above solution should work fine but each time condition will be invoked in for loop so a little change as follows will avoid that also.

 awk '{for(i=1;i<=NF;i++){A[tolower($i)]}};{if(NF == length(A)){print};delete A}'  Input_file
 

So above I am closing the loop before and after completion of it I am executing the condition part then.

Thanks,
R. Singh

2 Likes

Interesting solution. In short it is

awk '{for(i=1; i<=NF; i++) A[tolower($i)]} (NF==length(A)); {delete A}'

But works only with a recent GNU awk.
Other awk versions say "fatal: attempt to use array `A' in a scalar context" or "syntax error" or do not display anything.

1 Like

I like that approach. Somewhat abbreviated:

awk '{delete A; for(i=1; i<=NF; i++) if (++A[tolower($i)] > 1) next} 1' file
One and a Two
Unix.com is the Best
1 Like

Note:

  • delete A is BSD awk and GNU awk only. With regular awks use for(i in A) delete A or split("",A)
  • length(A) is GNU sed only, with regular awk, use a counter..

--
So POSIX awk complaint versions of the last three suggestions:

awk '{for(i=1; i<=NF; i++) A[tolower($i)]; c=0; for(i in A) {delete A; c++}} NF==c' file

and

awk '{split(x,A); for(i=1; i<=NF; i++) if (A[tolower($i)]++) next} 1' file

--
To also consider minimum string length:

awk '{split(x,A); for(i=1; i<=NF; i++) if (length($i)>2 && A[tolower($i)]++) next} 1'  file

--
@ravinder: nice approach

1 Like

The usual trick to delete an array is

split("",A)
1 Like

Thanks MadeInGermany, forgot about that one, added to my post...

Best of breed

awk '{split("",A); for(i=1; i<=NF; i++) if (A[tolower($i)]++) next} 1' file

(Just seeing Scrutinizer has it already)

1 Like
cat martinsmith.file
One and a Two
Unix.com is the Best
This as a Line Line
Example will be the same that EXAMPLE
Example duplicate sentence with the word duplicate
perl -ne 'print unless /(\b\w+\b).*\g1/i' martinsmith.file
One and a Two
Unix.com is the Best
1 Like

Aia, your solution does not work. First, there is a g too many. Second, the \b is not replicated to the \1. But even if I improve it like

perl -ne 'print unless /\b(\w+)\b.*\b\1\b/i' martinsmith.file

it won't print the following line

Unix.unix should be printed
1 Like

Thank you everyone! Lot's of awesome solutions for this problem. Very much appreciated!

It is working as designed
Unix.unix should be printed NOT

---------- Post updated at 11:19 AM ---------- Previous update was at 07:38 AM ----------

Please, refer to the perldoc to know what \g1 does.

Ok, one more experts posting.
The \g1 was introduced in Perl 5.10 and behaves like \1 (I tested with Perl 5.8 only, my bad).
The perl solution treats Unix.unix as two words while the awk solution treats it as one word.
Regarding my \b comment, only my version prints both

No duplicat sentence with the word duplicate
No duplicate sentence with the word duplicat

(Now I have tested with perl 5.8 and 5.18)

Could that be a bug or oversight in the AWK sugestion? Maybe is enough for the OP intention, however, a word normally is not only defined by characters separated by spaces.

Seeing all these elaborate awk solutions i wonder if sed wouldn't be easier:

sed '/\([^ ]*\) \1/d' file

It is little known that back references ("\1") can be used not only in the replacement string but also in the search regexp.

Btw.: "word" here is something surrounded by whitespace, not a certain number of characters. It is easy to put such a further restriction in if it is indeed needed.

I hope this helps.

bakunin

@Bakunin, that would only work with adjacent words and would also match partial patterns:

$ echo foo foobar | sed '/\([^ ]*\) \1/d'
$

And because of the zero or more match:

$ echo abc def ghi | sed '/\([^ ]*\) \1/d'
$
1 Like