Within my text file i have several thousand lines of text with some lines containing duplicate strings/words. I would like to entirely remove those lines which contain the duplicate strings.
Eg;
One and a Two
Unix.com is the Best
This as a Line Line
Example duplicate sentence with the word duplicate
Could you please try this and let me know if this helps. I am ignoring case sensitivity here so it will match all kind of same words either they are in capital or small letters.
So let's say following is the Input_file:
One and a Two
Unix.com is the Best
This as a Line Line
Example duplicate sentence with the word DUPLICATE
UNIX is very good GOOD
Now following is the code for same.
awk 'BEGIN{IGNORECASE = 1} {for(i=1;i<=NF;i++){for(j=1;j<=NF;j++){if($j==$i){A[$i]++;}};if(A[$i]>1){for(i in A){delete A;next}}};print;for(i in A){delete A}}' Input_file
The string length was between 3 to 12 characters. ( words which were identical ).
I tried your solution and it works like a charm. Thank you Rudi
---------- Post updated at 07:58 PM ---------- Previous update was at 07:53 PM ----------
Thanks R. Singh. It worked but seems to have taken some extra lines out. I believe Rudi's solution matched the patterns/words exactly since some words were similar spelling but different.
But works only with a recent GNU awk.
Other awk versions say "fatal: attempt to use array `A' in a scalar context" or "syntax error" or do not display anything.
Ok, one more experts posting.
The \g1 was introduced in Perl 5.10 and behaves like \1 (I tested with Perl 5.8 only, my bad).
The perl solution treats Unix.unix as two words while the awk solution treats it as one word.
Regarding my \b comment, only my version prints both
No duplicat sentence with the word duplicate
No duplicate sentence with the word duplicat
Could that be a bug or oversight in the AWK sugestion? Maybe is enough for the OP intention, however, a word normally is not only defined by characters separated by spaces.
Seeing all these elaborate awk solutions i wonder if sed wouldn't be easier:
sed '/\([^ ]*\) \1/d' file
It is little known that back references ("\1") can be used not only in the replacement string but also in the search regexp.
Btw.: "word" here is something surrounded by whitespace, not a certain number of characters. It is easy to put such a further restriction in if it is indeed needed.