Find redundant text in a file

hbar · December 3, 2011, 9:45am

I want to find which pattern or strings have occurred more than one time so that I can remove unnecessary redundancy.

For example:

If I have the sentence:

A quick brown brown fox jumps jumps jumps over the lazy dog

in a file, then I want to know that

in the above mentioned sentence.

Note that I have no idea which words have been repeated.
So I cannot make a pattern match search.

So I just need to know what are the texts/strings are redundant in a file. Is it possible?

Thanks.

bartus11 · December 3, 2011, 10:12am

Try:

perl -0ne 'while (/(\w+ )\1+/g){@x=split / /,$&;print "$x[0]: " . ($#x+1) . " times\n"}' file

hbar · December 3, 2011, 10:37am

Sorry I didn't get any output !

Suppose I have a file called test.sh

cat test.sh

gives

abc dfg
ecd xkl mno
abc
dfg asj kllll
jkl p
dfg
o

Now you see 'abc' is repeated in the 1st and 3rd line.

'dfg' is repeated in 1st, 4th, and 5th line.

I may expect to see 'abc' and 'dfg' to be printed out on the screen with highlights in the corresponding lines or something similar.

I have attached the sample file.

Thanks.

abc   dfg
ecd  xkl mno
abc  
dfg  asj kllll 
jkl  p
dfg
o

bartus11 · December 3, 2011, 11:59am

I thought you need only consecutive repetitions. Try this:

perl -ne 'while (/\w+/g){$c{$&}++};END{for $i (keys %c){print "$i: $c{$i}\n" if $c{$i}>1}}' file

hbar · December 3, 2011, 12:32pm

Thanks what if a file contain names like this:

Bat:Ball

Bat:Wicket

Bat:Ball

Bat:Bat

Wicket:Bat

I wish to get "Bat:Ball" to be printed, not the "Bat" or "Ball" individually.

Thanks.

hbar · December 4, 2011, 1:25pm

Please some one reply. It seems quite important to me. Thanks.

ahamed101 · December 4, 2011, 1:55pm

Try this...

awk '{for(i=1;i<=NF;i++){a[$i]++}}END{for(i in a){if(a>1){print i,a}}}' input_file

--ahamed

bartus11 · December 4, 2011, 2:41pm

perl -ne 'while (/[\w:]+/g){$c{$&}++};END{for $i (keys %c){print "$i: $c{$i}\n" if $c{$i}>1}}' file