counting lines that match pattern

robsonde · October 11, 2012, 8:26pm

I have a file of 1.3 millions lines.

some are with the same word twice on the line, some line have two diffrent words.
each line has two words, one in brackets.

example:

foo      (foo)
bar      (bar)
thae    (awvd)
beladf  (vswvw)

I am sure this can be done with one line of awk of sed, but my brain is done for the day.

I know I can do it with shell, but it would run very slow for 1.3 million lines.

jim_mcnamara · October 11, 2012, 8:28pm

You have explained the data, but not explained what your expected output will be.
What pattern? ... for example

robsonde · October 11, 2012, 8:32pm

sorry, just need a count of lines that have same word match.

for for sample data, output of "2"

Scrutinizer · October 12, 2012, 12:37am

TrY:

grep -c '\(.*\).*(\1)' infile

guruprasadpr · October 12, 2012, 1:10am

@Scrutinizer : I thought \(.\) would consume everything till the character just before the "(" and the following . will be left with nothing. But, \(.\) took exactly the 1st word, leaving the . to consume spaces.

Please help me in understanding how the .* consumed the spaces?

Guru.

elixir_sinari · October 12, 2012, 1:24am

.* will always be greedy and match as much as possible (the whole line) but the parentheses and back-refs (in this case) force the regexp engine to back-track and give up one character of the matched string, at a time, to try if the overall match is possible.

Scrutinizer · October 12, 2012, 1:39am

I thought of a case where it would not work correctly. If we have

foobar   (foo)

Then it would still be counted, so perhaps we would need something like:

grep -c '^ *\(.*\) .*(\1)' infile

if only spaces are used to separate the fields...