counting lines that match pattern

I have a file of 1.3 millions lines.

some are with the same word twice on the line, some line have two diffrent words.
each line has two words, one in brackets.

example:

foo      (foo)
bar      (bar)
thae    (awvd)
beladf  (vswvw)

I am sure this can be done with one line of awk of sed, but my brain is done for the day.

I know I can do it with shell, but it would run very slow for 1.3 million lines.

You have explained the data, but not explained what your expected output will be.
What pattern? ... for example

sorry, just need a count of lines that have same word match.

for for sample data, output of "2"

TrY:

grep -c '\(.*\).*(\1)' infile
1 Like

@Scrutinizer : I thought \(.\) would consume everything till the character just before the "(" and the following . will be left with nothing. But, \(.\) took exactly the 1st word, leaving the . to consume spaces.

Please help me in understanding how the .* consumed the spaces?

Guru.

.* will always be greedy and match as much as possible (the whole line) but the parentheses and back-refs (in this case) force the regexp engine to back-track and give up one character of the matched string, at a time, to try if the overall match is possible.

1 Like

I thought of a case where it would not work correctly. If we have

foobar   (foo)

Then it would still be counted, so perhaps we would need something like:

grep -c '^ *\(.*\) .*(\1)' infile

if only spaces are used to separate the fields...