grep backreferencing question

Hello,
My input would be :

###Anything
   int b,c,a;
int    a,b,b;
###Anything
  int c,d,c;
int k,l;
###ANYTHING

Many declarations interspersed with other statements. I am trying to find only the declarations where a line has a variable declared more than once.

The output for the above would be:

int a,b,b;
  int c,d,c;

I did grep '^[ ]int[ ]*[a-z][a-z0-9]*\(,[a-z][a-z0-9]\)\{0,\};$' to match all declarations, but I am not able to make the regex remember a variable and match it when it occurs later. My output just catches all the declaration statements.

Please help.

Thanks,
Prasanna

---------- Post updated at 12:55 PM ---------- Previous update was at 12:51 PM ----------

To add, I am only using grep to do this. I have done this before, but I don't remember. I am sure it's possible with grep with a little tweak to the regex and the backreferencing.

The problem is that grep is always greedy. So I can make a regex like

echo "a,b,c,d,c,b,a,c" | egrep -o "([a-z]+)(,[a-z]+)*"

...and it will match the whole string, but when I start trying to use backreferences, the first ([a-z]+) will only ever match the very first variable: It will never skip past it and try other combinations when the backreference fails. There's no way to make grep non-greedy, either. Perl regexes support this though.

Your grep is "too anchored" and your regex visualization is too wild. There is no back referencing in regex, just iteratively forward testing: '.*' means try remainder of pattern at every following byte.

A line containing the word int and later a semicolon should not have any variable-legal word repeated between them. Every variable name in C must start with a letter, the rest of the name can consist of letters, numbers and underscore characters. Commas are not variable-legal words, so you can ignore them -- classic excess information problem.

Deal with white spaces using \<\> or similar word boundary, so you avoid substrings but do not get tangled in the whole comma, space, tab thing. Some grep do not honor '\<\>' so you may need sed or '\b'.

Regex Tutorial - \b Word Boundaries

If you get desperate, add spaces by commas and semicolon so you can look for space or tab [ \t]. If you need to restore the original, sed has a hold space h/g command pair.

Narrative: grep for a line with the free standing word 'int', and
 later on that line for every C variable name as a free standing word somewhere,
  see if we have that same C variable name as a free standing word later anywhere,
 and yet later on that line a semicolon.

grep '\<int\>.*\<\([a-zA-Z][a-zA-Z0-9_]*\)\>.*\<\1\>.*;'

Non-greedy match( i.e. *? ) in perl:

#!/usr/bin/perl

my $var="a,b,c,d,c";

if($var =~ /([a-z]+)(,[a-z]+)*?,\1/ )
{
        print "Match\n";
}

---------- Post updated at 11:30 AM ---------- Previous update was at 11:29 AM ----------

There's definitely backreferencing in egrep.

grep 'int .*\([^,][^,]*\),.*\1[,\;]' infile
int a,b,b;
int c,d,c;

Oh, that \(\) \1 bit, nothing back about it, the first collects and the second applies. Who defined this silly term, Bachus-Naur?

Well, when you slip into egrep/grep -E, the rules shift, which is one reason I use sed in complex egrep situations. It was a trustworthy pal until someone redefined regex '\<', arrogant little beasts!

---------- Post updated at 01:58 PM ---------- Previous update was at 01:54 PM ----------

$ sed '/\<int\>.*\<\([a-zA-Z][a-zA-Z0-9_]*\)\>.*\<\1\>.*;/!d'  <<!
###Anything
int b,c,a;
int a,b,b;
###Anything
int c,d,c;
int k,l;
###ANYTHING
 
Many declarations interspersed with other statements. I am trying to find only the declarations where a line has a variable declared more than once.
 
!
int a,b,b;
int c,d,c;
$

---------- Post updated at 01:58 PM ---------- Previous update was at 01:58 PM ----------

Gnu sed, older regex lib.

---------- Post updated at 02:01 PM ---------- Previous update was at 01:58 PM ----------

Would you like line numbers?

$ sed '
/\<int\>.*\<\([a-zA-Z][a-zA-Z0-9_]*\)\>.*\<\1\>.*;/!d
=
'  <<!
###Anything
int b,c,a;
int a,b,b;
###Anything
int c,d,c;
int k,l;
###ANYTHING
 
Many declarations interspersed with other statements. I am trying to find only the declarations where a line has a variable declared more than once.
 
!
3
int a,b,b;
5
int c,d,c;
$

>>>> Thanks lot. The only problem with the above is, it matches illegal declarations also.

like, int a,b,,b; int a,b,b,;

---------- Post updated at 02:20 PM ---------- Previous update was at 02:19 PM ----------

>>> Thanks lot. This one matches illegal declarations too.

like, int a,b,,b; int a,b,b,;

Well, C legality is a whole lot more than a grep! Is cc or CC making it hard to decipher?

I recommend never declaring more than 1 variable on a line. It facilitates diff and diff3 use. It gives you a distinct line number for every variable in error.

It works even if I remove all the \<\> from the above code. Can you explain how?

That wasn't part of your problem statement, was it? Anyway,

grep -E 'int .*([^,]+),.*\1[,\;]|int .*,[,\;]' infile

Without the \<\>, b matches bb.

Thanks.

---------- Post updated at 02:59 PM ---------- Previous update was at 02:56 PM ----------

Oh, sorry. Yes, I did not state it initially.

One more question, the .* makes it match anything. But can I match only valid declarations? That is, the variables should start with alpha, can have only alphabets and numbers. Let's assume lowercase only. How do you tweak the above to check for that?

Thanks a lot.

---------- Post updated at 03:01 PM ---------- Previous update was at 02:59 PM ----------

Thanks.
So, can we not check for illegal declarations using the regex?

Hi, no the .* will not match anything, since they are embedded in context. What is your additional question? I think you question about back references is answered. If so, it may be better to start a new topic.

Wrong comes in so many flavors, like integral calculus. Even right has a lot of flavors. First there are the datatypes, then the commas, the initializations, typedefs, unions, structs, Classes, code, continued lines, includes, macros.

Why not cc -c ? I hope it knows enough to catch that. There may even be c verification tools out there. Ever try lint?

To be clear:

int a123,b,c; --- legal
int 12a,b; ---- illegal since the variable starts with a number
int A,b,c ---- illegal since the variable starts with [A-Z]

A few valid lines that should match (only when a variable occurs more than once):

int a,b,a;
int a,b,b;
int a,b,c,a,g;

---------- Post updated at 03:12 PM ---------- Previous update was at 03:07 PM ----------

Oh.. But it matches something like:

int aa; --- which should not match.

@DGPickett, The \< , \> are GNU-only, no?

If we have \(a\(b\)\)

Which is \1, which is \2?

First opened is first numbered, but try it and see.

sed 's/\(\(.\).\).*/\1 \2/' <<!
abcdefg
!
ab a

1 Like

Yes, first is \1.

Thanks.

According to the Posix Specification there is no back reference in ERE, while there is back reference in BRE. compare:
9.3.6 BREs Matching Multiple Characters
9.4.6 EREs Matching Multiple Characters