Regex - Capturing groups

ysvsr1 · November 13, 2013, 11:02am

I am having trouble with regex capturing groups, For Ex :

I am having a file with

ABC  CDLF SFSDFK PRIMARY INDEX(XYZ,DEF,GHI);
XYZ   FLJ SDFKLD; PRIMARY INDEX(ABC);
BHI    SDKFLFLSFD  PRIMARY INDEX (QWE , RTY , LHJ);

My output should be :

ABC XYZ,DEF,GHI
XYZ ABC
BHI  OWE,RTY,LHJ

I am able to do the regex match, but not sure how to capture only a portion of the match.

gawk -v RS=";" '{
match($0,"PRIMARY[ ]+INDEX[ ]*[A-Za-z_ ]*[(]+",arr)
print arr[0]
}' tmp2.txt

Appreciate your help !

Yoda · November 13, 2013, 11:29am

Use RSTART and RLENGTH variables set by match function:

gawk '
        {
                match ( $0, /\(.*\)/ )
                print $1, substr ( $0, RSTART + 1, RLENGTH - 2 )
        }
' file

Or use gsub function:

awk '{ gsub(/[ ].*\(|\).*/," ")}1' file

CarloM · November 13, 2013, 12:09pm

Basically you need to define each part that you want as a sub-expression using ellipses. Assuming the expression was matched, you'll get the matching parts in arr[1], arr[2], etc (and everything in 0, as well as some internal stuff used for RSTART, etc).

Also note that match breaking it into an array like this is GNUawk-specific.

Something to play with:

$ awk '{match ($0, /INDEX[ ]*\(([^\,]*)\,*([^\,]*)*/, arr); for (i in arr) {printf "%s:%s - [%s]\n", NR, i, arr}}' x.txt
1:0start - [26]
1:0length - [13]
1:1start - [32]
1:2start - [36]
1:0 - [INDEX(XYZ,DEF]
1:1 - [XYZ]
1:2 - [DEF]
1:2length - [3]
1:1length - [3]
2:0start - [27]
2:0length - [11]
2:1start - [33]
2:2start - [38]
2:0 - [INDEX(ABC);]
2:1 - [ABC);]
2:2 - []
2:2length - [0]
2:1length - [5]
3:0start - [28]
3:0length - [17]
3:1start - [35]
3:2start - [40]
3:0 - [INDEX (QWE , RTY ]
3:1 - [QWE ]
3:2 - [ RTY ]
3:2length - [5]
3:1length - [4]

(Line 2 match values indicate that the regex needs work :))

greet_sed · November 13, 2013, 12:45pm

In perl:

perl -ne 'print "$1 $2\n" if /(\w+)\s(?:.*)\((.*)\);$/' inputfile

protocomm · November 13, 2013, 12:59pm

A barbarian method:

cat file | while read line
do var=$(echo $line | awk '{print $1}')
var1=$(echo $line | awk -F[\(\)] '{print $2}')
echo $var $var1
done

Yoda · November 13, 2013, 1:24pm

It is indeed barbarian method!

You are using awk & UUOC, but all this can be done by using shell builtins:

#!/bin/ksh

while read v1 rest
do
        rest="${rest##*\(}"
        print "$v1 ${rest%\)*}"
done < file

ysvsr1 · November 13, 2013, 3:11pm

Another Barbarian Method :

gawk -v RS=";" '{
if($0~/PRIMARY[ ]+INDEX/)
{
match($0,"PRIMARY[ ]+INDEX[ ]*[A-Za-z_ ]*[(]+[ A-Za-z_,]+",arr)
a=RSTART 
b=RLENGTH 
match($0,"PRIMARY[ ]+INDEX[ ]*[A-Za-z_ ]*[(]+",xyz)
c=RSTART 
d=RLENGTH 
e=b-d
f=a+d
l=index($1,".")
m=substr($1,l+1)
print m "  " substr($0,f,e) >> "pi_tmp.txt"
}
}' tmp2.txt

perl -ne 'print "$1 $2\n" if /(\w+)\s(?:.*)\((.*)\);$/' inputfile

Wow !!! Impressive , Can you explain how it is done ?? Thank you !!

greet_sed · November 13, 2013, 3:31pm

perl -ne 'print "$1 $2\n" if /(\w+)\s(?:.*)\((.*)\);$/' inputfile

-n is similar to while loop without printing the lines
while (<>) {
# do something
}

-e - perl code written in command line

\w - matches word ie alphanumeric and underscore 
\s - matches whitespace or tab
(\w+) - used for grouping - $1 as a backreference
(?: ) - grouping but will not make backreference. Hence it will not be $2.
\( - matches literal (
(.*) - grouping ie matches any character - $2 as a backreference
\) - matches literal )

RavinderSingh13 · November 13, 2013, 5:13pm

Hello,

one more approach, may help. Let us say file name is check_column_bracket.

awk '{
n=split($NF,array,"[(.*]");
print $1" "array[n]}' check_column_bracket | awk -vs=");" 'gsub(s,X)'

Output wil be as follows.

ABC XYZ,DEF,GHI
XYZ ABC
BHI LHJ

Thanks,
R. Singh

Yoda · November 13, 2013, 5:29pm

But OP's desired output is:

ABC XYZ,DEF,GHI
XYZ ABC
BHI  OWE,RTY,LHJ

By the way it seems to me that OP is not interested in an awk/shell solution, but satisfied with a perl based solution.

RavinderSingh13 · March 21, 2014, 1:30am

Hello,

Here is a solution which have same output as per request.

awk '{match($0,/\(.*\)/); check=substr($0,RSTART+1,RLENGTH-2); gsub(/[[:space:]]/,X,check); {print $1 OFS check}}'  file_name

Output is as follows:

ABC XYZ,DEF,GHI
XYZ ABC
BHI QWE,RTY,LHJ

Thanks,
R. Singh