ysvsr1
November 13, 2013, 11:02am
1
I am having trouble with regex capturing groups, For Ex :
I am having a file with
ABC CDLF SFSDFK PRIMARY INDEX(XYZ,DEF,GHI);
XYZ FLJ SDFKLD; PRIMARY INDEX(ABC);
BHI SDKFLFLSFD PRIMARY INDEX (QWE , RTY , LHJ);
My output should be :
ABC XYZ,DEF,GHI
XYZ ABC
BHI OWE,RTY,LHJ
I am able to do the regex match, but not sure how to capture only a portion of the match.
gawk -v RS=";" '{
match($0,"PRIMARY[ ]+INDEX[ ]*[A-Za-z_ ]*[(]+",arr)
print arr[0]
}' tmp2.txt
Appreciate your help !
Yoda
November 13, 2013, 11:29am
2
Use RSTART
and RLENGTH
variables set by match function:
gawk '
{
match ( $0, /\(.*\)/ )
print $1, substr ( $0, RSTART + 1, RLENGTH - 2 )
}
' file
Or use gsub function:
awk '{ gsub(/[ ].*\(|\).*/," ")}1' file
CarloM
November 13, 2013, 12:09pm
3
Basically you need to define each part that you want as a sub-expression using ellipses. Assuming the expression was matched, you'll get the matching parts in arr[1], arr[2], etc (and everything in 0, as well as some internal stuff used for RSTART, etc).
Also note that match breaking it into an array like this is GNUawk-specific.
Something to play with:
$ awk '{match ($0, /INDEX[ ]*\(([^\,]*)\,*([^\,]*)*/, arr); for (i in arr) {printf "%s:%s - [%s]\n", NR, i, arr}}' x.txt
1:0start - [26]
1:0length - [13]
1:1start - [32]
1:2start - [36]
1:0 - [INDEX(XYZ,DEF]
1:1 - [XYZ]
1:2 - [DEF]
1:2length - [3]
1:1length - [3]
2:0start - [27]
2:0length - [11]
2:1start - [33]
2:2start - [38]
2:0 - [INDEX(ABC);]
2:1 - [ABC);]
2:2 - []
2:2length - [0]
2:1length - [5]
3:0start - [28]
3:0length - [17]
3:1start - [35]
3:2start - [40]
3:0 - [INDEX (QWE , RTY ]
3:1 - [QWE ]
3:2 - [ RTY ]
3:2length - [5]
3:1length - [4]
(Line 2 match values indicate that the regex needs work :))
In perl:
perl -ne 'print "$1 $2\n" if /(\w+)\s(?:.*)\((.*)\);$/' inputfile
1 Like
A barbarian method:
cat file | while read line
do var=$(echo $line | awk '{print $1}')
var1=$(echo $line | awk -F[\(\)] '{print $2}')
echo $var $var1
done
Yoda
November 13, 2013, 1:24pm
6
It is indeed barbarian method!
You are using awk & UUOC , but all this can be done by using shell builtins:
#!/bin/ksh
while read v1 rest
do
rest="${rest##*\(}"
print "$v1 ${rest%\)*}"
done < file
1 Like
ysvsr1
November 13, 2013, 3:11pm
7
Another Barbarian Method :
gawk -v RS=";" '{
if($0~/PRIMARY[ ]+INDEX/)
{
match($0,"PRIMARY[ ]+INDEX[ ]*[A-Za-z_ ]*[(]+[ A-Za-z_,]+",arr)
a=RSTART
b=RLENGTH
match($0,"PRIMARY[ ]+INDEX[ ]*[A-Za-z_ ]*[(]+",xyz)
c=RSTART
d=RLENGTH
e=b-d
f=a+d
l=index($1,".")
m=substr($1,l+1)
print m " " substr($0,f,e) >> "pi_tmp.txt"
}
}' tmp2.txt
perl -ne 'print "$1 $2\n" if /(\w+)\s(?:.*)\((.*)\);$/' inputfile
Wow !!! Impressive , Can you explain how it is done ?? Thank you !!
perl -ne 'print "$1 $2\n" if /(\w+)\s(?:.*)\((.*)\);$/' inputfile
-n is similar to while loop without printing the lines
while (<>) {
# do something
}
-e - perl code written in command line
\w - matches word ie alphanumeric and underscore
\s - matches whitespace or tab
(\w+) - used for grouping - $1 as a backreference
(?: ) - grouping but will not make backreference. Hence it will not be $2.
\( - matches literal (
(.*) - grouping ie matches any character - $2 as a backreference
\) - matches literal )
Hello,
one more approach, may help. Let us say file name is check_column_bracket.
awk '{
n=split($NF,array,"[(.*]");
print $1" "array[n]}' check_column_bracket | awk -vs=");" 'gsub(s,X)'
Output wil be as follows.
ABC XYZ,DEF,GHI
XYZ ABC
BHI LHJ
Thanks,
R. Singh
Yoda
November 13, 2013, 5:29pm
10
ravindersingh13:
Output wil be as follows.
ABC XYZ,DEF,GHI
XYZ ABC
BHI LHJ
But OP's desired output is:
ABC XYZ,DEF,GHI
XYZ ABC
BHI OWE,RTY,LHJ
By the way it seems to me that OP is not interested in an awk/shell solution, but satisfied with a perl based solution.
Hello,
Here is a solution which have same output as per request.
awk '{match($0,/\(.*\)/); check=substr($0,RSTART+1,RLENGTH-2); gsub(/[[:space:]]/,X,check); {print $1 OFS check}}' file_name
Output is as follows:
ABC XYZ,DEF,GHI
XYZ ABC
BHI QWE,RTY,LHJ
Thanks,
R. Singh