Pattern Match & Extract from a string

karumudi7 · January 24, 2013, 1:16pm

Hi,

I have long string in 2nd field, as shown below:

 
REF1 | CLESCLJSCSHSCSMSCSNSCSRSCUDSCUFSCU7SCV1SCWPSCXGPDBACAPA0DHDPDMESED6 
REF2 | SBR4PCBFPCDRSCSCG3SCHEBSCKNSCKPSCLLSCMCZXTNPCVFPCV6P4KL0DMDSDSASEWG

I have a group of fixed patterns which can occur in these long strings & only one pattern will come for one record, I will maintain all possible patterns in a file called Patterns.txt:

APA 
APC 
DFH 
CZX

Eg: for the first record, APA occured and for second rec CZX occured and both are occured at differnt positions.

Expected output:

 
REF1 | APA
REF2 | CZX

Thanks

Yoda · January 24, 2013, 1:55pm

Here is one way of doing it:

while read p
do
    awk -F\| -v P="$p" '{if(match($2,P)>0) print $1,substr($2,RSTART,RLENGTH); }' OFS=\| filename
done < Patterns.txt

Scrutinizer · January 24, 2013, 2:00pm

awk 'NR==FNR{P[$1]; next}{for(i in P) if($3~i) {print $1,$2,i; next}}' file2 file1

karumudi7 · January 24, 2013, 2:02pm

Thanks Bipin, I got the desired output. if possible, can u please explain the functionality of this code ?

Corona688 · January 24, 2013, 2:05pm

The while-loop simply reads every line in patterns.txt into the p variable, one by one.

awk
# Use | as the input separator
        -F\|
# Set the P variable inside awk to the value of the shell variable $p
        -v P="$p"
# For each line, check if the second token matches the variable P
# If it does, print the first token, and the subsection of the second
# token that matched.
# RSTART and RLENGTH are automatic variables set by match.
        '{if(match($2,P)>0) print $1,substr($2,RSTART,RLENGTH); }'
# Use | as the output separator
        OFS=\|
# Read from filename
        filename

Yoda · January 24, 2013, 2:09pm

Even though my approach works, I recommend using Scrutinizer's approach because it will be way much faster than using a while loop and feeding input to awk

karumudi7 · January 24, 2013, 3:03pm

To avoild dependecy on Pattern.txt, I just want to calculate the required output directly from the data:

Sample data:

REF 1 | BADSBCESBCSSBNUSBR4PCBFPCDRSCF3SCGDSCG3SCHEPCKBSCKN DMDSDSASEWG SGTKSGXWSGX4SHABSHGASJACPJATSJAV NSPCC QCCSRA4SRCA RDHSRDLSR
REF 2 | APASBABSBCSSBC2SBNESBNGPBNPPBNSPBNTPBRFSCAKSCDCSNHMSPXR QXRSRA2SRCGSRCDFH DHDPDMESED6 GAMSGFASG

Desired output :

REF 1|PCC|QCC|EWG
REF 2|PXR|QXR|ED6

The three rules to extract the data are:
(i) second filed is cal based on the occurence of "P" just 3 characters left to the space in the source and from there 3 chars.
(ii) Third field is calculated - occurence of "Q" just after the space and from there 3 characters.
(iii) Fouth filed is based on occurence of "E" just before 3 characters left to a space and from there 3 chars.
i.e; The second, third & fourth fields of the output are always 3 chars only.
Any ideas to implement this ?
Thanks in advance.

Don_Cragun · January 24, 2013, 3:05pm

Hi Karumudi7,
Note that although the scripts provided by bipinajith and Scrutinizer both do what you want, neither of them do what you asked for. The contents you gave us for Patterns.txt in the 1st message in this thread has a trailing <space> character at the end of the first three lines. And, the first line of your input file does not contain "APA " in the last field.

The script bipinajith provided strips the trailing spaces by using the default value of IFS while reading lines from Patterns.txt. The awk script Scrutinizer provided stripped the trailing space by using the default field separator in awk.

I was working on an awk script similar to Scrutinizer's script, but I was using FS = " [|] " to simplify the output line. It took me a while to realize that my script was failing due to the trailing spaces in your list of patterns.

karumudi7 · January 24, 2013, 3:18pm

don cragun:

Hi Karumudi7,
Note that although the scripts provided by bipinajith and Scrutinizer both do what you want, neither of them do what you asked for. The contents you gave us for Patterns.txt in the 1st message in this thread has a trailing <space> character at the end of the first three lines. And, the first line of your input file does not contain "APA " in the last field.

The script bipinajith provided strips the trailing spaces by using the default value of IFS while reading lines from Patterns.txt. The awk script Scrutinizer provided stripped the trailing space by using the default field separator in awk.

I was working on an awk script similar to Scrutinizer's script, but I was using FS = " [|] " to simplify the output line. It took me a while to realize that my script was failing due to the trailing spaces in your list of patterns.

The trailing space might came wich I copying & paste those here. Sorry for that.
Due to these constraints, I want to remove the dependecy on Patterns.txt and updated the same before your post.

Thanks.

Don_Cragun · January 24, 2013, 3:25pm

For your new problem, try:

awk 'BEGIN {FS = OFS = "|" }
{       match($2, /P.. /)
        f2 = " " substr($2, RSTART, RLENGTH)
        match($2, / Q../)
        f3 = substr($2, RSTART, RLENGTH) " "
        match($2, /E.. /)
        f4 = " " substr($2, RSTART, RLENGTH - 1)
        print $1, f2, f3, f4
}' Sample2.txt

As always, if you're using a Solaris/Sun OS system, use /usr/xpg4/bin/awk or nawk instead of awk .

karumudi7 · January 26, 2013, 4:51am

don cragun:

For your new problem, try:
awk 'BEGIN {FS = OFS = "|" }
{       match($2, /P.. /)
   f2 = " " substr($2, RSTART, RLENGTH)
   match($2, / Q../)
   f3 = substr($2, RSTART, RLENGTH) " "
   match($2, /E.. /)
   f4 = " " substr($2, RSTART, RLENGTH - 1)
   print $1, f2, f3, f4
}' Sample2.txt
As always, if you're using a Solaris/Sun OS system, use /usr/xpg4/bin/awk or nawk instead of awk .

Thanks it worked, can u please let me know the functionality. First time I am using match,RSTART & RLENGTH functions.

Don_Cragun · January 26, 2013, 5:45am

The awk script:

awk 'BEGIN {FS = OFS = "|" }
{       match($2, /P.. /)
        f2 = " " substr($2, RSTART, RLENGTH)
        match($2, / Q../)
        f3 = substr($2, RSTART, RLENGTH) " "
        match($2, /E.. /)
        f4 = " " substr($2, RSTART, RLENGTH - 1)
        print $1, f2, f3, f4
}' Sample2.txt

starts by setting the input and output field separators (FS and OFS) to the <vertical-line> (or pipe) character. So, with your sample data, the 2nd field in each input line always begins with a <space> character.

The match() calls search the string specified by the first argument (in all three cases this is the 2nd field in an input line) for the extended regular expression specified by the 2nd argument, returns the index in that string where the 1st match occurs (or 0 if there is no match) and sets RSTART to the same value. If RSTART is not zero, RLENGTH is set to the length of the substring that matches the ERE. The substring EREs given to these three calls to match() search for a P followed by any two characters followed by a space, for a space followed by a Q followed by any two characters, and for an E followed by any two characters followed by a space character, respectively. (Note that with these EREs, three letter strings found at the end of the line will not be matched since there is no trailing space in those cases; but your requirements explicitly stated that the match was to be to a following space. Note also that if some of your input lines do not have matches for all three of your specified conditions, the results are unspecified and there will be no warning that something didn't match. If this is a concern, you should check the return code from match() and print a diagnostic message if it returns 0.)

The following calls to substr() use the values of RSTART and RLENGTH set by match() to extract the desired output fields (with added leading or trailing spaces) to set f2, f3, and f4 to be the desired 2nd, 3rd, and 4th output fields, respectively.

Note that RLENGTH - 1 is used in the last substr() to eliminate the unwanted trailing space that would appear at the end of the line if RLENGTH had been used instead. With all of the EREs used in these match() calls, RLENGTH will always be 4, but I kept RLENGTH and RLENGTH - 1 rather than 4 and 3 in case you later decide to change the EREs to match different strings.

With OFS set to "|", the print call adds the specified field separators when printing the output lines.