Converting grep to awk

dagamier · May 22, 2012, 11:49am

First, I am trying to search for certain string types within a very large file. Here is a sample record:

MyCountry|MyGroup|MyCust|Run-Date|N/A|SERVER|OS|USERID|897//123456//LNAME FNAME|STATE|LLOGON

In field 9 I am looking for invalid formats. A valid format would be 897/C/123456//LNAME FNAME . The leading 897 can be numeric or it can be character (897 or US, these are country codes).

I am trying to convert a grep regexp search into an awk search with little success. The reason being is that I have to read in the line with grep, then test the variable, then print out the whole line. I figured with awk I could be doing the whole thing with one line and I could get rid of a slow "while read LINE;do" statement which makes my script extremely slow. Below are my two line examples:

GREP:

cat SID-Removed.csv | cut -d"|" -f9 |grep -ih '[[:alnum:]]\{2,3\}"//"[[:alnum:]]\{6\}'   (this one finally worked)

Note: I had tried

grep -ih '[[:alnum:]]\{2,3\}//[[:alnum:]]\{6\}'

but it did not produce the desired results

AWK:

awk -F'|' 'BEGIN { search_regex = "[[:alnum:]]\{2,3\}//[[:alnum:]]\{6\}" }tolower($9) ~ search_regex  {print $0}' SID-Removed.csv

I even tried this:

awk -F'|' 'BEGIN { search_regex = "[[:alnum:]]\{2,3\}\/\/[[:alnum:]]\{6\}"  }tolower($9) ~ search_regex  {print $0}' SID-Removed.csv

The awk statement finds 897/C/123456/LNAME FNAME , but it does not find the ones where the second element of that string is blank (897//123456//LNAME FNAME) . Can anyone help me figure out what i'm doing wrong?

Corona688 · May 22, 2012, 12:15pm

I'm not sure [[:alnum:]] is supported in awk. Try [0-9a-zA-Z].

I don't think you need to escape the { } with \.

neutronscott · May 22, 2012, 12:20pm

alnum is POSIX character class. gawk is ok with those. the { } is what is not supported by many awk. it is newer i guess and gawk didn't want to break itself

[mute@geek ~]$ echo foo | gawk '/^[[:alnum:]]{3}$/'
[mute@geek ~]$ echo foo | gawk --posix '/^[[:alnum:]]{3}$/'
foo

so to help more, please tell me which awk and/or OS you use.

shamrock · May 22, 2012, 12:28pm

awk -F\| '$9 ~ "^[a-zA-Z0-9]{2,3}/[a-zA-Z0-9]?/[a-zA-Z0-9]{6}"' file

dagamier · May 22, 2012, 12:32pm

I was trying to use [[:alnum:]]\{2,3\} because I want to search for any alpha (0-9, a-z as you indicated), but only if it found 2 or 3 before the /

897//123456
US//123456

If I switch to [0-9a-zA-z], can I still use the count feature from grep where I tell it to only match 2 or 3 characters in that position?

shamrock · May 22, 2012, 12:38pm

Yes you can...

dagamier · May 22, 2012, 12:39pm

Here is a full code snippet of what i'm trying to convert. Note that I am currently using grep in a while look and reading through a file with millions of records make this take quite a long time to complete:

while read LINE
do
   REC13=`echo $LINE |cut -d"|" -f9 |grep -ih '[[:alnum:]]\{2,3\}"//"[[:alnum:]]\{6\}'`
   if [ -n "$REC13" ]
   then
        echo $LINE >> ./$PRVYR/$MONTH/mislabeled/$MONTH-mislabeled.csv
   fi
done < INFILE

This particular record looks for these strings: CCC//SSSSSS or CC//SSSSSS

My goal is to try and convert this into an awk command.

Corona688 · May 22, 2012, 12:51pm

I see. No wonder you wanted to convert it

You can convert {2,3} to more regular syntax fortunately. Just put three of them, and make the third one optional with a ? after it. And repeat the {6} 6 times. Not elegant but at least efficient.

This could replace the whole loop I think:

awk -F"|" '$9 ~ /[[:alnum:]][[:alnum:]][[:alnum:]]?\/\/[[:alnum:]][[:alnum:]][[:alnum:]][[:alnum:]][[:alnum:]][[:alnum:]]/' infile > outfile

shamrock · May 22, 2012, 1:01pm

dagamier:

Here is a full code snippet of what i'm trying to convert. Note that I am currently using grep in a while look and reading through a file with millions of records make this take quite a long time to complete:
while read LINE
do
   REC13=`echo $LINE |cut -d"|" -f9 |grep -ih '[[:alnum:]]\{2,3\}"//"[[:alnum:]]\{6\}'`
   if [ -n "$REC13" ]
   then
   echo $LINE >> ./$PRVYR/$MONTH/mislabeled/$MONTH-mislabeled.csv
   fi
done < INFILE
This particular record looks for these strings: CCC//SSSSSS or CC//SSSSSS

My goal is to try and convert this into an awk command.

Yes reading and processing a line at a time on a file with millions of records is a huge resource waste...as the entire while loop can be replaced by the awk one liner I posted...so give it a try.

dagamier · May 24, 2012, 2:09pm

Ok. I made a little progress, but am still stuck. I was able to make my awk work if I use nawk versus traditional awk. Howver, it is not matching all my conditions. First a data set example:

Record1|some text (c-1234, US)|some more stuff
Record2| more text (c-1234, 897)|more stuff
Record3| new stuff (abc234, 897)| extra stuff

When I run this:

while read LINE
do
echo $LINE | nawk -F'|' 'BEGIN {search_regex = "\\([[:alnum:]]\{1\}.[[:alnum:]]\{4\},[[:blank:]][[:alnum:]]\{2,3\}\\)"} tolower($9) ~ search_regex {print $9}'
done < xx

When I run it I get this:

new stuff (abc234, 897)

but I don't get the other two records. How can I get awk to allow any charcter in that second position even if it is a dash? As you can see, I have even tried . notation with no success. Any help in helping me resolve this will help me fix about tweny other things i'm trying to work through one at a time. As always, any help is greatly appreciated.

Just as a note, I did also try this outside of the while loop and I get the same results, not sure why I expected something different in the loop.

---------- Post updated at 11:09 AM ---------- Previous update was at 10:54 AM ----------

Shamrock, I did try this and it worked for the one condition, but not the ones where there was a - in the second field (can be other special characters too). I am working on making each awk a separate statement (like grepping a file but much faster) without the while loop.

neutronscott · May 24, 2012, 2:48pm

You'll probably have better success if you did not store the regex as a string first. just use tolower($9) ~ /regex/ . in fact the tolower isn't needed since you're not using any case sensitive things.

I'm confused in what you're doing now though. First it was about 897/C/123456/LNAME FNAME in field 9, now it's something else in field 2 but your code says $9?

In this latest example:

$ awk --posix -F\| '$2 ~ /\([[:alnum:]].[[:alnum:]]{4},[[:blank:]]*[[:alnum:]]{2,3}\)/ {print $2}' input2
some text (c-1234, US)
 more text (c-1234, 897)
 new stuff (abc234, 897)

and for the name thing you switched between using / and //, I think it'd be: $9 ~ /^[[:alnum:]]{2,3}\/.?\/[[:alnum:]]{6}\// to check those first 2/3 subfields (897, c?, 123456)

dagamier · May 24, 2012, 3:48pm

Apologies Scott. I stripped out a bunch of the other fields (it will still be field 9), but my test sample is just a couple of fields. trying to simplify things without copying in tons of data

Corona688 · May 24, 2012, 5:22pm

...And that's the problem. If you gave us some actual data the first time, we could make something that works right with it the first time.

Go ahead, stretch the browser

dagamier · May 24, 2012, 5:57pm

OK. Let's stretch the browser capability. Here are the actual data sets I need to go match Each is its own "sample" record:

Field1|Field2|Field3|3-13-12||Field6|Field7|Field8|US/C/// US/C///|Field10|Field11|Field12
Field1|Field2|Field3|3-13-12||Field6|Field7|Field8|US/C///, US/C///,|Field10|Field11|Field12
Field1|Field2|Field3|3-13-12||Field6|Field7|Field8|000/C///(abc123 US)|Field10|Field11|Field12
Field1|Field2|Field3|3-13-12||Field6|Field7|Field8|000/C///pour  (abc123 897) LBG|Field10|Field11|Field12
Field1|Field2|Field3|3-13-12||Field6|Field7|Field8|000/C///pour  (abc123 US) LBG|Field10|Field11|Field12
Field1|Field2|Field3|3-13-12||Field6|Field7|Field8|000/C///pour  (abc123, 897) LBG|Field10|Field11|Field12
Field1|Field2|Field3|3-13-12||Field6|Field7|Field8|000/C///pour  (abc123, US) LBG|Field10|Field11|Field12
Field1|Field2|Field3|2012-01-25||Field6|Field7|Field8|000/C///run (c-ab123, US) jobs.|Field10|Field11|Field12
Field1|Field2|Field3|2012-01-25||Field6|Field7|Field8|000/C///run (c-ab123, 897) jobs.|Field10|Field11|Field12
Field1|Field2|Field3|2012-01-25||Field6|Field7|Field8|000/C///run (c-ab123 US) jobs.|Field10|Field11|Field12
Field1|Field2|Field3|2012-01-25||Field6|Field7|Field8|000/C///run (c-ab123 897) jobs.|Field10|Field11|Field12
Field1|Field2|Field3|3-13-12||Field6|Field7|Field8|000/C///pour  SN:abc123 897 LBG|Field10|Field11|Field12
Field1|Field2|Field3|3-13-12||Field6|Field7|Field8|000/C///pour  897//abc123 LBG|Field10|Field11|Field12
Field1|Field2|Field3|3-13-12||Field6|Field7|Field8|000/C///pour  US//abc123 LBG|Field10|Field11|Field12
Field1|Field2|Field3|3-13-12||Field6|Field7|Field8|000/C///pour  897//a-1234 LBG|Field10|Field11|Field12
Field1|Field2|Field3|3-13-12||Field6|Field7|Field8|000/C///pour  US//a-cdef LBG|Field10|Field11|Field12
Field1|Field2|Field3|3-13-12||Field6|Field7|Field8|000/C///pour  ;abcdef;897 LBG|Field10|Field11|Field12
Field1|Field2|Field3|2012-01-15||Field6|Field7|Field8|649/C//GTAA/ - :abcdef:897 Conver|Field10|Field11|Field12
Field1|Field2|Field3|3-13-12||Field6|Field7|Field8|000/C///,abcdef,897|Field10|Field11|Field12
Field1|Field2|Field3|3-13-12||Field6|Field7|Field8|000/C/// /abcdef897|Field10|Field11|Field12
Field1|Field2|Field3|3-13-12||Field6|Field7|Field8|000/C/// /abcdefUS|Field10|Field11|Field12
Field1|Field2|Field3|3-13-12||Field6|Field7|Field8|abcdef 897 000/C///|Field10|Field11|Field12
Field1|Field2|Field3|3-13-12||Field6|Field7|Field8|000/C///SN:abcdef 897|Field10|Field11|Field12
Field1|Field2|Field3|3-13-12||Field6|Field7|Field8|000/C///me@some.domain.com|Field10|Field11|Field12

As you can see there quite a few. The ones that are giving me a hard time are the ones that have a - in the field. The ones without dash I have been able to mostly figure out with the help of this forum. And a few others i'm close on, but i'm getting better at this. And I have been testing the previous suggestions of removing the storage of the regex as well to simplify the code

Scrutinizer · May 24, 2012, 7:57pm

For that sample, none of the number 9 fields are valid format as defined in post#1?

dagamier · May 24, 2012, 8:15pm

That's because post 1 was already solved and I was trying to work my way down one other item similarly related without having to start lots of threads with basically just variations. Variation one is a complete valid record. The other ones i'm posting are "invalid" conditions I need to find in a sea of good data

Scrutinizer · May 25, 2012, 6:15am

In for example (c-ab123, US) , there are 7 characters before the comma, not 6, so could you specify which strings you want to match?

dagamier · May 25, 2012, 9:39am

These are the striings i'm trying to match.

(c-ab123, 897)
(c-ab123 897)
897//a-1234
;a-cdef;897
:a-cdef:897
,a-cdef,897
/a-cdef897
a-cdef 897
SN:a-c123 897
SN=a-cdef 897
SN-a-cdef 897

these all appear in field 9 of my data file which is why my samples all showed me comparing that field

---------- Post updated at 06:39 AM ---------- Previous update was at 06:36 AM ----------

To be more clear, I'm looking for the

c-ab123, 897

info inside of each string

Scrutinizer · May 25, 2012, 9:46am

Actually that second part of your post makes is less clear because it seems to contradict the patterns in the first part. So what are you trying to match?

dagamier · May 25, 2012, 10:44am

I am specifically looking for the part with the

c-ab123, 897

inside of other string information. However, it can be surrounded by other junk (paren, punctuation, wierd formats) which is why I have look for all the other stuff that users might put in there incorrectly. Basically i'm trying to find the known bad conditions in a monthly set of records that I have to analyze.