In field 9 I am looking for invalid formats. A valid format would be 897/C/123456//LNAME FNAME . The leading 897 can be numeric or it can be character (897 or US, these are country codes).
I am trying to convert a grep regexp search into an awk search with little success. The reason being is that I have to read in the line with grep, then test the variable, then print out the whole line. I figured with awk I could be doing the whole thing with one line and I could get rid of a slow "while read LINE;do" statement which makes my script extremely slow. Below are my two line examples:
The awk statement finds 897/C/123456/LNAME FNAME , but it does not find the ones where the second element of that string is blank (897//123456//LNAME FNAME) . Can anyone help me figure out what i'm doing wrong?
alnum is POSIX character class. gawk is ok with those. the { } is what is not supported by many awk. it is newer i guess and gawk didn't want to break itself
Here is a full code snippet of what i'm trying to convert. Note that I am currently using grep in a while look and reading through a file with millions of records make this take quite a long time to complete:
while read LINE
do
REC13=`echo $LINE |cut -d"|" -f9 |grep -ih '[[:alnum:]]\{2,3\}"//"[[:alnum:]]\{6\}'`
if [ -n "$REC13" ]
then
echo $LINE >> ./$PRVYR/$MONTH/mislabeled/$MONTH-mislabeled.csv
fi
done < INFILE
This particular record looks for these strings: CCC//SSSSSS or CC//SSSSSS
My goal is to try and convert this into an awk command.
You can convert {2,3} to more regular syntax fortunately. Just put three of them, and make the third one optional with a ? after it. And repeat the {6} 6 times. Not elegant but at least efficient.
Yes reading and processing a line at a time on a file with millions of records is a huge resource waste...as the entire while loop can be replaced by the awk one liner I posted...so give it a try.
Ok. I made a little progress, but am still stuck. I was able to make my awk work if I use nawk versus traditional awk. Howver, it is not matching all my conditions. First a data set example:
Record1|some text (c-1234, US)|some more stuff
Record2| more text (c-1234, 897)|more stuff
Record3| new stuff (abc234, 897)| extra stuff
When I run this:
while read LINE
do
echo $LINE | nawk -F'|' 'BEGIN {search_regex = "\\([[:alnum:]]\{1\}.[[:alnum:]]\{4\},[[:blank:]][[:alnum:]]\{2,3\}\\)"} tolower($9) ~ search_regex {print $9}'
done < xx
When I run it I get this:
new stuff (abc234, 897)
but I don't get the other two records. How can I get awk to allow any charcter in that second position even if it is a dash? As you can see, I have even tried . notation with no success. Any help in helping me resolve this will help me fix about tweny other things i'm trying to work through one at a time. As always, any help is greatly appreciated.
Just as a note, I did also try this outside of the while loop and I get the same results, not sure why I expected something different in the loop.
---------- Post updated at 11:09 AM ---------- Previous update was at 10:54 AM ----------
Shamrock, I did try this and it worked for the one condition, but not the ones where there was a - in the second field (can be other special characters too). I am working on making each awk a separate statement (like grepping a file but much faster) without the while loop.
You'll probably have better success if you did not store the regex as a string first. just use tolower($9) ~ /regex/ . in fact the tolower isn't needed since you're not using any case sensitive things.
I'm confused in what you're doing now though. First it was about 897/C/123456/LNAME FNAME in field 9, now it's something else in field 2 but your code says $9?
In this latest example:
$ awk --posix -F\| '$2 ~ /\([[:alnum:]].[[:alnum:]]{4},[[:blank:]]*[[:alnum:]]{2,3}\)/ {print $2}' input2
some text (c-1234, US)
more text (c-1234, 897)
new stuff (abc234, 897)
and for the name thing you switched between using / and //, I think it'd be: $9 ~ /^[[:alnum:]]{2,3}\/.?\/[[:alnum:]]{6}\// to check those first 2/3 subfields (897, c?, 123456)
Apologies Scott. I stripped out a bunch of the other fields (it will still be field 9), but my test sample is just a couple of fields. trying to simplify things without copying in tons of data
As you can see there quite a few. The ones that are giving me a hard time are the ones that have a - in the field. The ones without dash I have been able to mostly figure out with the help of this forum. And a few others i'm close on, but i'm getting better at this. And I have been testing the previous suggestions of removing the storage of the regex as well to simplify the code
That's because post 1 was already solved and I was trying to work my way down one other item similarly related without having to start lots of threads with basically just variations. Variation one is a complete valid record. The other ones i'm posting are "invalid" conditions I need to find in a sea of good data
Actually that second part of your post makes is less clear because it seems to contradict the patterns in the first part. So what are you trying to match?
inside of other string information. However, it can be surrounded by other junk (paren, punctuation, wierd formats) which is why I have look for all the other stuff that users might put in there incorrectly. Basically i'm trying to find the known bad conditions in a monthly set of records that I have to analyze.