AWK - Ignoring White Space with FS

reno4me · February 15, 2012, 10:34am

I have an AWK script that uses multiple delimiters in the FS variable.
FS="[\. _]+"

My awk script takes a file name such as this:
12345_smith_bubba_12345_20120215_4_0.pdf and parses it out based on the under score. Each parsed field then has some code for data validation etc.

This script has been working fine until I got a file name like this:
12345_smith johnson jones_bubba_12345_20120215_4_0.pdf
Where the last name field has three last names separated by a space.

Is there a way in my FS assignment to tell AWK to ignore the whitespace and just use the period "." and underscore "_" as the only delimiters?

FYI this is my first post!

Randy

bartus11 · February 15, 2012, 10:37am

Try:

FS="[\._]+"

ctsgnb · February 15, 2012, 10:38am

Otherwise you can try renaming the file with a character that is not listed in the possible FS example:

12345_smith-johnson-jones_bubba_12345_20120215_4_0.pdf

then run your script again

I am reluctant to suppress the space from the -F list , since if the coder added it to the list, maybe it is used to process some further step in the awk script. So if you remove it, you have to make sure that it has no impact on the remaining awk code.

jim_mcnamara · February 15, 2012, 10:39am

Are you on Solaris -try nawk with Bartus' answer.

reno4me · February 15, 2012, 10:44am

Bartus - That was the first thing I tried and still got the same result.

ctsgnb - The script's purpose is to create an "index" file used by an application to import PDF's for viewing. The application will have the person's name preloaded. So when it looks to the index file for the person's name, it has to be exact or it will fail to import. So the name has to maintain the spaces.

Thank you for replying so fast!

---------- Post updated at 10:44 AM ---------- Previous update was at 10:43 AM ----------

Jim - We are on HP UX.

Scrutinizer · February 15, 2012, 11:11am

Where do you set the FS variable? Could you post the script or the relevant part of it?

reno4me · February 15, 2012, 11:17am

I set the FS in the BEGIN block:

BEGIN  {
 RS="\n"  # new_line record separator
 
 FS="[\. _]+"
 OFS="_"

Then in the main body of the script I do a print to an output file:

          field2=$2
          printf ("\"%s\",",field2)     >> path

Scrutinizer · February 15, 2012, 11:35am

What output do you get when you do run this command?

echo "12345_smith johnson jones_bubba_12345_20120215_4_0.pdf" | awk -F '[_.]*' '{print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10}' OFS=:

reno4me · February 15, 2012, 11:41am

Output -
12345:smith johnson jones:bubba:12345:20120215:4:0:pdf::

Which looks really close. Expected output would be -
12345_smith johnson jones_bubba_12345_20120215_4_0.pdf

I'll use your code to do some tweaking and see if I can get it. I'll reply back and let you know what happens.

Thank you!!

ctsgnb · February 15, 2012, 11:42am

If you want to maintain the space in the name, then you should remove it from your FS list.

Just remove the space which is between your dot and your _
FS="[\. _]+"
must be changed to :
FS="[\._]+"

You must then review the rest of your awk code to check that it behaves as you expect.

If not, then post what input you have, what output you expect, and what is your current code, people will then be able to help you to fix it.

Scrutinizer · February 15, 2012, 11:46am

Hi, I used this to check if there was an anomaly with your particular awk, but it looks OK, so Bartus11' suggestion would seem to work after all.
Just change OFS to underscore for the desired output and puy a dot between $7 and $8:

echo "12345_smith johnson jones_bubba_12345_20120215_4_0.pdf" | awk -F '[_.]*' '{print $1,$2,$3,$4,$5,$6,$7"."$8}' OFS=_

ctsgnb · February 15, 2012, 12:01pm

Note that if what you want is to rebuild the initial filename, you can just use the FILENAME awk built in variable.

reno4me · February 15, 2012, 1:57pm

I did get it working using the FS=[_.]*, the output was in the format I expected.
I appreciate the help!! This is a GREAT forum!
Randy