Awk Multiple Field Separators

Tonka52 · April 7, 2004, 6:19am

Hi Guys,

I'm tying to split a line similar to this:

YO6-2000-30.htm:                                               (3 properties found).

......into separate columns, so effectively I need to check for a - , . , : , a tab and a space in the statement.

Any help would be appreciated

Thanks!

google · April 7, 2004, 6:53am

Does your data look like this? YO6-2000-30.htm: or like this?YO6-2000-30.htm: (3 properties found).

In the first case, set RS = ":" to delimit each record and then you can parse each field within the record using regexp's. In the latter case, play the same game by setting RS = ")."

Tonka52 · April 7, 2004, 6:57am

Google,

The line is exactly as printed.

So :

YO6-2000-30.htm: (3 properties found).

......needs to become

 $1   $2    $3  $4   $5
YO6   2000  30  htm  (3 properties found)

I understand that $5 could affectively become:

$5   $6           $7
(3   properties   found)

....but that's manageable!

google · April 7, 2004, 7:04am

Awk takes input and creates "records" by delimiting by the value of RS. Awk delimits each "record" by the value of "FS", the field separator. You can then slice and dice each value of a field at your whim. In addition, if you dont have an easy way to split records and fields, Awk (gawk) allows you to define your own record by specifying column widths using the FIELDWIDTHS variable. Example

BEGIN  { FIELDWIDTHS = "9 6 10 6 7 7 35" }

.....will define a record of fixed width including whitespace between columns. So $1 is defined as a field of 9 bytes, $2 is defined as a field of 6 bytes and so on.

This is a pretty good tutorial on Awk. GNU Awk Tutorial

Tonka52 · April 7, 2004, 7:12am

Google,

I can't rely on fixed width fields, as the length changes. I was hoping for a command something like :

awk ' {FS = "[:, -]+" } { print $1 $2 $3 $4 $5}'

Of course, this syntax is wrong...but you know what I'm getting at yeah?

google · April 7, 2004, 7:16am

You can massage the data a bit by changing all of the "(" and ")" and "." to a ":" before you parse the data. Once you have that then all of your data looks the same. Set FS = ":" to define your fields, set OFS to some output delimiter you need and print your data. If you need the parens in the output, add them back in your print statement. Remember, Awk does not change the original record so you can make these changes for the purposes of your program without mucking anything up!

gensub(regexp, replacement, how [, target]) # 
gensub is a general substitution function. Like sub and gsub, it 
searches the target string target for matches of the regular 
expression regexp. Unlike sub and gsub, the modified string is 
returned
as the result of the function and the original target string is not 
changed. If how is a string beginning with g or G, then it replaces
 all matches of regexp with replacement. Otherwise, how is 
treated as a number that indicates which match of regexp to 
replace. If no target is supplied, $0 is used.
 

gensub provides an additional feature that is not available in sub
 or gsub: the ability to specify components of a regexp in the
replacement text. This is done by using parentheses in the regexp
 to mark the components and then specifying \N in the 
replacement text, where N is a digit from 1 to 9. For example:

Tonka52 · April 7, 2004, 7:19am

Yeah, thought of that but my curiosity got me wondering if there's a single expression I can use.

Thanks anyway!

toonse · April 7, 2004, 9:37pm

BEGIN { FS = "[-.:\t ]"; }
{
        print $1,$2,$3,$4,$5,$6,$7,$8,$9;
}

This is how I would do it. When tested with your data it does exactly what you want.