field substitution w/awk

giannicello · June 6, 2003, 5:46pm

I want to replace columns 15 thru 22 with a date in this format mmddyyyy in a file that has fixed record length of 110 columns. Only lines with column 1 = "T" will be changed but I can't seem to get it to work.

I saw several postings and tried to work with them but they don't work for me...

If I had 20 lines, any lines not 'T' is echoed out as is and when there's a 'T', it is supposed to substitute 15-22 with, say 06062003, but awk command substr won't do it:

echo "$LINE" | awk '{print substr ($0, 1, 14) $proc_dt substr ($0, 24, length ($0) - 24)}' >> newfile

it's literally printing $proc_dt in the newfile...when it should say 06062003 in col 15-22 (I've already set the variable...)

What's going on? I tried enclosing $proc_dt in (, {, etc and nothing works...

Gianni

Perderabo · June 6, 2003, 10:00pm

You have something like this:

'blah blah blah ${variable} more stuff'

The whole idea of single quotes is that "what you see is what you get". The shell will not make any substitutions inside single quotes.

So turn that into two single quoted strings with the variable in between. Like this:

'blah blah blah '${variable}' more stuff'

The brackets are not strictly needed, but I think they improve the readability of the code.

giannicello · June 9, 2003, 11:26am

Awesome. I thought I tried this already and it didn't work before...

In any case. this works as expected, so thank you.

The other question comes back to calling the external awk program which seems really slow when trying to substitute 20K records (could take 10-15 minutes) depending on how many users are on the system. Does anyone have any other ideas on how this could be achieved quicker? sed?

Gianni

Jimbo · June 9, 2003, 1:13pm

If you have a shell while loop, and within it you are doing:

echo "$LINE" | awk ...

then you are calling awk 20,000 times, opening your output file 20,000 times etc. Instead, do away with the shell loop, and do it with awk:

awk '{
if ($1 != "T")
   print
else
   print substr ...
}' infile > newfile

criglerj · June 10, 2003, 9:39am

A bit more flexible than the other solutions:

awk -v newdate=06062003 '/^T/ {
        $0 = substr($0,1,14) newdate substr($0,23,length($0)-22)
    }
    { print $0 }'

This can be put in a shell script to automatically get the date from the 'date' command (if you want today's date), or your script can check the date to be sure it's valid (or you can check the validity in the awk ...) ...

BTW, your original solution dropped a character from the beginning of the line after what comes after the date and another from the end of the line in the expression "substr ($0, 24, length ($0) - 24)".

giannicello · June 13, 2003, 2:42pm

awk '{
if (substr($1, 1, 1) != 'T' )
print
else
print substr ($1, 1, 14) '${proc_dt}' substr ($1, 24, length ($1))}'
}' infile > newfile

What's wrong with this? I'm not good w/awk..
I got it to work with the while statement but not this way...

Gianni

Jimbo · June 14, 2003, 4:00pm

That's not bad, actually - just a couple of issues ...

awk '{
if (substr($1, 1, 1) != "T" )
print
else
print substr ($1, 1, 14) '\"${proc_dt}\"' substr ($1, 24)
}' infile > newfile

Your code terminates with two rightbrace-quotemarks, so it hangs waiting on stdin because it does not see infile beyond that second quote.

There are several ways to get a variable into awk. The approach you used by allowing it to be substituted when the command line is parsed is a good way. However, when awk sees the unprotected 14-JUN-2003, that is evaluated as a numeric expression. JUN, as an undefined variable, evaluates to zero, so that makes 14-0-2003 = -1989. You want awk to see that as a string constant. So I surrounded it with double quotes and protected with backslashes so they don't get removed by the shell.

And a small point: When you want the tail-end of a word, such as beginning with character 24 for the remainder of that word, you can just say substr($1,24). With no third operand, it takes the remainder of the expression being substringed.

criglerj · June 16, 2003, 10:12am

One substantive issue with Jimbo's post of 6-14-2003: If you use substr($1,1,1) == "T", you will pick up lines where the first non-blank character is a "T", which is in general a superset of the lines the OP asked for (lines with "T" in column 1), though it may be sufficient for this particular application. Better would be substr($0,1,1)=="T" or /^T/.

Jimbo · June 16, 2003, 10:36am

That's not my code. I was just fixing the major problems while retaining as much as possible of the original code.

When the OP says "column 1", some people may think column 1 of the entire line, but many people call the fields "columns", while some say "fields" and yet others say "words". Therefore, the OP actually might want to check character 1 of $1 rather than character 1 of $0.

giannicello · June 19, 2003, 9:27am

Thanks for the hint. It almost worked. Now the program runs in 1-3 seconds (as opposed to 5-10 minutes)...however, it did not seem to preserve the spaces in the record that was updated...even though I said to print out from 1 - 14 then 24 - to end of record... There are many blanks in this fixed formatted file and I need to preserve everything as is...other than the date.

Also, how do I set variables in the awk statement so that I can echo out line number, incrementing line count, record type, etc of the updated record as they are being processed? My attempt to do it resulted in my code abending again and again....

Thanks.

Gian

Jimbo · June 19, 2003, 9:49am

It would be helpful to see the code you have at this point. If you take substr($0,1,14), that takes the first 14 characters of the line, multiple spaces and all.

Set variables with

myvar=(some expression)

See "man awk" for predefined expressions. For example, NR is a predefined variable representing the current record number being processed from 1 thru n across all files, and FNR is a file-relative current record nuimber that begins over at 1 when awk is processing multiple input files.

If you want to see additional info being processed such as record numbers etc, you will want to have two outputs: your newly formatted output file and your "processing log output". This is easily done with awk:

print substr($0,1,14) substr($0,24)
print NR, rectype, amount > "mylogfile"

You don't need a double right-angle bracket here for append because awk will open mylogfile only once, not once for each record being processed. But if you want mylogfile to be appended across multiple runs, then use the double.

oombera · June 19, 2003, 10:36am

Just a side note about your code from above, giannicello. It didn't seem (at least to me when I was reading) that there were spaces in your file, but now you said there are. Awk uses the space as a delimiter by default. What this means is that if a typical line in your file looks like:

abcdefghijklmn12345678some other stuff

then $1 is abcdefghijklmn12345678some, $2 is other and $3 is stuff.
$0 stores the whole line.

So this piece of code you entered - print substr ($1, 1, 14) '\"${proc_dt}\"' substr ($1, 24) - will print the first 14 characters of $1, or abcdefghijklmn, then the value of $proc_dt, then the rest of $1, which is just some. other and stuff get left out completely, so you need to use $0.

giannicello · June 19, 2003, 10:55am

As you can tell, I didn't know what I was doing that's why I'm asking the experts. I think I understand awk a little better now. I started out with $0 and somewhere along the line it because $1. I made the change and now everything is fine. The code is some much simpler and faster too.

Thanks.