AWK help please

penfold · April 7, 2005, 11:25am

Hi,

I have two files each with different record seperators, one with a pipe | and the other with a semi-colon ;

How do you deal with this in awk?

Any help appreciated

specifically i need to change the RS to ; when the following statement operates on the second file (assets.dat)

awk -F\| 'FNR==NR{arr[$3]=1;next};$30 in arr' output.txt assets.dat

criglerj · April 7, 2005, 12:21pm

You said "record separator" but then you specified the field separator.
The manpage tells you that RS and FS are just a predefined variables with default values. The setting from the command line as you have it is just an assignment statement that takes place before any user-defined BEGIN section. Therefore, you can change RS, FS and any other variable when processing any action. (It's actually more times than that, but for the novice, it's plenty.)
FNR will only equal NR in the first file.

penfold · April 7, 2005, 12:28pm

Thanks for your response - sorry yes I meant the record seperator, from reading the man pages i'm aware you can change it within an awk statement - its just that I can't find examples of it. If it is possible could someone show me how i'd do it in relation to the above awk statement?

criglerj · April 7, 2005, 12:33pm

awk -F\| 'FNR==NR{arr[$3]=1;RS=";";next};$30 in arr' output.txt assets.dat

But this still will never be executed in the second file.

penfold · April 7, 2005, 12:38pm

Is that because of the FNR statement?

What I need to do is compare two files, third field of the first file (field seperator "|") and the 2nd field of the second file (field seperator ";")

Thanks

criglerj · April 7, 2005, 12:43pm

Right. FNR is the record number in the current file; NR is the cumulative record number.

Are these one-line files? If so this is acquiring that HW smell... Also, if these are one-line files, awk is not my weapon of choice ...

penfold · April 7, 2005, 12:48pm

This is the format of output.txt:

COEC2372323|EC2372323|7128778| |BE0117013319|381666|180617

and this is the format of assets.dat

BANKS;;7128778;;02;861542;03;B01ZJL7;;;06;EQ0010004100001000;11;IE0000197834;;;;;;;;;;;;;;;;;;;09;901773;;;;;;;;;;;;;Y;;;;EUR;EUR;;;;;;;;;;;;;;;EUR;;;;;;;;;;;;;;;;;;;;;;;S;;;;;;;;;;;;;;;;;;;;;;;;;B01ZJL7;;Y;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

Both these files are hundreds of records long - possibly thousands.

The fields that need comparing are highlighted - if the third field of output.txt doesn't match any of the occurances of field three in assets.dat then we need to removed the occurance of that record in output.txt

criglerj · April 7, 2005, 1:33pm

Yikes! Okay, here's one approach. I'm assuming you're showing us one line (record) from either file.

In your BEGIN section, gather all the field 3's from the first file into an array (arr[$3]=1), since from your description you aren't interested in any other fields. At the end of your BEGIN, change FS=";". Then the second file is the one processed using all the other patterns. The pattern I think you want is "arr[$2]". If all you want to do is send all those lines to stdout, you don't even need an action, since "print the matcing line" is the default action for any pattern. So the outline goes something like this (not tested):

awk -F\| 'BEGIN { while(getline < "output.txt") arr[$3] = 1; FS = ";" } arr[$2]' assets.dat

As one-liners go, that's a little longer than I like, so I'd put it in a script file with proper indenting. As part of a sh/ksh/zsh/bash script, you can indent away between the apostrophes without creating a separate file.

Re: awk versions

Some awks are field-count limited and will do undesirable things on lines with more than 99 fields. Also, some awks don't handle arrays of thousands of things very well. Perl or Ruby (or Python? or ???) hashes might give you a performance boost. (Of course, awk arrays are really hashes anyhow, so your seven digit indexes aren't creating arrays of millions of null entries.)