Script to validate file header and trailer

ash_sh · July 9, 2012, 4:57pm

Hi,

I need a script that validates a file header/detail/trailer. File layout is:

Sample Data:

HDR|customer_data.dat|20120709
LIN|value1|value2|value3...
LIN|value1|value2|value3...
LIN|value1|value2|value3...
LIN|value1|value2|value3...
TRL|customer_data.dat|20120709|4

The script should validate that only 1 header and trailer exists. If more, raise exception.
The script should verify that total detail lines equal to the trailder record (Record_count)

I really appreciate if someone can provide me the script

Thanks

PikK45 · July 9, 2012, 10:15pm

Can you show us whatever that you have done on this so far?

agama · July 9, 2012, 10:44pm

A minimal solution:

awk -F "|" ' /^HDR/ { h++; next; } /^TRL/ { exit( (h > 1) || $NF != NR - 2 ); }' file-name

If the exit code is non-zero there is an error.

ash_sh · July 10, 2012, 9:18am

I don't have Unix experience, I am using this script in an ETL program that loads the file to a DB. So I havent done anything,

chedlee88-1 · July 10, 2012, 9:32am

That awk script looks complicated.

Why don't you break the problem down into several parts?
1) Check if the header exist. Use grep & head
2) Check if the trailer record exist. Use grep & tail
3) Check if there is multiple records of header & trailers. Use grep & wc

ash_sh · July 10, 2012, 9:35am

Equally important requirement is to verify record count matches in detail line and the record count sum in the trailer. How can this be done?

agama · July 10, 2012, 8:21pm

The awk programme that I posted and a small bug -- didn't properly check for multiple trailers. This does better:

awk -F "|" ' /^HDR/ { h++; next; } /^TRL/ {t++; next} END { exit( (h != 1) || t != 1 || $NF != NR - 2 ); }' input-file

It checks both for the existence of exactly one header and one trailer, AND checks that the record count in the trailer matches the records observed. If either fails the exit code is non-zero to indicate that there is an error. If you need more precision, knowing exactly why there is an error, a longer programme can be used:

awk -F "|" '
/^HDR/ { h++; next; }
/^TRL/ { t++; next; }
END {
    ec = 1;
    if( h != 1 || t != 1 )
        printf( "header/trailer count error: %d headers; %d trailers\n", h, t ) >"/dev/fd/2";
    else
        if( $NF != NR - 2 )
            printf( "bad record count: %d(t) != %d(rec)\n", $NF, NR ) >"/dev/fd/2";
        else
            ec = 0;
    exit( ec );
}' input-file

I am not familiar with what an 'ETL program' is, so I don't know if you can invoke awk or not.

If ETL is anything like and of the standard *NIX shells, you probably can do something like this:


if ! awk -F "|" ' /^HDR/ { h++; next; } /^TRL/ {t++; next} END { exit( (h != 1) || t != 1 || $NF != NR - 2 ); }' $file_name
then
   echo "file did not pass verification test: $file_name"
   exit 1
fi

# put rest of your processing on success here.

@chedlee88-1 -- for small files, reading each three times might not have a noticeable impact, but if the input being verified is large, reading each three times might be so inefficient as not to be practical. It makes more sense to read the file once.

This programme will do the same thing, non-zero exit code, and write the failure reason onto standard err.

ash_sh · July 11, 2012, 4:55pm

@agama, thanks for the script, i will test it tomorrow. Btw, my file is quite large indeed and surely something efficient is a must. Oh ETL is an integration/middleware tool to transfer data from source to target.