Dropping Records for unknown reason in awk script

mkastin · October 30, 2009, 10:40am

Hi,

I have written the following it is pretty sloppy but I don't see any reason why I should be losing 54 records from a 3.5 million line file after using it.

What I am doing:

I have a 3.5 million record file with about 80,000 records need a correction. They are missing the last data from an append because they didn't have a match. I need to insert defaulted data on these records. My script worked at intended, however I have 54 less output records than input records and I don't know why they were dropped.

#!/bin/ksh

myFile="${1}"
myOutput="${2}"

awk '{
 match_flag=substr($0,63,2);
 if (NR == 1) insert_data=substr($0,41,22);
 if (match_flag == "  ") {strt=substr($0,1,40); print strt insert_data "\ \ \ \ \ \ \ \ \ \ \ NM\ X";}
else print $0;}' "${myFile}" >> "${myOutput}"

Basically what I am doing is appending a long string a data to any records that are missing a value in position 3064-3065.

Since this file is soo large I can't really provide sample data but I'll attempt to reproduce a short version below.

INPUT:
0001  Ronald   McDonald  01 H81 0001256 0100111               V VEEEFKFS SP X
0002  Elmo     St. Elmo  02 H82 0089621  001  10 11 01 1      0000WWDFCWWSP X
0003  Cookie   Monster   01 H81 0887141    1  .  0   0  .  1  BBB000 QWFJSP X
0004  Tfer     Harris    04 H84 0985512 0000000000000000000000BBE00122933NM X
0005  Oscar    Grouche   03 H83 0364471                   110.VVMWEWGODWFDA X
0006  Dumb     Name      02 H82 0000233   111 00 1111 00000000F23202233FFDA X
0007  Butter   Face      04 H84 0014666 1111111111111111111111M012291122FDA X
0008  Ford     F150      01 H81 0000001 00111 110 110  0011 ..S1102234SSMSP X
0009  Bar      Foo       03 H83 7741668 0 1 0 1 0 1 0 1 0 1 0 P019441MEWEDA X
0010  ChoCho   Train     04 H84 0014669 1111111111111111111111POWA1224023OB X
0011  Stone    Stone     04 H84 0014566 1111111111111111111111M12301MANWEOB X
0012  Problem  Record    04 H84 0000000 

OUTPUT:
0001  Ronald   McDonald  01 H81 0001256 0100111               V VEEEFKFS SP X
0002  Elmo     St. Elmo  02 H82 0089621  001  10 11 01 1      0000WWDFCWWSP X
0003  Cookie   Monster   01 H81 0887141    1  .  0   0  .  1  BBB000 QWFJSP X
0004  Tfer     Harris    04 H84 0985512 0000000000000000000000BBE00122933NM X
0005  Oscar    Grouche   03 H83 0364471                   110.VVMWEWGODWFDA X
0006  Dumb     Name      02 H82 0000233   111 00 1111 00000000F23202233FFDA X
0007  Butter   Face      04 H84 0014666 1111111111111111111111M012291122FDA X
0008  Ford     F150      01 H81 0000001 00111 110 110  0011 ..S1102234SSMSP X
0009  Bar      Foo       03 H83 7741668 0 1 0 1 0 1 0 1 0 1 0 P019441MEWEDA X
0010  ChoCho   Train     04 H84 0014669 1111111111111111111111POWA1224023OB X
0011  Stone    Stone     04 H84 0014566 1111111111111111111111M12301MANWEOB X
0012  Problem  Record    04 H84 0000000 0000000000000000000000           NM X

File is fixed length no delimiters.

mkastin · November 1, 2009, 12:19pm

Please Help!

steadyonabix · November 1, 2009, 2:56pm

It would be helpful if you rewrote your example code to work on the sample input you provide. At the moment there is no way of knowing what you are expecting in position 3064. Although your assumption that it is two empty spaces may be at the root of your problem.

I also don't understand your print statement: -

print strt insert_data "\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ 0NM\ X"

When I try: -

nawk ' BEGIN{
  print "\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ 0NM\ X"
} '

I get: -

\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ 0NM\ X

as my output not: -

0000000000000000000000NMX

My advise is scale down your example awk to work with your sample file and maybe someone will reply. Simply reposting the same request without changing it at all seems to be getting you nowhere.

Good luck

mkastin · November 2, 2009, 10:29am

Haha, wow, just realized how horrible my question was. Okay, I adjusted everything and it should hopefully be clearer now.

$ awk ' BEGIN{
  print "\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ 0NM\ X"
} '
awk: cmd. line:1: warning: escape sequence `\ ' treated as plain ` '
                                    0NM X

This statement works fine for me, although the escape sequence isn't necessary.

binlib · November 2, 2009, 10:47am

Is there a pattern for the missing records, e.g. at the end?
Since your output format is the same as the input, do

cmp -l infile outfile

Look for difference that doesn't look like your intended one. The expected difference is you replace blanks of input with fixed values on the output. Try to spot visually (or mechanically) the unintended differences.

steadyonabix · November 2, 2009, 11:06am

Another approach is to diff the input and output files and redirect the differences to a file. Then open the file and look to see why the matches in your awk fail for those lines. You can go to the character postions and confirm if the patterns you are trying to match are what you expect.Good luck

mkastin · November 2, 2009, 11:08am

binlib:

Is there a pattern for the missing records, e.g. at the end?
Since your output format is the same as the input, do
cmp -l infile outfile
Look for difference that doesn't look like your intended one. The expected difference is you replace blanks of input with fixed values on the output. Try to spot visually (or mechanically) the unintended differences.

I ran a diff on the files and I got over 160,000 lines returned I couldn't tell from this what lines went missing or if there was a discernible pattern to them. What I could tell from the diff was that appending the data onto the records I wanted to did work. I don't know if some of these records disappeared or if it was other fully intact lines.

Franklin52 · November 2, 2009, 12:31pm

Assuming the length of a record is 77 you can do something like:

awk 'length < 77 {$0 = $0 "0000000000000000000000           NM X"}1' file > newfile

binlib · November 2, 2009, 7:36pm

The result of the "cmp -l" looks like:

899  40  60
900  40  60
901  40  60

If you are missing records, it is most likely that the differences will not have a 40 (space) in the second field. Thus

cmp -l infile outfile |awk '$2 != 40' |head

would give the approximate locations of the missing records.