AWK Multi-Line Records Processing

RacerX · October 11, 2007, 2:44pm

I am an Awk newbie and cannot wrap my brain around my problem:

Given multi-line records of varying lengths separated by a blank line I need to skip the first two lines
of every record and extract every-other line in each record unless the first line of the record has the word "(CONT)" in the line
then skip the second line and append those every-other lines to the previous records every-other lines.

I hope that makes some sort of sense. I tried the following awk to get every
other line but it doesn't come out right. So i haven't even begun to try to figure out the rest of my problem....
awk '(((NR % 2) == 0) && ( NR > 2 )) {print}' ~/Desktop/datafile

Any programming help would be appreciated! I have provided the following example input data of four records:

CHARGER R M 1972 9 3 3 1 $7,060 1570 INDY 13 $27,717 MICKEY E.& OLGA B.SMITH,VIENNA,NC.
LA72TAUR FORD CHEVY GMC 1.57.00Q DAVID R.MILLER,ALDEN,NY.
MD TEST 0321 1371 OFF OFF OFF SONNY SMITH, KASHMAN, DAVE
FEB 20-98 VLY 1041 2094 8 8 8 8 8 8 NB SMITH GW : - : - :LAPS : 8 : 40 : - : - : 1
GD TEST 0311 1354 3H 4H 2H 0304 VIC, YO HERSHEY, CHARGER
JAN 7-98 VLY 1030 2064 6 3 3 4 4 3 2071 NB SMITH GW : - : - :LAPS : 6 : 40 : - : - : 1
$2,000 MD NW2L5CD 0303 1343 1Q 2 2T 0314 WILD MICK, CHARGER, TEKLA
MAR 9-98 VLY $500 1024 2060 5 2 3 4 2 2 2063 1900 SMITH GW : - : - :LAPS : 8 : 36 : - : - : 1
$2,700 GD OPENRUN 0292 1312 2 H 1 0303 CHARGER, HAL THE BARBER, WOLFMAN
MAR 13-98 VLY $1,350 1004 2022 2 3 2 2 1 1 2022 1130 SARAMA GW : - : - :LAPS : 7 : 31 : - : - : 2
$2,700 FT NW2L5CD 0294 1320 Q Q H 0300 WHEELS WIN, CHARGER, ROCK
MAR 27-98 VLY $675 1013 2020 4 1 1 1 1 2 2020 *1750 SMITH GW : - : - :LAPS : 8 : 60 : - : - : 2
$3,000 FT OPEN 0291 1301 1 1 3Q 0293 CHARGER, OVERRUN, ROCK
MAY 1-98 VLY $1,500 0594 1594 2 1 1 1 1 1 1594 *9500 SMITH GW : - : - :LAPS : 7 : 70 : - : - : 9
$4,000 FT OPEN 0263 1280 2Q 1Q T 0283 CHARGER, GUARDIAN, TORRE
MAY 9-98 HILL $2,000 0581 1570 1 6 5 4 2 1 1570 *2200 SMITH GW : - : - :LAPS : 7 : 60 : - : - : 8
$4,400 FT WO4000LT 0292 1320 OFF OFF OFF TORRE, ROCK, TY ZOLAK
MAY 15-98 HILL 1003 2011 7 8 8 8 8 8 *265 SMITH GW : - : - :LAPS : 8 : 75 : - : - : 9
$8,000 FT TM1500CND 0290 1294 OFF OFF BOSTON BEEMER, THE CANNON, ZURICH TOYOTA
MAY 21-98 HILL 0593 2010 7 8 8 8 8 8 9550 SMITH GW : - : - :LAPS : 9 : 46 : - : - : 2

SPARKPLUG BLK M 1964 2 5 5 4 $10,534 2001 HILL 5 $14,926 JOHN DOE,TARPORT,DE.
N764CHVY FORD CHEVY GMC 2.00.10F ELMER SMITH,NY,NY.
$2,700 FT NW4L5CD 0294 1320 Q Q H 0300 WHEELS WIN, CHARGER, ROCK
FEB 22-98 VLY $675 1013 2020 4 1 1 1 1 2 2020 *1750 SMITH GW : - : - :LAPS : 8 : 60 : - : - : 3
$2,700 FT NW4L5CD 0291 1311 1H T LT 0294 HAL THE BARBER, CHARGER, MAC
APR 3-98 VLY $675 1001 2011 3 2 2 3 3 2 2011 *1550 SMITH GW : - : - :LAPS : 6 : 45 : - : - : 5

SPARKPLUG (CONT)
N764CHVY
$2,000 MD NW4L5CD 0303 1343 1Q 2 2T 0314 WILD MICK, CHARGER, TEKLA
MAR 8-99 VLY $500 1024 2060 5 2 3 4 2 2 2063 1900 SMITH GW : - : - :LAPS : 8 : 36 : - : - : 10
$2,700 GD OPENRUN 0292 1312 2 H 1 0303 CHARGER, HAL THE BARBER, WOLFMAN
MAR 13-99 VLY $1,350 1004 2022 2 3 2 2 1 1 2022 1130 SMITH GW : - : - :LAPS : 7 : 31 : - : - : 7
$2,700 FT NW4L5CD 0294 1320 Q Q H 0300 WHEELS WIN, CHARGER, ROCK

DUTCHESS W F 82 21 3 2 4 $10,834 2003 VLY 3 $10,858 TARP INC,VALLEY CITY,CA.
PN82TRCK FORD CHEVY GMC 2.00.30M RICK SMITH,RED CEDAR,ND.
$2,800 MD F-NW2CND 0284 1311 7 8Q CARD SHARK, PHP GIRL, BREEZY BREE
AUG 25-98 RIDC 1011 2011 9 4 3< 3< 7 7 2024 820 MILLER TF : - : - :MILE : 9 : 69 : - : - : 6

awk · October 11, 2007, 4:01pm

Not sure which lines you are wanting to print, but there is a trick with awk. You can reset the value of NR anytime you want.

So a program like this

awk 'NR > 2 && (NR % 2 == 0 ){ print}
/^$/{NR=0}' <textfile> # /^$/ represent the blank line

got me output that looked like this:

FEB 20-98 VLY 1041 2094 8 8 8 8 8 8 NB SMITH GW : - : - :LAPS : 8 : 40 : - : - : 1
JAN 7-98 VLY 1030 2064 6 3 3 4 4 3 2071 NB SMITH GW : - : - :LAPS : 6 : 40 : - : - : 1
MAR 9-98 VLY $500 1024 2060 5 2 3 4 2 2 2063 1900 SMITH GW : - : - :LAPS : 8 : 36 : - : - : 1
MAR 13-98 VLY $1,350 1004 2022 2 3 2 2 1 1 2022 1130 SARAMA GW : - : - :LAPS : 7 : 31 : - : - : 2
MAR 27-98 VLY $675 1013 2020 4 1 1 1 1 2 2020 *1750 SMITH GW : - : - :LAPS : 8 : 60 : - : - : 2
MAY 1-98 VLY $1,500 0594 1594 2 1 1 1 1 1 1594 *9500 SMITH GW : - : - :LAPS : 7 : 70 : - : - : 9
MAY 9-98 HILL $2,000 0581 1570 1 6 5 4 2 1 1570 *2200 SMITH GW : - : - :LAPS : 7 : 60 : - : - : 8
MAY 15-98 HILL 1003 2011 7 8 8 8 8 8 *265 SMITH GW : - : - :LAPS : 8 : 75 : - : - : 9
MAY 21-98 HILL 0593 2010 7 8 8 8 8 8 9550 SMITH GW : - : - :LAPS : 9 : 46 : - : - : 2
FEB 22-98 VLY $675 1013 2020 4 1 1 1 1 2 2020 *1750 SMITH GW : - : - :LAPS : 8 : 60 : - : - : 3
APR 3-98 VLY $675 1001 2011 3 2 2 3 3 2 2011 *1550 SMITH GW : - : - :LAPS : 6 : 45 : - : - : 5
MAR 8-99 VLY $500 1024 2060 5 2 3 4 2 2 2063 1900 SMITH GW : - : - :LAPS : 8 : 36 : - : - : 10
MAR 13-99 VLY $1,350 1004 2022 2 3 2 2 1 1 2022 1130 SMITH GW : - : - :LAPS : 7 : 31 : - : - : 7

AUG 25-98 RIDC 1011 2011 9 4 3< 3< 7 7 2024 820 MILLER TF : - : - :MILE : 9 : 69 : - : - : 6

RacerX · October 11, 2007, 6:26pm

Thanks for the reply! Your solution got exactly the lines i wanted to pick out!

I believe i should be able to solve the rest of my problem on my own using some type of regex for "(CONT)" on the first line and an if-else statement.

If i can't figure it out, i'll be back with another question

Thanks again for your help, that trick with resetting the NR is a good one to know as i was clueless and kept fiddling with the settings for the FS and RS which was getting me no-where fast. Your solution is simply elegant.

RacerX · October 18, 2007, 10:58am

Is there any way to get the info lined up in columns using printf? I've tried a few things but it never seems to come out right; maybe the data is just too funky to get it to line-up?

So, given INPUT like:
FEB 20-98 VLY 1041 2094 8 8 8 8 8 8 NB SMITH GW : - : - :LAPS : 8 : 40 : - : - : 1
JAN 7-98 VLY 1030 2064 6 3 3 4 4 3 2071 NB SMITH GW : - : - :LAPS : 6 : 40 : - : - : 1
MAR 9-98 VLY $500 1024 2060 5 2 3 4 2 2 2063 1900 SMITH GW : - : - :LAPS : 8 : 36 : - : - : 1
MAR 13-98 VLY $1,350 1004 2022 2 3 2 2 1 1 2022 1130 SARAMA GW : - : - :LAPS : 7 : 31 : - : - : 2

Can i get OUTPUT like:

FEB 20-98	VLY			1041	2094	8  8  8  8  8  8		NB    SMITH GW	: - : - :LAPS : 8 : 40 : - : - : 1
JAN 7-98	VLY			1030	2064	6  3  3  4  4  3	2071	NB    SMITH GW	: - : - :LAPS : 6 : 40 : - : - : 1
MAR 9-98	VLY		$500	1024	2060	5  2  3  4  2  2	2063	1900  SMITH GW	: - : - :LAPS : 8 : 36 : - : - : 1
MAR 13-98	VLY		$1,350	1004	2022	2  3  2  2  1  1	2022	1130  SARAMA GW	: - : - :LAPS : 7 : 31 : - : - : 2
MAR 27-98	VLY		$675	1013	2020	4  1  1  1  1  2	2020	*1750 SMITH GW	: - : - :LAPS : 8 : 60 : - : - : 2

awk · October 18, 2007, 11:03am

Instead of a print, use a printf command. It allows you to specifiy a mask, then the data to print. for example

printf("%-30s", "MY NAME");

will right justify the value in the column. If you are a C programmer, it follows that printing convention. I suggest looking up the online (free and in pdf) version of "Effective awk programming" by Arnold Robbins for more information.

RacerX · October 18, 2007, 3:52pm

I've been reading about, and testing printf options for a while now, and am stuck on how to handle the above situation where one of the fields in a column is blank-whitespace. I tried using printf in the above code, specifically the following line using the first six fields only (i want to format the rest of the fields too, but for testing purposes only tried the first six to show my problem):

NR > 2 && (NR % 2 == 0 ) {printf "%-5s%-8s: %-10s: %-15s: %-10s: %-5s:\n",$1,$2,$3,$4,$5,$6} /^$/{NR=0}

I GET RETURNED:
FEB  20-98   : VLY       : 1041           : 2094      : 8    :
JAN  7-98    : VLY       : 1030           : 2064      : 6    :
MAR  9-98    : VLY       : $500           : 1024      : 2060 :
MAR  13-98   : VLY       : $1,350         : 1004      : 2022 :
MAR  27-98   : VLY       : $675           : 1013      : 2020 :
MAY  1-98    : VLY       : $1,500         : 0594      : 1594 :
MAY  9-98    : HILL      : $2,000         : 0581      : 1570 :
MAY  15-98   : HILL      : 1003           : 2011      : 7    :
MAY  21-98   : HILL      : 0593           : 2010      : 7    :
FEB  22-98   : VLY       : $675           : 1013      : 2020 :
APR  3-98    : VLY       : $675           : 1001      : 2011 :
MAR  8-99    : VLY       : $500           : 1024      : 2060 :
MAR  13-99   : VLY       : $1,350         : 1004      : 2022 :
             :           :                :           :      :
AUG  25-98   : RIDC      : 1011           : 2011      : 9    :

Which messes up which columns go where. So, how can i handle formatting a field that is whitespace?

It should be:

FEB  20-98   : VLY       :                : 1041      : 2094      : 8    :
JAN  7-98    : VLY       :                : 1030      : 2064      : 6    :
MAR  9-98    : VLY       : $500           : 1024      : 2060 :
MAR  13-98   : VLY       : $1,350         : 1004      : 2022 :
MAR  27-98   : VLY       : $675           : 1013      : 2020 :
MAY  1-98    : VLY       : $1,500         : 0594      : 1594 :
MAY  9-98    : HILL      : $2,000         : 0581      : 1570 :
MAY  15-98   : HILL      :                : 1003      : 2011      : 7    :
MAY  21-98   : HILL      :                : 0593      : 2010      : 7    :
FEB  22-98   : VLY       : $675           : 1013      : 2020 :
APR  3-98    : VLY       : $675           : 1001      : 2011 :
MAR  8-99    : VLY       : $500           : 1024      : 2060 :
MAR  13-99   : VLY       : $1,350         : 1004      : 2022 :

AUG  25-98   : RIDC      :                : 1011      : 2011      : 9    :

awk · October 18, 2007, 4:24pm

Yeah, but it is going to get complicated (believe it or not, this has been pretty straightforward).

the problem comes up from awk not being able to recognize a whitespace column. If you used tab separators, you could have a -F parameter for the tabs, but if it is simply spaces, you have to make an programmatic decision.

for instance, it looks like if column 3 is a $ amount - if that is always true, you can check to see if it has a $, and print in the right column, or, if not, then you know everything has slid down one.

So you could have to printf statements
if ($3 ~ /\$/ )
{ print style 1 }
else
{ print style 2 }

As much as I hate to admit it, I would have to try some trial and error to make sure the search for the $ works, since that is and End_of_line indicator and I was thinking that escaping it was the right idea.

RacerX · October 18, 2007, 6:24pm

I was afraid it might be something like that, and unfortunately all the fields are separated by spaces, not tabs. Furthermore, while column 3 has the dollar sign because it is a currency amount, there are other columns where there will be no value like the dollar sign to check if a condition is true or not.

I guess i might have to try something like:

BEGIN {OFS=":"}
col1=substr($0,1,9)
col2=substr($0,10,8)
col3=substr($0,21,10)
etc....

Thanks for the help you have given me up to this point. I could have not gotten this far without it.

awk · October 18, 2007, 6:46pm

try something like this (ran against the data above - don't know if I can get this fond to be non-proportional, but try running against the data you showed above.

awk -F: '! /^$/{
nf=split($1, X, " ")
if (X[4] ~ /\$/)
{
printf("%s %5s %-6s %7s %4s %4s %3s %3s %3s %3s %3s %3s %5s %5s %10s %2s", X[1], X[2], X[3], X[4], X[5], X[6],
X[7], X[8], X[9], X[10], X[11], X[12], X[13], X[14], X[15], X[16] )
}
else
{
printf("%s %5s %-6s %-7s %4s %4s %3s %3s %3s %3s %3s %3s %5s %5s %10s %2s", X[1], X[2], X[3], " " , X[4], X[5], X[6],
X[7], X[8], X[9], X[10], X[11], X[12], X[13], X[14], X[15])
}

printf("%3s %3s %5s %3s %3s %3s %3s %3s\n", $2, $3, $4, $5, $6, $7, $8, $9)
}'

Not sure what you are expecting, but I got this - still see some errors to clean up.

FEB 20-98 VLY 1041 2094 8 8 8 8 8 8 NB SMITH GW - - LAPS 8 40 - - 1
JAN 7-98 VLY 1030 2064 6 3 3 4 4 3 2071 NB SMITH GW - - LAPS 6 40 - - 1
MAR 9-98 VLY $500 1024 2060 5 2 3 4 2 2 2063 1900 SMITH GW - - LAPS 8 36 - - 1
MAR 13-98 VLY $1,350 1004 2022 2 3 2 2 1 1 2022 1130 SARAMA GW - - LAPS 7 31 - - 2
MAR 27-98 VLY $675 1013 2020 4 1 1 1 1 2 2020 *1750 SMITH GW - - LAPS 8 60 - - 2
MAY 1-98 VLY $1,500 0594 1594 2 1 1 1 1 1 1594 *9500 SMITH GW - - LAPS 7 70 - - 9
MAY 9-98 HILL $2,000 0581 1570 1 6 5 4 2 1 1570 *2200 SMITH GW - - LAPS 7 60 - - 8
MAY 15-98 HILL 1003 2011 7 8 8 8 8 8 *265 SMITH GW - - LAPS 8 75 - - 9
MAY 21-98 HILL 0593 2010 7 8 8 8 8 8 9550 SMITH GW - - LAPS 9 46 - - 2
FEB 22-98 VLY $675 1013 2020 4 1 1 1 1 2 2020 *1750 SMITH GW - - LAPS 8 60 - - 3
APR 3-98 VLY $675 1001 2011 3 2 2 3 3 2 2011 *1550 SMITH GW - - LAPS 6 45 - - 5
MAR 8-99 VLY $500 1024 2060 5 2 3 4 2 2 2063 1900 SMITH GW - - LAPS 8 36 - - 10
MAR 13-99 VLY $1,350 1004 2022 2 3 2 2 1 1 2022 1130 SMITH GW - - LAPS 7 31 - - 7
AUG 25-98 RIDC 1011 2011 9 4 3< 3< 7 7 2024 820 MILLER TF - - MILE 9 69 - - 6

RacerX · October 18, 2007, 8:09pm

That is pretty darn close to working!

For me on my machine it returned:

FEB 20-98 VLY            1041 2094   8   8   8   8   8   8    NB SMITH         GW    -   -  LAPS   8   40   -   -    1
JAN  7-98 VLY            1030 2064   6   3   3   4   4   3  2071    NB      SMITH GW -   -  LAPS   6   40   -   -    1
MAR  9-98 VLY       $500 1024 2060   5   2   3   4   2   2  2063  1900      SMITH GW -   -  LAPS   8   36   -   -    1
MAR 13-98 VLY     $1,350 1004 2022   2   3   2   2   1   1  2022  1130     SARAMA GW -   -  LAPS   7   31   -   -    2

The only trip-up was the first record where there was a blank/no value after the last 8 so the NB SMITH etc. should be over one column but with what you have given me i should be able to work that out.

If not i'll be back with more questions Thanks again for all your help and direction.

futurelet · October 18, 2007, 8:46pm

BEGIN {
  RS = "" ; FS = "\n"
  split( "%s|%7s| : %-9s| : %9s| : %s" \
         "| : %s| : %s| : %s", format, "|" )
}

{ if ( $1 !~ /\(CONT\)/ )
    print ""
  for (i=4; i<=NF; i+=2 )
    process( $i )
}

function process( line,       a, i )
{ split( line, a, " " )
  if ( length(a)==29)
    insert( a, 12, "" )
  if ( length(a) == 30 )
    insert( a, 4, "" )
  for (i=1; i<=length(format); i++)
    printf format, a
  print ""
}

function insert( a, where, what,      i )
{ for ( i = length(a); i >= where; i-- )
    a[i+1] = a
  a[where] = what
}