awk: switching lines and concatenating lines?

Borghal · March 28, 2010, 9:08am

Hello, I have only recently begun with awk and need to write this:
I have an input consisting of a couple of letters, a space and a number followed by various other characters:

fiRcQ 9( [various data ])
klsRo 9( [various data ]) pause
fiRcQ 9( [various data ]) pause
klsRo continue 1
aPLnJ 62( [various data ])
fiRcQ continue 5
... and so on

I want an output where each pause would be followed by a continue with the same key identifier:

fiRcQ 9( [various data ])
klsRo 9( [various data ]) pause klsRo continue 1
fiRcQ 9( [various data ]) pause fiRcQ continue 5
aPLnJ 62( [various data ])

So the algorithm would be something along the lines of:

start on line = 1
search for "pause" [else increase line by 1 and repeat]
 next line, search for "continue" [else move to next line]
  compare \^[a-zA-Z]*\ from starting line with the same regexp on the current line [else move to next line]
    in case of match, take current line, add it to starting line and delete current line
    increase line by 1 and repeat.

I know what I want to do, but being a total newbie in awk, I have no idea how the syntax would look like.

malcomex999 · March 28, 2010, 9:54am

If the order of the lines in output doesn't matter, this will do what you want...

awk '/pause/ || /continue/{arr[$1]=arr[$1]" "$0;next}
{arr[$0]=arr[$0]" "$0}END{for(i in arr) print arr}' infile

Borghal · March 28, 2010, 1:01pm

Well...what I meant was: when it finds a line with "pause", search one line after another from top down for a line with a matching first column (letter-id) that contains "continue". Then take the "continue" line and MOVE it to the end of the first one, keeping all the characters of both lines so instead of two separate lines you get one longer line in place of the first.

The input is already sorted in a way that although more lines than just two can have the same id, if a line with an id contains "pause", than the next line with the same id will contain "continue", in the same way that interrupted processes work.

EDIT: Could you please comment how your piece of code works? I don't really understand it much

Franklin52 · March 28, 2010, 1:54pm

Something like this?

awk '/pause/{ a[$1]=$0;next } a[$1]{ print a[$1] FS $0;next } 1' file

alister · March 28, 2010, 2:03pm

Hello, Borghal:

Welcome to the forums. I modified your sample data to include a pause-continue pair that reuses a key used by a previous pause-continue pair.

$ cat data
fiRcQ 9( [various data ])
klsRo 9( [various data ]) pause
fiRcQ 9( [various data ]) pause
klsRo continue 1
aPLnJ 62( [various data ])
fiRcQ continue 5
klsRo 9( [various data ]) pause
klsRo continue 21

$ awk 'NR==FNR {if ($2=="continue") c[$1,++c[$1,"i"]]=$0; next} $NF=="pause" {print $0,c[$1,++p[$1]]; next} $2!="continue"' data data
fiRcQ 9( [various data ])
klsRo 9( [various data ]) pause klsRo continue 1
fiRcQ 9( [various data ]) pause fiRcQ continue 5
aPLnJ 62( [various data ])
klsRo 9( [various data ]) pause klsRo continue 21

Regards,
Alister

---------- Post updated at 02:03 PM ---------- Previous update was at 01:56 PM ----------

If you need to strictly preserve the order, I would recommend my solution over Franklin52's. If not, then most definitely use Franklin52's as it's simpler and could be significantly faster (since mine must read the data twice).

Alister

Borghal · March 28, 2010, 3:02pm

Thanks, everyone... unfortunately, I do need to preserve the order of the input file, but as I'm using it in a filter cat | awk | grep | sed ..., reading the data twice is not an option I think.

This is proving more difficult than I thought it would

Could someone explain to me please what does this line do:
(It's supposed to do what I want, but all it does is delete both concerning lines)

awk '/pause$/ {array[$1$2]=$0; next}/continue/ {if($1$2 in array) print array[$1$2] $0; delete array[$0]; next}{ print $0 }'

First it looks for a line with pause, then assigns the line it finds to array[$1$2] (why $1$2?) then it goes on to the next line to start a search for continue?

Sorry, I must look really dumb, but I can't find any good tutorial that would help me understand how it works...

alister · March 28, 2010, 3:38pm

For the best help possible, you should post the entire pipeline (your filter).

Your awk code appears to delete lines that end in "pause" or contain "continue" because when a continue line is found, and you look up $1$2 in the array, there is never a match. $1$2 for the continue line is equal to a key value followed by the word "continue". None of the pause lines will match that as the second field begins with a number. You need to key on $1 alone, not $1 and $2.

Tweaking your code:

awk '/pause$/ {array[$1]=$0; next}/continue/ {if($1 in array) print array[$1] $0; delete array[$1]; next}{ print $0 }'

... but, that's just a more verbose version of Franklin52's solution above.

This approach will reorder the lines a bit, because while the array holds a pause line until its continue is found, other lines may print, effectively moving a pause line further down the sequence.

Alister

Borghal · March 28, 2010, 4:08pm

Thank you very much, Alister, now it does exactly what I needed and I think I understand better how it works.