Merge multiple lines into a single line

dwdnet · July 4, 2017, 1:01pm

Hi all, I'm relatively new to scripting, I can do pretty basic things. I have a daily log file that looks like:

timestamp=2017-06-28-01.01.35.080576;
     event status=0;
     userid=user1;
   authid=user1;
     application id=10.10.10.10.11111.12345678901;
     application name=GUI;

  
timestamp=2017-06-28-01.01.36.096486;
     event status=0;
     userid=user1;
     authid=user1;
     application id=10.10.10.10.11111.12345678901;
     application name=GUI;
     statement text=SELECT table.field, table.field, table.field from database where table.field = value

There is a blank line between each log entry. I need to combine each log entry into one line and preferably remove the final semi-colon if there is one. I'm guessing that awk or sed could help. I've asked elsewhere, still no luck. Hopefully someone can help me out here.

MadeInGermany · July 4, 2017, 2:15pm

With sed (multi-liner with comments)

sed '
#loop
:L
# last line? Then jump to E (leave the loop)
$bE
# append next line to the line buffer with \n in between
N
# empty line? Not then jump to L
/\n *$/!bL
:E
# replace all empbedded \n and the surrounding space with one space
s/ *\n */ /g
# delete leading space
s/^ *//
# delete a trailing ; and following space
s/; *$//
# default print of line buffer
' logfile

Compare this practical exercise with the man sed ! You might learn something.

RudiC · July 4, 2017, 2:35pm

Try also

awk 'gsub (/;/, "\t") && $1 = $1' RS= file
timestamp=2017-06-28-01.01.35.080576	 event status=0	 userid=user1	 authid=user1	 application id=10.10.10.10.11111.12345678901	 application name=GUI	
timestamp=2017-06-28-01.01.36.096486	 event status=0	 userid=user1	 authid=user1	 application id=10.10.10.10.11111.12345678901	 application name=GUI	 statement text=SELECT table.field, table.field, table.field from database where table.field = value

RavinderSingh13 · July 4, 2017, 2:47pm

Hello MadeInGermany,

Nice code sir .

Hello dwdnet,

As you haven't shown the expected sample output, so based on your explanation of problem only, let me know if following helps you.

awk '{$1=$1;sub(/;$/,"");print}' RS=   Input_file

Hello Rudi sir,

Nice code, I think only thing could be improved is OP needs to remove the last occurrence of semi colon in each line.

Thanks,
R. Singh

RudiC · July 4, 2017, 3:00pm

Thanks for pointing this out; I misread it in post#1. Try

awk 'sub (/;$/, "")+1 && $1=$1' RS= file
timestamp=2017-06-28-01.01.35.080576; event status=0; userid=user1; authid=user1; application id=10.10.10.10.11111.12345678901; application name=GUI
timestamp=2017-06-28-01.01.36.096486; event status=0; userid=user1; authid=user1; application id=10.10.10.10.11111.12345678901; application name=GUI; statement text=SELECT table.field, table.field, table.field from database where table.field = value

dwdnet · July 4, 2017, 3:19pm

Thanks for the quick response guys. I'm not seeing the same results as you are. The only line that I am seeing is the very last 'statement text' line.

statement text=SELECT table.field, table.field, table.field from database where table.field = value

I'm running Debian Stretch. Could it be possible that the environment can cause this?

Don_Cragun · July 4, 2017, 4:41pm

One might guess that the file you are processing has DOS line terminators (<carriage-return><newline>) instead of the expected UNIX line terminators (<newline>) and that the last line in your output is overwriting earlier lines in your output because of the embedded <carriage-return> characters.

You didn't mention which code suggestion you are using. Please try it again and pipe the output produced through od as shown below and show us the output (in CODE tags):

current command you are using | od -bc

If that shows some carriage returns in the output (displayed as \r ) we can suggest ways to get around that problem if you tell us which code you are using.

dwdnet · July 4, 2017, 5:02pm

Thanks Don. I used the command line below. Right behind each semi-colon is a \r following by what looks like a space.

awk 'sub (/;$/, "")+1 && $1=$1' RS= infile
0000000 357 273 277 164 151 155 145 163 164 141 155 160 075 062 060 061
        357 273 277   t   i   m   e   s   t   a   m   p   =   2   0   1
0000020 067 055 060 066 055 062 070 055 060 061 056 060 061 056 063 065
          7   -   0   6   -   2   8   -   0   1   .   0   1   .   3   5
0000040 056 060 070 060 065 067 066 073 015 040 145 166 145 156 164 040
          .   0   8   0   5   7   6   ;  \r       e   v   e   n   t    
0000060 163 164 141 164 165 163 075 060 073 015 040 165 163 145 162 151
          s   t   a   t   u   s   =   0   ;  \r       u   s   e   r   i
0000100 144 075 165 163 145 162 061 073 015 040 141 165 164 150 151 144
          d   =   u   s   e   r   1   ;  \r       a   u   t   h   i   d
0000120 075 165 163 145 162 061 073 015 040 141 160 160 154 151 143 141
          =   u   s   e   r   1   ;  \r       a   p   p   l   i   c   a
0000140 164 151 157 156 040 151 144 075 061 060 056 061 060 056 061 060
          t   i   o   n       i   d   =   1   0   .   1   0   .   1   0
0000160 056 061 060 056 061 061 061 061 061 056 061 062 063 064 065 066
          .   1   0   .   1   1   1   1   1   .   1   2   3   4   5   6
0000200 067 070 071 060 061 073 015 040 141 160 160 154 151 143 141 164
          7   8   9   0   1   ;  \r       a   p   p   l   i   c   a   t
0000220 151 157 156 040 156 141 155 145 075 107 125 111 073 015 040 015
          i   o   n       n   a   m   e   =   G   U   I   ;  \r      \r
0000240 040 164 151 155 145 163 164 141 155 160 075 062 060 061 067 055
              t   i   m   e   s   t   a   m   p   =   2   0   1   7   -
0000260 060 066 055 062 070 055 060 061 056 060 061 056 063 066 056 060
          0   6   -   2   8   -   0   1   .   0   1   .   3   6   .   0
0000300 071 066 064 070 066 073 015 040 145 166 145 156 164 040 163 164
          9   6   4   8   6   ;  \r       e   v   e   n   t       s   t
0000320 141 164 165 163 075 060 073 015 040 165 163 145 162 151 144 075
          a   t   u   s   =   0   ;  \r       u   s   e   r   i   d   =
0000340 165 163 145 162 061 073 015 040 141 165 164 150 151 144 075 165
          u   s   e   r   1   ;  \r       a   u   t   h   i   d   =   u
0000360 163 145 162 061 073 015 040 141 160 160 154 151 143 141 164 151
          s   e   r   1   ;  \r       a   p   p   l   i   c   a   t   i
0000400 157 156 040 151 144 075 061 060 056 061 060 056 061 060 056 061
          o   n       i   d   =   1   0   .   1   0   .   1   0   .   1
0000420 060 056 061 061 061 061 061 056 061 062 063 064 065 066 067 070
          0   .   1   1   1   1   1   .   1   2   3   4   5   6   7   8
0000440 071 060 061 073 015 040 141 160 160 154 151 143 141 164 151 157
          9   0   1   ;  \r       a   p   p   l   i   c   a   t   i   o
0000460 156 040 156 141 155 145 075 107 125 111 073 015 040 163 164 141
          n       n   a   m   e   =   G   U   I   ;  \r       s   t   a
0000500 164 145 155 145 156 164 040 164 145 170 164 075 123 105 114 105
          t   e   m   e   n   t       t   e   x   t   =   S   E   L   E
0000520 103 124 040 164 141 142 154 145 056 146 151 145 154 144 054 040
          C   T       t   a   b   l   e   .   f   i   e   l   d   ,    
0000540 164 141 142 154 145 056 146 151 145 154 144 054 040 164 141 142
          t   a   b   l   e   .   f   i   e   l   d   ,       t   a   b
0000560 154 145 056 146 151 145 154 144 040 146 162 157 155 040 144 141
          l   e   .   f   i   e   l   d       f   r   o   m       d   a
0000600 164 141 142 141 163 145 040 167 150 145 162 145 040 164 141 142
          t   a   b   a   s   e       w   h   e   r   e       t   a   b
0000620 154 145 056 146 151 145 154 144 040 075 040 166 141 154 165 145
          l   e   .   f   i   e   l   d       =       v   a   l   u   e
0000640 015 040 015 012
         \r      \r  \n
0000644

RudiC · July 4, 2017, 5:26pm

Try

awk 'gsub (";$|\r", "")+1 && $1=$1' RS= file

dwdnet · July 4, 2017, 5:52pm

Thanks RudiC. We are getting closer thanks to you all. The last command put everything on one line. The events should all begin with the timestamp field. So by using the example I provided there should be 2 output lines.

RudiC · July 4, 2017, 6:04pm

You seem not to have any <line feed> chars in your file except in the very end. So mayhap the condition you quoted in post#1 - empty line between records - can't be met with non- *nix files. Is that a MAC file?

Don_Cragun · July 4, 2017, 7:26pm

We haven't seen enough information to determine whether or not there are <newline> (or <linefeed>) characters in the input file. Using RS= in awk to set the record separator to sequences of blank lines will not treat a line containing a <carriage-return> as a blank line. So, awk won't find any record separators in a DOS input file. (And, if you remove the <carriage-return> characters in the code that will happen AFTER the record separator search has already occurred.)

I would start by trying:

awk '
function p() {
	if(d) {
		sub(/;$/, "", o)
		print o
		o = ""
		d = 0
	}
}
{	sub(/\r/, "")
}
!NF {	p()
	next
}
{	$1 = $1
	o = o (o == "" ? "" : " ") $0
	d = 1
}
END {	p()
}' file

If that doesn't work, please show us the output from:

od -bc file

where file is the name of your input file.

RudiC · July 5, 2017, 12:19am

Hmmm - in the od output in post#8, I find many \r but just one \n . So my question... I only now see the three introductory bytes 357 273 277 - need to double check their meaning...

MadeInGermany · July 5, 2017, 1:32am

I think the od dump is from your first awk's output that was all on one line.
As Don said, the RS= will not see a \n\r\n but in GNU awk you can set RS="\n\r?\n" or even RS="\n[[:space:]]*\n" to allow any whitespace including \r and further \n.

dwdnet · July 5, 2017, 10:42am

Don, that definitely worked for all but a small portion of the log. That really helps. In a copy of the actual log, I ran the code you gave me and it did merge most lines.

I ran the od -bc command against it and I am seeing \n in some lines in the statement text= field. So when I view the log, some of the statement text= field content is still on their own line.

It sounds like the \n means new line. Would it be possible to remove the \n if it does not precede the timestamp= field if that is the case?

Don_Cragun · July 5, 2017, 9:11pm

You said originally that blank lines separate records. From what you have said in post #15 it sounds like you have some blank lines in the middle of some records. It isn't clear how you want the data represented by those blank input lines to appear in your output.

Please show us a couple of sample input records (in CODE tags) that do not work with the awk code suggested in post #12, show us the output that code is currently producing (in CODE tags), and show us the output that you want to produce from those input records (also in CODE tags).

dwdnet · July 10, 2017, 3:58pm

I'll have to create a mock up of a sample of the records as I can't put up live data. Hopefully that will suffice. Thanks.

---------- Post updated at 01:58 PM ---------- Previous update was at 11:29 AM ----------

I'm not sure this will work. Some of the statement text= lines are longer than 4096 characters. Quite complex database statements. I'll have to try and approach this from a different angle. I sincerely appreciate all the help though. Thanks, you guys are great.

Don_Cragun · July 10, 2017, 4:34pm

If the input is a text file, you don't care that the output might not be a text file (due to line length limitations), can clearly describe the rules for combining input lines into output lines, and you can provide representative (not actual) sample input and corresponding sample output, then the fact the some of the output lines are long should not be a problem.

Some versions of the awk utility can't write more than 4K bytes at a time, but all can write a single output line by writing several partial lines as long as each segment of the line being written is less that 4K bytes.

Are you saying that your input didn't have any blank lines in the middle of any records, but instead had some records that were more than 4K bytes long?

dwdnet · July 10, 2017, 5:03pm

Some of the records are longer than 4K. There are no blank lines in the long record however there are some line feeds, which is what is breaking the record into different lines. I am able to manually edit the text file and remove the LF though.

Don_Cragun · July 10, 2017, 7:28pm

You can, of course, edit the files manually. Or, you can show us representative sample input and the corresponding output you want from that input and we might then be able to help you modify the code shown in post #12 to get what you want programmatically.