Log4j combining lines to single line

Our log4j file contents look like this:

2018-11-20T00:06:58,888  INFO [HiveServer2-Background-Pool: Thread-21912] ql.Driver: Executing command(queryId=hive_20181120000656_49af4ad0-1d37-4312-872c-a247ed80c181): CREATE TABLE RESULTS.E7014485_ALL_HMS_CAP1
 AS SELECT name,dept
 from employee
  Where employee='Jeff'
2018-11-20T00:06:58,888  INFO [HiveServer2-Background-Pool: Thread-21912] ql.Driver: Query ID = hive_20181120000656_49af4ad0-1d37-4312-872c-a247ed80c181
2018-11-20T00:06:58,888  INFO [HiveServer2-Background-Pool: Thread-21912] ql.Driver: Executing command(queryId=hive_20181120000656_49af4ad0-1d37-4312-872c-a247ed80c182): CREATE TABLE RESULTS.E7014485_ALL_HMS_CAP2
 AS SELECT name,dept
 from employee
  Where employee='Yung'
2018-11-20T00:06:58,888  INFO [HiveServer2-Background-Pool: Thread-21912] ql.Driver: Query ID = hive_20181120000656_49af4ad0-1d37-4312-872c-a247ed80c182

As you can see the create statement is across many lines, and the number of lines can vary.
I need to have only one line per entry.
My output should look like this:

2018-11-20T00:06:58,888  INFO [HiveServer2-Background-Pool: Thread-21912] ql.Driver: Executing command(queryId=hive_20181120000656_49af4ad0-1d37-4312-872c-a247ed80c181): CREATE TABLE RESULTS.E7014485_ALL_HMS_CAP1 AS SELECT name,dept from employee  Where employee='Jeff'
2018-11-20T00:06:58,888  INFO [HiveServer2-Background-Pool: Thread-21912] ql.Driver: Query ID = hive_20181120000656_49af4ad0-1d37-4312-872c-a247ed80c181
2018-11-20T00:06:58,888  INFO [HiveServer2-Background-Pool: Thread-21912] ql.Driver: Executing command(queryId=hive_20181120000656_49af4ad0-1d37-4312-872c-a247ed80c182): CREATE TABLE RESULTS.E7014485_ALL_HMS_CAP2 AS SELECT name,dept from employee  Where employee='Yung'
2018-11-20T00:06:58,888  INFO [HiveServer2-Background-Pool: Thread-21912] ql.Driver: Query ID = hive_20181120000656_49af4ad0-1d37-4312-872c-a247ed80c182

Any idea on how to achieve this?

I was trying sed and some regex patterns, but was unable to make it work


Where and how did you get stuck with your "sed and some regex patterns" attempt?
And, what OS, shell, sed versions are you using?

The idea is any line which does not start with a data, replace the first character with a backspace.
So I tried the command below:

sed '/^[[:digit:]]\{4\}-[[:digit:]]\{2\}-[[:digit:]]\{2\}T[[:digit:]]\{2\}:[[:digit:]]\{2\}:[[:digit:]]\{2\},[[:digit:]]\{3\}\ [[:alpha:]]*/! s/^/^\b/g' logfile.txt

But backspace is not working, maybe character is wrong, or I need to try another way.

Shell: Bash

Indeed. When you work with sed , especially when you are about to do rather complex things, it pays to first define as exactly as possible what you are going to do, so the first step is to describe (in as excruciating detail as possible) what we are going to do and when. If in the following my assumptions are wrong don't hesitate to correct them.

We want to rearrange the line endings, so that lines only start with a "clause" of this type:

2018-11-20T00:06:58,888  INFO [HiveServer2-Background-Pool: Thread-21912] ql.Driver:

Question: might this clause also be spread over several lines? If yes we need to do more work, for now i assume it isn't.

What do we need to do when we encounter such a clause? We need to start collecting text until we hit another such clause - or the end of file - which is when we need to output everything collected so far in one line. For all the other lines we encounter this means: they must be part of such a previous line and we simply collect them to what we have already. Now let us formalise this into rules what we do when:

Lines starting with the clause:
          - remove the newlines from the last line if there is one
          - output the last line if there is one
          - clear the collecting buffer
          - put the new line into the collecting buffer
EOF, last line:
          - add it to what we have collected so far
          - remove the newlines from the currently collected line
          - output that line
other lines:
          - put the line into the collecting buffer

This is already the very structure of our sed-script, because sed works rule-based. Furthermore, sed has exactly what we need for this: the "hold space". This is the collecting buffer we will need. I suggest you sit down with the man page and read what it does and how it is manipulated.

Let us start coding. We need a regexp to express what i called "clause" above. I will do it but you probably want to refine it because you know your data better than I. I.e i coded the month and day "[0-9][0-9]" because i supposed dates will be written "2018-03-04", but maybe they are not and it would be "2018-3-4" in which case you will have to correct the regexp. Also, remove the commentary because sed will not understand them, they are just there for you to better understand:

/^20[0-9][0-9]-[0-9][0-9]-[0-9][0-9]T[0-9][0-9]:[0-9][0-9]:[0-9][0-9],[0-9].*\[.*\] ql\.Driver:/ {
     x                      # exchange hold space and pattern space, this clears the collecting buffer and puts
                            #  the new line there, we work from now on with the last line collected so far
     s/\n/ /g               # replace newlines with blanks
     p                      # and print the line finally
     b end                  # and go to end of script/start with the next line
$ {
     H                      # add this line to the hold space
     x                      # exchange hold space and pattern space, we have what we collected in pattern space again
     s/\n/ /g               # replace newlines with blanks
     p                      # and print the line finally
     b end                  # and go to end of script/start with the next line
                            # here we land only with all "other" lines not covered by above rules: they would jump over this
H                           # add this line to the hold space
d                           # and delete the line from pattern space, we do not want to print it

:end                        # here we land when we execute the b-commands

I hope this helps.


Not as sophisticated as bakunin's proposal (esp. the date detection regex), but you could try also

tac file | sed -n '/^2018/!{G; h; b}; G; s/\n//g; p; s/.*//; h' | tac

It starts from the end, composes the "CREATE TABLE" statement in hold space if no date found. If date found, append the hold space, print, and empty hold space.

1 Like