Format the text using sed or awk

kenshinhimura · August 23, 2018, 11:51am

I was able to figure out how to format a text.

Raw Data:
$ cat test
Thu Aug 23 15:43:28 UTC 2018,
hostname01,
232.02,
3,
0.00
Thu Aug 23 15:43:35 UTC 2018,
hostname02,
231.09,
4,
0.31
Thu Aug 23 15:43:37 UTC 2018,
hostname03,
241.67,
4,
0.43




My output:
cat test| sed 'N;N;N;N; s/\n/ /g'
Thu Aug 23 15:43:28 UTC 2018, hostname01, 232.02, 3, 0.00
Thu Aug 23 15:43:35 UTC 2018, hostname02, 231.09, 4, 0.31
Thu Aug 23 15:43:37 UTC 2018, hostname03, 241.67, 4, 0.43


This one works for me "sed 'N;N;N;N; s/\n/ /g'"

But what if the data is not perfect?

$ cat test
Thu Aug 23 15:43:28 UTC 2018,
hostname01,
232.02,
3,
0.00
Thu Aug 23 15:43:35 UTC 2018,
hostname02,
231.09,
0.31
Thu Aug 23 15:43:37 UTC 2018,
hostname03,
241.67,
4,
0.43
$

#Missing between line number 8 and 9.


$ cat test| sed 'N;N;N;N; s/\n/ /g'
Thu Aug 23 15:43:28 UTC 2018, hostname01, 232.02, 3, 0.00
Thu Aug 23 15:43:35 UTC 2018, hostname02, 231.09, 0.31 Thu Aug 23 15:43:37 UTC 2018,
hostname03,
241.67,
4,
0.43

$

#it ruin the output... I just hope no matter what, it wont ruin, it will just put another (,) 

Hoping for something like this

Thu Aug 23 15:43:28 UTC 2018, hostname01, 232.02, 3, 0.00
Thu Aug 23 15:43:35 UTC 2018, hostname02, 231.09, 4, 0.31
Thu Aug 23 15:43:37 UTC 2018, hostname03, 241.67, , 0.43

RudiC · August 23, 2018, 12:21pm

So, if there's no guarantee it's always five lines of data per record, how then would you tell one record from the other? Is there always a time stamp in line 1? A hostname in line 2? How do we tell it's line 4 that's missing (to put the comma right)?

vgersh99 · August 23, 2018, 12:26pm

Could be a line that starts with the capital indicates the start of a 'record'.
But the OP would need to say if it's a safe assumption and/or if there's a better indication of a start of a record...

Furthermore, once we determine the boundaries of a 'block', how do we determine which field is missing?

MadeInGermany · August 23, 2018, 3:19pm

Looks like a trailing comma indicates to append the next line.
And the record ends when there is no trailing comma.
That means, append the next line if the current line ends with a comma. Try four times.

sed '/,$/N; /,$/N; /,$/N; /,$/N; s/\n/ /g' test

vgersh99 · August 23, 2018, 3:27pm

madeingermany:

Looks like a trailing comma indicates to append the next line.
And the record ends when there is no trailing comma.
That means, append the next line if the current line ends with a comma. Try four times.
sed '/,$/N; /,$/N; /,$/N; /,$/N; s/\n/ /g' test

I get the following on a sample file with missing field:

Thu Aug 23 15:43:28 UTC 2018, hostname01, 232.02, 3, 0.00
Thu Aug 23 15:43:35 UTC 2018, hostname02, 231.09, 0.31
Thu Aug 23 15:43:37 UTC 2018, hostname03, 241.67, 4, 0.43

Thought the desired output (incorrectly done by the OP) would be:

Thu Aug 23 15:43:28 UTC 2018, hostname01, 232.02, 3, 0.00
Thu Aug 23 15:43:35 UTC 2018, hostname02, 231.09, , 0.31
Thu Aug 23 15:43:37 UTC 2018, hostname03, 241.67, 4, 0.43

MadeInGermany · August 23, 2018, 3:45pm

I have provided a plausible answer to "How to determine the boundaries of a 'block'?". And it does no longer "ruin" the following blocks.
The "How do we determine which field is missing?" is not answered.