any better way to remove line breaks

csmklee · October 6, 2008, 4:37am

Hi,

I got some log files which print the whole xml message in separate lines:
e.g.
2008-10-01 14:21:44,561 INFO do something
2008-10-01 14:21:44,561 INFO print xml : <?xml version="1.0" encoding="UTF-8"?>
<a>
<b>my data</b>
</a>
2008-10-01 14:21:44,563 INFO do something again

I want to convert the xml part into one single line, e.g.
2008-10-01 14:21:44,561 INFO do something
2008-10-01 14:21:44,561 INFO print xml : <?xml version="1.0" encoding="UTF-8"?><a><b>my data</b></a>
2008-10-01 14:21:44,563 INFO do something again

I once got a script like:
gzip -dc log.gz | sed -n -e ":a" -e "$ s/>\n/>/gp;N;b a"

but it's very slow and run into out of memory

is there any better way to do achieve it?

era · October 6, 2008, 5:39am

Maybe replace all line breaks, then replace back if the next character is not a wedge.

gzip -dc log.gz | tr '\n' '�' | sed -e 's/�</</g' -e 's/�/\n/g'

The character � is unlikely to occur in the log file, but might be problematic if your locale doesn't handle it as a single byte. If your file doesn't contain any underscores, using an underscore instead is safer; or maybe you can come up with another character which doesn't occur in the file (literal vertical bar perhaps? exclamation mark?)

The notation \n might or might not be understood to mean newline by your tr and/or sed; read the manual page and/or experiment with other possible notations, including \012 (for tr) and literal newline:

gzip -dc log.gz | tr '
' '�' | sed -e 's/�</</g' -e 's/�/\
/g'

Yes, it looks weird, but it's valid string syntax in the shell. (Might want to try without the backslash before the newline in the sed script if it still doesn't work.)

cfajohnson · October 6, 2008, 12:42pm

gzip -dc log.gz | awk '
   /^</ { printf "%s", $0; next }
        { print ""; printf "%s", $0 }
    END { print "" }'

##

csmklee · January 13, 2009, 1:42am

Initially, we implemented the sed solution. It does work, but hang up in some of our hosts (occupying 99.9% of cpu time and drive no output) when the log file size when up to 100MB.

Then, we switch to use awk and the results generated in seconds.

I'm ignorant on shell programming
May I know why there is such difference? Is the sed bounded by CPU or memory problem?