Removing ^M and the newline that follows it.

Leedor · April 1, 2011, 6:34pm

Hi Gurus,

Apologies as I feel like this must be answered already on here somewhere but I just can't find it. I find many people looking to remove all \n and \r (CR and LF) or one or the other but the only times I've found someone trying to remove them only when both are together they've found workarounds instead eg: Issue with Removing Carriage Return (^M) in delimited file

So my issue is: I have data like

a,b,c,d,e
a,b,c,d,e
a,b,c,^M
d,e
a,b,c,d,e

Removing the ^M alone I can do with

tr -d '\r'

But this still leaves the broken line

I tried

tr -d '\r\n'

but of course that moves ALL linefeeds, not just the ones after a ^M

I can find the lines with ease using grep and <ctrlv><ctrlm> and fix them manually but I need this in an automated script, as simple to understand as possible please as I'm obviously no unix wiz.

Your help is greatly appreciated, as always!

Lee

alister · April 1, 2011, 6:42pm

Hi, Lee:

Perhaps the following will do the trick for you:

sed '/^M$/{N; s/.\n//;}'

That solution will discard a final line ending with \r\n since there's nothing to merge it with. If that's undesirable, see the next offering.

The following, when it encounters a final line which ends in \r\n, will strip the \r but leave the \n. Nothing follows the line so it cannot be merged with another, but the final result is a text file with unix line endings as per your example:

sed '/^M$/{s///; $p; N; s/\n//;}'

Note that everything said above only applies to a posix-compliant sed implementation. GNU sed by default chooses to ignore the standard and writes the pattern space to stdout when N executes and there's no further text. The following (I think .. untested) are the equivalents for the GNU sed commonly found on linux systems:

sed '/^M$/{$d; N; s/.\n//;}'

sed '/^M$/{s///; N; s/\n//;}'

Regards,
Alister

P.S. In case the rationale for GNU sed's N behavior is of interest to anyone, it's discussed @ Reporting Bugs (sed, a stream editor) (I found that myself when looking to report the "bug" ;))

mvijayv · April 1, 2011, 6:44pm

I had this issue with files that were ftp-ed from windows to unix .. using the dos2unix command cleans it up pretty well. Try it. I used it in a script to clean up data files before loading into the database and it worked too.

dos2unix <inputfile> <outputfile>

alister · April 1, 2011, 6:48pm

mvijayv:

I think you misunderstood the problem. dos2unix, to my knowledge, only converts the line ending; it does not merge lines by deleting one type of line ending while leaving others intact.

Regards,
Alister

yinyuemi · April 1, 2011, 6:58pm

echo "a,b,c,d,e
a,b,c,d,e
a,b,c,^M
d,e
a,b,c,d,e" |sed 'N;s/^M\n//'
a,b,c,d,e
a,b,c,d,e
a,b,c,d,e
a,b,c,d,e

mvijayv · April 1, 2011, 7:00pm

Thanks Alister. Missed that piece .. Yup dos2unix doesn't merge the lines.

alister · April 1, 2011, 7:23pm

That solution will not work for even numbered lines ending with \r\n. In those cases, there will be no \r\n in the pattern space. The pattern space will end with \r alone.

Regards,
Alister

Leedor · April 2, 2011, 11:43am

Thanks so much for this Alister, the first option you gave works fine for my example and this issue is such a rare occurence that I'm sure we'd be fine with that since I doubt we'd ever get it in final line of data (this is the first occurrence after millions of records). That said, if its safer to go with the second option (which also works fine for me) then I guess we should do so. I don't see any downside to the second option you gave.

Very much appreciated!

Lee