Which one is faster to remove control m characters?

madhunk · August 15, 2006, 6:17pm

I have a file with millions of records...Before I experiment, I would like to know which one is faster.

Both the commands work absolutely fine on a smaller set of records.

Please advice.

sed 's/^M//g' ${INPUT_FILE} > tmp.txt

mv  tmp.txt  ${INPUT_FILE}

tr -d "\15"  < ${INPUT_FILE} >  tmp.txt;

mv  tmp.txt  ${INPUT_FILE}

This somehow didn't work...

tr -s '[:cntrl:]' ' ' < ${INPUT_FILE}> tmp.txt

Thank you in advance.

cpmurray · August 15, 2006, 9:53pm

Try
perl -p -i -e 's/^M//g' ${INPUT_FILE}

This will do an inline edit of the file, if you wish to do every file in the directory
perl -p -i -e 's/^M//g' *

Regards
Craig Murray

blowtorch · August 15, 2006, 10:40pm

What OS? Most unix systems have some sort of program to do this job. Solaris has dos2unix, HP-UX has dos2ux, most of the other systems should have dos2unix too. Use these.

vangalli · August 16, 2006, 2:33am

if its DOS file!! copied to unix system then use........

dos2unix filename

if viceversa...........

use unix2dos filename

madhunk · August 16, 2006, 9:59am

This is on SunSolaris box. I somehow could not use dos2unix -- I get the following message

1K$ dos2unix
could not open /dev/kbd to get keyboard type US keyboard assumed
could not get keyboard type US keyboard assumed

cpmurray suggested to use perl. I will try that option too.

vgersh99 · August 16, 2006, 10:07am

'man dos2unix'

NAME
     dos2unix - convert text file from DOS format to ISO format

SYNOPSIS
     dos2unix [-ascii] [-iso] [-7] [-437 | -850 | -860 |  -863  |
     -865]  originalfile convertedfile

madhunk · August 16, 2006, 10:20am

That is what I did vgersh..

1K$ /usr/bin/dos2unix tmp.txt > tmp1.txt
could not open /dev/kbd to get keyboard type US keyboard assumed
could not get keyboard type US keyboard assumed

I am not sure what this keyboard type thing is....

Is there any other faster way to strip off ^M characters? Right now, I am using sed and tr...

Please advice..

blowtorch · August 16, 2006, 10:24am

You do not have to worry about that. That is just a warning. dos2unix will assume that you have a US style keyboard and go ahead with the file conversion to unix format. And since you are using Solaris, you do not need to worry about that '>' redirection stuff either.

dos2unix filename filename

This will take input from the filename specified and provide the output in the same file.

vgersh99 · August 16, 2006, 10:28am

(echo 's/^M//g'; echo 'wq') | ex -s tmp.txt

madhunk · August 16, 2006, 10:35am

Thank you blowtorch and vgersh...

Would both the options are faster on files with 12-15 million records...

vgersh99 · August 16, 2006, 10:40am

Unfortunately the only way to find out 'for sure' is to benchmark it.

tmarikle · August 16, 2006, 12:55pm

vgersh99's point is correct, you'll have to benchmark it for yourself but just for kicks ,I ran a test on my system with some interesting results:

dos2unix was 3x faster (which included removing extra file). This of course reqires 2x the disk space.

Perl was second and 5x faster than ex.

The ex method will require that you have adequate space on wherever ex creates temp files.