I am trying to remove whitespaces from a file containing sample data as:
457 <EOFD> Mar 1 2007 12:00:00:000AM <EOFD> Mar 31 2007 12:00:00:000AM <EOFD> system <EORD> 458 <EOFD> Mar 1 2007 12:00:00:000AM<EOFD>agf <EOFD> Apr 20 2007 9:10:56:036PM <EOFD> prodiws<EORD>
. Basically these files are delimited extracted files from a database with the delimiters as <EOFD>
I am using the below command to remove the whitespace perl -pi -e 's/[[:space:]]*\<EOFD\>[[:space:]]*/\<EOFD\>/g' sample.dat
The above command is part of shell scripts t.ksh which is being invoked on the command line and internally runs the perl command.
It is working fine for moderately huge files but once the file reaches a conderable huge size, say 2.2 -3 GB, it empties the file content and the scripts exist with the below error:
./t.ksh[249]: 700492 Memory fault(coredump)
Server Message: <my unix machine name>- Msg 208, Level 16, State 1:
[249] - this points to my perl command. Also, I see a huge coredump file getting created in the directory from where I am invoking the shell script t.ksh. The file system has adequate space.
Can someone advise whats wrong here or any workaround?
It appears that your file is one giant line, yes? Perl is attempting to process it as one line, i.e. load the entire 2 gigabytes of it into memory at once.
Thanks Corona688..Is there any workaround for this? I am currently trying to split the files into smaller chunks and try it. But that requires an effort as my file is delimited file and I have to make sure the split happens properly.
If you tell awk what your "lines" are, it won't have to read 2GB of data at once. RS and ORS variables control this. They usually default to newline, but they can as easily be <EOFD>.
Yeahh..it works..but just one thing..the whitespace between system and <EORD> remains: "system <EORD>". Actually except the first line, <EORD> defines start of a new row in the file so basically its the start of a new line and end of the previous line. Can you suggest for this?
Also, it echoes the whole content on the console. Can i avoid it?
Moreover, the whitespaces before <EORD> are also not removed. Just to let you know, I invoke the command inside a shell scripts. Actually the data shown is a single line and not multiple lines. I wrap it up to a single line here somehow