Removing White spaces from a huge file

I am trying to remove whitespaces from a file containing sample data as:

457 <EOFD> Mar  1 2007 12:00:00:000AM   <EOFD> Mar 31 2007 12:00:00:000AM   <EOFD>  system  <EORD> 458 <EOFD>    Mar  1 2007 12:00:00:000AM<EOFD>agf <EOFD> Apr 20 2007  9:10:56:036PM    <EOFD>  prodiws<EORD> 

. Basically these files are delimited extracted files from a database with the delimiters as <EOFD>
I am using the below command to remove the whitespace
perl -pi -e 's/[[:space:]]*\<EOFD\>[[:space:]]*/\<EOFD\>/g' sample.dat

The above command is part of shell scripts t.ksh which is being invoked on the command line and internally runs the perl command.
It is working fine for moderately huge files but once the file reaches a conderable huge size, say 2.2 -3 GB, it empties the file content and the scripts exist with the below error:

./t.ksh[249]: 700492 Memory fault(coredump)
Server Message: <my unix machine name>- Msg 208, Level 16, State 1:

[249] - this points to my perl command. Also, I see a huge coredump file getting created in the directory from where I am invoking the shell script t.ksh. The file system has adequate space.

Can someone advise whats wrong here or any workaround?

Regards

It appears that your file is one giant line, yes? Perl is attempting to process it as one line, i.e. load the entire 2 gigabytes of it into memory at once.

1 Like

Thanks Corona688..Is there any workaround for this? I am currently trying to split the files into smaller chunks and try it. But that requires an effort as my file is delimited file and I have to make sure the split happens properly.

If you tell awk what your "lines" are, it won't have to read 2GB of data at once. RS and ORS variables control this. They usually default to newline, but they can as easily be <EOFD>.

 $ cat data
457 <EOFD> Mar  1 2007 12:00:00:000AM   <EOFD> Mar 31 2007 12:00:00:000AM   <EOFD>  system  <EORD> 458 <EOFD>    Mar  1 2007 12:00:00:000AM<EOFD>agf <EOFD> Apr 20 2007  9:10:56:036PM    <EOFD>

$  awk '{ sub(/ +$/, ""); sub(/^ +/, ""); } 1' RS="<EOFD>" ORS="<EOFD>" datafile ; echo

457<EOFD>Mar  1 2007 12:00:00:000AM<EOFD>Mar 31 2007 12:00:00:000AM<EOFD>system  <EORD> 458<EOFD>Mar  1 2007 12:00:00:000AM<EOFD>agf<EOFD>Apr 20 2007  9:10:56:036PM<EOFD>

$

..the "echo" afterwards is just to move the cursor to the next line, since it wouldn't print a newline otherwise.

1 Like

Thanks Corona688..Just to confirm, my files are like below:

  1. Each field delimited by <EOFD>
  2. The marker for a new row is <EORD> . I think since <EORD> marks the end of a line, can you suggest what should be the above command?

Unless the output I showed is wrong somehow, the command I just showed you works, no?

Yeahh..it works..but just one thing..the whitespace between system and <EORD> remains: "system <EORD>". Actually except the first line, <EORD> defines start of a new row in the file so basically its the start of a new line and end of the previous line. Can you suggest for this?
Also, it echoes the whole content on the console. Can i avoid it?

OK, I can do that. I just can't know what you don't mention, is all.

$ awk '{ sub(/ +$/, ""); sub(/^ +/, ""); sub(/ *<EORD> */, "<EORD>"); } 1' RS="<EOFD>" ORS="<EOFD>" datafile ; echo

457<EOFD>Mar  1 2007 12:00:00:000AM<EOFD>Mar 31 2007 12:00:00:000AM<EOFD>system<EORD>458<EOFD>Mar  1 2007 12:00:00:000AM<EOFD>agf<EOFD>Apr 20 2007  9:10:56:036PM<EOFD>

$

Thanks Corona688...but the code isn't working. I can still see some spaces between fileds in the file. Below is the reference sample:

457<EOFD>Mar  1 2007 12:00:00:000AM<EOFD>Mar 31 2007 12:00:00:000AM<EOFD>ACRD<EOFD>sn<EOFD>D   <EOFD>3000<EOFD>65.00<EOFD>Apr 20 2007  9:10:56:036PM<EOFD>pro
diws     <EORD>458<EOFD>Mar  1 2007 12:00:00:000AM<EOFD>Mar 31 2007 12:00:00:000AM<EOFD>ACRD<EOFD>sn<EOFD>D   <EOFD>3300<EOFD>36.00<EOFD>Apr 20 2007  9:10:56
:036PM<EOFD>prodiws     <EORD>

Moreover, the whitespaces before <EORD> are also not removed. Just to let you know, I invoke the command inside a shell scripts. Actually the data shown is a single line and not multiple lines. I wrap it up to a single line here somehow

Hi,
With perl, can you try:

perl -p -e 'BEGIN { $/="<EORD>" } s/\s*<EO(FD|RD)>\s*/<EO$1>/g;s/^ *//' datafile

Regards.

Double check the code you're using. It works:

$ awk '{ sub(/ +$/, ""); sub(/^ +/, ""); sub(/ *<EORD> */, "<EORD>"); } 1' RS="<EOFD>" ORS="<EOFD>" data2 ; echo
457<EOFD>Mar  1 2007 12:00:00:000AM<EOFD>Mar 31 2007 12:00:00:000AM<EOFD>ACRD<EOFD>sn<EOFD>D<EOFD>3000<EOFD>65.00<EOFD>Apr 20 2007  9:10:56:036PM<EOFD>prodiws<EORD>458<EOFD>Mar  1 2007 12:00:00:000AM<EOFD>Mar 31 2007 12:00:00:000AM<EOFD>ACRD<EOFD>sn<EOFD>D<EOFD>3300<EOFD>36.00<EOFD>Apr 20 2007  9:10:56:036PM<EOFD>prodiws<EORD>

$

There's been some difficulty pasting it, which may explain the confusion. My terminal doesn't like a nonwrapped datafile.

Try:

awk '
  /EO[RF]D/ {
    sub(/[[:space:]]*$/,x,p)
    sub(/^[[:space:]]*/,x,$2)
  }
  {
    if(NR>1) print p RS
    p=$0
  } 
  END {
    print p
  }
' RS=\< ORS= FS=\> OFS=\> file

----
Note:

  • Only gawk and mawk process multiple character RS (as a regular expression). Regular awk only uses the first character of the RS string..
  • if there can be whitespace characters other than simple space, use [[:space:]] .