Removing White spaces from a huge file

amvip · January 20, 2017, 2:16pm

I am trying to remove whitespaces from a file containing sample data as:

457 <EOFD> Mar  1 2007 12:00:00:000AM   <EOFD> Mar 31 2007 12:00:00:000AM   <EOFD>  system  <EORD> 458 <EOFD>    Mar  1 2007 12:00:00:000AM<EOFD>agf <EOFD> Apr 20 2007  9:10:56:036PM    <EOFD>  prodiws<EORD>

. Basically these files are delimited extracted files from a database with the delimiters as <EOFD>
I am using the below command to remove the whitespace
perl -pi -e 's/[[:space:]]*\<EOFD\>[[:space:]]*/\<EOFD\>/g' sample.dat

The above command is part of shell scripts t.ksh which is being invoked on the command line and internally runs the perl command.
It is working fine for moderately huge files but once the file reaches a conderable huge size, say 2.2 -3 GB, it empties the file content and the scripts exist with the below error:

./t.ksh[249]: 700492 Memory fault(coredump)
Server Message: <my unix machine name>- Msg 208, Level 16, State 1:

[249] - this points to my perl command. Also, I see a huge coredump file getting created in the directory from where I am invoking the shell script t.ksh. The file system has adequate space.

Can someone advise whats wrong here or any workaround?

Regards

Corona688 · January 20, 2017, 2:21pm

It appears that your file is one giant line, yes? Perl is attempting to process it as one line, i.e. load the entire 2 gigabytes of it into memory at once.

amvip · January 20, 2017, 2:28pm

Thanks Corona688..Is there any workaround for this? I am currently trying to split the files into smaller chunks and try it. But that requires an effort as my file is delimited file and I have to make sure the split happens properly.

Corona688 · January 20, 2017, 2:29pm

If you tell awk what your "lines" are, it won't have to read 2GB of data at once. RS and ORS variables control this. They usually default to newline, but they can as easily be <EOFD>.

 $ cat data
457 <EOFD> Mar  1 2007 12:00:00:000AM   <EOFD> Mar 31 2007 12:00:00:000AM   <EOFD>  system  <EORD> 458 <EOFD>    Mar  1 2007 12:00:00:000AM<EOFD>agf <EOFD> Apr 20 2007  9:10:56:036PM    <EOFD>

$  awk '{ sub(/ +$/, ""); sub(/^ +/, ""); } 1' RS="<EOFD>" ORS="<EOFD>" datafile ; echo

457<EOFD>Mar  1 2007 12:00:00:000AM<EOFD>Mar 31 2007 12:00:00:000AM<EOFD>system  <EORD> 458<EOFD>Mar  1 2007 12:00:00:000AM<EOFD>agf<EOFD>Apr 20 2007  9:10:56:036PM<EOFD>

$

..the "echo" afterwards is just to move the cursor to the next line, since it wouldn't print a newline otherwise.

amvip · January 20, 2017, 2:44pm

Thanks Corona688..Just to confirm, my files are like below:

Each field delimited by <EOFD>
The marker for a new row is <EORD> . I think since <EORD> marks the end of a line, can you suggest what should be the above command?

Corona688 · January 20, 2017, 2:48pm

Unless the output I showed is wrong somehow, the command I just showed you works, no?

amvip · January 20, 2017, 3:00pm

Yeahh..it works..but just one thing..the whitespace between system and <EORD> remains: "system <EORD>". Actually except the first line, <EORD> defines start of a new row in the file so basically its the start of a new line and end of the previous line. Can you suggest for this?
Also, it echoes the whole content on the console. Can i avoid it?

Corona688 · January 20, 2017, 3:05pm

OK, I can do that. I just can't know what you don't mention, is all.

$ awk '{ sub(/ +$/, ""); sub(/^ +/, ""); sub(/ *<EORD> */, "<EORD>"); } 1' RS="<EOFD>" ORS="<EOFD>" datafile ; echo

457<EOFD>Mar  1 2007 12:00:00:000AM<EOFD>Mar 31 2007 12:00:00:000AM<EOFD>system<EORD>458<EOFD>Mar  1 2007 12:00:00:000AM<EOFD>agf<EOFD>Apr 20 2007  9:10:56:036PM<EOFD>

$

amvip · January 20, 2017, 4:58pm

Thanks Corona688...but the code isn't working. I can still see some spaces between fileds in the file. Below is the reference sample:

457<EOFD>Mar  1 2007 12:00:00:000AM<EOFD>Mar 31 2007 12:00:00:000AM<EOFD>ACRD<EOFD>sn<EOFD>D   <EOFD>3000<EOFD>65.00<EOFD>Apr 20 2007  9:10:56:036PM<EOFD>pro
diws     <EORD>458<EOFD>Mar  1 2007 12:00:00:000AM<EOFD>Mar 31 2007 12:00:00:000AM<EOFD>ACRD<EOFD>sn<EOFD>D   <EOFD>3300<EOFD>36.00<EOFD>Apr 20 2007  9:10:56
:036PM<EOFD>prodiws     <EORD>

Moreover, the whitespaces before <EORD> are also not removed. Just to let you know, I invoke the command inside a shell scripts. Actually the data shown is a single line and not multiple lines. I wrap it up to a single line here somehow

disedorgue · January 20, 2017, 5:09pm

Hi,
With perl, can you try:

perl -p -e 'BEGIN { $/="<EORD>" } s/\s*<EO(FD|RD)>\s*/<EO$1>/g;s/^ *//' datafile

Regards.

Corona688 · January 20, 2017, 5:36pm

Double check the code you're using. It works:

$ awk '{ sub(/ +$/, ""); sub(/^ +/, ""); sub(/ *<EORD> */, "<EORD>"); } 1' RS="<EOFD>" ORS="<EOFD>" data2 ; echo
457<EOFD>Mar  1 2007 12:00:00:000AM<EOFD>Mar 31 2007 12:00:00:000AM<EOFD>ACRD<EOFD>sn<EOFD>D<EOFD>3000<EOFD>65.00<EOFD>Apr 20 2007  9:10:56:036PM<EOFD>prodiws<EORD>458<EOFD>Mar  1 2007 12:00:00:000AM<EOFD>Mar 31 2007 12:00:00:000AM<EOFD>ACRD<EOFD>sn<EOFD>D<EOFD>3300<EOFD>36.00<EOFD>Apr 20 2007  9:10:56:036PM<EOFD>prodiws<EORD>

$

There's been some difficulty pasting it, which may explain the confusion. My terminal doesn't like a nonwrapped datafile.

Scrutinizer · January 22, 2017, 6:36am

Try:

awk '
  /EO[RF]D/ {
    sub(/[[:space:]]*$/,x,p)
    sub(/^[[:space:]]*/,x,$2)
  }
  {
    if(NR>1) print p RS
    p=$0
  } 
  END {
    print p
  }
' RS=\< ORS= FS=\> OFS=\> file

----
Note:

Only gawk and mawk process multiple character RS (as a regular expression). Regular awk only uses the first character of the RS string..
if there can be whitespace characters other than simple space, use [[:space:]] .