Splitting a delimited text file

lupin_the_3rd · April 25, 2014, 4:58pm

Howdy folks, I've got a very large plain text file that I need to split into many smaller files. My script-fu is not powerful enough for this, so any assistance is much appreciated.

The file is a database dump from Cyrus IMAP server. It's basically a bunch of emails (thousands) all concatenated into one huge file. There is a delimiter line between each email. It looks something like this

--dump-4564564.some.jibberish.whatever
From: user@domain.com
To: myfriend@email.com

Email Body

Best Regards,
Email Author

--dump-789789863.random.numbers.maybe
From: anotheruser@domain.com
To: someguy@planet.earth

Email Body

Your Friend,
another user

So as you can see, the start of each email is preceded with a line that begins with "--dump".

What I'm looking for, is:

To split this monolithic file into many smaller files, where each smaller file contains a single email.
Where each smaller file should contain all of the lines of text after a "--dump" delimiter, up until the next "--dump" delimiter (or end of file).
And the "--dump" delimiter line itself should not be included in each smaller file.

I feel like some awk/grep/sed magic could do this, but I'm not enough of a wizard to write this script.

Thank you very much!

cjcox · April 25, 2014, 7:20pm

man csplit ?

jethrow · April 25, 2014, 7:31pm

awk 'NR>1 {print > (OFN=FILENAME"."(NR-1)); close(OFN)}' RS="--dump[^\n]*" file

EDIT:

... implemented this above ...

Don_Cragun · April 25, 2014, 9:14pm

It is hard to get csplit (and split ) to drop the delimiter lines.

The awk script jethro provided is giving me some files just containing an empty line and some files just containing "dump". And, on many systems, this code will run out of file descriptors when you're processing a file containing a lot of mail messages.

Assuming that your file containing the dump of the mail messages is named dump , you might try something like:

awk '
/^--dump/ {
	if(ofn != "") close(ofn)
	ofn = sprintf("message:%07d", ++f)
	next
}
{	print > ofn
}' dump

If you're running this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk .

jethrow · April 25, 2014, 9:46pm

For my reference ...

... can you provide an example?

... what would be the threshold? Would gawk perform better?

Don_Cragun · April 26, 2014, 12:19am

Hi jethro,
Using the script you provided:

awk 'NR>1 {print > FILENAME"."(NR-1)}' RS="--dump[^\n]*" file

on Mac OS X I get:

awk: syntax error at source line 1
 context is
	NR>1 {print > >>>  FILENAME"." <<< 
awk: illegal statement at source line 1
awk: illegal statement at source line 1

The standards explicitly say that it is unspecified whether:

print > FILENAME"."(NR-1)

is interpreted as:

(print > FILENAME)"."(NR-1)

(as it is in awk on OS X) or as:

print > (FILENAME"."(NR-1))

(as it is on your system).
Changing your code to:

awk 'NR>1 {print > (FILENAME"."(NR-1))}' RS="--dump[^\n]*" file

with a file named file containing:

--dump-4564564.some.jibberish.whatever
From: user@domain.com
To: myfriend@email.com

Email Body

Best Regards,
Email Author

--dump-789789863.random.numbers.maybe
From: anotheruser@domain.com
To: someguy@planet.earth

Email Body

Your Friend,
another user

I get 6 files as shown here:

-rw-r--r--  1 dwc  staff     1 Apr 25 20:32 file.1
-rw-r--r--  1 dwc  staff     5 Apr 25 20:32 file.2
-rw-r--r--  1 dwc  staff   119 Apr 25 20:32 file.3
-rw-r--r--  1 dwc  staff     1 Apr 25 20:32 file.4
-rw-r--r--  1 dwc  staff     5 Apr 25 20:32 file.5
-rw-r--r--  1 dwc  staff   126 Apr 25 20:32 file.6

where file.1 and file.4 contain only a <newline> character; file.2 and file.5 contain "dump" and the line terminating <newline> character; and file.3 and file.6 contain the requested mail messages plus the tail end of the headers:
file.3 :

4564564.some.jibberish.whatever
From: user@domain.com
To: myfriend@email.com

Email Body

Best Regards,
Email Author

file.6 :

789789863.random.numbers.maybe
From: anotheruser@domain.com
To: someguy@planet.earth

Email Body

Your Friend,
another user

Note that there are two empty lines at the end of both of the above files. I believe that only one empty line was expected.

The number of file descriptors used by awk for open input and output streams is implementation defined. Some versions of awk used to limit you to 9 open files. Most systems today allow around 1024 or 2048 file descriptors per process and (unless you have privileges to up that limit before you invoke awk), awk won't be able to have more files open than the number of file descriptors available to it. You may have noticed that my script closed the previous output file before opening the next output file. This is usually a much better practice unless you know that your script will open less than ten files in its lifetime.

---------------

OOPS. I originally said that file.2 and file.5 contained "code". That has been corrected above. They contain "dump"; not "code".

Don_Cragun · April 26, 2014, 1:34am

I forgot to mention that the standards specify that the first character in the awk variable RS is used as the record separator. If RS contains more than one character, the standards explicitly state that the behavior is unspecified. (It appears that the awk on jethro's system treats RS as an ERE while the awk on OS X only uses the first character of RS.)

lupin_the_3rd · April 28, 2014, 3:50pm

On my system (HP-UX 11.31) I get:

awk: Input line Disposition: attachm cannot be longer than 3,000 bytes.
The input line number is 53. The file is qsubmit.processed.dump.
The source line number is 1.

FYI the input file has emails as large as several megabytes (because of mime encoded attachments).

Thanks!

---------- Post updated at 11:48 AM ---------- Previous update was at 11:47 AM ----------

Dropping them is ideal, but not necessarily a problem for me, as I can "grep -v" to remove them in a second pass.

---------- Post updated at 03:50 PM ---------- Previous update was at 11:48 AM ----------

Ok, I got what I needed using this. Thank you all for the helpful ideas, it got me pointed down the right path.

csplit -n 5 $1 /-dump-/ {*}

for i in $(ls xx*); do
  awk 'NR > 2' $i > ./output/$i.eml
  rm $i
done

Don_Cragun · April 28, 2014, 4:07pm

lupin..the..3rd:

On my system (HP-UX 11.31) I get:

awk: Input line Disposition: attachm cannot be longer than 3,000 bytes.
The input line number is 53. The file is qsubmit.processed.dump.
The source line number is 1.

FYI the input file has emails as large as several megabytes (because of mime encoded attachments).

Thanks!

---------- Post updated at 11:48 AM ---------- Previous update was at 11:47 AM ----------

Dropping them is ideal, but not necessarily a problem for me, as I can "grep -v" to remove them in a second pass.

---------- Post updated at 03:50 PM ---------- Previous update was at 11:48 AM ----------

Ok, I got what I needed using this. Thank you all for the helpful ideas, it got me pointed down the right path.
csplit -n 5 $1 /-dump-/ {*}

for i in $(ls xx*); do
  awk 'NR > 2' $i > ./output/$i.eml
  rm $i
done

Just out of curiosity, why did you decide not to use the awk script I suggested?

awk '
/^--dump/ {
	if(ofn != "") close(ofn)
	ofn = sprintf("message:%07d", ++f)
	next
}
{	print > ofn
}' dump

It only invokes awk once (instead of once per extracted message) and only reads and writes the data found in your input file once (instead of twice); so it should be considerably faster.

lupin_the_3rd · April 28, 2014, 4:31pm

Thank you Don, I agree that it is more elegant to only invoke awk one time, rather than twice. Even more so when you consider that the dump files I need to split up are ~4 GB each in size, containing ~12,000 emails each.

But at least on my HP-UX 11.31 servers, I get the following awk error:

itl1 # ./script.sh ./admin.inbox
awk: A print or getline function must have a file name.
 The input line number is 1. The file is ./admin.inbox.
 The source line number is 7.
itl1 # cat script.sh
awk '
/^--dump/ {
        if(ofn != "") close(ofn)
        ofn = sprintf("message:%07d", ++f)
        next
}
{       print > ofn
}' $1
itl1 #
itl1 # uname -a
HP-UX itl1 B.11.31 U ia64 3456089508 unlimited-user license
itl1 #

---------- Post updated at 04:31 PM ---------- Previous update was at 04:19 PM ----------

Update: Trying the same thing on RHEL6, I get the following error:

[root@email root]# ./script.sh ./sub.proc
awk: cmd. line:6: (FILENAME=./sub.proc FNR=1) fatal: expression for `>' redirection has null string value
[root@email root]# cat script.sh
#!/bin/sh
awk '
/^--dump/ {
        if(ofn != "") close(ofn)
        ofn = sprintf("message:%07d", ++f)
        next
}
{       print > ofn
}' $1
[root@email root]# uname -a
Linux email.dev 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Dec 13 06:58:20 EST 2013 x86_64 x86_64 x86_64 GNU/Linux
[root@email root]#

Don_Cragun · April 28, 2014, 5:00pm

If your input files contain 12,000 messages, your script is invoking awk 12,000 times; not 2 times!

In your sample input in the first message in this thread you showed that the 1st line in your input file started with --dump . From those error messages, I have to assume that the 1st line of your input file does not start with that string.

If the data before the 1st line in your file starting with --dump is a mail message you want to keep, change:

awk '

to:

awk '
BEGIN {	ofn = sprintf("message:%07d", ++f)
}

otherwise, change it to:

awk '
BEGIN {	ofn = "/dev/null"
}

Note that on many filesystem types, putting 12,000 files in a single directory may make processing files in that directory slow. You might want to consider creating intermediate directories to reduce the number of files/directory.

lupin_the_3rd · April 28, 2014, 6:00pm

Yep, you caught me there. :o The first line of the input file does not begin with "--dump". There are a few lines of metadata at the head of the file. This metadata can be discarded. After the few lines of metadata, it's all "--dump" delimited emails.

Thanks again for the suggestions, I'll try them when I'm back in the office tomorrow morning.

lupin_the_3rd · April 30, 2014, 11:51am

Ok, just like you said, this worked perfectly for me, so I'll be using this on my server. THANK YOU!

[root@email root]# cat script.sh
#!/bin/sh
awk '
BEGIN { ofn = "/dev/null"
}
/^--dump/ {
        if(ofn != "") close(ofn)
        ofn = sprintf("message:%07d", ++f)
        next
}
{       print > ofn
}' $1

Don_Cragun · April 30, 2014, 1:24pm

I'm glad it worked for you. In the future, please be sure that you fully describe your input file format so we can avoid providing solutions that do what you asked for, but not what you needed.