How to conditionally display and remove first line only?

I have a maildir hierarchy of 90k eml files and;

1) I would like to walk the tree and display the first line from any file, whose first line begins with;

From -

That's "From space dash space" and only if it's the first seven characters, of the first line in the file.

2) I would also like to count the number of files who have this first line

3) I then would finally like to completely remove that line from those files

Thanks,

Jason

P.S. I only really need 3, but I would love to do 1 and 2 first. TIA

If this is a homework assignment, it need to be refiled in the Homework and Coursework Questions forum following the special rules (including filling out the homework questionnaire) required when submitting any question to that forum.

If this is not a homework assignment, please explain why a *.eml would have a 1st line like that and why removing that line would "improve" things.

Thanks for the response Don.

  1. It's not a homework assignment.
  2. Some time back I switched my email client from kmail to the Thunderbird Linux client.
  3. When I switched clients I maintained my maildir mailstore format, instead of converting everything to mbox.
  4. At some point unbeknownst to me, either initially, or during an update, Thunderbird began inserting into every new EML file, what was described to me as an mbox postmark, at the front of the file. (Which was also explained to me, it's not supposed to do).
  5. It's only now become an issue, because I want to convert the maildir mailstore over to mbox, and every conversion method/program I've tried, drops any message with the mbox postmark, without converting it. This means after conversion, I lose all messages after about Oct 2012.
  6. I tested it by manually removing the first line from several affected emails, and my conversion program successfully converted the message once the mbox postmark line was removed.

Here's a comparison of EML file headers. The first few from one file, which successfully converts, and the first few from one file that won't;

Example EML file headers from a file that successfully converts

From company@company.com Tue Aug 07 19: 10:33 2012
Return-path: <company@company.com>
Received: from [1.1.1.1] (helo=email-server.com) by
email-server.com with esmtp (Exim 4.69) id 12345-67890-AA for
user@email-server.com; Tue, 07 Aug 2012 19:10:33 +0200
Received: from exim by email-server.com with dspam-scanned (Exim
4.71) id 12345-67890-AA for user@email-server.com; Tue, 07 Aug
2012 19:10:32 +0200
Received: from exim by email-server.com with sa-scanned (Exim 4.71)
id 12345-67890-AA for user@email-server.com; Tue, 07 Aug 2012
19:10:32 +0200

Example EML file headers from a file that fails to convert

From - Wed Dec 5 11:13:43 2012
X-Account-Key: accountz
X-UIDL: 123456789.0000.email-server,S=47979
X-Mozilla-Status: 0001
X-Mozilla-Status2: 00000000
X-Mozilla-Keys: 
>From company@company.com Wed Dec 05 17:12:38 2012
Return-path: <company@company.com>
Received: from [1.1.1.1] (helo=email-server.com)
     by email-server.com with esmtp (Exim 4.69)
     id 12345-6789-00
     for user@email-server.com@email-server.com; Wed, 05 Dec 2012 17:12:38 +0100
Received: from exim by email-server.com with dspam-scanned (Exim 4.76)
     id 12345-6789-00
     for user@email-server.com@email-server.com; Wed, 05 Dec 2012 17:12:37 +0100
Received: from exim by email-server.com with sa-scanned (Exim 4.76)
     id 12345-6789-00
     for user@email-server.com@email-server.com; Wed, 05 Dec 2012 17:12:37 +0100

If that first line from the second set of headers;

From - Wed Dec 5 11:13:43 2012

is in the EML file, the conversion program skips the message. If it's removed, the conversion program converts the email. That's the reason for my request for help.

Jason

Please take a look at man grep (linux).

derekludwig,
The grep utility searches entire files; not just the first line.

jasn,
The following might not be highly efficient, but it shouldn't be too bad. Your 1st post in this thread talked about "eml" files; but in post #3 in this thread you talked about "EML" files. The following script will process any files in and under the directory you're in when you run it with names ending with . followed by eml in lower case, upper case, or mixed case letters:

#!/bin/ksh
bfc=0
find . -type f -name '*.[Ee][Mm][Ll]' | while read -r file
do	read -r f1 f2 rest < "$file"
	if [ "$f1" = "From" ] && [ "$f2" = "-" ]
	then	# Bad file found...
		bfc=$((bfc + 1))
		printf 'bad file #%d: %s\n\tFrom - %s\n' $bfc "$file" "$rest"
		ed -s "$file" <<-EOF
			1d
			w
			q
		EOF
	fi
done

This was tested with ksh and bash , but should work with any POSIX conforming shell. (It won't work with a csh derivative nor with an original Bourne shell (such as /bin/sh on Solaris systems).)

Note that it is important that the characters before the EOF at the end of the ed here-document must be tabs; not spaces.

1 Like

With respects to Don, but I was thinking this as three separate tasks:

find . -type f -name '*.[Ee][Mm][Ll]' -print | xargs head -1 | grep '^From - '
find . -type f -name '*.[Ee][Mm][Ll]' -print | xargs head -1 | grep -c '^From - '
find . -type f -name '*.[Ee][Mm][Ll]' -print0 | xargs -0 sed -i -e  '2b' -e '/^From - /d'

Though I would create a backup of the original file...

find . -type f -name '*.[Ee][Mm][Ll]' -print0 | xargs -0 sed -ibak  -e '2b' -e '/^From - /d'

These can be removed at a later point in time, once the export is successful.

One question, does the >From line have to be restored, so that the original sender information is imported?

find . -type f -name '*.[Ee][Mm][Ll]' -print0 | xargs -0 sed -ibak   -e '/^$/,$b' -e 's/>From /From /' -e '2b' -e '/^From - /d'
1 Like

That is a reasonable way to do what you're trying to do. I just didn't see how a list of up to 90000 lines with no indication of what file they came from would be very useful. I assumed it would be more useful to count the files and print the 1st line of the files that need to be modified in a single step.

This should work OK if you just want to get a count of the files that jasn wanted to edit.

These pipelines edit every file; not just those that jasn wanted to modify. And, if there are any lines (other than the 1st two lines) that start with From - , they will be removed from the file even if they were data in the middle of a mail message. Did you intend to use -e '2,$b' instead of -e '2b' ?

I'm not following the logic of what you're trying to do with this pipeline. Are you trying to remove every line starting with >From - or From - and change every other occurrence of >From to From anywhere else on a line that appears after line 2 and before the first empty line in each file?

1 Like

Don,

You are correct, it should have been 2,$b .

The idea was to remove the "From - " line if it was on the first line, and then remove the ">" from the line that contains >From in the header.

And yes, this will modify every file, but correctly formed files will not be changed.

1 Like

Comparing above mail files to what I have on my systems, the first is an "mbox" type mail, while the second seems to be or already have been converted to "thunderbird" format.
Did you try to make thunderbird read them directly?

Don and Derek,

Thanks so much for the assistance. I really appreciate the fact that your help will allow me to change mail clients again, (which is why I'm doing the conversion to mbox finally), and not lose the last couple of years of emails.

For an initial quick response, the files are email messages stored individually as separate files in a maildir hierarchy. While they are appended with the extension .eml usually in a Microsoft windows environment, in my hierarchy they're stored as files with filenames that are a string of numbers with no extension.

When I edited the first find commands that Derek suggested to search for all files, (removed the EeMmLl), it did indeed locate, and then later count, all affected files. Additionally while I originally thought I should remove the > character from the actual From header line, I found out that the conversion program works whether the > character is there or not. The important part is the removal of the first From line, (mbox postmark), when it exists. I appreciate your suggestion though of removing the > character from the actual From line, and will look to use that as well.

Finally, Rudi, the example headers I posted are pulled from two separate EML files in my hierarchy. They're not in an mbox file, (which is basically a single file with contains a number of email messages).

Thanks all,

Jason

---------- Post updated at 02:08 PM ---------- Previous update was at 08:58 AM ----------

Thanks again Don and Derek. I ended up using Don's shell script, and editing the find filename parameters to;

#!/bin/bash
bfc=0
find . -type f -name '*' | while read -r file
do      read -r f1 f2 rest < "$file"
        if [ "$f1" = "From" ] && [ "$f2" = "-" ]
        then    # Bad file found...
                bfc=$((bfc + 1))
                printf 'bad file #%d: %s\n\tFrom - %s\n' $bfc "$file" "$rest"
                ed -s "$file" <<-EOF
                        1d
                        w
                        q
                EOF
        fi
done

I ran it from the top level maildir subdirectory, and it worked perfectly, finding every file with an mbox postmark in my maildir hierarchy, displaying the filename, counting the file, and then deleting the mbox postmark from the affected file.

Total 'bad' eml files discovered, and cleaned of mbox postmark; 14,473

Thanks very much to both of you!

Jason

Jason,
I'm very glad you were able to get my script to work for you. For future reference note that:

find file... -type f -name '*'

will produce exactly the same output as:

find file... -type f

since the -name pattern * will match every file.

1 Like

Don,

One last thing. Based on Derek's suggestion and his find one liners, is it possible to add to your shell script a function to replace all lines that begin with

>From

with

From

I'm not sure that is having an effect on this conversion, but I think it would be useful to clean things up from that perspective.

Thanks,

Jason

In traditional mbox format mail files, mail body lines starting with From are converted by software submitting messages to the mailer daemons to >From to keep it from being misinterpreted as a message boundary. Assuming that the lines you really want to convert are headers that start with >From and contain an email address (i.e. something containing an @ as in user@company.com), the following, in addition to what it did before, will remove the > from the start of a line that starts with >From and contains an @ :

		ed -s "$file" <<-EOF
			1d
			g/^>From .*@/s/^>//
			w
			q
		EOF

If you really want to remove the > from every line containing >From , change the global substitute line above to:

			g/^>From /s//From /
1 Like