convert email headers' encoding?

hi all -

first, huge thanks to anyone who might be able to help me out with this. it's fairly esoteric, but it seems like there has to be an answer for me...

  • the environment:

mac os x 10.5.x server
communigate pro (mail server)
bash script (read on)

  • the brief:

my script is meant to parse a spam folder; it puts together a nicely-formatted summary email of all messages that have arrived in the past 24 hours, showing only the From: and Subject: lines. mechanically speaking, it works great.

  • the problem:

encodings. some character sets (russian/cyrillic; japanese; presumably chinese) break my script pretty badly - a mailer will display them properly in the From or Subject line, but in the body of my email, it just shows them as garbage, i assume because my emails are using another character set. for example:

Subject: =?koi8-r?B?UmU6IMvVxMEg0M/FxMnNIM/UxNnIwdTYPw==?=

the script is smart enough to find the encoding and run the whole message through iconv - but that doesn't seem to help with the header lines, only the email body. which is ignored by the script, so...yeah.

  • the question:

does anyone know of a way to properly convert these header lines, ideally into something like utf-8? alternatively, would it help if i specified some text encoding in the summary email itself instead?

for what it's worth, when the lines are displayed in the summaries, i've stripped out the Subject: and From: part, leaving only the actual subject and from text in place. in case that matters...

thanks for reading,
-john.

When I suffered from char issues, HPUX using roman8, I used a .mailrc file with this inside:
set crt=21
set encoding=8bit
set charset=iso-8859-1
#

it would be worth investigating ?

hi vbe -

i'll check that out. in the meantime, i tried changing the charset in the emails themselves from us-ascii to utf-8 (which i think would accomplish pretty much the same thing), with no effect.

i also realized that i could've provided a little more info - sorry, folks. the accounts all have .mdir mailboxes (as opposed to .mbox) - so each message is its own rfc 822-compliant textfile. that means the script is plowing through sometimes hundreds of files per account, and pulling only what it needs (in this case, from, subject, and a couple of other things that are irrelevant to this problem).

for each message, it takes that info and writes it all to one line in a temp file, then moves on to the next. when it's processed all the messages for that account, it reads back the file it just finished writing (which consists of the from & subject lines plus that other info, like a from name and its spam score), one line at a time, and clunks those bits of info into the body of the summary email.

i guess it's a little more complicated than i remembered - but again, the mechanics are working fine; it's just this charset thing that's broken.

thanks again to anyone with a tip,
-john.

It's no wonder switching to UTF-8 "doesn't work", because email messages must be composed of entirely ASCII and anything else must be encoded. UTF-8 is of no exception to this rule (but still, I think using UTF-8 is better than other legacy encodings - it just doesn't relate to your issue).

The subject header you quoted has been encoded as required by MIME. You can refer to additional information in the RFC 2047 itself:

http://www.rfc-editor.org/rfc/rfc2047.txt

I don't think you can easily find a shell script that does MIME decoding for you. Even with Perl, a set of custom modules would be needed to be installed to parse all that properly. If you are willing to use PHP for this parsing, it is likely the easiest route because support is builtin, and you save a lot of module installation. As an example, parsing the sample you quoted:

<?php

// Actually in PHP 5, iconv_mime_decode() is the easiest way.

// Assume base64 encoding
$array = array();
$mstring = '=?koi8-r?B?UmU6IMvVxMEg0M/FxMnNIM/UxNnIwdTYPw==?=';
preg_match('/^=\?(.+)\?B\?(.+?)\?=$/', $mstring, $array);
list(, $charset, $encoded) = $array;
$str = base64_decode($encoded);
echo iconv($charset, "UTF-8", $str);

?>

So on my terminal, I got

Re:   ?

Not sure what it is, but it looks properly decoded.

well, then...time to learn some php!

i'll see if i can't roll your code into something that works in my environment.

thanks for your help!

-john.

My code was meant to show you the general process of MIME decoding (and mostly concept). It was not quite good for production use. Parsing a real-world email message is likely slightly more complex due to existence of variations.

To be frank, if you can get hold of PHP 5, as indicated in the inline comment, the simplest approach would be to use the iconv_mime_decode() function which is a one-stop shop of what you want. There was a (intentional) flaw in my posted code because it didn't handle the case where the encoding is quoted-printable, that is also supported by MIME. For simplicity, I only posted the part which decodes Base64, because that was used in your sample posted.

If you get hold of the concepts needed, you may then check other languages or tools to see if they may better suit your environment compared with PHP. As PHP install is typically pretty big, it may not be necessarily suitable in all deployment environments (say on very limited storage space).