Archived Forum Post

Index of archived forum posts

Question:

Email with Polish chars gets corrupted?

Sep 16 '14 at 09:21

When loading an email containing polish characters which is 8bit UTF-8 encoded, I am seeing corruption of the polish characters following loading.

An example .eml is attached and you can see by running it through the following that the mail gets corrupted. Specifying the encoding also does not appear to work.


Email email = new Email();
email.LoadEml("C:\Users\chilkat\Desktop\original.eml");
System.IO.File.WriteAllText("C:\Users\chilkat\Desktop\mimed.eml", email.GetMime());

Is there any way to work around this?

Answer

The bug is in the 3 lines of the C# code above.

The call to email.GetMime() returns a C# string containing the MIME of the email. Note that in C# and VB.NET, a string is an object. It is not a raw sequence of bytes where the bytes are exposed to the application. The string object, internally contains characters, and those characters are stored in some way that is of no concern to the application. The application can operate on a string by doing comparisons, concatenation, and all sorts of other manipulations without having to worry about how each character is represented in raw bytes. This is clearly seen when you try to convert a string to a byte array because you must indicate a character encoding, such as utf-8. For example:

byte[] bytes = System.Text.UTF8Encoding.UTF8.GetBytes(utf8string);

Now... the MIME contains a charset attribute within the Content-Type header that tells a program the character encoding of the MIME itself. In your case, the email is "utf-8" and therefore the MIME should be saved using the utf-8 encoding. If it is not, then a program reading the MIME will examine the charset in the Content-Type header, and try to interpret the bytes as utf-8, when in fact they are not.

Maybe the bug is clear to you now. The call to email.GetMime() returns a string, but for it to be saved to a file correctly, it must be saved using utf-8. The call to WriteAllText is implicitly saving the string returned by GetMime in the ANSI character encoding.

Remember, a string is an object. GetMime, because it returns a string, always returns the same string regardless of the "charset" of the email. You may instead call email.GetMimeBinary(), which will return the MIME as a byte array, and that byte array will be in the correct encoding as indicated within the MIME's Content-Type header. Or... the solution is to call GetMime(), but then save the MIME using the correct charset. I think calling GetMimeBinary is a better solution just in case the MIME contains a body that is in a binary content-encoding such that the bytes returned don't actually represent characters. Most MIME have bodies encoded as base64, or quoted-printable, where this doesn't matter.