Archived Forum Post

Index of archived forum posts

Question:

UTF-8 Characters Not Encrypting

Jul 19 '12 at 12:51

In C# I am using the code below which works fine for ASCII characters but returns a null value if the UnsecuredString in the crypt.EncryptStringENC contains UTF-8 characters such as Հայաստանի Հա

Is there a workaround to this issue?

Here is the code:

            Chilkat.Crypt2 crypt = new Chilkat.Crypt2();

            bool success = crypt.UnlockComponent(licenseKey);

            if (!success)
                return "LICENSE EXPIRED";

            //  Specify 3DES for the encryption algorithm:
            crypt.CryptAlgorithm = "3des";

            crypt.CipherMode = "ecb";

            //  For 2-Key Triple-DES, use a key length of 128
            //  (Given that each byte's msb is a parity bit, the strength is really 112 bits).
            crypt.KeyLength = 128;

            //  Pad with zeros
            crypt.PaddingScheme = 3;

            //  EncodingMode specifies the encoding of the output for
            //  encryption, and the input for decryption.
            //  It may be "hex", "url", "base64", or "quoted-printable".
            crypt.EncodingMode = "hex";

            //  Let's create a secret key by using the MD5 hash of a password.
            //  The Digest-MD5 algorithm produces a 16-byte hash (i.e. 128 bits)
            crypt.HashAlgorithm = "md5";
            string keyHex;
            keyHex = crypt.HashStringENC("Random Value");

            //  Set the encryption key:
            crypt.SetEncodedKey(keyHex, "hex");

            //  Encrypt
            encryptedString = crypt.EncryptStringENC(UnsecuredString);

Answer

The answer to your question requires that you first understand the fundamental difference between a "string" data type in a language such as C#, and the byte representation of a string for a given character encoding (i.e. charset).

In C#, the string type represents a string of Unicode characters. (string is an alias for System.String in the .NET Framework.) Notice the use of the word "characters" (not "bytes"). The methods of the string class in C# (and in any programming language that has a "string" class) are such that they hide the underlying byte representation of the characters. There are methods to get the length (in characters, not bytes), to get the Nth character, to append, prepend, find sub-strings, or display the glyphs in a user interface control, such as a text box. All of these methods are designed to not care about the underlying/internal byte representation.

However, as soon as you want to write the string to a file, the byte representation of the characters becomes important. The same goes for reading a string from a file. For example, if (in C#) you can write a string to a file via the System.IO.WriteAllText method. There are two overloads for this method:

// This first one does not specify the character encoding because it assumes the utf-8 encoding: public static void WriteAllText(
string path,
string contents
)

// The 2nd overload allows you to explicitly specify the character encoding:
public static void WriteAllText(
string path,
string contents,
Encoding encoding
)

This is important because any given character can have different byte representations based on the character encoding.

For example consider this character: É

In the iso-8859-1 character encoding, it is represented by a single byte: 0xC9
In the utf-8 character encoding, it is represented by a two bytes: 0xC3 0x89
In the ucs-2 character encoding, it is represented by a two bytes: 0x00 0xC9

If a program writes a file containing "É" using the utf-8 encoding, and another program reads the file but instead interprets the bytes according to the iso-8859-1 encoding, the result would be garbage.

Now.. to finally answer your question: Encryption and decryption algorithms operate on bytes. (Just like file reading and writing -- ultimately it is bytes that are read/written.) So the question is: What byte representation of the "string" is being encrypted? Is it utf-8? ANSI? (more about ANSI below), ucs-2? or something else?

The Chilkat.Crypt2.Charset property controls the character encoding used for the byte representation of the string, and it defaults to the ANSI charset.

The ANSI charset is the default multibyte charset for a given computer. The ANSI charset (or code page) depends on the locale of the computer. For German computers it might be Window-1252, for Japanese computers it may be Shift_JIS

The ANSI charset is typically a 1-byte per character encoding, meaning that it is not capable of representing more than 256 characters, and is restricted to the local language. In your case above, the EncryptStringENC method is (internally) trying to represent "Հայաստանի Հա" in the ANSI character encoding, and since these characters have no representation in that encoding, the method fails and returns NULL. (See the LastErrorText for some clues about what happened.)

The solution is to set the Chilkat.Crypt2.Charset property = "utf-8". The EncryptStringENC method will instead use the utf-8 representation, which can produce a byte representation for any character in any language.