Archived Forum Post

Index of archived forum posts

Question:

Possible to Determine the Charset used in a Text File?

May 22 '13 at 11:07

Does Chilkat have any methods to automatically determine the charset for a text file?


Answer

There is no possible way, with 100% accuracy, to determine the character encoding used for a file. Some programs use heuristics to determine the likely charset. Also, it's possible to check for utf-8 and utf-16 preambles and if found, assume that these bytes do in fact represent preambles and therefore the file is utf-8 or utf-16.

To give you an idea of why it is impossible, imagine these two cases:

Case 1: A file contains a single byte representing a character. For example, imagine the byte is 0xE1. This is a valid byte representation for a single character in many charsets. For example, in iso-8859-1 it is "á", but in iso-8859-7 it is "α". There is obviously no way to know which character this byte represents. Perhaps context could be used, but this would involve more sophisticated heuristics, and wouldn't be 100% accurate.

Case 2: Have a look at the code page layout for the Windows-1256 charset: http://en.wikipedia.org/wiki/Windows-1256 This is a one-byte per char encoding. You'll see that every single possible byte value (0-255) represents some character. Therefore, all data, no matter what it contains, will be Windows-1256 valid. It may not make sense, and if presented as Arabic chars to an Arabic reader would obviously be gibberish, but technically each byte represents a character and there is no way, short of more advanced heuristics, to know whether the byte data is or is not Windows-1256.

Chilkat does not provide any method to attempt to determine the charset because it would be the equivalent of opening a never-ending can-of-worms. It would never be perfect, and there would always be some data that does not get correctly identified, or some data where it is impossible to make any conclusion at all.