login about faq

Does Chilkat have any methods to automatically determine the charset for a text file?

asked May 22 '13 at 10:54

chilkat's gravatar image

chilkat ♦♦

There is no possible way, with 100% accuracy, to determine the character encoding used for a file. Some programs use heuristics to determine the likely charset. Also, it's possible to check for utf-8 and utf-16 preambles and if found, assume that these bytes do in fact represent preambles and therefore the file is utf-8 or utf-16.

To give you an idea of why it is impossible, imagine these two cases:

Case 1: A file contains a single byte representing a character. For example, imagine the byte is 0xE1. This is a valid byte representation for a single character in many charsets. For example, in iso-8859-1 it is "á", but in iso-8859-7 it is "α". There is obviously no way to know which character this byte represents. Perhaps context could be used, but this would involve more sophisticated heuristics, and wouldn't be 100% accurate.

Case 2: Have a look at the code page layout for the Windows-1256 charset: http://en.wikipedia.org/wiki/Windows-1256 This is a one-byte per char encoding. You'll see that every single possible byte value (0-255) represents some character. Therefore, all data, no matter what it contains, will be Windows-1256 valid. It may not make sense, and if presented as Arabic chars to an Arabic reader would obviously be gibberish, but technically each byte represents a character and there is no way, short of more advanced heuristics, to know whether the byte data is or is not Windows-1256.

Chilkat does not provide any method to attempt to determine the charset because it would be the equivalent of opening a never-ending can-of-worms. It would never be perfect, and there would always be some data that does not get correctly identified, or some data where it is impossible to make any conclusion at all.


answered May 22 '13 at 11:07

chilkat's gravatar image

chilkat ♦♦

Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here



Answers and Comments

Markdown Basics

  • *italic* or __italic__
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported



Asked: May 22 '13 at 10:54

Seen: 1,484 times

Last updated: May 22 '13 at 11:07

powered by OSQA