login about faq

Is it possible to check to see if a text file (specifically an XML file) actually contains utf-8 characters? Sometimes we have XML file that indicate "utf-8" in the XML declaration (i.e. the first line of the XML file), but the contents of the XML is actually ANSI (1-byte/char) encoded.

Pseudo code:

If(someChilkatLib.Encode(data) == ‘ANSI’)
   someChilkatLib.convertTo(data, ‘utf-8’)
// the file (data) now is in correct format!

asked Feb 27 '14 at 12:01

chilkat's gravatar image

chilkat ♦♦

edited Feb 27 '14 at 12:33

Here's a sample in C++:

// This is NOT a perfect solution...
bool isItReallyUtf8(const char *path)
    // Load the file with no interpretation of bytes.
    CkByteData data;

// Does this have the utf-8 preamble?  Some utf-8 files may, some may not.
if (data.getSize() >= 3)
const unsigned char *p = data.getData();
if ((*p == 0xEF) && (*(p+1) == 0xBB) && (*(p+2) == 0xBF))
    // Yes, this seems to be utf-8.
    return true;

// Load the file, telling the CkString object to interpret that bytes as utf-8 encoded chars.
CkString str1;

// Round-trip from widechar unicode back to utf-8.
CkString str2;

if (!str2.equalsStr(str1))
return false;   // The file was not utf-8.
return true;    // The file was utf-8.


answered Feb 27 '14 at 12:33

chilkat's gravatar image

chilkat ♦♦

Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here



Answers and Comments

Markdown Basics

  • *italic* or __italic__
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported



Asked: Feb 27 '14 at 12:01

Seen: 2,989 times

Last updated: Feb 27 '14 at 12:33

powered by OSQA