Archived Forum Post

Index of archived forum posts

Question:

Difference between getSizeAnsi, getSizeUnicode, and getSizeUtf8 in CkString?

Mar 01 '13 at 10:30

What is the difference between getSizeAnsi, getSizeUnicode, and getSizeUtf8 in CkString? They all seem to return size of the string in bytes, so why have 3 different functions that seem to return the same value? The documentation isn’t very clear on the differences from what I’ve found.


Answer

These methods return the size, in bytes, of the string when the characters are represented in each of the respective encodings: ANSI, Unicode (utf-16), and utf-8.

Consider this character: É

In the iso-8859-1 or Windows-1252 character encoding, it is represented by a single byte: 0xC9
getSizeAnsi would return the value 1.

In the utf-8 character encoding, it is represented by a two bytes: 0xC3 0x89
getSizeUtf8 would return the value 2

In the utf-16 character encoding, it is represented by a two bytes: 0x00 0xC9
getSizeUnicode would return the value 2

What is the ANSI Charset?

The ANSI charset is the default multibyte charset for a given computer. The ANSI charset (or code page) depends on the locale of the computer. For German computers it might be Window-1252, for Japanese computers it may be Shift_JIS

What is a MultiByte Charset?

Generally, all charsets except Unicode (2-bytes/char, also known as utf-16) are called "multibyte". This includes us-ascii. Some multibyte charsets represent characters in a single byte, others represent characters in variable lengths of bytes. One example is utf-8, which is the multibyte encoding for Unicode. (Google's search result pages use utf-8.)