login about faq

I have a test string: "ąćęłńśóźżĄĆĘŁŃŚÓŹŻ" and I want to send it by SendString function. I can correctly send it from C# to PHP. But why I am not able to send it between PHP and PHP? I have set up two machines, same configuration (one is a clone of another). And it work almost great - almost because the 'Ł' is lost every time.

What can be wrong? In PHP string is not an object, is might be it? Do you have any clues what should I do with it?

asked Sep 19 at 09:00

chilkat's gravatar image

chilkat ♦♦
11.8k316358420


First one must understand this: http://php.net/manual/en/language.types.string.php#language.types.string.details

Specifically, that

The string in PHP is implemented as an array of bytes and an integer indicating the length of the buffer. It has no information about how those bytes translate to characters, leaving that task to the programmer. There are no limitations on the values the string can be composed of; in particular, bytes with value 0 (“NUL bytes”) are allowed anywhere in the string (however, a few functions, said in this manual not to be “binary safe”, may hand off the strings to libraries that ignore data after a NUL byte.)

This nature of the string type explains why there is no separate “byte” type in PHP – strings take this role. Functions that return no textual data – for instance, arbitrary data read from a network socket – will still return strings.

Given that PHP does not dictate a specific encoding for strings, one might wonder how string literals are encoded. For instance, is the string "á" equivalent to "xE1" (ISO-8859-1), "\xC3\xA1" (UTF-8, C form), "\x61\xCC\x81" (UTF-8, D form) or any other possible representation? The answer is that string will be encoded in whatever fashion it is encoded in the script file. Thus, if the script is written in ISO-8859-1, the string will be encoded in ISO-8859-1 and so on. ...

There are two hurdles that need to be cleared in order to get things right:

(1) Given that in PHP a string is just an array of bytes, when the bytes are passed to Chilkat, Chilkat must know how to interpret the bytes. Are they utf-8 bytes such that an "á" is represented by "\xC3\xA1", or are they ANSI bytes, where the ANSI character encoding is defined by the locale of the machine, which is likely iso-8859-2 if the computer is in Poland, and in this case the "á" is represented by "\xE1".

For programming languages where strings are byte arrays, Chilkat provides a "Utf8" property that defaults to false/0. The Utf8 property defines how Chilkat is going to interpret the bytes of passed-in string arguments -- as either ANSI bytes or utf-8 bytes. This must be set correctly.

In your particular case, given that only the 'Ł' is lost, it must be that the string is passed in correctly, and the problem occurs in (2) described below.

(2) The 2nd hurdle that must be cleared is that Chilkat must know exactly what bytes to send. Will it be sending the utf-8 representation of the string, the ANSI representation, or something else (perhaps utf-16, utf-32, or some arcane charset that's seldom used). The way to control this is to set the Socket.StringCharset property. This is likely the problem -- your program passed the string to Chilkat correctly, and now Chilkat must convert it to the actual bytes that are going to be sent over the socket. The StringCharset controls which byte representation. If the StringCharset is set to some charset (encoding) where the "Ł" character has no possible byte representation, then it is lost. For example, you cannot send "Ł" if StringCharset = "iso-8859-1" because that charset is 1 byte per char and there is no byte value that represents "Ł". To send "Ł", StringCharset must be something that includes "Ł", which can be any Unicode encoding (utf-8, utf-16, etc.) or the multibyte encodings for the region (iso-8859-2, Windows-1250, etc.)

link

answered Sep 19 at 09:30

chilkat's gravatar image

chilkat ♦♦
11.8k316358420

edited Sep 20 at 07:51

What a great answer, thank you very much.

Before I've asked a question I was testing various combinations of put_StringCharset and put_Utf8 for both client and server socket. After reading your response I've decided to note every test and it's result just to be sure I didn't ommited anything.

I might suprise you, but the case when only "Ł" was ommited occured with put_Utf8(false).

In my case the solution is: put_Utf8(true) and put_StringCharset('utf-8') on both client and server.

link

answered Sep 20 at 02:59

swister's gravatar image

swister
1

Thanks! There's one more thing to know, and it may explain what happened in your case. If there is a literal string in your source file (i.e. a literal quoted string such as "ąćęłńśóźżĄĆĘŁŃŚÓŹŻ"), then it makes a difference how that source file is saved. For example, if you are using an IDE or text editor that saves source files in utf-8, then the bytes of those chars are saved in the utf-8 representation. When PHP is interpreting the source file, the string is composed of the bytes found in the source file -- thus the need to set the Chilkat Utf8 property to true/false depends on how the PHP source file was saved. (This applies to the source files for any programming language where strings are simply byte arrays. In other programming languages, it may be that the compiler/interpreter expects the source to be utf-8, and it would always be a mistake to save the source in the ANSI encoding.)

link

answered Sep 20 at 08:14

chilkat's gravatar image

chilkat ♦♦
11.8k316358420

edited Sep 20 at 08:15

Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or __italic__
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×60
×35
×27
×13
×4

Asked: Sep 19 at 09:00

Seen: 178 times

Last updated: Sep 20 at 08:15

powered by OSQA