Archived Forum Post

Index of archived forum posts

Question:

HTML to Plain Text removes all text after orphaned <

Aug 29 '13 at 09:30

Hello, I'm using the HtmlToText to convert HTML to plain text. The component is treating angle brackets used as mathematical symbols as HTML though. For example: "before less than < after less than" becomes "before less than". The value of the SuppressLinks property didn't have an effect.

I did turn on verbose logging, but it didn't give me anything I could use. ToText: DllDate: Aug 15 2013 ChilkatVersion: 9.4.1.42 Username: IUSR Architecture: Little Endian; 32-bit Language: .NET 2.0 VerboseLogging: 1 decodeHtmlEntities: 1 HtmlCodePage: 65001 charset3: utf-8 toXmlTime: Elapsed time: 0 millisec xmlToText: recursiveToText: (leaveContext) (leaveContext) toTextTime: Elapsed time: 16 millisec Success. (leaveContext)

Any suggestions? Is this a known issue?

Thanks for you help.


Answer

When parsing HTML, the "<" character is interpreted as the open character of an HTML tag. Therefore, when an unencoded "<" exists, such as in "before less than < after less than" the HTML parser things that the HTML tag is "<afterlessthan...."

As a human we can look at it and obviously know that the "<" character in that case is a mistake. However, programmatically it is not so easy. There is no way to really encounter a "<" and decide to NOT interpret it as the start character for an HTML tag.