Archived Forum Post

Index of archived forum posts

Question:

extracting plain text from web page

Mar 22 '13 at 20:02

I’m currently using your VC++ Version 6 library. What I want is to be able to convert a web page into a plain text file.

Is it possible to download the html file with the http.Download(http://www.mydomain/xyz.html) call?

I noticed that you have an CkHtmlToText call I can make. How would you suggest feeding it a complete web page? Would I download the html file, then read it in as one big string? Then feed the big string to h2t.toText(bigstring)?

It would be ideal is if there was a way for me to pass in a URL instead of a big string?

Let me know what calls you would suggest that I look at.

Thank you!


Answer

The Download method sends an HTTP GET request to fetch the content, whatever it may be, at a specified URL and saves it to a file. The content at URL could be anything -- a JPG, a .zip archive, an HTML page, a Perl script that emits something, a dynamic web page using ASP, JSP, etc. Therefore, downloading HTML is no different than downloading any other type of content. The bytes sent in the body of the HTTP response is what is saved to the output file.

To "download" directly into a byte array, call http.QuickGet.

To "download" directly into a string variable, call http.QuickGetStr.

(Of course, it would only make sense to download text usingi QuickGetStr.)


Answer

Thanks! I tried using the HtmlToText function, but the result returns items like href URL's instead of the Alt tag. I was hoping that it would return the same thing as if you copied the whole web page into the clipboard and pasted it into a notepad file. Do you have any suggestions if what I want is the equivalent text as if you did a copy/paste from a whole web page to a notepad file?