Archived Forum Post

Index of archived forum posts

Question:

Spider not working

Mar 04 '14 at 04:09

Hi, Chilkat Spider not working anymore with Google. Here is my code:

 Dim spider As New Chilkat.Spider
        spider.Initialize("www.google.com")
        spider.AddUnspidered("http://www.google.com/search?hl=en&q=car")
        Dim success As Boolean = spider.CrawlNext()
        If success = True Then
            For ii = 0 To spider.NumUnspidered - 1
                TextBox1.Text = TextBox1.Text & vbCrLf & spider.GetUnspideredUrl(ii)
            Next
            For ii = 0 To spider.NumOutboundLinks - 1
                TextBox1.Text = TextBox1.Text & vbCrLf & spider.GetOutboundLink(ii)
            Next
            For ii = 0 To spider.NumSpidered - 1
                TextBox1.Text = TextBox1.Text & vbCrLf & spider.GetSpideredUrl(ii)
            Next
        End If

Please help on this.


Answer

Very interesting! You coded your library so that developers cannot violate Google's Terms of Service? This should not be your problem. Or will you restrict the keylength for encryption routines, because China, North Korea and other countries don't allow that? Time to reconsider your preemptive obedience.

If there are technical reasons, it is absolutely OK.


Answer

Make sure to show the lastErrorText, see http://www.cknotes.com/?p=423


Answer

Hi, This is the last error HTML:

ChilkatLog:<blockquote>
CrawlNext:<blockquote>
DllDate: Aug  5 2012<br>
robotsUrl: http://www.google.com/robots.txt<br>
HttpGet:<blockquote>
QuickReq:<blockquote>
url: http://www.google.com/robots.txt<br>
QuickGetToOutput_OnExisting:<blockquote>
qGet_1:<blockquote>
simpleHttpRequest_3:<blockquote>
httpMethod: GET<br>
requestUrl: http://www.google.com/robots.txt<br>
Connecting to web server...<br>
httpServer: www.google.com<br>
port: 80<br>
ConnectTimeoutMs_1: 10000<br>
calling ConnectSocket2<br>
IPV6 enabled connect with NO heartbeat.<br>
connectingTo: www.google.com<br>
dnsCacheLookup: www.google.com<br>
Resolving domain name (IPV4)<br>
GetHostByNameHB_ipv4: Elapsed time: 156 millisec<br>
myIP_1: 58.97.167.75<br>
myPort_1: 4316<br>
connect successful (1)<br>
connectElapsedMs: 500<br>
-- BuildFireFoxGetRequest --<br>
Not auto-adding cookies.<br>
sendElapsedMs: 0<br>
StatusCode: 200<br>
StatusText: OK<br>
Reading response body...<br>
readResponseElapsedMs: 156<br>
CompressedSize: 1781<br>
</blockquote>
</blockquote>
</blockquote>
</blockquote>
responseSize: 6870<br>
responseContentType: text/plain<br>
This is a text response...<br>
No charset specified, assuming Windows-1252<br>
Converting to utf-8 charset.<br>
ConvertFromCodePage: 1252<br>
Success.<br>
</blockquote>
Fetched robots.txt<br>
url: http://www.google.com/search?hl=en&amp;amp;q=car<br>
HttpGet:<blockquote>
QuickReq:<blockquote>
url: http://www.google.com/search?hl=en&amp;amp;q=car<br>
QuickGetToOutput_OnExisting:<blockquote>
qGet_1:<blockquote>
simpleHttpRequest_3:<blockquote>
httpMethod: GET<br>
requestUrl: http://www.google.com/search?hl=en&amp;amp;q=car<br>
Using existing connection to web server...<br>
-- BuildFireFoxGetRequest --<br>
Not auto-adding cookies.<br>
sendElapsedMs: 0<br>
StatusCode: 302<br>
StatusText: Found<br>
Reading response body...<br>
readResponseElapsedMs: 141<br>
statusCode: 302<br>
Not updating cache because status code != 200<br>
redirectUrl: http://www.google.com/webhp?hl=en<br>
newUrlLocation:<blockquote>
url: http://www.google.com/search?hl=en&amp;amp;q=car<br>
location: http://www.google.com/webhp?hl=en<br>
newUrlFinal: http://www.google.com/webhp?hl=en<br>
</blockquote>
Using existing connection for redirect...<br>
RedirectGet:<blockquote>
QuickGetToOutput_Redirect:<blockquote>
newUrl: http://www.google.com/webhp?hl=en<br>
qGet_1:<blockquote>
simpleHttpRequest_3:<blockquote>
httpMethod: GET<br>
requestUrl: http://www.google.com/webhp?hl=en<br>
Using existing connection to web server...<br>
-- BuildFireFoxGetRequest --<br>
Not auto-adding cookies.<br>
sendElapsedMs: 0<br>
StatusCode: 200<br>
StatusText: OK<br>
Reading response body...<br>
Reading chunked response<br>
readResponseElapsedMs: 578<br>
CompressedSize: 28057<br>
UrlToCache: http://www.google.com/webhp?hl=en<br>
NewExpireTime: Fri, 28 Feb 2013 03:11:02 +0600<br>
Etag: <br>
saveToCache:<blockquote>
No cache roots have been set.  Need to call AddRoot at least once.<br>
</blockquote>
Cache not updated.<br>
</blockquote>
</blockquote>
</blockquote>
</blockquote>
</blockquote>
</blockquote>
</blockquote>
</blockquote>
responseSize: 96379<br>
responseContentType: text/html; charset=UTF-8<br>
This is a text response...<br>
ResponseHdrCharset: UTF-8<br>
Success.<br>
</blockquote>
</blockquote>
</blockquote>

Answer

Chilkat Spider does not allow for Google pages to be spidered. It would be against Google's Terms of Service. Chilkat specifically checks for Google URLs and does not follow them. I'm very sorry.


Answer

Hello Chilkat,

I have become a member and stumbled upon this post.

With regards to your response above, would it not be more practical for the users of this module to receive some kind of information message stating that "the URL to be spidered is prohibited based on the hosting services terms of service" or some such message, especially given the fact that you are diligently checking the URL's to be spidered.

I just think that it would save some development time especially for those that are not aware.