login about faq

Hi, Chilkat Spider not working anymore with Google. Here is my code:

 Dim spider As New Chilkat.Spider
        spider.Initialize("www.google.com")
        spider.AddUnspidered("http://www.google.com/search?hl=en&q=car")
        Dim success As Boolean = spider.CrawlNext()
        If success = True Then
            For ii = 0 To spider.NumUnspidered - 1
                TextBox1.Text = TextBox1.Text & vbCrLf & spider.GetUnspideredUrl(ii)
            Next
            For ii = 0 To spider.NumOutboundLinks - 1
                TextBox1.Text = TextBox1.Text & vbCrLf & spider.GetOutboundLink(ii)
            Next
            For ii = 0 To spider.NumSpidered - 1
                TextBox1.Text = TextBox1.Text & vbCrLf & spider.GetSpideredUrl(ii)
            Next
        End If

Please help on this.

asked Feb 20 '13 at 15:39

Shaoun1000's gravatar image

Shaoun1000
1556


Very interesting! You coded your library so that developers cannot violate Google's Terms of Service? This should not be your problem. Or will you restrict the keylength for encryption routines, because China, North Korea and other countries don't allow that? Time to reconsider your preemptive obedience.

If there are technical reasons, it is absolutely OK.

link

answered Feb 23 '13 at 07:00

Istvan_Szabo's gravatar image

Istvan_Szabo
31112

I think Chilcat did a great job. Google is restricting that which is funny.

(Feb 23 '13 at 12:41) Shaoun1000

Make sure to show the lastErrorText, see http://www.cknotes.com/?p=423

link

answered Feb 20 '13 at 15:42

Gert's gravatar image

Gert ♦
629141824

Hi, This is the last error HTML:

ChilkatLog:<blockquote>
CrawlNext:<blockquote>
DllDate: Aug  5 2012<br>
robotsUrl: http://www.google.com/robots.txt<br>
HttpGet:<blockquote>
QuickReq:<blockquote>
url: http://www.google.com/robots.txt<br>
QuickGetToOutput_OnExisting:<blockquote>
qGet_1:<blockquote>
simpleHttpRequest_3:<blockquote>
httpMethod: GET<br>
requestUrl: http://www.google.com/robots.txt<br>
Connecting to web server...<br>
httpServer: www.google.com<br>
port: 80<br>
ConnectTimeoutMs_1: 10000<br>
calling ConnectSocket2<br>
IPV6 enabled connect with NO heartbeat.<br>
connectingTo: www.google.com<br>
dnsCacheLookup: www.google.com<br>
Resolving domain name (IPV4)<br>
GetHostByNameHB_ipv4: Elapsed time: 156 millisec<br>
myIP_1: 58.97.167.75<br>
myPort_1: 4316<br>
connect successful (1)<br>
connectElapsedMs: 500<br>
-- BuildFireFoxGetRequest --<br>
Not auto-adding cookies.<br>
sendElapsedMs: 0<br>
StatusCode: 200<br>
StatusText: OK<br>
Reading response body...<br>
readResponseElapsedMs: 156<br>
CompressedSize: 1781<br>
</blockquote>
</blockquote>
</blockquote>
</blockquote>
responseSize: 6870<br>
responseContentType: text/plain<br>
This is a text response...<br>
No charset specified, assuming Windows-1252<br>
Converting to utf-8 charset.<br>
ConvertFromCodePage: 1252<br>
Success.<br>
</blockquote>
Fetched robots.txt<br>
url: http://www.google.com/search?hl=en&amp;amp;q=car<br>
HttpGet:<blockquote>
QuickReq:<blockquote>
url: http://www.google.com/search?hl=en&amp;amp;q=car<br>
QuickGetToOutput_OnExisting:<blockquote>
qGet_1:<blockquote>
simpleHttpRequest_3:<blockquote>
httpMethod: GET<br>
requestUrl: http://www.google.com/search?hl=en&amp;amp;q=car<br>
Using existing connection to web server...<br>
-- BuildFireFoxGetRequest --<br>
Not auto-adding cookies.<br>
sendElapsedMs: 0<br>
StatusCode: 302<br>
StatusText: Found<br>
Reading response body...<br>
readResponseElapsedMs: 141<br>
statusCode: 302<br>
Not updating cache because status code != 200<br>
redirectUrl: http://www.google.com/webhp?hl=en<br>
newUrlLocation:<blockquote>
url: http://www.google.com/search?hl=en&amp;amp;q=car<br>
location: http://www.google.com/webhp?hl=en<br>
newUrlFinal: http://www.google.com/webhp?hl=en<br>
</blockquote>
Using existing connection for redirect...<br>
RedirectGet:<blockquote>
QuickGetToOutput_Redirect:<blockquote>
newUrl: http://www.google.com/webhp?hl=en<br>
qGet_1:<blockquote>
simpleHttpRequest_3:<blockquote>
httpMethod: GET<br>
requestUrl: http://www.google.com/webhp?hl=en<br>
Using existing connection to web server...<br>
-- BuildFireFoxGetRequest --<br>
Not auto-adding cookies.<br>
sendElapsedMs: 0<br>
StatusCode: 200<br>
StatusText: OK<br>
Reading response body...<br>
Reading chunked response<br>
readResponseElapsedMs: 578<br>
CompressedSize: 28057<br>
UrlToCache: http://www.google.com/webhp?hl=en<br>
NewExpireTime: Fri, 28 Feb 2013 03:11:02 +0600<br>
Etag: <br>
saveToCache:<blockquote>
No cache roots have been set.  Need to call AddRoot at least once.<br>
</blockquote>
Cache not updated.<br>
</blockquote>
</blockquote>
</blockquote>
</blockquote>
</blockquote>
</blockquote>
</blockquote>
</blockquote>
responseSize: 96379<br>
responseContentType: text/html; charset=UTF-8<br>
This is a text response...<br>
ResponseHdrCharset: UTF-8<br>
Success.<br>
</blockquote>
</blockquote>
</blockquote>
link

answered Feb 20 '13 at 16:13

Shaoun1000's gravatar image

Shaoun1000
1556

Chilkat Spider does not allow for Google pages to be spidered. It would be against Google's Terms of Service. Chilkat specifically checks for Google URLs and does not follow them. I'm very sorry.

link

answered Feb 20 '13 at 21:07

chilkat's gravatar image

chilkat ♦♦
11.8k316358421

Hello Chilkat,

I have become a member and stumbled upon this post.

With regards to your response above, would it not be more practical for the users of this module to receive some kind of information message stating that "the URL to be spidered is prohibited based on the hosting services terms of service" or some such message, especially given the fact that you are diligently checking the URL's to be spidered.

I just think that it would save some development time especially for those that are not aware.

link

answered Mar 04 '14 at 04:09

Greg's gravatar image

Greg
111

Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or __italic__
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×13

Asked: Feb 20 '13 at 15:39

Seen: 8,821 times

Last updated: Mar 04 '14 at 04:09

powered by OSQA