login about faq

Hi everyone,

Looking at the example here:

http://www.example-code.com/vbdotnet/spider_mustMatchPattern.asp

Something is wrong with .AddMustMatchPattern

In VS 2012 VB.net, with textbox and button:

Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
    '  The Chilkat Spider component/library is free.
    Dim spider As New Chilkat.Spider()

    '  --------------------------------------------------------------------
    '  Note: The URLs in this example are no longer valid.
    '  You should replace the URLs with URLs from a site of your
    '  own choosing -- preferably your own site if testing.
    '  (Google's Directory no longer exists.)
    '  --------------------------------------------------------------------

    '  First, we'll get the outbound links for a page in the
    '  Google directory.  Then we'll add some must-match
    '  and then re-fetch, to see it work...

    spider.Initialize("www.dmoz.org")
    spider.AddUnspidered("http://www.dmoz.org/Business/Accounting/")

    Dim success As Boolean
    success = spider.CrawlNext()

    '  Display the outbound links
    Dim i As Long
    Dim url As String
    For i = 0 To spider.NumOutboundLinks - 1
        TextBox1.Text = TextBox1.Text & spider.GetOutboundLink(i) & vbCrLf
    Next

    '  Do it again, but this time with avoid patterns.
    spider.Initialize("www.dmoz.org")
    spider.AddUnspidered("http://www.dmoz.org/Business/Accounting/")   '  Add some must-match patterns:

    spider.AddMustMatchPattern("*.com/*")
    spider.AddMustMatchPattern("*.net/*")

   '  Add some avoid-patterns:
    spider.AddAvoidOutboundLinkPattern("*.mypages.*")
    spider.AddAvoidOutboundLinkPattern("*.personal.*")
    spider.AddAvoidOutboundLinkPattern("*.comcast.*")
    spider.AddAvoidOutboundLinkPattern("*.aol.*")
    spider.AddAvoidOutboundLinkPattern("*~*")

    success = spider.CrawlNext()

    TextBox1.Text = TextBox1.Text & "-----------------------" & vbCrLf

    '  Display the outbound links
    For i = 0 To spider.NumOutboundLinks - 1
        TextBox1.Text = TextBox1.Text & spider.GetOutboundLink(i) & vbCrLf
    Next

End Sub

Produces output:

alt text

There are a row of "---------" above the line is the straightforward spider.

Below the line should appear those links which match the .AddMustMatchPattern. However nothing appears. It is like the method is blocking everything.

Thanks!

asked Sep 19 '13 at 04:13

bendecko's gravatar image

bendecko
1112

edited Sep 24 '13 at 10:38

chilkat's gravatar image

chilkat ♦♦
11.8k316358421


The AddMustMatchPattern can be called one or more times to provide wildcarded strings such that at least one must be matched for the URL to be not ignored.

In the case above, you added 2 must-match patterns:

spider.AddMustMatchPattern("*.com/*")
spider.AddMustMatchPattern("*.net/*")

None of the outbound URLs match any of these patterns. Therefore, they are all excluded. If, for example, there was an outbound link to "http://www.something.com/test.asp" then the "*.com/*" pattern would match it and it would be included in the outbound links.

link

answered Sep 24 '13 at 10:37

chilkat's gravatar image

chilkat ♦♦
11.8k316358421

But http://www.dmoz.org/Business/Accounting/ contains lots of .com links and none show up.

http://www.azahranaccounting.com/, http://www.bcgcompany.com/ and http://www.smartpros.com/ are there.

Please try the code.

Thanks

(Sep 24 '13 at 14:55) bendecko
Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or __italic__
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×43
×13

Asked: Sep 19 '13 at 04:13

Seen: 1,950 times

Last updated: Sep 24 '13 at 14:55

powered by OSQA