Archived Forum Post

Index of archived forum posts

Question:

.AddMustMatchPattern spider blocking all results.

Sep 24 '13 at 14:55

Hi everyone,

Looking at the example here:

http://www.example-code.com/vbdotnet/spider_mustMatchPattern.asp

Something is wrong with .AddMustMatchPattern

In VS 2012 VB.net, with textbox and button:

Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
    '  The Chilkat Spider component/library is free.
    Dim spider As New Chilkat.Spider()

    '  --------------------------------------------------------------------
    '  Note: The URLs in this example are no longer valid.
    '  You should replace the URLs with URLs from a site of your
    '  own choosing -- preferably your own site if testing.
    '  (Google's Directory no longer exists.)
    '  --------------------------------------------------------------------

    '  First, we'll get the outbound links for a page in the
    '  Google directory.  Then we'll add some must-match
    '  and then re-fetch, to see it work...

    spider.Initialize("www.dmoz.org")
    spider.AddUnspidered("http://www.dmoz.org/Business/Accounting/")

    Dim success As Boolean
    success = spider.CrawlNext()

    '  Display the outbound links
    Dim i As Long
    Dim url As String
    For i = 0 To spider.NumOutboundLinks - 1
        TextBox1.Text = TextBox1.Text & spider.GetOutboundLink(i) & vbCrLf
    Next

    '  Do it again, but this time with avoid patterns.
    spider.Initialize("www.dmoz.org")
    spider.AddUnspidered("http://www.dmoz.org/Business/Accounting/")   '  Add some must-match patterns:

    spider.AddMustMatchPattern("*.com/*")
    spider.AddMustMatchPattern("*.net/*")

   '  Add some avoid-patterns:
    spider.AddAvoidOutboundLinkPattern("*.mypages.*")
    spider.AddAvoidOutboundLinkPattern("*.personal.*")
    spider.AddAvoidOutboundLinkPattern("*.comcast.*")
    spider.AddAvoidOutboundLinkPattern("*.aol.*")
    spider.AddAvoidOutboundLinkPattern("*~*")

    success = spider.CrawlNext()

    TextBox1.Text = TextBox1.Text & "-----------------------" & vbCrLf

    '  Display the outbound links
    For i = 0 To spider.NumOutboundLinks - 1
        TextBox1.Text = TextBox1.Text & spider.GetOutboundLink(i) & vbCrLf
    Next

End Sub

Produces output:

alt text

There are a row of "---------" above the line is the straightforward spider.

Below the line should appear those links which match the .AddMustMatchPattern. However nothing appears. It is like the method is blocking everything.

Thanks!


Answer

The AddMustMatchPattern can be called one or more times to provide wildcarded strings such that at least one must be matched for the URL to be not ignored.

In the case above, you added 2 must-match patterns:

spider.AddMustMatchPattern("*.com/*")
spider.AddMustMatchPattern("*.net/*")

None of the outbound URLs match any of these patterns. Therefore, they are all excluded. If, for example, there was an outbound link to "http://www.something.com/test.asp" then the "*.com/*" pattern would match it and it would be included in the outbound links.