Focused web crawlers: New Starting URL’s

by chris bose on June 10, 2011

Starting URL’s are first determined by identifying useful directories, PR sites and sometimes listening to your spam email. Thats how I found www.splut.com.

But there will come a time when these avenues are exhausted and you have run all your target markets and similarity sets and you are stuck.

Then I had an idea out of the blue because I had been thinking hard about it. I believe that the solution may be in a Google search. Search for some relevant key phrases after turning off Google instant and selecting 100 results per page.

Clean up the page by removing the irrelevant links Google puts there, install MAMP, put the page in the htdocs file and add the local path to your script.

You have now constructed your own focused directory in a sense which you can populate with as many links as you think necessary by adding other search results pages to your own.

Then test and re-test.

I have found that best results come from a page that is the result of an advanced Google search with things like -pdf and -jobs in the search string. I am sure you can think of others.

Your focused web crawler now has access to an infinite number of starting URL’s. This combined with the ability to run multiple python scripts in different windows on powerful desktops leads to monthly terabyte bandwidth usage yielding 20 to 30 new target companies per month.

Previous post:

Next post: