Similarity Sets and Server Loads

by chris bose on July 15, 2011

In the beginning of this work, I did not delve into the quality of the similarity sets too deeply, hoping that the script would sort out any inadequacies.

Well quality in to get quality out. Obviously check the URL out to see if its relevant to your theme. Then check the log to see that the script is downloading the pages in the site, because sometimes it isn’t. This could be because of server issues or robots.txt files.

You can increase the quality of the runs by downloading entire sites with Web Dumper and serving the pages from your own server. This way you don’t piss off server owners by using their bandwidth every time you do a run, and you can tweak your own server to make sure the pages are being delivered in the time required. See the quality of your results go up!

Because the files are being accessed “locally” I can now run as many simultaneous runs as my bandwidth will cope with for a particular similarity set.

Previous post:

Next post: