Forming a corpus

by bose on March 20, 2010

Web Dumper to download web pages from all interesting sites: depth level 12; do not allow external links; disable all file types except text files; used 50% bandwidth with 6 simultaneous connections.

Tried the list option but always seemed to hang so did 1 URL at a time. Always stopped the run if site did not begin downloading after 20 seconds.

When complete, ran a text factory in BBEdit to turn html into plain text and overwrote all files

{ 0 comments… add one now }

Leave a Comment

Previous post:

Next post: