Most of this can be done pretty easily with either perl scripts (see the module WWW::Spyder on CPAN, or see http://search.cpan.org/~ashley/WWW-Spyder-0.18/Spyder.pm) or with shell scripts and command-line flags to wget. It is definitely not a plug-in-and-go kind of task, however - you are going to have to invest some serious time to get it working and working *right* for your data collection needs. Adjacency matrix assembly is quite another problem. The raw matrix produced by such a crawl will very quickly outstrip your ability to either store or analyze it, for any sizeable chunk of data collected. --elijah
I am trying to analyze the relationships between organizations on the web. In particular, I want to map the linking behavior of a set of organizations subjectively defined.
I would be most grateful if someone could indicate me whether there exists some web crawler that allows to define - a set of URLs from where to start the crawl - the depth - how many levels one wants to look in a given target domain - and number of iterations -how far from the original URL domain one wants to go. - and a few filters -limit specific types of pages (pdf for example)
and returns either a map, a table of relationships (some sort of adjacency matrix) or both.
Thanks in advance,
Rafel Lucea MIT - Sloan School of Management
_______________________________________________ The Air-l-aoir.org@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
Join the Association of Internet Researchers: http://aoir.org/airjoin.html