Mapping the net with crawlers/robots
Hi all, I am trying to analyze the relationships between organizations on the web. In particular, I want to map the linking behavior of a set of organizations subjectively defined. I have explored a number of software packages (Website Watcher, Sphinx, MnoGoSearch, issuecrawler...) but they are either not thought for this specific purpose (WW, Issuecrawler) or require coding abilities that are beyond my knowledge (Sphinx -OS). I would be most grateful if someone could indicate me whether there exists some web crawler that allows to define - a set of URLs from where to start the crawl - the depth - how many levels one wants to look in a given target domain - and number of iterations -how far from the original URL domain one wants to go. - and a few filters -limit specific types of pages (pdf for example) and returns either a map, a table of relationships (some sort of adjacency matrix) or both. Thanks in advance, Rafel Lucea MIT - Sloan School of Management
Most of this can be done pretty easily with either perl scripts (see the module WWW::Spyder on CPAN, or see http://search.cpan.org/~ashley/WWW-Spyder-0.18/Spyder.pm) or with shell scripts and command-line flags to wget. It is definitely not a plug-in-and-go kind of task, however - you are going to have to invest some serious time to get it working and working *right* for your data collection needs. Adjacency matrix assembly is quite another problem. The raw matrix produced by such a crawl will very quickly outstrip your ability to either store or analyze it, for any sizeable chunk of data collected. --elijah
I am trying to analyze the relationships between organizations on the web. In particular, I want to map the linking behavior of a set of organizations subjectively defined.
I would be most grateful if someone could indicate me whether there exists some web crawler that allows to define - a set of URLs from where to start the crawl - the depth - how many levels one wants to look in a given target domain - and number of iterations -how far from the original URL domain one wants to go. - and a few filters -limit specific types of pages (pdf for example)
and returns either a map, a table of relationships (some sort of adjacency matrix) or both.
Thanks in advance,
Rafel Lucea MIT - Sloan School of Management
_______________________________________________ The Air-l-aoir.org@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
Join the Association of Internet Researchers: http://aoir.org/airjoin.html
Hi Rafel, Touch Graph Google Browser might be able to help you with the project. The software can be found at: http://www.touchgraph.com/TGGoogleBrowser.html Chris
I am trying to analyze the relationships between organizations on the web. In particular, I want to map the linking behavior of a set of organizations subjectively defined.
===== Christopher Helland, Ph.D. SSHRC Research Fellow Gorsebrook Research Institute for Atlantic Canada Studies Saint Mary's University Halifax, Nova Scotia Canada B3H 3C3 http://www.chass.utoronto.ca/~chelland/index.html __________________________________ Do you Yahoo!? Yahoo! Mail Address AutoComplete - You start. We finish. http://promotions.yahoo.com/new_mail
Hi, Richard Rogers govcom.com http://www.pressepapiers.net/archives/2003/10/20/aggregating_global_civil_so... Greetings, Rob. -- http://www.virtueelplatform.nl/ http://memling.ugent.be/staff/rob http://blogger.xs4all.nl/kranenbu/ mail: kranenbu@xs4all.nl text/sms: 0032 472 40 63 72
participants (4)
-
Christopher Helland -
elijah wright -
rafel Lucea -
Rob van Kranenburg