Re: [Air-l] Mapping the net with crawlers/robots

20 Oct 2004

      Most of this can be done pretty easily with either perl scripts (see the 
module WWW::Spyder on CPAN, or see 
http://search.cpan.org/~ashley/WWW-Spyder-0.18/Spyder.pm) or with shell 
scripts and command-line flags to wget.  It is definitely not a 
plug-in-and-go kind of task, however - you are going to have to invest 
some serious time to get it working and working *right* for your data 
collection needs.

Adjacency matrix assembly is quite another problem.  The raw matrix 
produced by such a crawl will very quickly outstrip your ability to either 
store or analyze it, for any sizeable chunk of data collected.

--elijah
...
I am trying to analyze the relationships between organizations on the 
web. In particular, I want to map the linking behavior of a set of 
organizations subjectively defined.
I would be most grateful if someone could indicate me whether there
exists some web crawler that allows to define
  - a set of URLs from where to start the crawl
  - the depth - how many levels one wants to look in a given target
domain
  - and number of iterations -how far from the original URL domain one
wants to go.
  - and a few filters -limit specific types of pages (pdf for example)
and returns either a map, a table of relationships (some sort of
adjacency matrix) or both.
Thanks in advance,
Rafel Lucea
MIT - Sloan School of Management
_______________________________________________
The Air-l-aoir.org@listserv.aoir.org mailing list
is provided by the Association of Internet Researchers http://aoir.org
Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
Join the Association of Internet Researchers:
http://aoir.org/airjoin.html

Re: [Air-l] Mapping the net with crawlers/robots

elijah wright