I guess I will weigh in on the liberal collecting side here: COPYRIGHT Agreeing with Dan's post, suggesting that fair use is about as far from clear cut as possible, it strikes me that there is potential harm in being over-cautious on this. I guess I am living (ever so slightly) dangerously when I make private copies of websites as part of my research, but I also think it is necessary, fair, and just. I would hate to see us as a community give up our rights to make use of copyrighted material. Our research does (or at least ought to) serve the public welfare, and we should assert our rights to use copyrighted works appropriately. By shying away from this, I think we adversely affect the future enforcement of copyright. It is easy enough for publishers to limit access to materials on the web by use of passwords and the like, and as soon as they do, it is clearly no longer publicly published and that's a whole different situation. But particularly for sites that are deliberately put into the public eye, I think we have a responsibility as researchers to access and archive that material as our research demands. ROBOTS.TXT / TOS There needs to be some balance here in terms of coverage. While I may be in the minority, I don't think that it is vital that robots.txt *always* be followed. (For practical reasons, especially with dynamic sites, it may be a good idea, but I don't think it is an absolute.) If my robot behaves in such a way that it is indistinguishable from a gaggle of humans loading stuff on their browsers and saving, then I see no reason I shouldn't be able to use a robot. Most robot.txt prohibitions exist because web authors are looking for a way to shape the search engines' report of their site. They have not predicted the use of the content by researchers. On the other hand, some robots (built-in IE engine mentioned above, iirc, as well as Acrobat) seem to behave not at all like a human, delivering a huge number of simultaneous or sequential requests, without appropriate delays. Here, the harm (or potential harm) is much clearer.