On 6/5/05 1:36 PM, "Jeremy Hunsinger" <jhuns@vt.edu> wrote:
I use wget or mirror.pl and then there are the tools represented at http://webarchivist.org/ too On Jun 5, 2005, at 10:10 AM, elijah wright wrote:
I am a PhD student at the University of Reading, Uk. I am running a study on 200 protest group websites. Would you suggest any good SW to store whole websites offline?
The tools represented at http://webarchivist.org are for somewhat more elaborate research approaches than many individual scholars are interested in developing -- but let me try to explain our thinking on this topic. WebArchivist was created to solve the problem of making regular periodic copies of a number of sites or pages; retrieving the archived objects by URL and date; indexing, cataloguing and/or analyzing the sites / pages; and then retrieving the archived objects on the basis of researcher/cataloguer created metadata (i.e. The index, catalog, or analysis fields). Our tools seem to be most efficient when the number of objects is relatively large (dozens to hundreds or even thousands of sites), regular (daily, weekly or monthly) and sustained (three months to a few years). Examples of the kinds of archives / collections that can be sustained include the Web spheres we've analyzed around the 2002 US Election and the September 11 terrorist attacks; both are presented at http://www.loc.gov/minerva; additional scholarly data on the 2002 election collection is presented at http://politicalweb.info. We strongly encourage scholars to work closely with librarians at their institutions to see if they are willing to work with you to store your collection for future researchers. Alternatively, consider working with us or perhaps the Internet Archive to store your collection and the data about the Web objects that you collect. If you are interested in making a collection accessible to other researchers, even others in your own research group, you will need to consider how to serve the objects in the collection. If you have any concerns about preservation, or concerns about representing the data in as close to the observed form as possible, you may wish to consider the crawlers that do not change the HTML code. Some programs, such as teleport pro, and wget in some of its usages do -- while rewriting HTML code to make links readable may make your initial observation easier, subsequent researchers may find your data very difficult to interpret. And it may be difficult or impossible to house your collection as an archive. Most recently, we've been using the Heritrix crawler and saving our data into ARC files. This creates an additional challenge of reading the ARC files, however. There are some tools out there that help -- check out http://www.netarchive.dk/website/sources/index-en.htm. This thread raises interesting issues about our ability as scholars to create datasets (archives) of Web-based materials. I'd be glad to continue the discussion if anyone is interested in this. //steve. Steven M. Schneider Associate Professor, SUNYIT: http://www.sunyit.edu/~steve Co-Director, WebArchivist: http://www.webarchivist.org