I have a new take on the old problem of archiving a web site. The problem is that the site I need to archive has already been taken off line. (An object lesson in why it¹s important to archive web sites you¹re depending on in your research....) Fortunately, the site is still available through Google¹s cache, but this is a difficult way to access the content. (The site in question is an e-mail list archive, and so each message is a separate page, which means I need to download more than 2000 pages.) I¹ve tried various archiving programs, inputting the URL that Google generates for its search result page as the root page for the archive. But so far no luck I think because Google creates a separate URL for each page of the search results, and because it¹s difficult to figure out the right ³depth² of archive. (I need it to go 200 pages deep to get to the last page of search results, but I only want it to go 2 pages deep to get each message.) Does anyone have any experience trying to archive a Google cache? Or any suggestions? Thanks, Alex -- Alexandra Samuel samuel@fas.harvard.edu http://www.alexandrasamuel.com
Alexandra, You may want to try Google's Web Services and write some code to get the pages you want. http://www.google.com/apis/ If you can't do it your self some CS major at UBC should be able to whip up the query *fairly* quickly. Good luck! hth -- =============================================== Karim R. Lakhani MIT Sloan School of Management MIT Free/Open Source Software Research Project e-mail: lakhani@mit.edu voice: 617-851-1224 fax: 617-344-0403 http://opensource.mit.edu http://freesoftware.mit.edu http://mit.edu/lakhani/www ==============================================
participants (2)
-
Alexandra Samuel -
Karim R. Lakhani