I¹m working on a project that involves conducting a cluster analysis (type of textual analysis based on Kenneth Burke¹s work) on the content of five different websites. I want to download the full content of these five sites so I have hard copies to work from during the rather arduous process of going through and categorizing the text. Can anyone recommend a good program to download full websites (to a page depth of at least 3)? I¹ve been using SiteSucker but am finding it a bit buggy. Thank you! Katie Kathleen Stansberry Ph.D. Candidate University of Oregon School of Journalism and Communication http://katiestansberry.com kpontius@uoregon.edu (541) 228-5576
Hi Kathleen, Apache Lucene is the best resource for something like this, in my opinion. Available here: http://lucene.apache.org/ Requires some programming knowledge though. Thanks, Wojciech On Mon, Feb 13, 2012 at 12:33 AM, Kathleen Stansberry <kpontius@uoregon.edu>wrote:
I¹m working on a project that involves conducting a cluster analysis (type of textual analysis based on Kenneth Burke¹s work) on the content of five different websites. I want to download the full content of these five sites so I have hard copies to work from during the rather arduous process of going through and categorizing the text.
Can anyone recommend a good program to download full websites (to a page depth of at least 3)? I¹ve been using SiteSucker but am finding it a bit buggy.
Thank you! Katie
Kathleen Stansberry Ph.D. Candidate University of Oregon School of Journalism and Communication http://katiestansberry.com kpontius@uoregon.edu (541) 228-5576 _______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
Join the Association of Internet Researchers: http://www.aoir.org/
Hi Kathleen Lucene is good, but there are also some simple options. I like the command line; there you can use wget: http://gnuwin32.sourceforge.net/packages/wget.htm Usage detailed here: http://how-to.wikia.com/wiki/How_to_mirror,_spider,_or_archive_a_website http://blog.moldoveanu.net/2010/11/downloading-an-entire-website-using-wget/ Or you can use a 'spider' extension as part of the firefox webbrowser; Install firefox, www.mozilla.org/en-US/firefox/new/ and then, in firefox, install the a spider addon, either https://addons.mozilla.org/en-US/firefox/addon/spiderzilla/ or https://addons.mozilla.org/en-US/firefox/addon/foxyspider/ Write back if you have any problems. Cheers Dennis On 02/13/2012 04:48 PM, Wojciech Gryc wrote:
Hi Kathleen,
Apache Lucene is the best resource for something like this, in my opinion. Available here: http://lucene.apache.org/
Requires some programming knowledge though.
Thanks, Wojciech
On Mon, Feb 13, 2012 at 12:33 AM, Kathleen Stansberry <kpontius@uoregon.edu>wrote:
I¹m working on a project that involves conducting a cluster analysis (type of textual analysis based on Kenneth Burke¹s work) on the content of five different websites. I want to download the full content of these five sites so I have hard copies to work from during the rather arduous process of going through and categorizing the text.
Can anyone recommend a good program to download full websites (to a page depth of at least 3)? I¹ve been using SiteSucker but am finding it a bit buggy.
Thank you! Katie
Kathleen Stansberry Ph.D. Candidate University of Oregon School of Journalism and Communication http://katiestansberry.com kpontius@uoregon.edu (541) 228-5576 _______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
Join the Association of Internet Researchers: http://www.aoir.org/
_______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
Join the Association of Internet Researchers: http://www.aoir.org/
Hi Kathleen I have used both BootCat http://bootcat.sslmit.unibo.it/ and HTtrack www.httrack.com for building corpora of websites for textual analysis. They were recommended to me by colleagues in my department. They can be a bit slow on larger sites but I found them both user friendly and effective. You can also set the search to exclude certain file types i.e. image files if you just want text. Let me know if you want any further info Karen ________________________________________ From: air-l-bounces@listserv.aoir.org [air-l-bounces@listserv.aoir.org] on behalf of Wojciech Gryc [wojciech@gmail.com] Sent: 13 February 2012 05:48 To: Kathleen Stansberry Cc: air-l@listserv.aoir.org Subject: Re: [Air-L] Tool to Download Websites? Hi Kathleen, Apache Lucene is the best resource for something like this, in my opinion. Available here: http://lucene.apache.org/ Requires some programming knowledge though. Thanks, Wojciech On Mon, Feb 13, 2012 at 12:33 AM, Kathleen Stansberry <kpontius@uoregon.edu>wrote:
I¹m working on a project that involves conducting a cluster analysis (type of textual analysis based on Kenneth Burke¹s work) on the content of five different websites. I want to download the full content of these five sites so I have hard copies to work from during the rather arduous process of going through and categorizing the text.
Can anyone recommend a good program to download full websites (to a page depth of at least 3)? I¹ve been using SiteSucker but am finding it a bit buggy.
Thank you! Katie
Kathleen Stansberry Ph.D. Candidate University of Oregon School of Journalism and Communication http://katiestansberry.com kpontius@uoregon.edu (541) 228-5576 _______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
Join the Association of Internet Researchers: http://www.aoir.org/
_______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org Join the Association of Internet Researchers: http://www.aoir.org/
Hello Katie, the open source website copier HTTrack (http://www.httrack.com/) is commonly used to download individual websites. It saves downloaded pages as static, linked HTML files, which you can then traverse with your browser or easily access with other analytical tools. It offers configurable options for link depth and desired file types (e.g., if you were only interested in text), among other things. ~Nicholas Nicholas Taylor | Information Technology Specialist | Library of Congress Web Archiving
I have used BlackWidow and BrownRecluse from Softbytelabs. Both have worked well. Blackwidow creates the site structure from which you can select file types to download as required. BrownRecluse is more programmable and can download elements of a page in a database. Devayani On Tue, Feb 14, 2012 at 8:24 AM, Taylor, Nicholas A. <ntay@loc.gov> wrote:
Hello Katie, the open source website copier HTTrack ( http://www.httrack.com/) is commonly used to download individual websites. It saves downloaded pages as static, linked HTML files, which you can then traverse with your browser or easily access with other analytical tools. It offers configurable options for link depth and desired file types (e.g., if you were only interested in text), among other things.
~Nicholas
Nicholas Taylor | Information Technology Specialist | Library of Congress Web Archiving _______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
Join the Association of Internet Researchers: http://www.aoir.org/
-- Devayani Tirthali Research Associate Institute for Learning Technologies Teachers College, Columbia University
participants (6)
-
Dennis Wollersheim -
Devayani Tirthali -
Donnelly, Karen -
Kathleen Stansberry -
Taylor, Nicholas A. -
Wojciech Gryc