Dear *, could you kindly recommend me a viewer for WARC files (web page archiving). Kind regards .
WARC's are a standard web archiving file format (http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml); its an open standard. Usually you would use a web archiving tool like Wayback Machine or the underlying open source software (the Heretrix web crawler to collect web content, the NutchWAX indexing engine to provide search services, and Wayback to provide the user interfaces), or a service from Archive-IT (subscription to custom web archiving service - www.archive-it.org) to view these files. I don;t know of a specific viewer for WARCs. Baden On Thu, Feb 18, 2010 at 10:06 AM, Steffen Schilke <steffen.schilke@gmail.com> wrote:
Dear *,
could you kindly recommend me a viewer for WARC files (web page archiving).
Kind regards
. _______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
Join the Association of Internet Researchers: http://www.aoir.org/
These are some WARC-related tools I'm aware of. I doubt there is a viewer there, though. http://code.google.com/p/warc-tools/ http://code.google.com/p/search-tools/ While we're on the topic of WARCs, does anyone know of an open source utility for programmatically extracting data from WARCs? e.g. for 're-crawling' web pages stored in WARC format so as to extract hyperlinks and text content (e.g. meta tags)? Rob ------------------------------------- Dr Robert Ackland The Australian National University e-mail: robert.ackland@anu.edu.au homepage: http://adsri.anu.edu.au/people/robert.php project: http://voson.anu.edu.au Information about the Master of Social Research (Social Science of the Internet specialisation): http://adsri.anu.edu.au/study/msr.php ------------------------------------- Baden Hughes wrote:
WARC's are a standard web archiving file format (http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml); its an open standard.
Usually you would use a web archiving tool like Wayback Machine or the underlying open source software (the Heretrix web crawler to collect web content, the NutchWAX indexing engine to provide search services, and Wayback to provide the user interfaces), or a service from Archive-IT (subscription to custom web archiving service - www.archive-it.org) to view these files.
I don;t know of a specific viewer for WARCs.
Baden
On Thu, Feb 18, 2010 at 10:06 AM, Steffen Schilke <steffen.schilke@gmail.com> wrote:
Dear *,
could you kindly recommend me a viewer for WARC files (web page archiving).
Kind regards
. _______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
Join the Association of Internet Researchers: http://www.aoir.org/
_______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
Join the Association of Internet Researchers: http://www.aoir.org/
Dear *, thank you for the answers. I think there should be an open implementation of a (standalone) viewer for WARC files which would also allow to use another archiving system to store these files. In addition it would be possible to view / browse single WARC files (pages stored in WARC files). I also would see the need to "export" a single page with all components e.g., to proof how a web page look at a certain point in time (e.g., for legal reasons, historic research, etc.). Speaking of Heritrix: I was reading the manual and I have a little problem understanding how I can set up a crawl job. My task would be to archive only certain pages in a crawl job, i.e., I want to give Heritrix a list of URLs referring to one page each and I want them to be collected (including all components of that page (e.g., PDF files, images, ...). Anyboy here which could give me a hint / sample job definition? Thank you and Kind regards sws On Thu, Feb 18, 2010 at 1:06 AM, Baden Hughes <baden.hughes@gmail.com>wrote:
WARC's are a standard web archiving file format (http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml); its an open standard.
Usually you would use a web archiving tool like Wayback Machine or the underlying open source software (the Heretrix web crawler to collect web content, the NutchWAX indexing engine to provide search services, and Wayback to provide the user interfaces), or a service from Archive-IT (subscription to custom web archiving service - www.archive-it.org) to view these files.
I don;t know of a specific viewer for WARCs.
Baden
On Thu, Feb 18, 2010 at 10:06 AM, Steffen Schilke <steffen.schilke@gmail.com> wrote:
Dear *,
could you kindly recommend me a viewer for WARC files (web page archiving).
Kind regards
. _______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
Join the Association of Internet Researchers: http://www.aoir.org/
participants (3)
-
Baden Hughes -
Robert Ackland -
Steffen Schilke