WARC File viewer - Air-L - lists.aoir.org

newer
CfP DIGITAL GAME PLAY AS...

WARC File viewer

older
Call for Papers: Special Issue of...

Steffen Schilke

17 Feb 2010 17 Feb '10

11:06 p.m.

Dear *, could you kindly recommend me a viewer for WARC files (web page archiving). Kind regards .

Reply

Sign in to reply online Use email software

Show replies by date

Baden Hughes

18 Feb 18 Feb

12:06 a.m.

WARC's are a standard web archiving file format (http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml); its an open standard. Usually you would use a web archiving tool like Wayback Machine or the underlying open source software (the Heretrix web crawler to collect web content, the NutchWAX indexing engine to provide search services, and Wayback to provide the user interfaces), or a service from Archive-IT (subscription to custom web archiving service - www.archive-it.org) to view these files. I don;t know of a specific viewer for WARCs. Baden On Thu, Feb 18, 2010 at 10:06 AM, Steffen Schilke <steffen.schilke@gmail.com> wrote:

Dear *,

could you kindly recommend me a viewer for WARC files (web page archiving).

Kind regards

. _______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org

Join the Association of Internet Researchers: http://www.aoir.org/

Reply

Sign in to reply online Use email software

Robert Ackland

12:22 a.m.

These are some WARC-related tools I'm aware of. I doubt there is a viewer there, though. http://code.google.com/p/warc-tools/ http://code.google.com/p/search-tools/ While we're on the topic of WARCs, does anyone know of an open source utility for programmatically extracting data from WARCs? e.g. for 're-crawling' web pages stored in WARC format so as to extract hyperlinks and text content (e.g. meta tags)? Rob ------------------------------------- Dr Robert Ackland The Australian National University e-mail: robert.ackland@anu.edu.au homepage: http://adsri.anu.edu.au/people/robert.php project: http://voson.anu.edu.au Information about the Master of Social Research (Social Science of the Internet specialisation): http://adsri.anu.edu.au/study/msr.php ------------------------------------- Baden Hughes wrote:

WARC's are a standard web archiving file format (http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml); its an open standard.

Usually you would use a web archiving tool like Wayback Machine or the underlying open source software (the Heretrix web crawler to collect web content, the NutchWAX indexing engine to provide search services, and Wayback to provide the user interfaces), or a service from Archive-IT (subscription to custom web archiving service - www.archive-it.org) to view these files.

I don;t know of a specific viewer for WARCs.

Baden

On Thu, Feb 18, 2010 at 10:06 AM, Steffen Schilke <steffen.schilke@gmail.com> wrote:

...
Dear *,

could you kindly recommend me a viewer for WARC files (web page archiving).

Kind regards

. _______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org

Join the Association of Internet Researchers: http://www.aoir.org/

_______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org

Join the Association of Internet Researchers: http://www.aoir.org/

Reply

Sign in to reply online Use email software

Steffen Schilke

7:18 a.m.

Dear *, thank you for the answers. I think there should be an open implementation of a (standalone) viewer for WARC files which would also allow to use another archiving system to store these files. In addition it would be possible to view / browse single WARC files (pages stored in WARC files). I also would see the need to "export" a single page with all components e.g., to proof how a web page look at a certain point in time (e.g., for legal reasons, historic research, etc.). Speaking of Heritrix: I was reading the manual and I have a little problem understanding how I can set up a crawl job. My task would be to archive only certain pages in a crawl job, i.e., I want to give Heritrix a list of URLs referring to one page each and I want them to be collected (including all components of that page (e.g., PDF files, images, ...). Anyboy here which could give me a hint / sample job definition? Thank you and Kind regards sws On Thu, Feb 18, 2010 at 1:06 AM, Baden Hughes <baden.hughes@gmail.com>wrote:

WARC's are a standard web archiving file format (http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml); its an open standard.

Usually you would use a web archiving tool like Wayback Machine or the underlying open source software (the Heretrix web crawler to collect web content, the NutchWAX indexing engine to provide search services, and Wayback to provide the user interfaces), or a service from Archive-IT (subscription to custom web archiving service - www.archive-it.org) to view these files.

I don;t know of a specific viewer for WARCs.

Baden

On Thu, Feb 18, 2010 at 10:06 AM, Steffen Schilke <steffen.schilke@gmail.com> wrote:

...
Dear *,

could you kindly recommend me a viewer for WARC files (web page archiving).

Kind regards

. _______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org

Join the Association of Internet Researchers: http://www.aoir.org/

Reply

Sign in to reply online Use email software

5978

Age (days ago)

5979

Last active (days ago)

Download

3 comments

3 participants

tags

participants (3)

Baden Hughes
Robert Ackland
Steffen Schilke