Re: [Air-L] the best way to archive web material?

22 Oct 2010

      Httrack tends to respect robots.txt, which will prevent spidering, and that may have been the issue there, but it may have been something else, i'd have to look at the site:)  researchers should respect robots.txt too, i think, though archivists should only respect it to a lesser degree, there are discussions of robots.txt in the list archives that as i recall go on for some pages.  Mostly in my work i'm interested in text, and some design, images/videos within that text, but mostly text.  the reasons i like httrack and wget, wget is mostly what i use... is that it does grab what I need.  sometimes you might need more. danah asked me about javascript based sites in an offlist, and well those do cause major issues, as will sites that encrypt their html through javascript, but... most sites, you can just grab the rendered html. For me, i archive my data sets primarily so others can use them, verify them, if they want, i don't do it for my own analysis, but mostly because you aren't really doing science unless you make the data accessible and analyzable by people whose opinion may differ.   granted though, i don't usually release them until something is published from them, except for the wikipedia data... that i never published and just put online.

On Oct 21, 2010, at 6:33 PM, Sari wrote:
...
I've been using HTTrack http://www.httrack.com/ (suggested by Jeremy) for a
while…Unfortunately, it breaks the crawling process at the very beginning
sometimes. Am not sure why it does so, but I suppose it is related to the
structure of the website or the portion of the website you are trying to
download for offline browsing.
I've switched to the HTML Spider in the Free Download Manager
http://www.freedownloadmanager.org/ and haven't faced any problem since
then. Adjusting the crawling settings in this spider (depth, in(ex)cluding
images, in(ex)cluding files, etc) is much easier than adjusting them in
HTTrack.
In an early release I used, HTTrack was silently fetching the whole Yahoo to
me =)
/Sari
On Thu, Oct 21, 2010 at 11:16 PM, WL Wong <wwon8281@uni.sydney.edu.au>wrote:
...
Sarah
I use WebCite http://www.webcitation.org/ and Evernote
http://www.evernote.com/.
Cheers
WL
On 22/10/2010, at 1:33 AM, Adi Kuntsman wrote:
...
Dear Sarah
I am using zotero which is a free add on to Firefox
http://www.zotero.org/
Good thing about it: it takes captures of webpages as they are at any
particular
moment + creates info on URL, date of access etc (Zotero was originaly
developed
as a tool to create and share bibliographies)
Files are easy to organise into folders and subfolders, and I think there
is an
option to have your archive stored on zotero site , to be able to share
(haven't
explored this as I work along on my project)
Not so good thing: can't download videos. So you will need to download
separately.
I am sure there are other, better ways, so look forward to other
responses
Adi
--
Dr. Adi Kuntsman
Leverhulme Early Career Fellow
Research Institute for Cosmopolitan Cultures
The University of Manchester
Second Floor, Arthur Lewis Building, room 2.007
Oxford Road, Manchester M13 9PL, UK
http://www.socialsciences.manchester.ac.uk/ricc/index.html
http://adi.kuntsman.googlepages.com
________________________________
From: Sarah Oates <s.oates@lbss.gla.ac.uk>
To: air-l@listserv.aoir.org
Sent: Thu, October 21, 2010 3:25:47 PM
Subject: [Air-L] the best way to archive web material?
Hello and apologies if this has been asked recently or seems a bit basic!
Does anyone have a recommendation for software to archive web material? I
am
heading a project to study political activism on the Russian internet and
we
need to store a range of different types of web pages across time ... I
can't
even get my PC to store even a small amount with full images. My research
partner in Ukraine can, but she has a Mac (not an option available at my
university right now). I have a small budget to buy some software,
although
freeware suggestions always appreciated. I want to have the archive
complete so
that we can work with it, share it with other researchers, go back to it
as
necessary, etc., so I really want to have full graphics etc. Optimally,
it would
be something that could do automatic crawls and downloads as well,
although as
we are tending to focus on relatively short periods of intense interest
around
particularly issues/events, we don't need a long-term crawl system.
Suggestions from this clever and useful list most welcome, although
currently
this list is making me sad that I am not in Sweden to meet people at
exciting
venues and hear what I am sure is some great work (:
Sincerely
Sarah
Sarah Oates
Professor of Political Communication
School of Social and Political Sciences
Adam Smith Building
University of Glasgow
Glasgow G12 8RT
Email: sarah.oates@glasgow.ac.uk
Website: www.media-politics.com<http://www.media-politics.com/>
Telephone: (0)141 330 5124
The University of Glasgow, charity number SC004401
_______________________________________________
The Air-L@listserv.aoir.org mailing list
is provided by the Association of Internet Researchers http://aoir.org
Subscribe, change options or unsubscribe at:
http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
Join the Association of Internet Researchers:
http://www.aoir.org/
_______________________________________________
The Air-L@listserv.aoir.org mailing list
is provided by the Association of Internet Researchers http://aoir.org
Subscribe, change options or unsubscribe at:
http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
Join the Association of Internet Researchers:
http://www.aoir.org/
_______________________________________________
The Air-L@listserv.aoir.org mailing list
is provided by the Association of Internet Researchers http://aoir.org
Subscribe, change options or unsubscribe at:
http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
Join the Association of Internet Researchers:
http://www.aoir.org/
-- 
-----BEGIN PGP PUBLIC KEY BLOCK-----
Version: PGP Desktop 9.5.0 (Build 1202)
mQCNBEgtLgoBBACqQYBgYCY40SblWGbTcrvwCngPrjx2CNtcfR/ATvZ4mbF/xHgy
SzV6+XRs76hgAv0K2AG+i4UjDwRRJfb8HPe8DVtsyOQNPFtZO9Gk700aD7MndwlF
m7HrGwc5uBfnH6iUws1o/Z1J7i+5fUfk3mew/b3532WxLvDi+QUSxlsKdQARAQAB
tCRTYXJpIEhhaiBIdXNzZWluIDxhbmd5am9vQHlhaG9vLmNvbT6JAPIEEAECAFwF
AkgtL4UwFIAAAAAAIAAHcHJlZmVycmVkLWVtYWlsLWVuY29kaW5nQHBncC5jb21w
Z3BtaW1lCAsJBwgDAgEKAhkBBRsDAAAABBYDAgEFHgEAAAAHFQgCCgkDAQAKCRCy
i48IPBmZbZoNA/0ckC3rWxoe/Jf66+YauicNtH8zZmr9Y7dypV+yZm/vrkAtffcY
1VKMhj9YMpqwzylP/nomuG211bWoGhMzAb7CAho1tS3KXtUNZzLj1U5hvRtWfrWc
dipwY3YJbnaFdkzIi9xj3HMZ4BKHQZtBKjwru6HafQF2smokS8yjxTKELA==
=9/vk
-----END PGP PUBLIC KEY BLOCK-----
_______________________________________________
The Air-L@listserv.aoir.org mailing list
is provided by the Association of Internet Researchers http://aoir.org
Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
Join the Association of Internet Researchers:
http://www.aoir.org/
Jeremy Hunsinger
Center for Digital Discourse and Culture
Virginia Tech

Words are things; and a small drop of ink, falling like dew upon a thought, produces that which makes thousands, perhaps millions, think. --Byron

Re: [Air-L] the best way to archive web material?

jeremy hunsinger