[Air-L] a question about privacy protection and copyright in Internet research

8 May 2011

      Hi all,

Just a short comment about robots.txt. Whether robots.txt should be respected or not by web archives is not as unequivocal as put by Jeremy below - there are national differences. The national Danish web archive Netarkivet.dk is based on the "Act no. 1439 of December 22, 2004 on Legal Deposit of Published Material", and here it is stated that the archive is not obliged to respect robots.txt.

The rationale for this is that once it's out there it's out there, and as long as it's accessible to everyone it's 'public' - thus it's part of the cultural heritage, thus it should be archived. This also applies for password protected material IF everyone could get a password, no matter if s/he had to pay for it. And if you don't want it to be public, then don't put it out there - robots.txt does not prevent people to see it, then it's public and will be archived. In the Danish legislation ignoring robots.txt is being considered necessary to collect all relevant material.

However, the acces to the web archive is very restricted: there's only acces for scholars, and you have to apply for acces.

If interested, the Danish web archive has made a short fact sheet: http://netarkivet.dk/publikationer/Fact%20sheet%20Webarchiving%20in%20Denmar...

And a FAQ: http://netarkivet.dk/faq/index-en.php - and about robots.txt: http://netarkivet.dk/faq/index-en.php#faq_robots

Best,

Niels Brügger

-----------
Message: 9
Date: Sat, 7 May 2011 11:18:04 -0400
From: jeremy hunsinger <jhuns@vt.edu>
To: aoir list <air-l@aoir.org>
Subject: Re: [Air-L] a question about privacy protection and copyright
	in	Internet research
Message-ID: <B7115BE4-D7AA-4B72-8C70-CF029ADF6AEE@vt.edu>
Content-Type: text/plain; charset=us-ascii

I think the tendency is to muddy the waters immensely here, but I also don't think we need to muddy the waters in regards to the Document vs Research Subject distinction.  If you were allowed to research facebook, then you could do it either way or both ways, having subjects and documents, having just documents, or having just subjects.   But once we are dealing with documents, then the only question we have is whether those are published documents or not.   

I think that instead of muddying the waters and continuing to say it is not simple, as we are inclined to do as academics, is going to continue to cause us grief and possibly prevent perfectly reasonable research, and thus i think we should embark on the other strategy that says:  'We can make this simple'.  If it is published, it is public and open to research, you determine if it is published using these guidelines.  If it is not published, then what is it, is it a private diary?  is it a private letter?  who has rights to the material and how can it be released for research.    If you are dealing with research subjects, in what way are you doing that?  if you are just reading their postings... you are not interacting with them and not creating research subjects, if you are doing an ethnography or participatory or action research then yes you are interacting with them and you are creating human subjects, in short, a matrix of methods in relation to their objects would great
ly clarify the document vs subject distinction.  

In terms of public private on the web, my position is more or less that if you put it on the web and you do not protect in in some manner via legal device, technical system, or otherwise, then you are producing a public document and that's the end of it.  People  I would argue that it does not matter if that was not your intent or that you wanted it to be private, what matters is that you committed something to the public record, and while you can withdraw it, once it is distributed, you might find that very difficult, and withdrawing likely doesn't change it's public status, it just changes the ease of access to that data.   

Robots.txt should not be ignored.  it is one of the technical means that people use to secure their property on the web.  If there is a robots txt and it prevents you from making a copy of something, then i'd guess that the owners of the material do not want you to use it for research, and you'd need to get permission.

------------------------------------------------------------

LATEST PUBLICATIONS AND PAPERS

February 2011
"Web Archiving — Between Past, Present, and Future", The Handbook of Internet Studies (eds. M. Consalvo, & C. Ess), Wiley-Blackwell 2011, pp. 24-42
Read more on the publishers website: http://eu.wiley.com/WileyCDA/WileyTitle/productCd-1405185880.html

April 2010
Web History (ed.), Peter Lang Publishing, New York 2010
Read more on the publishers website: http://www.peterlang.com/index.cfm?vID=310468&vLang=E&vHR=1&vUR=2&vUUR=1

January 2010
Website Analysis, Papers from The Centre for Internet Research, 12, The Centre for Internet Research, Aarhus 2010,
[ get an electronic copy at: http://cfi.au.dk/en/publications/paper/#a12 ]

NIELS BRÜGGER, Associate Professor, PhD
Director, the Centre for Internet Research
Department of Information and Media Studies
Aarhus University
Helsingforsgade 14
8200 Aarhus N
Denmark

Phone (switchboard)   +45 8942 1111
Phone (direct)               +45 8942 9226
Telefax                           + 45 8942 5950
E-mail                             nb@imv.au.dk
Webpage                       http://imv.au.dk/~nb

Profile at LinkedIn: http://www.linkedin.com/pub/1/50a/555
Profile at Kommunikationsforum [in Danish]: www.kommunikationsforum.dk/Niels-Brugger

The Centre for Internet Research               http://cfi.au.dk
The history of dr.dk, 1996-2006                  http://drdk.dk
LARM (Radio Culture and Auditory Resources Research Infrastructure) http://www.larm-archive.org