Hi all, Just a short comment about robots.txt. Whether robots.txt should be respected or not by web archives is not as unequivocal as put by Jeremy below - there are national differences. The national Danish web archive Netarkivet.dk is based on the "Act no. 1439 of December 22, 2004 on Legal Deposit of Published Material", and here it is stated that the archive is not obliged to respect robots.txt. The rationale for this is that once it's out there it's out there, and as long as it's accessible to everyone it's 'public' - thus it's part of the cultural heritage, thus it should be archived. This also applies for password protected material IF everyone could get a password, no matter if s/he had to pay for it. And if you don't want it to be public, then don't put it out there - robots.txt does not prevent people to see it, then it's public and will be archived. In the Danish legislation ignoring robots.txt is being considered necessary to collect all relevant material. However, the acces to the web archive is very restricted: there's only acces for scholars, and you have to apply for acces. If interested, the Danish web archive has made a short fact sheet: http://netarkivet.dk/publikationer/Fact%20sheet%20Webarchiving%20in%20Denmar... And a FAQ: http://netarkivet.dk/faq/index-en.php - and about robots.txt: http://netarkivet.dk/faq/index-en.php#faq_robots Best, Niels Brügger ----------- Message: 9 Date: Sat, 7 May 2011 11:18:04 -0400 From: jeremy hunsinger <jhuns@vt.edu> To: aoir list <air-l@aoir.org> Subject: Re: [Air-L] a question about privacy protection and copyright in Internet research Message-ID: <B7115BE4-D7AA-4B72-8C70-CF029ADF6AEE@vt.edu> Content-Type: text/plain; charset=us-ascii I think the tendency is to muddy the waters immensely here, but I also don't think we need to muddy the waters in regards to the Document vs Research Subject distinction. If you were allowed to research facebook, then you could do it either way or both ways, having subjects and documents, having just documents, or having just subjects. But once we are dealing with documents, then the only question we have is whether those are published documents or not. I think that instead of muddying the waters and continuing to say it is not simple, as we are inclined to do as academics, is going to continue to cause us grief and possibly prevent perfectly reasonable research, and thus i think we should embark on the other strategy that says: 'We can make this simple'. If it is published, it is public and open to research, you determine if it is published using these guidelines. If it is not published, then what is it, is it a private diary? is it a private letter? who has rights to the material and how can it be released for research. If you are dealing with research subjects, in what way are you doing that? if you are just reading their postings... you are not interacting with them and not creating research subjects, if you are doing an ethnography or participatory or action research then yes you are interacting with them and you are creating human subjects, in short, a matrix of methods in relation to their objects would great ly clarify the document vs subject distinction. In terms of public private on the web, my position is more or less that if you put it on the web and you do not protect in in some manner via legal device, technical system, or otherwise, then you are producing a public document and that's the end of it. People I would argue that it does not matter if that was not your intent or that you wanted it to be private, what matters is that you committed something to the public record, and while you can withdraw it, once it is distributed, you might find that very difficult, and withdrawing likely doesn't change it's public status, it just changes the ease of access to that data. Robots.txt should not be ignored. it is one of the technical means that people use to secure their property on the web. If there is a robots txt and it prevents you from making a copy of something, then i'd guess that the owners of the material do not want you to use it for research, and you'd need to get permission. ------------------------------------------------------------ LATEST PUBLICATIONS AND PAPERS February 2011 "Web Archiving — Between Past, Present, and Future", The Handbook of Internet Studies (eds. M. Consalvo, & C. Ess), Wiley-Blackwell 2011, pp. 24-42 Read more on the publishers website: http://eu.wiley.com/WileyCDA/WileyTitle/productCd-1405185880.html April 2010 Web History (ed.), Peter Lang Publishing, New York 2010 Read more on the publishers website: http://www.peterlang.com/index.cfm?vID=310468&vLang=E&vHR=1&vUR=2&vUUR=1 January 2010 Website Analysis, Papers from The Centre for Internet Research, 12, The Centre for Internet Research, Aarhus 2010, [ get an electronic copy at: http://cfi.au.dk/en/publications/paper/#a12 ] NIELS BRÜGGER, Associate Professor, PhD Director, the Centre for Internet Research Department of Information and Media Studies Aarhus University Helsingforsgade 14 8200 Aarhus N Denmark Phone (switchboard) +45 8942 1111 Phone (direct) +45 8942 9226 Telefax + 45 8942 5950 E-mail nb@imv.au.dk Webpage http://imv.au.dk/~nb Profile at LinkedIn: http://www.linkedin.com/pub/1/50a/555 Profile at Kommunikationsforum [in Danish]: www.kommunikationsforum.dk/Niels-Brugger The Centre for Internet Research http://cfi.au.dk The history of dr.dk, 1996-2006 http://drdk.dk LARM (Radio Culture and Auditory Resources Research Infrastructure) http://www.larm-archive.org