Re: [Air-L] Ethics of using hacked data.
Dear Nathaniel, Interesting question! Apologies for this very late answer (due to travel to tech ethics workshops!). I will attempt an answer, but take into account my background as an Information lawyer and my project on the ethics of networked systems research (so more about Internet architecture experimentation, less about the use of content: http://ensr.oii.ox.ac.uk). I will not try to give formal legal advice here, just some things to consider. There are many subtleties that I won’t go into. The most important point to consider is the following: **Just because data is available, doesn’t mean you’re not violating rights or ethical principles by using it (research or otherwise). In other words: Just because you can, doesn’t mean you should.** Some analogies: even though you can tape music from the radio and sell the tapes at a local market, doesn’t mean you’re not violating copyright. Yes, you can collect WiFi signal data while you’re taking pictures of every corner of the Earth, but Google learned that it doesn’t mean it’s therefore legal. In short, information isn’t always as free as it echnically appears to be. (1) Privacy laws, data protection frameworks, and ethical principles likely do apply to you in this case. None of these people whose data was breached consented to be part of your research. It’s not clear to me exactly which data you’d want to use to create aggregated measures, but just be aware that processing any information that is linked to an identifiable person would likely constitute a breach of relevant privacy/data protection laws. By identifiable I mean identifiable by anyone, including those with more computing capacity or resources than you (difficult to say where to draw the line in this hypothetical assessment). (2) Your statement "I don't see how currently anyone could have an expectation of privacy any more” doesn’t hold. these people had an expectation of privacy at the time of communicating this data, or interacting with the website. They did not intend for their data to be used for academic research, or any other processing outside of the context of the service they trusted. You’d be changing the context and the audience of this information. Besides the legal issue addressed in the previous paragraph, you should take into account the potential harm of using the information in this new context & audience that you’re creating for this information (inc. the dissemination of your papers and datsets). Even the fact that it’s now leaked doesn’t change this per se. Maybe the audience is now script kiddies, spammers, and intelligence agencies, but you’d be adding even more unintended audiences to this. So if you use data that can be used to identify someone, somehow, you’re likely breaching laws. This is less so if you’re only using aggregated data, but I wonder how you'd construct this data without using personal data. Further, even if it turns out you’ve found a way to not violate laws, which is possible, you’re still in an ethical grey zone: (3) It could be argued that by using this data, you’re (implicitly) condoning the act of hacking and publishing this data. Stronger, still: If you profit from using information from this breach (publications and other career enhancements), you could entice others to also work with leaked data, therefore potentially incentivising (and even justifying) hackers for their acts (“for science!”). These statements may be a bit far fetched for some, but there is some value to thinking about this. You’d be setting a precedent regarding working with hacked data that may be difficult to reverse. I appreciate that huge data sets like this one are a social scientist’s dream come true, but better ways must be found to access them. Although this is up for debate, it seems to me that academics are perceived to have a different ethical framework than activists or journalists. They, too, need to take into account ethics, but their purposes and perceived benefits differ, so they whole weighing and justification process is different. In a recently published extended workshop report of the above named project, we discuss some cases that are similar to yours. We don’t discuss hacked data directly, but some of the considerations and lessons drawn from this report will be useful for your thinking about this. Find it here: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2666934 Also, feel free to present this case to a panel for Networking and Security that we’re running for a more detailed response (I can’t cover it all here): https://www.ethicalresearch.org/efp/netsec/ Finally, I’ be interested to write up this case study about this with you for a particular venue, but we can discuss this between us. Regards, Ben ------------------------------------------------ Ben Zevenbergen DPhil (PhD) Candidate Oxford Internet Institute University of Oxford Senior Fellow Open Technology Fund On 07/10/2015 16:55, "air-l-request@listserv.aoir.org" <air-l-request@listserv.aoir.org> wrote:
Date: Wed, 7 Oct 2015 16:11:31 -0400 From: Nathaniel Poor <natpoor@gmail.com> To: AOIR <air-l@listserv.aoir.org> Subject: [Air-L] Ethics of using hacked data. Message-ID: <CACdJtt91rBo4BG80Svnd7NO9AOKjVbo5kUVus9V0VmazsmvrJw@mail.gmail.com> Content-Type: text/plain; charset=UTF-8
Hello list-
I recently got into a discussion with a colleague about the ethics of using hacked data, specifically the Patreon hacked data (see here: http://arstechnica.com/security/2015/10/gigabytes-of-user-data-from-hack-o f-patreon-donations-site-dumped-online/ ).
He and I do crowdfunding work, and had wanted to look at Patreon, but as far as I can tell they have no easy hook into all their projects (for scraping), so, to me this data hack was like a gift! But he said there was no way we could use it. We aren't doing sentiment analysis or anything, we would use aggregated measures like funding levels and then report things like means and maybe a regression, so there would be no identifiable information whatsoever derived from the hacked data in any of our resulting work (we might go to the site and pull some quotes).
I looked at the AoIR ethics guidelines ( http://aoir.org/reports/ethics2.pdf ), and didn't see anything specifically about hacked data (I don't think "hacked" is the best word, but I don't like "stolen" either, but those are different discussions).
One relevant line I noticed was this one: "If access to an online context is publicly available, do members/participants/authors perceive the context to be public?" (p. 8) So, the problem with the data is that it's the entire website, so some was private and some was public, but now it's all public and everyone knows it's public.
To me, I agree that a lot of the data in the data-dump had been intended to be private -- apparently, direct messages are in there -- but we wouldn't use that data (it's not something we're interested in). We'd use data like number of funders and funding levels and then aggregate everything. I see that some of it was meant to be private, but given the entire site was hacked and exported I don't see how currently anyone could have an expectation of privacy any more. I'm not trying to torture the definition, it's just that it was private until it wasn't.
I can see that some academic researchers -- at least those in computer security -- would be interested in this data and should be able to publish in peer reviewed journals about it, in an anonymized manner (probably as an example of "here's a data hack like what we are talking about, here's what hackers released").
I also think that probably every script kiddie has downloaded the data, as has every grey and black market email list spammer, and probably every botnet purveyor (for passwords) and maybe even the hacking arm of the Chinese army and the NSA. My point here is that if we were to use the data in academic research we wouldn't be publicizing it to nefarious people who would misuse it since all of those people already have it. We could maybe help people who want to use crowdfunding some (hopefully!) if we have some results. (I guess I don't see that we would be doing any harm by using it.)
So, what do people think? Did I miss something in the AoIR guidelines? I realize I don't think it's clear either way, or I wouldn't be asking, so probably the answers will point to this as a grey area (so why do I even ask, I am not sure).
But I'm not looking for "You can't use it because it's hacked," because I don't think that explains anything. I could counter that with "It is publicly available found data," because it is, although I don't think that's the best reply either. Both lack nuance.
-Nat
-- Nathaniel Poor, Ph.D. http://natpoor.blogspot.com
participants (1)
-
Bendert Zevenbergen