Re: [Air-L] Ethics of using hacked data.

19 Oct 2015

      Dear Nathaniel,

Interesting question! Apologies for this very late answer (due to travel
to tech ethics workshops!).

I will attempt an answer, but take into account my background as an
Information lawyer and my project on the ethics of networked systems
research (so more about Internet architecture experimentation, less about
the use of content: http://ensr.oii.ox.ac.uk). I will not try to give
formal legal advice here, just some things to consider. There are many
subtleties that I won’t go into. The most important point to consider is
the following:

**Just because data is available, doesn’t mean you’re not violating rights
or ethical principles by using it (research or otherwise). In other words:
Just because you can, doesn’t mean you should.**

Some analogies: even though you can tape music from the radio and sell the
tapes at a local market, doesn’t mean you’re not violating copyright. Yes,
you can collect WiFi signal data while you’re taking pictures of every
corner of the Earth, but Google learned that it doesn’t mean it’s
therefore legal. In short, information isn’t always as free as it
echnically appears to be.

(1) Privacy laws, data protection frameworks, and ethical principles
likely do apply to you in this case. None of these people whose data was
breached consented to be part of your research. It’s not clear to me
exactly which data you’d want to use to create aggregated measures, but
just be aware that processing any information that is linked to an
identifiable person would likely constitute a breach of relevant
privacy/data protection laws. By identifiable I mean identifiable by
anyone, including those with more computing capacity or resources than you
(difficult to say where to draw the line in this hypothetical assessment).

(2) Your statement "I don't see how currently anyone could have an
expectation of privacy any more” doesn’t hold. these people had an
expectation of privacy at the time of communicating this data, or
interacting with the website. They did not intend for their data to be
used for academic research, or any other processing outside of the context
of the service they trusted. You’d be changing the context and the
audience of this information. Besides the legal issue addressed in the
previous paragraph, you should take into account the potential harm of
using the information in this new context & audience that you’re creating
for this information (inc. the dissemination of your papers and datsets).
Even the fact that it’s now leaked doesn’t change this per se. Maybe the
audience is now script kiddies, spammers, and intelligence agencies, but
you’d be adding even more unintended audiences to this.

So if you use data that can be used to identify someone, somehow, you’re
likely breaching laws. This is less so if you’re only using aggregated
data, but I wonder how you'd construct this data without using personal
data.

Further, even if it turns out you’ve found a way to not violate laws,
which is possible, you’re still in an ethical grey zone:

(3) It could be argued that by using this data, you’re (implicitly)
condoning the act of hacking and publishing this data. Stronger, still: If
you profit from using information from this breach (publications and other
career enhancements), you could entice others to also work with leaked
data, therefore potentially incentivising (and even justifying) hackers
for their acts (“for science!”). These statements may be a bit far fetched
for some, but there is some value to thinking about this. You’d be setting
a precedent regarding working with hacked data that may be difficult to
reverse.

I appreciate that huge data sets like this one are a social scientist’s
dream come true, but better ways must be found to access them.

Although this is up for debate, it seems to me that academics are
perceived to have a different ethical framework than activists or
journalists. They, too, need to take into account ethics, but their
purposes and perceived benefits differ, so they whole weighing and
justification process is different.

In a recently published extended workshop report of the above named
project, we discuss some cases that are similar to yours. We don’t discuss
hacked data directly, but some of the considerations and lessons drawn
from this report will be useful for your thinking about this. Find it
here: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2666934

Also, feel free to present this case to a panel for Networking and
Security that we’re running for a more detailed response (I can’t cover it
all here): https://www.ethicalresearch.org/efp/netsec/

Finally, I’ be interested to write up this case study about this with you
for a particular venue, but we can discuss this between us.

Regards,

Ben

------------------------------------------------
Ben Zevenbergen
DPhil (PhD) Candidate
Oxford Internet Institute
University of Oxford
Senior Fellow Open Technology Fund

On 07/10/2015 16:55, "air-l-request@listserv.aoir.org"
<air-l-request@listserv.aoir.org> wrote:
...
Date: Wed, 7 Oct 2015 16:11:31 -0400
From: Nathaniel Poor <natpoor@gmail.com>
To: AOIR <air-l@listserv.aoir.org>
Subject: [Air-L] Ethics of using hacked data.
Message-ID:
  <CACdJtt91rBo4BG80Svnd7NO9AOKjVbo5kUVus9V0VmazsmvrJw@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8
Hello list-
I recently got into a discussion with a colleague about the ethics of
using
hacked data, specifically the Patreon hacked data (see here:
http://arstechnica.com/security/2015/10/gigabytes-of-user-data-from-hack-o
f-patreon-donations-site-dumped-online/
).
He and I do crowdfunding work, and had wanted to look at Patreon, but as
far as I can tell they have no easy hook into all their projects (for
scraping), so, to me this data hack was like a gift! But he said there was
no way we could use it. We aren't doing sentiment analysis or anything, we
would use aggregated measures like funding levels and then report things
like means and maybe a regression, so there would be no identifiable
information whatsoever derived from the hacked data in any of our
resulting
work (we might go to the site and pull some quotes).
I looked at the AoIR ethics guidelines (
http://aoir.org/reports/ethics2.pdf
), and didn't see anything specifically about hacked data (I don't think
"hacked" is the best word, but I don't like "stolen" either, but those are
different discussions).
One relevant line I noticed was this one:
"If access to an online context is publicly available, do
members/participants/authors
perceive the context to be public?" (p. 8)
So, the problem with the data is that it's the entire website, so some was
private and some was public, but now it's all public and everyone knows
it's public.
To me, I agree that a lot of the data in the data-dump had been intended
to
be private -- apparently, direct messages are in there -- but we wouldn't
use that data (it's not something we're interested in). We'd use data like
number of funders and funding levels and then aggregate everything. I see
that some of it was meant to be private, but given the entire site was
hacked and exported I don't see how currently anyone could have an
expectation of privacy any more. I'm not trying to torture the definition,
it's just that it was private until it wasn't.
I can see that some academic researchers -- at least those in computer
security -- would be interested in this data and should be able to publish
in peer reviewed journals about it, in an anonymized manner (probably as
an
example of "here's a data hack like what we are talking about, here's what
hackers released").
I also think that probably every script kiddie has downloaded the data, as
has every grey and black market email list spammer, and probably every
botnet purveyor (for passwords) and maybe even the hacking arm of the
Chinese army and the NSA. My point here is that if we were to use the data
in academic research we wouldn't be publicizing it to nefarious people who
would misuse it since all of those people already have it. We could maybe
help people who want to use crowdfunding some (hopefully!) if we have some
results. (I guess I don't see that we would be doing any harm by using
it.)
So, what do people think? Did I miss something in the AoIR guidelines? I
realize I don't think it's clear either way, or I wouldn't be asking, so
probably the answers will point to this as a grey area (so why do I even
ask, I am not sure).
But I'm not looking for "You can't use it because it's hacked," because I
don't think that explains anything. I could counter that with "It is
publicly available found data," because it is, although I don't think
that's the best reply either. Both lack nuance.
-Nat
-- 
Nathaniel Poor, Ph.D.
http://natpoor.blogspot.com

Bendert Zevenbergen

tags

participants (1)