Re: [Air-L] Anonymizing Twitter handles
If you do qualitative research and e.g. cite tweets, it's useless to anonymize (as Michael states). If you do quantitative research the findings will be so abstract that individual users will be hard to trace, but not impossible depending on the number of variables and the number of userhandles. Still, anonymizing is fairly easy when you have the data in a statistical program such as SPSS, R or even Excel: replace the userhandles with a unique number (from 1 to N). Then remove the userhandles from the dataset. Still I would advice always to keep a secure file with both keyvariables userhandles and the new identifyer for future resrearch. Hope this helps. Maurice On Fri, Apr 14, 2017 at 5:11 AM, Ye Na Lee <jpt2007@berkeley.edu> wrote:
Dear subscribers to Association of Internet Researchers, I am currently going through IRB process for a research on Twitter data and I was told to anonymize Twitter handles completely. Are there any online programs with which I could anonymize usernames? I don`t think I should create fake Twitter handles for every single tweet that I quote on my paper. I`d really appreciate any suggestions on anonymizing Twitter handles! Thank you in advance! _______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/ listinfo.cgi/air-l-aoir.org
Join the Association of Internet Researchers: http://www.aoir.org/
-- ________________________________________________ Maurice Vergeer To contact me, see http://mauricevergeer.nl/node/5 For yesterday's news in perspective: http://www.echovannl.nl/ To see my publications, see http://mauricevergeer.nl/node/1 PGP public key: https://keys.mailvelope.com/pks/lookup?op=get&search=0xE7BF24D19BE34017 ________________________________________________
Hello there are also other things to keep in mind: for both ethical and legal purposes. For example there is the notion of geographic singularities: If in a certain geographical area there are only one, or a few, points (for example geo referenced tweets or instagram posts) it could be fairly easy to identify them even if their nicknames had been anonymized. In this case further levels of anonymization would be needed. (or even getting rid of the items) In any case: there is a wealth of cases in which the content alone, or its context, make it easy to identify the subject, even if anonymized. These require more complex actions to preserve people's rights. cheers! Salvatore ᐧ On Fri, Apr 14, 2017 at 8:47 AM, Maurice Vergeer <m.vergeer@maw.ru.nl> wrote:
If you do qualitative research and e.g. cite tweets, it's useless to anonymize (as Michael states). If you do quantitative research the findings will be so abstract that individual users will be hard to trace, but not impossible depending on the number of variables and the number of userhandles.
Still, anonymizing is fairly easy when you have the data in a statistical program such as SPSS, R or even Excel: replace the userhandles with a unique number (from 1 to N). Then remove the userhandles from the dataset. Still I would advice always to keep a secure file with both keyvariables userhandles and the new identifyer for future resrearch.
Hope this helps. Maurice
On Fri, Apr 14, 2017 at 5:11 AM, Ye Na Lee <jpt2007@berkeley.edu> wrote:
Dear subscribers to Association of Internet Researchers, I am currently going through IRB process for a research on Twitter data and I was told to anonymize Twitter handles completely. Are there any online programs with which I could anonymize usernames? I don`t think I should create fake Twitter handles for every single tweet that I quote on my paper. I`d really appreciate any suggestions on anonymizing Twitter handles! Thank you in advance! _______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/ listinfo.cgi/air-l-aoir.org
Join the Association of Internet Researchers: http://www.aoir.org/
-- ________________________________________________ Maurice Vergeer To contact me, see http://mauricevergeer.nl/node/5 For yesterday's news in perspective: http://www.echovannl.nl/ To see my publications, see http://mauricevergeer.nl/node/1 PGP public key: https://keys.mailvelope.com/pks/lookup?op=get&search=0xE7BF24D19BE34017 ________________________________________________ _______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/ listinfo.cgi/air-l-aoir.org
Join the Association of Internet Researchers: http://www.aoir.org/
-- *[**MUTATION**]* *Art is Open Source *- http://www.artisopensource.net *[**CITIES**]* *Human Ecosystems Relazioni* - http://he-r.i <http://human-ecosystems.com/>t *[**NEAR FUTURE DESIGN**]* *Nefula Ltd* - http://www.nefula.com *[**RIGHTS**]* *Ubiquitous Commons *- http://www.ubiquitouscommons.org --- Professor of Near Future and Transmedia Design at ISIA Design Florence: http://www.isiadesign.fi.it/
On 14 Apr 2017, at 7:47 , Maurice Vergeer <m.vergeer@maw.ru.nl> wrote:
Still, anonymizing is fairly easy when you have the data in a statistical program such as SPSS, R or even Excel: replace the userhandles with a unique number (from 1 to N). Then remove the userhandles from the dataset. Still I would advice always to keep a secure file with both keyvariables userhandles and the new identifyer for future resrearch.
I you hash the userhandle, e.g. with SHA-1 or similar (which is even possible in Excel with a small formula), there is no need to keep a correspondence file, because hashing a string will always yield the same hash - while making reversal virtually impossible (i.e. you cannot get the handle from the hash). best, Bernhard -- Bernhard Rieder | Associate Professor | New Media and Digital Culture University of Amsterdam | Turfdraagsterpad 9 | 1012 XT Amsterdam | The Netherlands http://thepoliticsofsystems.net | http://rieder.polsys.net | https://www.digitalmethods.net | @RiederB
There are many good reasons to anonymize Tweets during the research process (reducing annotator bias, for example) and definitely during the presentation of results (particularly controversial Tweets). Indeed, the visual presentation of sensational individual Tweets is something ethicists and IRBs might caution against, despite the public nature of the platform. Going further, you have to consider the ethical obligation not to publically display deleted Tweets, though I don't think this would extend to public figures, like @realdonaldtrump. Having said that, Tweets have considerably less "meaning" when you hide the Twitter handles. Context is lost, so there is a big trade off. DiscoverText has an automated redaction capability that can remove or obscure all the Twitter handles at once. Here is an example of an archive consisting of replies to a Tweet status ID where the start of every Tweet is a Twitter handle: https://drive.google.com/file/d/0B1iEonkdfwKua0lmWndNZTkyWXM/view?usp=sharin... This (underutilized) functionality is a part of a Freedom of Information Act (FOIA) capability including a "dirty word tool" that members of this list helped to create about 5 years ago. If any member of this list would like to experiment with the redaction tools, just shoot us an email ( info@texifter.com) and I will put you in a special sponsored sandbox for redaction experiments, I will give you a web demo, and we will provide complimentary Gnip and Search API access to play with. ~Stu On Fri, Apr 14, 2017 at 7:07 AM, Bernhard Rieder <berno.rieder@gmail.com> wrote:
On 14 Apr 2017, at 7:47 , Maurice Vergeer <m.vergeer@maw.ru.nl> wrote:
Still, anonymizing is fairly easy when you have the data in a statistical program such as SPSS, R or even Excel: replace the userhandles with a unique number (from 1 to N). Then remove the userhandles from the dataset. Still I would advice always to keep a secure file with both keyvariables userhandles and the new identifyer for future resrearch.
I you hash the userhandle, e.g. with SHA-1 or similar (which is even possible in Excel with a small formula), there is no need to keep a correspondence file, because hashing a string will always yield the same hash - while making reversal virtually impossible (i.e. you cannot get the handle from the hash).
best, Bernhard
-- Bernhard Rieder | Associate Professor | New Media and Digital Culture University of Amsterdam | Turfdraagsterpad 9 | 1012 XT Amsterdam | The Netherlands http://thepoliticsofsystems.net | http://rieder.polsys.net | https://www.digitalmethods.net | @RiederB
_______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/ listinfo.cgi/air-l-aoir.org
Join the Association of Internet Researchers: http://www.aoir.org/
-- Dr. Stuart W. Shulman Founder and CEO, Texifter LinkedIn: http://www.linkedin.com/in/stuartwshulman
On Fri, Apr 14, 2017 at 12:07:19PM +0100, Bernhard Rieder wrote:
I you hash the userhandle, e.g. with SHA-1 or similar (which is even possible in Excel with a small formula), there is no need to keep a correspondence file, because hashing a string will always yield the same hash - while making reversal virtually impossible (i.e. you cannot get the handle from the hash).
If you do this, please add a "salt" string that only you know to the Twitter handle. If you only do SHA1, while it is not possible to perform the inverse operation, the correspondence between the hashed string and the original handle can be guessed by brute-forcing the conversion of a list of handles. But I you convert the string "twitter_handle_1" + "mysalt" = "twitter_handle_1mysalt", that cannot be done (unless you also publish the "salt"). Cheers, JMM.
participants (5)
-
Bernhard Rieder -
José María Mateos -
Maurice Vergeer -
Shulman, Stu -
xDxD.vs.xDxD