Wikipedia article edit history extraction tools?
Hello Air-L list: This summer I'm doing research on Wikipedia entries in the field of Science and Technology Studies. A central question I'm asking is the extent to which this field, as it is now on Wikipedia, includes/features/references contributions made by women, feminist theorists, and feminist theory. To answer this, I'm gathering data on existing pages using a variety of mixed methods. I would like to ask for recommendations on tools for extracting the history of editing on a page. I want to see how many times a given article has been edited, by whom, and what types of edits and content contributions are made over time. So far, I've found the "history" tool on the Wikipedia page limited. I cannot see how many edits have been made on a particular article and understanding what kinds of edits are made (e.g. grammatical, content) requires going into each historical page view. I'd love to find a way to download the history of an article and extract the data into a spreadsheet -- perhaps this is a tall order. So far, I've found tools for extracting data on Wikipedia from the Digital Methods Initiative website (which was first introduced to me by this list serve! :)). Specifically, the program History Flow is useful to an extent for visualizing types of content contributions and edits over time. But there is no way to translate these visualizations into a spreadsheet format -- as far as I can tell -- so I've been doing that manually, somehow piecing together the history of edits on an article. Meanwhile, I was recommended a tool called WikiChecker ( http://en.wikichecker.com/article/?a=science_studies) but the summary format is limited and, at times, contradictory to data I get elsewhere. If anyone has any other tools or methods to suggest for ways to collect data on content contributions and edits on Wikipedia I would be most grateful. I'd also be happy to be in conversation with anymore interested in the concept of the project. I'm working on it as a part of the FemTechNet Initiative, spearheaded by Anne Balsamo and Alexandra Juhasz. I'm not sure if information on the initiative has circulated here, so I'll paste in a copy of the "call" which took place last spring. * http://aljean.files.wordpress.com/2012/05/femtechnet-long-form-invite-may-20... * Thank you, Monika -- Monika Sengul-Jones Graduate Student Communication & Science Studies University of California, San Diego msengul@ucsd.edu
Monika I am not sure how you will get the demography variables you obviously need. I use a handle to do my edits on Wikipedia. That's all you see in the edit history. Of course some like me may have a male first name in this handle or a female first name. In my legal studies BA we learned that we had to cite the first names of scholars because this allowed us to see the gender. Wikipedia do not know my gender. Unlike some paid web site that may have my credit card data and access to my gender which they could in turn share with a researcher I don't think Wikipedia have much real data about me they can share. -----Original Message----- From: air-l-bounces@listserv.aoir.org [mailto:air-l-bounces@listserv.aoir.org] On Behalf Of Monika Sengul-Jones Sent: August-14-12 6:39 PM To: air-l@listserv.aoir.org Subject: [Air-L] Wikipedia article edit history extraction tools? Hello Air-L list: This summer I'm doing research on Wikipedia entries in the field of Science and Technology Studies. A central question I'm asking is the extent to which this field, as it is now on Wikipedia, includes/features/references contributions made by women, feminist theorists, and feminist theory. To answer this, I'm gathering data on existing pages using a variety of mixed methods. I would like to ask for recommendations on tools for extracting the history of editing on a page. I want to see how many times a given article has been edited, by whom, and what types of edits and content contributions are made over time. So far, I've found the "history" tool on the Wikipedia page limited. I cannot see how many edits have been made on a particular article and understanding what kinds of edits are made (e.g. grammatical, content) requires going into each historical page view. I'd love to find a way to download the history of an article and extract the data into a spreadsheet -- perhaps this is a tall order. So far, I've found tools for extracting data on Wikipedia from the Digital Methods Initiative website (which was first introduced to me by this list serve! :)). Specifically, the program History Flow is useful to an extent for visualizing types of content contributions and edits over time. But there is no way to translate these visualizations into a spreadsheet format -- as far as I can tell -- so I've been doing that manually, somehow piecing together the history of edits on an article. Meanwhile, I was recommended a tool called WikiChecker ( http://en.wikichecker.com/article/?a=science_studies) but the summary format is limited and, at times, contradictory to data I get elsewhere. If anyone has any other tools or methods to suggest for ways to collect data on content contributions and edits on Wikipedia I would be most grateful. I'd also be happy to be in conversation with anymore interested in the concept of the project. I'm working on it as a part of the FemTechNet Initiative, spearheaded by Anne Balsamo and Alexandra Juhasz. I'm not sure if information on the initiative has circulated here, so I'll paste in a copy of the "call" which took place last spring. * http://aljean.files.wordpress.com/2012/05/femtechnet-long-form-invite-may-20 12.pdf * Thank you, Monika -- Monika Sengul-Jones Graduate Student Communication & Science Studies University of California, San Diego msengul@ucsd.edu _______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org Join the Association of Internet Researchers: http://www.aoir.org/
Hi Peter, hi all: Thanks for the response. Indeed, I'm not looking to ascertain demographic details of editors<http://www.dailymail.co.uk/femail/article-2174826/Does-Kate-Middletons-royal-wedding-gown-deserve-Wikipedia-page.html>, that's not within the scope of the project. The bulk of what I'm doing is content analysis of articles. A supplementary approach is to compare the history of edits and contributions over time across articles. I've using the history page on Wikipedia, IBM's History Flow, and WikiChecker for this. I am looking to see if there are other tools for extracting historical data on Wikipedia articles. Thanks! MSJ On Tue, Aug 14, 2012 at 4:02 PM, Peter Timusk <ptimusk@sympatico.ca> wrote:
Monika
I am not sure how you will get the demography variables you obviously need. I use a handle to do my edits on Wikipedia. That's all you see in the edit history. Of course some like me may have a male first name in this handle or a female first name. In my legal studies BA we learned that we had to cite the first names of scholars because this allowed us to see the gender. Wikipedia do not know my gender. Unlike some paid web site that may have my credit card data and access to my gender which they could in turn share with a researcher I don't think Wikipedia have much real data about me they can share.
-----Original Message----- From: air-l-bounces@listserv.aoir.org [mailto:air-l-bounces@listserv.aoir.org] On Behalf Of Monika Sengul-Jones Sent: August-14-12 6:39 PM To: air-l@listserv.aoir.org Subject: [Air-L] Wikipedia article edit history extraction tools?
Hello Air-L list:
This summer I'm doing research on Wikipedia entries in the field of Science and Technology Studies. A central question I'm asking is the extent to which this field, as it is now on Wikipedia, includes/features/references contributions made by women, feminist theorists, and feminist theory.
To answer this, I'm gathering data on existing pages using a variety of mixed methods. I would like to ask for recommendations on tools for extracting the history of editing on a page. I want to see how many times a given article has been edited, by whom, and what types of edits and content contributions are made over time. So far, I've found the "history" tool on the Wikipedia page limited. I cannot see how many edits have been made on a particular article and understanding what kinds of edits are made (e.g. grammatical, content) requires going into each historical page view. I'd love to find a way to download the history of an article and extract the data into a spreadsheet -- perhaps this is a tall order.
So far, I've found tools for extracting data on Wikipedia from the Digital Methods Initiative website (which was first introduced to me by this list serve! :)). Specifically, the program History Flow is useful to an extent for visualizing types of content contributions and edits over time. But there is no way to translate these visualizations into a spreadsheet format -- as far as I can tell -- so I've been doing that manually, somehow piecing together the history of edits on an article. Meanwhile, I was recommended a tool called WikiChecker ( http://en.wikichecker.com/article/?a=science_studies) but the summary format is limited and, at times, contradictory to data I get elsewhere.
If anyone has any other tools or methods to suggest for ways to collect data on content contributions and edits on Wikipedia I would be most grateful.
I'd also be happy to be in conversation with anymore interested in the concept of the project. I'm working on it as a part of the FemTechNet Initiative, spearheaded by Anne Balsamo and Alexandra Juhasz. I'm not sure if information on the initiative has circulated here, so I'll paste in a copy of the "call" which took place last spring. *
http://aljean.files.wordpress.com/2012/05/femtechnet-long-form-invite-may-20 12.pdf *
Thank you, Monika
-- Monika Sengul-Jones Graduate Student Communication & Science Studies University of California, San Diego msengul@ucsd.edu _______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
Join the Association of Internet Researchers: http://www.aoir.org/
Hi Monika and list, I've helped in creating wikitrip, a web tool displaying an animated visualization over time of geo-location and gender of Wikipedians who edited a specific page. You can search for any page (from any language wikipedia!) and get a stats of how many edits to this page are made from self-declared male and female wikipedians, over time (bottom right of the web interface). Few random examples: http://sonetlab.fbk.eu/wikitrip/#|en|Feminism The page Feminism received most gendered edits from males http://sonetlab.fbk.eu/wikitrip/#|en|Wikipedia_talk:WikiProject_Feminism but the talk of the Feminism project mainly from females http://sonetlab.fbk.eu/wikitrip/#|en|Sexual_intercourse Sexual_intercourse was edited mainly by males in the beginning (2001) but around 2008 female wikipedians jumped in http://sonetlab.fbk.eu/wikitrip/#|en|Talk:Sexual_intercourse Similar thing for the talk page of sexual intercourse. Wikitrip code is open source so anybody can look at it, improve it and re-use it. Moreover we have also released a useful API. So if you want to get the raw data, you can! The 3 available APIs are described on the help page at wikitrip (click the "read more..." link) and they are api.php: Get various stats about a page (including editors and how many edits they performed) api_gender.php: Get timestamp and gender for any edit by a registered user that specified his gender on a specific page api_geojson.php: Get location in the world for anonymous edits on a specific page 2 examples of the first 2 apis (they are described on the wikitrip help page) http://toolserver.org/~sonet/api.php?article=London&lang=en&editors&max_edit... http://toolserver.org/~sonet/api_gender.php?article=London&lang=en The output format is json but we can easily change it into csv or anything else, if there is such a request. Notes: 1) as you might know, expressing your gender on Wikipedia is not mandatory and few user do it (around 10% last time I checked if I remember correctly) so stats are heavily biased by this. Still a Wikitrip exploration can be a beginning for a research, not surely the end of it ;) 2) we show number of edits and not number of editors because number of edits are greater and so stats are more "dramatic" but this adds another level of "noise" since it might be that a single female editors, for example, performed 200 edits to a page that technically it is not receiving a lot of attention from females but from one female. However, as I wrote earlier, there is the API you can use to get the raw data (for example, all the gendered edits) and so to conduct different, less dramatic and more scientific research lines ;) I'm very interested in counducting research on Wikipedia and gender so Monika I'll contact you offline for possible collaborations. Actually I'll present Wikitrip (and Manypedia.com ) in few days at Wikisym 2012 in Linz, Austria, if you are there, I would love to talk with you face2face too. Ciao! ;) -- -- Paolo Massa Email: paolo AT gnuband DOT org Blog: http://gnuband.org On Wed, Aug 15, 2012 at 12:38 AM, Monika Sengul-Jones <jones.monika@gmail.com> wrote:
Hello Air-L list:
This summer I'm doing research on Wikipedia entries in the field of Science and Technology Studies. A central question I'm asking is the extent to which this field, as it is now on Wikipedia, includes/features/references contributions made by women, feminist theorists, and feminist theory.
To answer this, I'm gathering data on existing pages using a variety of mixed methods. I would like to ask for recommendations on tools for extracting the history of editing on a page. I want to see how many times a given article has been edited, by whom, and what types of edits and content contributions are made over time. So far, I've found the "history" tool on the Wikipedia page limited. I cannot see how many edits have been made on a particular article and understanding what kinds of edits are made (e.g. grammatical, content) requires going into each historical page view. I'd love to find a way to download the history of an article and extract the data into a spreadsheet -- perhaps this is a tall order.
So far, I've found tools for extracting data on Wikipedia from the Digital Methods Initiative website (which was first introduced to me by this list serve! :)). Specifically, the program History Flow is useful to an extent for visualizing types of content contributions and edits over time. But there is no way to translate these visualizations into a spreadsheet format -- as far as I can tell -- so I've been doing that manually, somehow piecing together the history of edits on an article. Meanwhile, I was recommended a tool called WikiChecker ( http://en.wikichecker.com/article/?a=science_studies) but the summary format is limited and, at times, contradictory to data I get elsewhere.
If anyone has any other tools or methods to suggest for ways to collect data on content contributions and edits on Wikipedia I would be most grateful.
I'd also be happy to be in conversation with anymore interested in the concept of the project. I'm working on it as a part of the FemTechNet Initiative, spearheaded by Anne Balsamo and Alexandra Juhasz. I'm not sure if information on the initiative has circulated here, so I'll paste in a copy of the "call" which took place last spring. * http://aljean.files.wordpress.com/2012/05/femtechnet-long-form-invite-may-20... *
Thank you, Monika
-- Monika Sengul-Jones Graduate Student Communication & Science Studies University of California, San Diego msengul@ucsd.edu _______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
Join the Association of Internet Researchers: http://www.aoir.org/
For some wikipedia text extraction, try the free open source InfoExtractor http://www.infoextractor.org/ A sample of what you get is below. It may not get exactly what you need, but the inventor is open to ideas for how to improve it. Discussions extracted from http://en.wikipedia.org/wiki/Obama TITLE: Time for a Featured Article Review? USER: Tarc DATE: 01:11, 2 August 2012 (UTC)}} TALK: It is time to put this baby to bed. The Featured Article Review process is not to be used or abused because one does not like certain things in an article, those are issues that we handle via normal, simple editing procedures. FAR is to identify and correct major deficiencies in a Featured Article that call into question it still being an FA at all. This ain't that, time to move on. USER: John DATE: 17:55, 27 July 2012 (UTC) TALK: I hadn't read this for quite a while and I see quite a few problems with its quality. I also notice its Featured Article status hasn't been properly reviewed since 2008. In that time, standards at FA have risen significantly. I wonder how regular editors would feel about conducting a proper audit to see whether the wider community think this article meets current FA criteria? I feel it could only benefit the article to undergo such a process. -- USER: Scjessey DATE: 18:02, 27 July 2012 (UTC) TALK: We're barely 100 days out from the election, and this article will doubtless receive a growing number of attacks from vandals and POV warriors if 2008 was any indication. Are you sure you want to get into a FAR procedure with all that going on? A review would be welcome, but only after the seas are calmer. -- On Thu, Aug 16, 2012 at 6:26 AM, Paolo Massa <paolo@gnuband.org> wrote:
Hi Monika and list, I've helped in creating wikitrip, a web tool displaying an animated visualization over time of geo-location and gender of Wikipedians who edited a specific page. You can search for any page (from any language wikipedia!) and get a stats of how many edits to this page are made from self-declared male and female wikipedians, over time (bottom right of the web interface).
Few random examples:
participants (4)
-
Monika Sengul-Jones -
Paolo Massa -
Peter Timusk -
Stuart Shulman