For what it's worth, the machine learning company Lateral has actually used raw data (available back to 2007 at http://dumps.wikimedia.org/other/pagecounts-raw) to produce just such a data set as I think Alex is describing, i.e., a "most popular content on Wikipedia" corpus. You can read more about their approach in a blog post here: https://blog.lateral.io/2015/06/the-unknown-perils-of-mining-wikipedia ("The Unknown Perils of Mining Wikipedia"). In particular, it seemed to me that some of the technical details of how they worked with page view data and content dumps, plus their consideration of how to handle bot-created content (even the very idea to plan for how to handle it), might be of interest to you. (If I understand correctly, bots are permitted on Wikimedia sites if they are "harmless" and approved, but not all bots are necessarily known, let alone evaluated.) Have you also considered reaching out to the Wikimedia Research team directly? https://www.mediawiki.org/wiki/Wikimedia_Research/Research_and_Data Cheers, Cory Salveson On Wed, Sep 23, 2015 at 12:23 PM, Alex Halavais <alex@halavais.net> wrote:
Hi, Josh,
It depends, of course, on what you are sampling *for*. A "constructed week" is generally based on viewing patterns, and so I suppose you could use traffic data to oversample the most popular pages. Or focus on the front page.
The most obvious here is to just randomly sample. In doing so, you will find a very large number of articles--some of them autogenerated/imported--that have never been touched.
If you haven't, you might consider copying this question over here as well:
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
In sum, though, any sampling method that draws on edit histories to study edit histories is probably a problem--ends up wagging the dog a bit. I guess you could use this:
https://aws.amazon.com/datasets/wikipedia-page-traffic-statistics/
to sample based on visitors, but that's a dated collection. I'm sure getting the traffic data from somewhere is a possibility, but seems like a lot of work to create a "constructed week."
Best,
Alex
On Wed, Sep 23, 2015 at 8:33 AM, Joshua Braun <jabraun@journ.umass.edu> wrote:
Hi All,
Just a brief question for the list: I'm considering doing a study that looks at the edit histories of a sample of Wikipedia articles, and I'm wondering if there are accepted strategies for assembling a "representative" sample of Wikipedia articles akin to the way that, say, television researchers put together a composite week for content analyses.
Obviously any sampling strategy will come with limitations, upsides, and downsides. I'm mostly curious as to whether there are accepted sampling methods that have emerged in the literature dealing with Wikipedia.
Thanks!
All the Best, Josh -- Josh Braun, Ph.D. Assistant Professor of Journalism Studies Journalism Department University of Massachusetts Amherst
@josh_braun Skype: wideaperture http://wideaperture.net/
"Maybe the only gift is a chance to inquire, to know nothing for certain. An inheritance of wonder and nothing more." William Least Heat-Moon
Sent from Emacs _______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
Join the Association of Internet Researchers: http://www.aoir.org/
--
// Alexander Halavais, Sociologist, Semiologist, and Saboteur Extraordinaire // Associate Professor of Social Technologies, Arizona State University // http://alex.halavais.net/bio @halavais
_______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
Join the Association of Internet Researchers: http://www.aoir.org/