Re: [Air-L] Wikipedia Sampling

25 Sep 2015

      For what it's worth, the machine learning company Lateral has actually used
raw data (available back to 2007 at
http://dumps.wikimedia.org/other/pagecounts-raw) to produce just such a
data set as I think Alex is describing, i.e., a "most popular content on
Wikipedia" corpus. You can read more about their approach in a blog post
here: https://blog.lateral.io/2015/06/the-unknown-perils-of-mining-wikipedia
("The Unknown Perils of Mining Wikipedia").

In particular, it seemed to me that some of the technical details of how
they worked with page view data and content dumps, plus their consideration
of how to handle bot-created content (even the very idea to plan for how to
handle it), might be of interest to you. (If I understand correctly, bots
are permitted on Wikimedia sites if they are "harmless" and approved, but
not all bots are necessarily known, let alone evaluated.)

Have you also considered reaching out to the Wikimedia Research team
directly?
https://www.mediawiki.org/wiki/Wikimedia_Research/Research_and_Data

Cheers,

Cory Salveson

On Wed, Sep 23, 2015 at 12:23 PM, Alex Halavais <alex@halavais.net> wrote:
...
Hi, Josh,
It depends, of course, on what you are sampling *for*. A "constructed
week" is generally based on viewing patterns, and so I suppose you
could use traffic data to oversample the most popular pages. Or focus
on the front page.
The most obvious here is to just randomly sample. In doing so, you
will find a very large number of articles--some of them
autogenerated/imported--that have never been touched.
If you haven't, you might consider copying this question over here as well:
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
In sum, though, any sampling method that draws on edit histories to
study edit histories is probably a problem--ends up wagging the dog a
bit. I guess you could use this:
https://aws.amazon.com/datasets/wikipedia-page-traffic-statistics/
to sample based on visitors, but that's a dated collection. I'm sure
getting the traffic data from somewhere is a possibility, but seems
like a lot of work to create a "constructed week."
Best,
Alex
On Wed, Sep 23, 2015 at 8:33 AM, Joshua Braun <jabraun@journ.umass.edu>
wrote:
...
Hi All,
Just a brief question for the list: I'm considering doing a study that
looks at the edit histories of a sample of Wikipedia articles, and I'm
wondering if there are accepted strategies for assembling a
"representative" sample of Wikipedia articles akin to the way that, say,
television researchers put together a composite week for content analyses.
Obviously any sampling strategy will come with limitations, upsides, and
downsides. I'm mostly curious as to whether there are accepted sampling
methods that have emerged in the literature dealing with Wikipedia.
Thanks!
All the Best,
Josh
--
Josh Braun, Ph.D.
Assistant Professor of Journalism Studies
Journalism Department
University of Massachusetts Amherst
@josh_braun
Skype: wideaperture
http://wideaperture.net/
"Maybe the only gift is a chance to inquire, to know nothing for
certain.  An inheritance of wonder and nothing more."
William Least Heat-Moon
Sent from Emacs
_______________________________________________
The Air-L@listserv.aoir.org mailing list
is provided by the Association of Internet Researchers http://aoir.org
Subscribe, change options or unsubscribe at:
http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
Join the Association of Internet Researchers:
http://www.aoir.org/
--
// Alexander Halavais, Sociologist, Semiologist, and Saboteur
Extraordinaire
// Associate Professor of Social Technologies, Arizona State University
// http://alex.halavais.net/bio     @halavais
_______________________________________________
The Air-L@listserv.aoir.org mailing list
is provided by the Association of Internet Researchers http://aoir.org
Subscribe, change options or unsubscribe at:
http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
Join the Association of Internet Researchers:
http://www.aoir.org/