Thomas, Of course all samples have biases, but isn't it incumbent upon us to better understand them -- particularly in the case of Google, precisely because it is fast becoming "all that we have", that is a default choice for information retrieval, not only for everyday users, but also for students and Internet researchers? Greg Elmer ----- Original Message ----- From: Thomas Koenig <T.Koenig@lboro.ac.uk> Date: Wednesday, March 2, 2005 8:54 pm Subject: Re: [Air-l] counting google hits
Citeren elijah wright <elw@stderr.org>:
[What's wrong with using Google stats?]
because people assume that all texts that are available are represented,> which according to the google people they are *not*.
Fair enough, but what is your alternative corpus? Most traditional corporahave a bias away from everyday language to journalistic and/or literary writings. Sometimes these bias' may not matter, some other times, they might be even desirable, but at times google is the better choice, even if imperfect.
in other words, the sample that you are pulling numbers from is neither> complete nor perfect - so your results won't be either.
Who gets unbiased random samples? No-one, not even NORC, who are pretty good at it. Does that invalidate *all* statistical results? Of course not. Don't get me wrong, I am all for careful random sampling, but if I cannot get it, I might, under some circumstances, resort to biased samples, rather than to not get any sample at all.
do you understand what google does well enough (details of the algorithm,> et cetera) to know what the weaknesses are? oh, you say they haven't published enough information for you to know? that's what i thought. :|
I do not know, how google indexes (I have a faint idea, though), but for many practical purposes, it simply does not matter, as long as I do notsuspect a bias of exclusions of websites, which are *systematically related* to the topic I am researching.
Would I rather have a random sample of all human-generated websits, preferably with the vital stats of their authors attached? You bet. I just won't get it. So I am taking the next best thing, aka Google.
I am afraid, this is how your argumentation sounds to me. Why should it be wrong to use the number of google hits under all circumstances?
i think your tone is pretty crass.
Funny, that's what I thought of yours, that's why I chose to use *your*words. You probably know that it's sometimes difficult to discern the tone when you have no cues other then some ASCII strings.
If I want to show that Canada is better known than Vanuatu
(http://googlefight.com/index.php?lang=en_GB&word1=canada&word2=vanuatu),> > why would the comparison of google hits be inadmissable? (There are a
number of reasons, why the "Vunuatu" hits are inflated, but that is of no concern here).
popularity of a term is one of the few instances in which comparative> occurrence vis a vis the google corpus *might* be useful. it would depend on your question, and whether the data available from the particular google server you're connected to is appropriate to answering it.
Of course, it always depends on what you want to do, but that's a far stretch of your wholesale rejection of using Google hits for any kind of research:
"folks realize that using the "number of hits returned on google" is a hilarious bad way to prove a point -- right?"
Thomas
-- thomas koenig, ph.d. department of social sciences, loughborough university, u.k. http://www.lboro.ac.uk/research/mmethods/staff/thomas/index.html
Greg Elmer, PhD Bell Globemedia Research Chair Rogers Communications Centre/School of Radio-TV Arts Ryerson University 350 Victoria Street, Toronto, Ontario Canada M5B 2K3 416-979-5282 _______________________________________________ Co-Editor, Space and Culture: An International Journal of Social Spaces http://www.carleton.ca/space/