Citeren elijah wright <elw@stderr.org>: [What's wrong with using Google stats?]
because people assume that all texts that are available are represented, which according to the google people they are *not*.
Fair enough, but what is your alternative corpus? Most traditional corpora have a bias away from everyday language to journalistic and/or literary writings. Sometimes these bias' may not matter, some other times, they might be even desirable, but at times google is the better choice, even if imperfect.
in other words, the sample that you are pulling numbers from is neither complete nor perfect - so your results won't be either.
Who gets unbiased random samples? No-one, not even NORC, who are pretty good at it. Does that invalidate *all* statistical results? Of course not. Don't get me wrong, I am all for careful random sampling, but if I cannot get it, I might, under some circumstances, resort to biased samples, rather than to not get any sample at all.
do you understand what google does well enough (details of the algorithm, et cetera) to know what the weaknesses are? oh, you say they haven't published enough information for you to know? that's what i thought. :|
I do not know, how google indexes (I have a faint idea, though), but for many practical purposes, it simply does not matter, as long as I do not suspect a bias of exclusions of websites, which are *systematically related* to the topic I am researching. Would I rather have a random sample of all human-generated websits, preferably with the vital stats of their authors attached? You bet. I just won't get it. So I am taking the next best thing, aka Google.
I am afraid, this is how your argumentation sounds to me. Why should it be wrong to use the number of google hits under all circumstances?
i think your tone is pretty crass.
Funny, that's what I thought of yours, that's why I chose to use *your* words. You probably know that it's sometimes difficult to discern the tone when you have no cues other then some ASCII strings.
If I want to show that Canada is better known than Vanuatu
(http://googlefight.com/index.php?lang=en_GB&word1=canada&word2=vanuatu),
why would the comparison of google hits be inadmissable? (There are a number of reasons, why the "Vunuatu" hits are inflated, but that is of no concern here).
popularity of a term is one of the few instances in which comparative occurrence vis a vis the google corpus *might* be useful. it would depend on your question, and whether the data available from the particular google server you're connected to is appropriate to answering it.
Of course, it always depends on what you want to do, but that's a far stretch of your wholesale rejection of using Google hits for any kind of research: "folks realize that using the "number of hits returned on google" is a hilarious bad way to prove a point -- right?" Thomas -- thomas koenig, ph.d. department of social sciences, loughborough university, u.k. http://www.lboro.ac.uk/research/mmethods/staff/thomas/index.html