Re: [Air-l] counting google hits

3 Mar 2005

      Thomas,

Of course all samples have biases, but isn't it incumbent upon us to better understand them -- particularly in the case of Google, precisely because it is fast becoming "all that we have", that is a default choice for information retrieval, not only for everyday users, but also for students and Internet researchers?

Greg Elmer

----- Original Message -----
From: Thomas Koenig <T.Koenig@lboro.ac.uk>
Date: Wednesday, March 2, 2005 8:54 pm
Subject: Re: [Air-l] counting google hits
...
Citeren elijah wright <elw@stderr.org>:
[What's wrong with using Google stats?]
...
because people assume that all texts that are available are 
represented,> which according to the google people they are *not*.
Fair enough, but what is your alternative corpus? Most traditional 
corporahave a bias away from everyday language to journalistic 
and/or literary
writings. Sometimes these bias' may not matter, some other times, they
might be even desirable, but at times google is the better choice, 
even if
imperfect.
...
in other words, the sample that you are pulling numbers from is 
neither> complete nor perfect - so your results won't be either.
Who gets unbiased random samples? No-one, not even NORC, who are 
pretty good
at it. Does that invalidate *all* statistical results? Of course 
not. Don't
get me wrong, I am all for careful random sampling, but if I cannot 
get it,
I might, under some circumstances, resort to biased samples, rather 
than to
not get any sample at all.
...
do you understand what google does well enough (details of the 
algorithm,> et cetera) to know what the weaknesses are?  oh, you 
say they haven't
published enough information for you to know?  that's what i 
thought.  :|
I do not know, how google indexes (I have a faint idea, though), 
but for
many practical purposes, it simply does not matter, as long as I do 
notsuspect a bias of exclusions of websites, which are *systematically
related* to the topic I am researching.
Would I rather have a random sample of all human-generated websits,
preferably with the vital stats of their authors attached? You bet. 
I just
won't get it. So I am taking the next best thing, aka Google.
...
...
I am afraid, this is how your argumentation sounds to me. Why 
should it
be wrong to use the number of google hits under all circumstances?
i think your tone is pretty crass.
Funny, that's what I thought of yours, that's why I chose to use 
*your*words. You probably know that it's sometimes difficult to 
discern the tone
when you have no cues other then some ASCII strings.
...
...
If I want to show that Canada is better known than Vanuatu
(http://googlefight.com/index.php?lang=en_GB&word1=canada&word2=vanuatu),> > why would the comparison of google hits be inadmissable? (There are a
...
...
number of reasons, why the "Vunuatu" hits are inflated, but 
that is of
no concern here).
popularity of a term is one of the few instances in which 
comparative> occurrence vis a vis the google corpus *might* be 
useful.  it would
depend
on your question, and whether the data available from the particular
google server you're connected to is appropriate to answering it.
Of course, it always depends on what you want to do, but that's a far
stretch of your wholesale rejection of using Google hits for any 
kind of
research:
"folks realize that using the "number of hits returned on google" 
is a
hilarious bad way to prove a point -- right?"
Thomas
--
thomas koenig, ph.d.
department of social sciences, loughborough university, u.k.
http://www.lboro.ac.uk/research/mmethods/staff/thomas/index.html
Greg Elmer, PhD
Bell Globemedia Research Chair
Rogers Communications Centre/School of Radio-TV Arts 
Ryerson University
350 Victoria Street, Toronto, Ontario
Canada      M5B 2K3

416-979-5282
_______________________________________________
Co-Editor, 
Space and Culture: An International Journal of Social Spaces
http://www.carleton.ca/space/
...