Re: [Air-L] A question for researchers interested in the basics of statistical inference

3 Sep 2009

      Hi Monica,

Congrats on your thesis.  I will take a stab at your questions.

I think what might be problematic is the conception of inference.  In  
inferential statistics, the base definition of inference is drawing  
inference about a larger population from a sampled data set.  In these  
cases, we the golden sampling method is SRSWR, though population- 
inferential statistics are commonly computed on SRS samples, cluster  
samples, multi-stage samples, and so on.

To produce unbiased estimates, inferential statistical methods have a  
set of assumptions.  OLS, for example, has a number of assumptions -  
IV's variation is not random, no multicollinearity, homoskedasticity,  
mean of residuals is zero.  Now, many of these assumptions are met  
when the sample that produced the data was a probability sample.  If  
the estimates are unbiased, you can calculate variance, standard  
errors and confidence intervals for the population.

Importantly, a sample does not require a random draw for valid  
inferential estimation.  If a purposive sample can meet the  
assumptions of an inferential model, you can certainly produce  
unbiased estimates.  However, gauging the degree of unbiasedness in a  
purposive sample is difficult, so it is unwise to assume true  
unbiasedness.  Let me focus on your question regarding the propriety  
of using inferential techniques on purposive samples.

A different, and complementary use of inferential statistics is to  
draw inferences about relations in data.  For example, to test  
differences between groups or the relations between many variables in  
an analysis.  In this case, the inference is not population-level;  
rather, it describes the relations in the population at hand.  In  
these cases, we cannot argue that our estimates are representative and  
unbiased, but many models are robust enough, and have enough  
diagnostic features, that we can generally gauge the validity of the  
measures.  In these cases, if we realize and report the limitations of  
the model, it is appropriate to use them.

Now, the second part of your question dealt with parametric and non- 
parametric methods.  In statistics, "parametric" is used to describe  
how the population fits to the parameters of a distribution.  In most  
cases, we are concerned with the normal distribution.  When the  
distribution of a population is non-parametric, it does not fit a  
particular distribution.  Often this happens in cases where our sample  
is quite small.  In this case a nonparametric method would apply.   
However, as populations grow larger, they tend to fit into  
distributions and distribution-appropriate methods would apply.

The application of inferential methods to non-probability samples is  
appropriate if the inferences are to be drawn within the sample, and  
the characteristics of the distributions reasonably meet the criteria  
of the method.  You generally should be careful when making claims  
outside of the sample (its representativeness) or to the degree of the  
un-biasedness, but you can use these techniques to make inferential  
estimates regarding the data at hand.

Finally, with regards to confidence levels, in a between-means  
comparison such as a t-test, we are comparing the hypothetical  
distributions of the groups, and the significance test provides our  
intervals for comparison.

Thanks,
Fred

On Sep 2, 2009, at 10:12 PM, Monica Barratt wrote:
...
Hi everyone
I'm currently writing up my thesis which has the working title  
'Researching
the forums: Illicit drug use in a networked world'. I conducted an  
online
survey using a purposive (nonprobability) sample of illicit drug  
users who
used internet message boards (forums) to discuss or read about drugs.
Originally I intended to conduct inferential statistics on this  
sample of
915, as this is the general practice in many other papers I had  
read. After
some more thought though, I'm leaning away from that.
Following is my thinking about this issue. I would really appreciate  
some
feedback on this from anyone with an interest in this areas (non  
experts
welcome too!)
*My understanding of the sampling and statistical inference in my  
thesis
work*
There are two types of samples: probability and nonprobability.  
Probability
samples occur when each individual from the population of interest  
has an
equal (non-zero) chance of being included in the sample (random  
selection).
In contrast, nonprobability samples contain self-selected  
individuals from a
population of interest - not everyone has a chance of participating,  
so we
can't calculate the relationship between the sample and the  
population of
interest.
Probability samples of illicit drug users are rare. This is because to
conduct a probability sample, the researcher needs to have a defined
population, such as a list of students or phone numbers of households.
Illicit drug use is a rare behaviour on a population level (excluding
perhaps, ever use of cannabis) and it is unlikely that list of drug  
users
will exist given the illegality of the behaviour and reluctance to
self-identify on such a list.
Inferential statistics are not compatible with nonprobability  
samples. A
core assumption of the use of inferential statistics is that  
individuals are
randomly selected from the population of interest. Without this  
randomness,
the logic of inferential statistics does not hold.
Inferential statistics can be further categorised into parametric and
non-parametric statistical methods. These types of inferential  
statistics
are chosen depending upon the distribution of the variables to be  
analysed;
eg. parametric statistics for continuous normal variables and  
nonparametric
statistics for nonnormal or categorical/ordinal variables.
Nonparametric or distribution free statistics are still inferential.  
So they
too are incompatible with nonprobability samples.
Descriptive statistics can still be applied to nonprobability  
samples to
determine the relationships between variables in the dataset. What  
should
not be done is 'significance testing' as the aim of this testing is to
determine whether a relationship is strong enough or a difference is  
large
enough, given the sample size, to be representative of a difference  
in the
population. This assumes that the sample has a known relationship to  
the
population. This is meaningless when applied to a nonprobability  
sample.
There are still good reasons to conduct a nonprobability sample.  
There are
simply situations when probability samples are impossible to obtain  
or just
too expensive (arguable this applies to my population of interest).  
They are
also useful in exploratory or preliminary studies (also relevant to  
me). The
trick is not to apply inappropriate statistical tests to data  
collected in
this way.
Why is it then that we see probability statistics routinely  
conducted upon
nonprobability samples, especially in the drug studies field? Is it
something about making our research appear more scientific with the  
addition
of a p < .05? Is it ignorance? Or do I have it wrong myself? Are  
there times
when inferential statistics, eg. a t-test or a correlation co- 
efficient can
be applied to nonprobability samples? Are there any exceptions to  
this rule?
-- 
Monica Barratt
BSc(Psych); PhD in progress...
National Drug Research Institute
Melbourne, Victoria, Australia
http://preview.tinyurl.com/lwyyzq
_______________________________________________
The Air-L@listserv.aoir.org mailing list
is provided by the Association of Internet Researchers http://aoir.org
Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
Join the Association of Internet Researchers:
http://www.aoir.org/
--
Fred Stutzman
Ph.D. Student and Teaching Fellow
School of Information and Library Science, UNC-Chapel Hill
fred@fredstutzman.com | (919) 260-8508 | http://fredstutzman.com/