Text Sample Size?

newer
Want to know who your friends are?...

Karyn Hollis

18 Aug 2009 18 Aug '09

12:29 a.m.

Hi All-- This is a newbie question. I am planning to do a quantitative data analysis to study blogs for gender differences in CMC. Are there any rules for the size of samples? Would comparing male to female blog texts of a total of 50,000 words each be enough to claim statistical significance for any differences I find? Thanks for any advice, Karyn Hollis Villanova University

Show replies by date

Peter Timusk

18 Aug 18 Aug

2:16 a.m.

I have no idea of samples of words. I do know samples of persons. A sample of persons below say 300 is suspect especially if not random. I am reading a few books in Internet studies that argue against previous studies by claiming the sample is too small and not random. You can claim somethings with samples as small as 12 but the more items you want to measure the larger your sample should be IMHO. A sample is best random. Some would argue a sample is only a sample if random. You can sample randomly and still choose roughly equal men and women. Can you randomize your samples in some ways? The Canadian Internet Use Survey has had a sample of more than 20,000 persons. All that I am saying is probably to be found in most undergraduate statistics books. You would need to ask text analysts about how to sample texts. I follow gender and computers so would be interested in your results or what you are looking for. Peter On 17-Aug-09, at 8:29 PM, Karyn Hollis wrote:

...

Hi All-- This is a newbie question. I am planning to do a quantitative data analysis to study blogs for gender differences in CMC. Are there any rules for the size of samples? Would comparing male to female blog texts of a total of 50,000 words each be enough to claim statistical significance for any differences I find? Thanks for any advice, Karyn Hollis Villanova University

Peter Timusk statistical computer programmer ptimusk@sympatico.ca address 701-151 Parkdale Avenue Ottawa, Ontario Canada K1Y 4V8 Phone 613-729-8328 May all your numbers be quality numbers... even if they are only average numbers.

Alex Halavais

2:46 a.m.

Karyn & Peter, I'm hoping someone out there will correct me. I think you are looking for something like a rule of thumb, and I suspect that doesn't exist. There are two questions. The first is how many blogs/bloggers you need to sample in order to generalize to all bloggers. I'm guessing that's not your question. (Although given the issues of arriving at a representative sample, it is not a trivial one.) I think the question you are asking is (a) how many different bloggers you will need to sample in order to have the power necessary to demonstrate a significant difference between groups, and (b) how much text from each of these bloggers you will need. Of course, that question hinges in part on the distribution of differences within your groups. That, in turn, is dependent on precisely how you are measuring such differences. (And we'll leave aside, for the moment, the question of whether those differences make a difference--i.e., the validity of whatever measure you choose to use.) If you are using a metric that has been used in the past to show gender differences, you may be able to use whatever differences they found--in group and between--to estimate your own sample needs. In practice, though, if that literature exists--you probably just use the same sample size. So, that is my non-answer. - Alex -- // // This email is // [x] assumed public and may be blogged / forwarded. // [ ] assumed to be private, please ask before redistributing. // // Alexander C. Halavais, ciberflâneur // http://alex.halavais.net // On Mon, Aug 17, 2009 at 10:16 PM, Peter Timusk<ptimusk@sympatico.ca> wrote:

...

I have no idea of samples of words. I do know samples of persons.

A sample of persons below say 300 is suspect especially if not random. I am reading a few books in Internet studies that argue against previous studies by claiming the sample is too small and not random.

You can claim somethings with samples as small as 12 but the more items you want to measure the larger your sample should be IMHO.

A sample is best random. Some would argue a sample is only a sample if random. You can sample randomly and still choose roughly equal men and women.

Can you randomize your samples in some ways?

The Canadian Internet Use Survey has had a sample of more than 20,000 persons.

All that I am saying is probably to be found in most undergraduate statistics books.

You would need to ask text analysts about how to sample texts.

I follow gender and computers so would be interested in your results or what you are looking for.

Peter

On 17-Aug-09, at 8:29 PM, Karyn Hollis wrote:

...
Hi All-- This is a newbie question. I am planning to do a quantitative data analysis to study blogs for gender differences in CMC. Are there any rules for the size of samples? Would comparing male to female blog texts of a total of 50,000 words each be enough to claim statistical significance for any differences I find? Thanks for any advice, Karyn Hollis Villanova University

Peter Timusk statistical computer programmer ptimusk@sympatico.ca address 701-151 Parkdale Avenue Ottawa, Ontario Canada K1Y 4V8 Phone 613-729-8328

May all your numbers be quality numbers... even if they are only average numbers.

_______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org

Join the Association of Internet Researchers: http://www.aoir.org/

Monica Barratt

12:30 p.m.

Alex, I believe you are right. There is no answer to the question 'how many observations do I need to enable statistically significance' as a rule of thumb. But, if you know a bit more about your planned analyses in advance, you may be able to estimate sample size using power tables. See Cohen, J. (1992). A power primer. Psychological Bulletin, 112(1), 155-159. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Earlbaum Associates. Power calculations are useful only if your data also meet other criteria though, which need to be considered before you should be applying inferential statistics. One of the problems with web-based data is the relative ease of collecting 'large numbers' of responses, words, observations, etc. People often think that large numbers means they can 'find statistical significance'. What matters is the way you are sampling those units and how you have defined the larger population about which you hope to infer - and other elements such as the expected differences between groups and effect sizes, as Alex and Peter have already mentioned. Although a lot of this is standard methods textbook content, it's surprising how many published articles use statistical inference in situation where assumptions for it aren't met. Indeed, I'm still trying to get my head around it. Colleagues of mine have said things like 'it's not a random sample and I don't want to generalise my results to a larger population as I know I cannot, but I can still use statistical tests to test variables within my data, right?' Given these things get published, I'm confused myself. Then again, what is theoretically correct and what gets published aren't necessarily the same thing... Some answers and more questions for you! Monica Monica Barratt http://db.ndri.curtin.edu.au/student.asp?persid=650&typeid=1 2009/8/18 Alex Halavais <alex@halavais.net>

...

Karyn & Peter,

I'm hoping someone out there will correct me. I think you are looking for something like a rule of thumb, and I suspect that doesn't exist.

There are two questions. The first is how many blogs/bloggers you need to sample in order to generalize to all bloggers. I'm guessing that's not your question. (Although given the issues of arriving at a representative sample, it is not a trivial one.)

I think the question you are asking is (a) how many different bloggers you will need to sample in order to have the power necessary to demonstrate a significant difference between groups, and (b) how much text from each of these bloggers you will need.

Of course, that question hinges in part on the distribution of differences within your groups. That, in turn, is dependent on precisely how you are measuring such differences. (And we'll leave aside, for the moment, the question of whether those differences make a difference--i.e., the validity of whatever measure you choose to use.)

If you are using a metric that has been used in the past to show gender differences, you may be able to use whatever differences they found--in group and between--to estimate your own sample needs. In practice, though, if that literature exists--you probably just use the same sample size.

So, that is my non-answer.

- Alex

-- // // This email is // [x] assumed public and may be blogged / forwarded. // [ ] assumed to be private, please ask before redistributing. // // Alexander C. Halavais, ciberflâneur // http://alex.halavais.net //

On Mon, Aug 17, 2009 at 10:16 PM, Peter Timusk<ptimusk@sympatico.ca> wrote:

...
I have no idea of samples of words. I do know samples of persons.

A sample of persons below say 300 is suspect especially if not random. I am reading a few books in Internet studies that argue against previous studies by claiming the sample is too small and not random.

You can claim somethings with samples as small as 12 but the more items you want to measure the larger your sample should be IMHO.

A sample is best random. Some would argue a sample is only a sample if random. You can sample randomly and still choose roughly equal men and women.

Can you randomize your samples in some ways?

The Canadian Internet Use Survey has had a sample of more than 20,000 persons.

All that I am saying is probably to be found in most undergraduate statistics books.

You would need to ask text analysts about how to sample texts.

I follow gender and computers so would be interested in your results or what you are looking for.

Peter

On 17-Aug-09, at 8:29 PM, Karyn Hollis wrote:

...
Hi All-- This is a newbie question. I am planning to do a quantitative data analysis to study blogs for gender differences in CMC. Are there any rules for the size of samples? Would comparing male to female blog texts of a total of 50,000 words each be enough to claim statistical significance for any differences I find? Thanks for any advice, Karyn Hollis Villanova University

Peter Timusk statistical computer programmer ptimusk@sympatico.ca address 701-151 Parkdale Avenue Ottawa, Ontario Canada K1Y 4V8 Phone 613-729-8328

May all your numbers be quality numbers... even if they are only average numbers.

_______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org

Join the Association of Internet Researchers: http://www.aoir.org/

_______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org

Join the Association of Internet Researchers: http://www.aoir.org/

richard hall

1:59 p.m.

On 8/18/09 7:30 AM, "Monica Barratt" <tronica@gmail.com> wrote:

...

Although a lot of this is standard methods textbook content, it's surprising how many published articles use statistical inference in situation where assumptions for it aren't met. Indeed, I'm still trying to get my head around it. Colleagues of mine have said things like 'it's not a random sample and I don't want to generalise my results to a larger population as I know I cannot, but I can still use statistical tests to test variables within my data, right?' Given these things get published, I'm confused myself. Then again, what is theoretically correct and what gets published aren't necessarily the same thing...

Well, a very popular response is something like - the Analysis of Variance (or whatever method) is very robust with respect to non-normality assumptions (or whatever assumption), so it's ok. Not that I'm really expert on whether that is ok, but I do know the word "robust" is used a lot to justify assumption violation in experimental psychology circles. ...peace...richard -- Richard H. Hall, PhD Professor, Information Science and Technology Missouri University of Science and Technology http://mst.edu/~rhall

richard hall

1:55 p.m.

First of all, I'm assuming you want to apply some sorts of inferential stats to these data - if not then this may not apply. If so, the main problem with small sample size is the loss of power, so, if there is an effect in the population, you are less likely to find it, so, mainly you're just handicapping yourself. In fact, if you do find something with low power, it's very possibly a very large effect. It's sort of like trying to see through a hazy pair of glasses. If there's something there, for you to see it, it needs to be big and obvious. ...peace...richard On 8/17/09 9:46 PM, "Alex Halavais" <alex@halavais.net> wrote:

...

Karyn & Peter,

I'm hoping someone out there will correct me. I think you are looking for something like a rule of thumb, and I suspect that doesn't exist.

There are two questions. The first is how many blogs/bloggers you need to sample in order to generalize to all bloggers. I'm guessing that's not your question. (Although given the issues of arriving at a representative sample, it is not a trivial one.)

I think the question you are asking is (a) how many different bloggers you will need to sample in order to have the power necessary to demonstrate a significant difference between groups, and (b) how much text from each of these bloggers you will need.

Of course, that question hinges in part on the distribution of differences within your groups. That, in turn, is dependent on precisely how you are measuring such differences. (And we'll leave aside, for the moment, the question of whether those differences make a difference--i.e., the validity of whatever measure you choose to use.)

If you are using a metric that has been used in the past to show gender differences, you may be able to use whatever differences they found--in group and between--to estimate your own sample needs. In practice, though, if that literature exists--you probably just use the same sample size.

So, that is my non-answer.

- Alex

-- // // This email is // [x] assumed public and may be blogged / forwarded. // [ ] assumed to be private, please ask before redistributing. // // Alexander C. Halavais, ciberflâneur // http://alex.halavais.net //

On Mon, Aug 17, 2009 at 10:16 PM, Peter Timusk<ptimusk@sympatico.ca> wrote:

...
I have no idea of samples of words. I do know samples of persons.

A sample of persons below say 300 is suspect especially if not random. I am reading a few books in Internet studies that argue against previous studies by claiming the sample is too small and not random.

You can claim somethings with samples as small as 12 but the more items you want to measure the larger your sample should be IMHO.

A sample is best random. Some would argue a sample is only a sample if random. You can sample randomly and still choose roughly equal men and women.

Can you randomize your samples in some ways?

The Canadian Internet Use Survey has had a sample of more than 20,000 persons.

All that I am saying is probably to be found in most undergraduate statistics books.

You would need to ask text analysts about how to sample texts.

I follow gender and computers so would be interested in your results or what you are looking for.

Peter

On 17-Aug-09, at 8:29 PM, Karyn Hollis wrote:

...
Hi All-- This is a newbie question. I am planning to do a quantitative data analysis to study blogs for gender differences in CMC. Are there any rules for the size of samples? Would comparing male to female blog texts of a total of 50,000 words each be enough to claim statistical significance for any differences I find? Thanks for any advice, Karyn Hollis Villanova University

Peter Timusk statistical computer programmer ptimusk@sympatico.ca address 701-151 Parkdale Avenue Ottawa, Ontario Canada K1Y 4V8 Phone 613-729-8328

May all your numbers be quality numbers... even if they are only average numbers.

_______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org

Join the Association of Internet Researchers: http://www.aoir.org/

_______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org

Join the Association of Internet Researchers: http://www.aoir.org/

-- Richard H. Hall, PhD Professor, Information Science and Technology Missouri University of Science and Technology http://mst.edu/~rhall

James Howison

4:50 p.m.

Lots of useful responses so far. Just wanted to add that we've dealt with a similar question in attempting to move from qualitative human coding to natural language processing. It was useful for us to think about the relationship between the phenomenon of interest and the units of analysis. ie. Do you have good theoretical (or prior empirical) reasons to believe that the differences between men and women that you are interested in vary with the number of words? If you are talking about individual word choice then the number of words that you sample ought to be relevant. If, though, the phenomena that you are interested in is at a different level of analysis, say paragraph level or post level, then your sample reasoning should match that. Perhaps the differences are in post openings or closings? Once you nail the unit of analysis question then you need to ask what you know about the population distribution of the phenomena you are interested in. For example if it shows up only once in every (approx) 1,000 words, then you'll need to sample enough 1,000 word units to ensure that you have enough possible places that it might have shown up (ie something like 300 x 1000) for the inferential logic to work. It's also possible that you don't yet know the patterns of difference, those might be what you are seeking to discover, although that would seem to call for a qualitative phase. In that case a logic of sufficiency (ie I've now seen enough examples, and I'm not seeing any new types, usually called "exhaustion", in reference to concepts, not the coder!) might help you determine when to stop coding. Of course such a strategy means that the claims you can make are different (ie this is a theory generative, not a theory testing, methodology). Once that process is done you'll have a better idea of the likely population distribution of your phenomena, which will then give you insight into what sample size you'd need to test your theory. Cheers, James <credibility information redacted ;> On 17 Aug 2009, at 8:29 PM, Karyn Hollis wrote:

...

Hi All-- This is a newbie question. I am planning to do a quantitative data analysis to study blogs for gender differences in CMC. Are there any rules for the size of samples? Would comparing male to female blog texts of a total of 50,000 words each be enough to claim statistical significance for any differences I find? Thanks for any advice, Karyn Hollis Villanova University _______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org

Join the Association of Internet Researchers: http://www.aoir.org/

Fred Stutzman

5:50 p.m.

Excellent thread - In terms of resources, you might wish to look at the work of Susan Herring, especially her content analyses of weblogs. Additionally, papers presented at the SIGIR, TREC (Blog track) and ICWSM conferences, and the journals JASIST and IP&M may have useful methodology segments. There are probably lots of other useful venues. Following on James' excellent comments, I would urge you to think about this analysis on the observation-level, rather than overall corpus size. Let's assume you have 200 chunks of 1000-word text from a random collection of blogs (100 female-gendered blogs, 100 male- gendered blogs). You could then have raters apply a subjective scale to the text, and then you could compare scale responses between the groups, looking for statistically significant differences. With 200 observations, it would be safe to assume that your data was parametric, and use standard t-tests or ANOVA. However, if you had fewer observations, nonparametric methods such as Wilcoxon and Kruskal- Wallis would be applicable. With these tests you're not drawing general, population-level inference, but this will allow you to run comparisons in your data set. If you are looking for population-level statistical significance, this study lends itself to a stratified design. The first stage of sampling could be from a public listing of weblogs (finite population) or from a randomized search (infinite population). The second stage of sampling would be text chunks of appropriate size. Depending on gender distribution you may need to apply weighting within your sample. Importantly, you would be able to calculate standard errors with this design. Best, Fred (Also of limited credibility) On Aug 18, 2009, at 12:50 PM, James Howison wrote:

...

Lots of useful responses so far. Just wanted to add that we've dealt with a similar question in attempting to move from qualitative human coding to natural language processing. It was useful for us to think about the relationship between the phenomenon of interest and the units of analysis.

ie. Do you have good theoretical (or prior empirical) reasons to believe that the differences between men and women that you are interested in vary with the number of words? If you are talking about individual word choice then the number of words that you sample ought to be relevant. If, though, the phenomena that you are interested in is at a different level of analysis, say paragraph level or post level, then your sample reasoning should match that. Perhaps the differences are in post openings or closings?

Once you nail the unit of analysis question then you need to ask what you know about the population distribution of the phenomena you are interested in. For example if it shows up only once in every (approx) 1,000 words, then you'll need to sample enough 1,000 word units to ensure that you have enough possible places that it might have shown up (ie something like 300 x 1000) for the inferential logic to work.

It's also possible that you don't yet know the patterns of difference, those might be what you are seeking to discover, although that would seem to call for a qualitative phase. In that case a logic of sufficiency (ie I've now seen enough examples, and I'm not seeing any new types, usually called "exhaustion", in reference to concepts, not the coder!) might help you determine when to stop coding. Of course such a strategy means that the claims you can make are different (ie this is a theory generative, not a theory testing, methodology). Once that process is done you'll have a better idea of the likely population distribution of your phenomena, which will then give you insight into what sample size you'd need to test your theory.

Cheers, James <credibility information redacted ;>

On 17 Aug 2009, at 8:29 PM, Karyn Hollis wrote:

...
Hi All-- This is a newbie question. I am planning to do a quantitative data analysis to study blogs for gender differences in CMC. Are there any rules for the size of samples? Would comparing male to female blog texts of a total of 50,000 words each be enough to claim statistical significance for any differences I find? Thanks for any advice, Karyn Hollis Villanova University _______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http:// aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org

Join the Association of Internet Researchers: http://www.aoir.org/

_______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org

Join the Association of Internet Researchers: http://www.aoir.org/

-- Fred Stutzman Ph.D. Student and Teaching Fellow School of Information and Library Science, UNC-Chapel Hill fred@fredstutzman.com | (919) 260-8508 | http://fredstutzman.com/

S. Courtney Walton

9:26 p.m.

Hi Karyn, In addition to the great suggestions you have already received, I wonder in your literature review might give you a clue as to sample size for your study? Looking at similar studies, can you get a sense for the number of blogs you will need? Also, does the theoretical framework itself speak to this issue? I'm interested to hear how you progress as I am conducting a related study of gender and disclosure on microblogs. Using Twitter as the site of study and a stratified, proportional random sample, my study analyzes microblogging disclosures made by users across two dimensions -- gender and identity (parent or professional). On Mon, Aug 17, 2009 at 5:29 PM, Karyn Hollis <karyn.hollis@villanova.edu>wrote:

...

Hi All-- This is a newbie question. I am planning to do a quantitative data analysis to study blogs for gender differences in CMC. Are there any rules for the size of samples? Would comparing male to female blog texts of a total of 50,000 words each be enough to claim statistical significance for any differences I find? Thanks for any advice, Karyn Hollis Villanova University _______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org

Join the Association of Internet Researchers: http://www.aoir.org/

-- Best Regards, S. Courtney Walton scw@umail.ucsb.edu http://www.scourtneywalton.com MA/PhD Student Department of Communication 4807 Ellison Hall University of California, Santa Barbara 93106

6162

Age (days ago)

6162

Last active (days ago)

List overview

Download

8 comments

8 participants

participants (8)

Alex Halavais
Fred Stutzman
James Howison
Karyn Hollis
Monica Barratt
Peter Timusk
richard hall
S. Courtney Walton

Text Sample Size?

Karyn Hollis

Peter Timusk

Alex Halavais

Monica Barratt

richard hall

richard hall

James Howison

Fred Stutzman

S. Courtney Walton

tags

participants (8)