Website/weblog word counts
It just occurred to me one rough way of indicating the degree of engagement required to produce a given weblog is to point out the number of words you might find on one. I have a couple archived as HTML - is there an easy way to count how many words there are in a folder full of HTML pages? Has anyone done a study of the number of words on a typical weblog? Or in a typical weblog posting? Or of differing patterns of increasing or decreasing postings over time in a given body of weblogs? FYI Apparently the average weblog comment is 63 words long according to G. Mishne and N. Glance (2006) Leave a Reply: An Analysis of Weblog Comments in WWW2006 http://staff.science.uva.nl/~gilad/pubs/www2006- blogcomments.pdf I just did a couple of manual word counts and one of my interviewees' weblogs varied between c 600 words one month to peak at over 11,000 words in each of two different months. --- David Brake, Doctoral Student in Media and Communications, London School of Economics & Political Science <http://www.lse.ac.uk/collections/media@lse/study/ mPhilPhDMediaAndCommunications.htm> Also see http://davidbrake.org/ (home page), http://blog.org/ (personal weblog) and http://get.to/lseblog (academic groupblog) Author of Dealing With E-Mail - <http://davidbrake.org/ dealingwithemail/> callto://DavidBrake (Skype.com's Instant Messenger and net phone) Please access the attached hyperlink for an important electronic communications disclaimer: http://www.lse.ac.uk/collections/secretariat/legal/disclaimer.htm
in unixen cat * | stripHTML | wc will give you a wordcount of the pages in the directory without html. there are many ways to do stripHTML depending on your coding preference, one of the simplest is just to use regex... though that generally assumes good code... so... you might want to use tidy, to fix the code first, then.... strip it... heh. jeremy hunsinger Information Ethics Fellow, Center for Information Policy Research, School of Information Studies, University of Wisconsin-Milwaukee (www.cipr.uwm.edu) wiki.tmttlt.com www.tmttlt.com () ascii ribbon campaign - against html mail /\ - against microsoft attachments http://www.stswiki.org/ sts wiki http://cfp.learning-inquiry.info/ Learning Inquiry-the journal http://transdisciplinarystudies.tmttlt.com/ Transdisciplinary Studies:the book series
It gets pretty thorny, actually, depending on how you have things archived, and what you are trying to get at. 1. Are you looking for words-per-post? If so, you should probably be archiving permalinked posts, but not all blogs allow you to address individual posts with a specific URL. Most also include comments at that permalink. 2. Just stripping out the HTML still leaves you with the cruft (sidebar, etc.) that is automatically generated, along with the comments if they are included. Words-per-month might be easier, since most blogging platforms/systems provide this at a single URL and without comments. You will still have cruft, but if you are sneaky about it (including a future month in your archive), you might be able to subtract this out from your counts. The other possibility is to use the RSS feed, assuming you have been archiving it. You can either feed it through an RSS parser (most scripting languages have them), or apply a regex to the feed. This, unfortunately, excludes those blogs that do not have RSS--a shrinking but still substantial number. The final possibility is to get hold of a sample--like the Blogpulse sample--that has already had some of the munging done. I would be pretty surprised if someone hadn't already done a word-count on the Weblogging Ecosystem data this year: http://www.blogpulse.com/www2006-workshop/ Best, Alex -- // // This email is // [X] assumed public and may be blogged / forwarded. // [ ] assumed to be private, please ask before redistributing. // // Alexander C. Halavais // Social Architect // http://alex.halavais.net //
Helo David, I am not sure but I think most actual qualitative research tools do not support the HTML format yet. A solution would be copying and pasting the files you want to an simple TXT or RTF file and then word counting them with specific software. I also research Blogs and recently I have been using TAMS Analyzer for Mac to do my analysis and I have been following the same process cited above. In my research I do not count words (specially because I work with Japanese and counting words and kanjis - ideograms - is not a very precise method) but I do count the frequency of posts, the presence of comments and the use of other media formats like photos, video and audio. Within this setting I can affirm the number of posts in many cases goes steadily or tends to increase over time - one of the observed blogs has the average of over five hundred posts a month!! Wish you luck in your study. My regards, Aristides Emmanuel Pereira, M.A. Int. Cultural Studies PhD Candidate Department of Multi-Cultural Societies Graduate School of International Cultural Studies Tohoku University Kawauchi, Aoba-ku, Sendai-shi 980-8576 JAPAN www.bleepsblops.com Tel. +81-90-6255-2095 ************************************************************************
From: David Brake <d.r.brake@lse.ac.uk> Reply-To: air-l@listserv.aoir.org To: AoIR mailing list <air-l-aoir.org@listserv.aoir.org> Subject: [Air-l] Website/weblog word counts Date: Wed, 16 May 2007 12:32:40 +0100
It just occurred to me one rough way of indicating the degree of engagement required to produce a given weblog is to point out the number of words you might find on one. I have a couple archived as HTML - is there an easy way to count how many words there are in a folder full of HTML pages? Has anyone done a study of the number of words on a typical weblog? Or in a typical weblog posting? Or of differing patterns of increasing or decreasing postings over time in a given body of weblogs?
FYI Apparently the average weblog comment is 63 words long according to G. Mishne and N. Glance (2006) Leave a Reply: An Analysis of Weblog Comments in WWW2006 http://staff.science.uva.nl/~gilad/pubs/www2006- blogcomments.pdf
I just did a couple of manual word counts and one of my interviewees' weblogs varied between c 600 words one month to peak at over 11,000 words in each of two different months. --- David Brake, Doctoral Student in Media and Communications, London School of Economics & Political Science <http://www.lse.ac.uk/collections/media@lse/study/ mPhilPhDMediaAndCommunications.htm> Also see http://davidbrake.org/ (home page), http://blog.org/ (personal weblog) and http://get.to/lseblog (academic groupblog) Author of Dealing With E-Mail - <http://davidbrake.org/ dealingwithemail/> callto://DavidBrake (Skype.com's Instant Messenger and net phone)
Please access the attached hyperlink for an important electronic communications disclaimer: http://www.lse.ac.uk/collections/secretariat/legal/disclaimer.htm _______________________________________________ The air-l@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
Join the Association of Internet Researchers: http://www.aoir.org/
_________________________________________________________________ Express yourself instantly with MSN Messenger! Download today it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
participants (4)
-
Alex Halavais -
Aristides Emmanuel Pereira -
David Brake -
Jeremy Hunsinger