It gets pretty thorny, actually, depending on how you have things archived, and what you are trying to get at. 1. Are you looking for words-per-post? If so, you should probably be archiving permalinked posts, but not all blogs allow you to address individual posts with a specific URL. Most also include comments at that permalink. 2. Just stripping out the HTML still leaves you with the cruft (sidebar, etc.) that is automatically generated, along with the comments if they are included. Words-per-month might be easier, since most blogging platforms/systems provide this at a single URL and without comments. You will still have cruft, but if you are sneaky about it (including a future month in your archive), you might be able to subtract this out from your counts. The other possibility is to use the RSS feed, assuming you have been archiving it. You can either feed it through an RSS parser (most scripting languages have them), or apply a regex to the feed. This, unfortunately, excludes those blogs that do not have RSS--a shrinking but still substantial number. The final possibility is to get hold of a sample--like the Blogpulse sample--that has already had some of the munging done. I would be pretty surprised if someone hadn't already done a word-count on the Weblogging Ecosystem data this year: http://www.blogpulse.com/www2006-workshop/ Best, Alex -- // // This email is // [X] assumed public and may be blogged / forwarded. // [ ] assumed to be private, please ask before redistributing. // // Alexander C. Halavais // Social Architect // http://alex.halavais.net //