Dear Claudia, (and other members of Air-L,) If it not absolutely required that the data you want to analyze come from Google Groups, but that similar data on people discussing on the Internet in a non-face-to-face manner could also be of interest, then maybe the following helps: Although there seems to be a Google Groups API (see http://google-gdata.googlecode.com/svn/docs/folder53/AllMembers_T_Google_GDa... something that might be useful; I don't know for sure if this if useful or even "official" in the sense that it stems from Google or is endorsed by them), I don't know of a way to neatly extract messages (or full threads / discussions, therefore) using this API. (An API is a software-interface to communicate with other programs, e.g. Google Groups, which you can use in your own programs.) A couple of years ago, we were planning on extracting messages and especially entire threads / discussions from Google Groups, but reading Google's Terms of Service (ToS) I found our plans were in contradiction with them. Maybe this has changed by now, but I would be surprised if that is were the case. At the time there were, in my opinion, three options: [1] - Abandon our research idea; This was of course not attractive, and I of course dismissed this option. [2] - Try to write a web-crawler to extract the messages; This was also a breach of Google's ToS. Another disadvantage was that we wanted data other people could use/access for analysis as well, so that others with an interest in this field would be able to do similar/related analyses using the same or similar data. Both points were thought to be sufficient reasons for rejecting this approach. We did try to obtain Google's permission, (which we did get) but the latter point remained: Google would not allow us to distribute our data set if we could build our own HTML-scraper to extract it from their web-pages, so other researchers would have no access to the data. [3] - Instead of sourcing data from Google Groups, use an internet provider's (ISP) access to Usenet messages; Because we wanted to see if cultural differences in the ways people communicate when face-to-face communication were existent when people use a digital, text-only medium (Usenet newsgroups) we also looked at plain Usenet messages. These can be obtained in plain-text format using a program like "inn" (which runs under Linux, I don't know about Windows/Mac environments). This program can store all messages / threads of your choosing in a directory structure, and it can update this tree at a specified interval, e.g. every 24 hours. This results in a directory structure filled with plain-text files which you can parse in situ, and/or load into some other form of database for analysis. You can extract all kinds of interesting information from these files, for more information / ideas see the offical Usenet (NNTP) communication protocol for more information: http://www.faqs.org/rfcs/rfc1036.html Although this can be a bit of work to set up, it can be very rewarding: a still growing vast database of data created by a great number of contributors. Another advantage is that their contributions are precisely time-stamped. And, of course, other researchers can access this publicly available data. There are disadvantages, too. For instance, it may be a little complicated to set up. (Probably best to leave this to a sysadmin.) Also, Usenet is (ab)used for swapping copyrighted materials such as movies & music, and it is also used for things like distributing (child!)pornography. By excluding certain groups and especially the so-called "binaries" (binary data files, i.e. not consisting of plain-text) you can avoid such unwanted discussion topics. Bear in mind that you and/or your institution can get in trouble when downloading such things, even if you do this in an automated fashion. By being precise about this when configuring, you can have access to a fascinating and largely untapped source of data. I hope this helps. If not, maybe it helps us Air-L-members if you can provide some specifics about what it is you want to use / do? Jerom Janssen On Sat, Nov 21, 2009 at 16:46, Claudia Mueller-Birn <clmb@cs.cmu.edu> wrote:
Dear all,
I am interested in doing some research using data from Google Groups; ideally I'd like to have the group archive in mbox or other parseable format. I can't imagine I am the first person who wants to do this and I am wondering if anyone has any tips or ideas.
Thank you.
:::Claudia
Claudia Mueller-Birn | Post-doctoral Fellow/Alexander von Humboldt Fellowship Researcher | Carnegie Mellon University | Institute for Software Research (ISR) | 5000 Forbes Avenue Pittsburgh, PA 15213 | phone: (412) 268 6367 | mail: clmb@cs.cmu.edu _______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
Join the Association of Internet Researchers: http://www.aoir.org/