Hello All, I'm looking for ideas on the best software to use for comment scraping. I plan on doing quantitative content and qualitative textual analysis on the comments connected to an article on an online pubulication. The publication uses Disqus for comments, and ideally I'd like a program that would maintain the integrity of the comment relationships. Any and all ideas are appreciated. Thanks, JM Jasmine McNealy Assistant Professor S.I. Newhouse School of Public Communication Syracuse University 215 University Place Syracuse, NY 13210 315-443-1151 http://ssrn.com/author=1357319
I highly recommend Discovertext. http://discovertext.com/ Easy to use, good tech support if/when you need it. Built in coding system. Also can export to spreadsheet (if necessary) with subscription. Best, Jacob -- Dr. Jacob Groshek Assistant (Visiting) Professor Digital Media and Research Methods jgroshek.com <http://www.jgroshek.com/> Head, CTEC <http://aejmcctec.com/> / AEJMC <http://www.aejmc.org/> Visiting Scholar, IAST <http://www.iast.fr/> Full Member, NeSCoR <http://nescor.socsci.uva.nl/> On Tue, Jan 22, 2013 at 2:47 PM, Jasmine E McNealy <jemcneal@syr.edu> wrote:
Hello All,
I'm looking for ideas on the best software to use for comment scraping. I plan on doing quantitative content and qualitative textual analysis on the comments connected to an article on an online pubulication. The publication uses Disqus for comments, and ideally I'd like a program that would maintain the integrity of the comment relationships. Any and all ideas are appreciated.
Thanks,
JM
Jasmine McNealy Assistant Professor S.I. Newhouse School of Public Communication Syracuse University 215 University Place Syracuse, NY 13210 315-443-1151 http://ssrn.com/author=1357319 _______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
Join the Association of Internet Researchers: http://www.aoir.org/
Jacob, Jasmine, etc,, That software looks great! But expensive! I wonder if there is a cheaper alternative? (I'm working with the same type of data) Otherwise, we the multipronged approach has been the best I've encountered: text file + screen shot + html file Thanks, Casey On Mon, Jan 21, 2013 at 10:57 PM, Jacob Groshek <jgroshek@gmail.com> wrote:
I highly recommend Discovertext. http://discovertext.com/
Easy to use, good tech support if/when you need it. Built in coding system. Also can export to spreadsheet (if necessary) with subscription.
Best,
Jacob
-- Dr. Jacob Groshek Assistant (Visiting) Professor Digital Media and Research Methods jgroshek.com <http://www.jgroshek.com/>
Head, CTEC <http://aejmcctec.com/> / AEJMC <http://www.aejmc.org/> Visiting Scholar, IAST <http://www.iast.fr/> Full Member, NeSCoR <http://nescor.socsci.uva.nl/>
On Tue, Jan 22, 2013 at 2:47 PM, Jasmine E McNealy <jemcneal@syr.edu> wrote:
Hello All,
I'm looking for ideas on the best software to use for comment scraping. I plan on doing quantitative content and qualitative textual analysis on the comments connected to an article on an online pubulication. The publication uses Disqus for comments, and ideally I'd like a program that would maintain the integrity of the comment relationships. Any and all ideas are appreciated.
Thanks,
JM
Jasmine McNealy Assistant Professor S.I. Newhouse School of Public Communication Syracuse University 215 University Place Syracuse, NY 13210 315-443-1151 http://ssrn.com/author=1357319 _______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
Join the Association of Internet Researchers: http://www.aoir.org/
_______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
Join the Association of Internet Researchers: http://www.aoir.org/
If you have a background in Python, or an interest in learning, the Scrapy open source solution is cheap and flexible. http://scrapy.org/ On Tue, Jan 22, 2013 at 7:54 AM, Casey Tesfaye <klt35@georgetown.edu> wrote:
Jacob, Jasmine, etc,,
That software looks great! But expensive! I wonder if there is a cheaper alternative? (I'm working with the same type of data)
Otherwise, we the multipronged approach has been the best I've encountered: text file + screen shot + html file
Thanks, Casey
On Mon, Jan 21, 2013 at 10:57 PM, Jacob Groshek <jgroshek@gmail.com> wrote:
I highly recommend Discovertext. http://discovertext.com/
Easy to use, good tech support if/when you need it. Built in coding system. Also can export to spreadsheet (if necessary) with subscription.
Best,
Jacob
-- Dr. Jacob Groshek Assistant (Visiting) Professor Digital Media and Research Methods jgroshek.com <http://www.jgroshek.com/>
Head, CTEC <http://aejmcctec.com/> / AEJMC <http://www.aejmc.org/> Visiting Scholar, IAST <http://www.iast.fr/> Full Member, NeSCoR <http://nescor.socsci.uva.nl/>
On Tue, Jan 22, 2013 at 2:47 PM, Jasmine E McNealy <jemcneal@syr.edu> wrote:
Hello All,
I'm looking for ideas on the best software to use for comment scraping. I plan on doing quantitative content and qualitative textual analysis on the comments connected to an article on an online pubulication. The publication uses Disqus for comments, and ideally I'd like a program that would maintain the integrity of the comment relationships. Any and all ideas are appreciated.
Thanks,
JM
Jasmine McNealy Assistant Professor S.I. Newhouse School of Public Communication Syracuse University 215 University Place Syracuse, NY 13210 315-443-1151 http://ssrn.com/author=1357319 _______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
Join the Association of Internet Researchers: http://www.aoir.org/
_______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
Join the Association of Internet Researchers: http://www.aoir.org/
_______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
Join the Association of Internet Researchers: http://www.aoir.org/
-- -- Senior Lecturer in Technical Communication at the University of North Texas Doctoral Candidate at Texas Tech University Phone: 806.392.7016 Twitter: mikertrice Skype: mrtrice1 Email: propeliea@gmail.com “If we knew what it was we were doing, it would not be called research, would it?” - Albert Einstein
I will put in a plug for the painful-but-standard-and-entirely-free solution: * scrape the comments using a free, command-line based program like wget (http://www.gnu.org/software/wget/) or curl (http://curl.haxx.se/) * clean the text using the BeautifulSoup python package for parsing HTML. (http://www.crummy.com/software/BeautifulSoup/) ...and I will put in a second, shameless plug for the text cleaning software put out by the CASOS Center at CMU, AutoMap. (http://www.casos.cs.cmu.edu/projects/automap/) While its main purpose is for convert text data into network data, it incorporates a basic HTML out-link scraper (probably not what you want), and a remove-all-html-from-text-and-convert-symbols-to-english cleaner (might be what you are looking for). This may not be what you need, but if you are less familiar with coding solutions it should hopefully help you get in the right direction. Best, pml On Tue, Jan 22, 2013 at 8:54 AM, Casey Tesfaye <klt35@georgetown.edu> wrote:
Jacob, Jasmine, etc,,
That software looks great! But expensive! I wonder if there is a cheaper alternative? (I'm working with the same type of data)
Otherwise, we the multipronged approach has been the best I've encountered: text file + screen shot + html file
Thanks, Casey
On Mon, Jan 21, 2013 at 10:57 PM, Jacob Groshek <jgroshek@gmail.com> wrote:
I highly recommend Discovertext. http://discovertext.com/
Easy to use, good tech support if/when you need it. Built in coding system. Also can export to spreadsheet (if necessary) with subscription.
Best,
Jacob
-- Dr. Jacob Groshek Assistant (Visiting) Professor Digital Media and Research Methods jgroshek.com <http://www.jgroshek.com/>
Head, CTEC <http://aejmcctec.com/> / AEJMC <http://www.aejmc.org/> Visiting Scholar, IAST <http://www.iast.fr/> Full Member, NeSCoR <http://nescor.socsci.uva.nl/>
On Tue, Jan 22, 2013 at 2:47 PM, Jasmine E McNealy <jemcneal@syr.edu> wrote:
Hello All,
I'm looking for ideas on the best software to use for comment scraping. I plan on doing quantitative content and qualitative textual analysis on the comments connected to an article on an online pubulication. The publication uses Disqus for comments, and ideally I'd like a program that would maintain the integrity of the comment relationships. Any and all ideas are appreciated.
Thanks,
JM
Jasmine McNealy Assistant Professor S.I. Newhouse School of Public Communication Syracuse University 215 University Place Syracuse, NY 13210 315-443-1151 http://ssrn.com/author=1357319 _______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
Join the Association of Internet Researchers: http://www.aoir.org/
_______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
Join the Association of Internet Researchers: http://www.aoir.org/
_______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
Join the Association of Internet Researchers: http://www.aoir.org/
On 01/21/2013 10:47 PM, Jasmine E McNealy wrote:
I'm looking for ideas on the best software to use for comment scraping.
As a caveat to the great suggestions on scraping, I'll also note that Disqus provides an API. That might be more handy than HTML scraping. http://disqus.com/api/docs/
participants (6)
-
Casey Tesfaye -
Jacob Groshek -
Jasmine E McNealy -
Joseph Reagle -
Michael Trice -
Pete[r] Landwehr