Hi Nathan, regarding your second question I don't have any quick ideas. However, to scrape the urls for their category, beautifulsoup, a Python module, should be helpful. You would be able, for pages with the same structure at least (say like in your example cnn.com (http://cnn.com) news articles), to extract the information that you mention here. I don't know about your background and skills in using Python. However, there is a workshop about this held regularly here at QUT DMRC and our summer schools by Patrik Wikström, who put the materials online in our GitHub repository: https://github.com/qut-dmrc/web-scraping-intro-workshop It's aimed at people with no prior programming experience, even though that would be helpful to understand the materials on your own. Hope that helps. Cheers, Felix Felix Victor Münch PhD Candidate @ QUT Digital Media Research Centre Social Media: https://about.me/flxvctr Google Scholar: https://scholar.google.com.au/citations?user=yn1Rz_EAAAAJ Academia.edu: https://qut.academia.edu/FlxVctr ResearchGate: https://www.researchgate.net/profile/Felix_Muench ORCID: https://orcid.org/0000-0001-8808-6790 QUT: http://staff.qut.edu.au/staff/muench/ QUT preprints: https://eprints.qut.edu.au/view/person/M=FCnch,_Felix_Victor.html
On Friday, Feb 09, 2018 at 5:45 pm, Nathan Stolero <stolero@gmail.com (mailto:stolero@gmail.com)> wrote: Dear AOIR's,
I'm studying the information seeking behavior of adolescents, young adults and adults. One of subjects I'm investigating, is the difference between the URLS/Links users choose to use (navigate/browse to, click on, etc.) and the URLS/Links users tend to avoid (looking at them, deciding not to navigate/browse/click, using eye-tracking).
As a result, I have a list of all the URLS the user visited during the experiment and a set of screenshots in which the avoided links are marked (I don't have the URLS because the user did not click on them, so the software did not save it). I have a question regarding these two lists:
1) Regarding the list of URLS - What can be the best way to mine a large lists of URLS for their category? Let's say - http://www.cnn.com with news/broadcasting/content. I tried WHOIS domains hoping to find this information, and then create a code that will mine this line for each link, but could not find something significant.
2) Regarding the screenshots - Is there a way, maybe using screen scraping, to automatically translate textual links (clickable headlines, for example) to their URLS? Maybe using a simple protocol of: a) Scrape the text in a marked area, b) search this text on google, c) Use the first URL?
I hope I've made my intentions clear and looking forward for wisdom on the virtual crowd.
Nathan
************************************************************* Nathan Stolero Doctoral Student The Communication Department, The Faculty of Social Science Tel Aviv University _______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
Join the Association of Internet Researchers: http://www.aoir.org/