Re: [Air-L] Screen Scraping of URLS and WHOIS subject/category mining

10 Feb 2018

      Hi Nathan,

regarding your second question I don't have any quick ideas. However, to scrape the urls for their category, beautifulsoup, a Python module, should be helpful. You would be able, for pages with the same structure at least (say like in your example cnn.com (http://cnn.com) news articles), to extract the information that you mention here.

I don't know about your background and skills in using Python. However, there is a workshop about this held regularly here at QUT DMRC and our summer schools by Patrik Wikström, who put the materials online in our GitHub repository: https://github.com/qut-dmrc/web-scraping-intro-workshop

It's aimed at people with no prior programming experience, even though that would be helpful to understand the materials on your own.

Hope that helps.

Cheers,

Felix

Felix Victor Münch
PhD Candidate @ QUT Digital Media Research Centre
Social Media: https://about.me/flxvctr
Google Scholar: https://scholar.google.com.au/citations?user=yn1Rz_EAAAAJ
Academia.edu: https://qut.academia.edu/FlxVctr
ResearchGate: https://www.researchgate.net/profile/Felix_Muench
ORCID: https://orcid.org/0000-0001-8808-6790
QUT: http://staff.qut.edu.au/staff/muench/
QUT preprints: https://eprints.qut.edu.au/view/person/M=FCnch,_Felix_Victor.html
...
On Friday, Feb 09, 2018 at 5:45 pm, Nathan Stolero <stolero@gmail.com (mailto:stolero@gmail.com)> wrote:
Dear AOIR's,
I'm studying the information seeking behavior of adolescents, young adults
and adults. One of subjects I'm investigating, is the difference between
the URLS/Links users choose to use (navigate/browse to, click on, etc.) and
the URLS/Links users tend to avoid (looking at them, deciding not to
navigate/browse/click, using eye-tracking).
As a result, I have a list of all the URLS the user visited during the
experiment and a set of screenshots in which the avoided links are marked
(I don't have the URLS because the user did not click on them, so the
software did not save it). I have a question regarding these two lists:
1) Regarding the list of URLS -
What can be the best way to mine a large lists of URLS for their category?
Let's say - http://www.cnn.com with news/broadcasting/content. I tried
WHOIS domains hoping to find this information, and then create a code that
will mine this line for each link, but could not find something significant.
2) Regarding the screenshots -
Is there a way, maybe using screen scraping, to automatically translate
textual links (clickable headlines, for example) to their URLS? Maybe using
a simple protocol of: a) Scrape the text in a marked area, b) search this
text on google, c) Use the first URL?
I hope I've made my intentions clear and looking forward for wisdom on the
virtual crowd.
Nathan
*************************************************************
Nathan Stolero
Doctoral Student
The Communication Department, The Faculty of Social Science
Tel Aviv University
_______________________________________________
The Air-L@listserv.aoir.org mailing list
is provided by the Association of Internet Researchers http://aoir.org
Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
Join the Association of Internet Researchers:
http://www.aoir.org/