[Air-L] Screen Scraping of URLS and WHOIS subject/category mining

9 Feb 2018

      Dear AOIR's,

I'm studying the information seeking behavior of adolescents, young adults
and adults. One of subjects I'm investigating, is the difference between
the URLS/Links users choose to use (navigate/browse to, click on, etc.) and
the URLS/Links users tend to avoid (looking at them, deciding not to
navigate/browse/click, using eye-tracking).

As a result, I have a list of all the URLS the user visited during the
experiment and a set of screenshots in which the avoided links are marked
(I don't have the URLS because the user did not click on them, so the
software did not save it). I have a question regarding these two lists:

1) Regarding the list of URLS -
What can be the best way to mine a large lists of URLS for their category?
Let's say - http://www.cnn.com with news/broadcasting/content. I tried
WHOIS domains hoping to find this information, and then create a code that
will mine this line for each link, but could not find something significant.

2) Regarding the screenshots -
Is there a way, maybe using screen scraping, to automatically translate
textual links (clickable headlines, for example) to their URLS? Maybe using
a simple protocol of: a) Scrape the text in a marked area, b) search this
text on google, c) Use the first URL?

I hope I've made my intentions clear and looking forward for wisdom on the
virtual crowd.

Nathan

*************************************************************
Nathan Stolero
Doctoral Student
The Communication Department, The Faculty of Social Science
Tel Aviv University