experiences making large web archives datasets accessible for research?
As the landscape of copyright and fair use continues to evolve and as more and more academic research relies on large datasets that are often, at least in part, compiled from open web crawling, especially emerging areas like neural network training datasets, its interesting to think about how the world's web archives might make more of their holdings available for academic research. In the US, for example, most university IRB's I've spoken with treat web-derived datasets as "exempt" regardless of the sensitivity of the questions being asked and some institutions that have additional pre-IRB data reviews appear to waive those in at least some cases when it comes to web data ( https://www.forbes.com/sites/kalevleetaru/2017/09/16/ai-gaydar-and-how-the-f...) and thus web crawled data is becoming especially popular. While technical limitations do surface as concerns, the most common issue I've heard from web archives regarding why they don't open their holdings more broadly to data mining access revolves around copyright law and their interpretations of fair use when it comes to academic data mining (and of course the landscape of copyright and "fair use" exceptions vary dramatically across the world). Thought many on this list would find of interest a piece I put out yesterday talking with Common Crawl and their approach to fair use and recommendations for web archives considering making their archives more accessible to data mining access: https://www.forbes.com/sites/kalevleetaru/2017/09/28/common-crawl-and-unlock... While obviously the notion of just what counts as "fair use" or its equivalent is highly contested and varies from country to country (if it exists at all in a form amenable to data mining), for a followon piece I'm doing, I'd love to hear from anyone on this list who has released similar large archives of web content for open research and the legal justifications you used and your experiences there and any adjustments you made to the collection that your counsel felt made the fair use argument stronger and whether you distributed just the raw HTML, whether you included imagery, etc, and whether you just posted a download link or whether you required a signed researcher agreement first, and whether you distributed the content to their machines or required it to be processed locally. There are obviously a tremendous number of opinion pieces out there and legal arguments and briefs provided by a myriad organizations for and against archives being able to box up large holdings of web pages and make them available for data mining, so I'm particularly interested in real-world examples of where groups have actually made large collections of web pages available to others for data mining and how they accomplished that and the considerations and concessions they made that they believe ensured their work complied with fair use or its equivalent in their country (rather than opinion pieces that just talk about how it can or can not and should or should not be done). Feel free to respond to me off list if you'd prefer. Thanks so much! Kalev
participants (1)
-
kalev leetaru