Sina, You face many key choices, not all of them algorithmic: - how many tasks? - which are the easy one ones versus which are hard? - what is the best order of tasks? - why 4 categories? - will you make categories mutually exclusive or not? - how will you resolve boundary cases? - is there enough unlabeled and labeled data to build a balanced model? - can you create original, corpus-based labels, yourself or in a small group? - do labels created by others, at another time, for another reason, using a different corpus, work? - how will you validate the labels, or the model they produce, is accurate? - how do you know a good annotator/evaluator from a poor one? - how does the task-specific aptitude of the annotators impact the training data and the model? Let me explain my thinking on these ideas, which reflect more than 30 years of labeling that started with a pen and paper undergraduate senior thesis in 1988, included 10 years of using and teaching NUD*IST, NVivo, and Atlas.ti, and my own software development journey to roll everything I know into two labeling platforms. The number of tasks is most important: one task for all categories or one/multiple tasks for each category? I always favor the latter. To the extent you decompose problems into separate tasks, the chance you will get them done consistently and accurately improves. By breaking the tasks down, you very quickly (sometimes in minutes) learn what is easy, what is difficult, and start to think about the best order of tasks, which with Twitter is often as follows: collect, deduplicate, sample seeds and singles, scope for relevance issues, and start your first binary classification, which for me is always relevance. For example, one common method sequence is to build a relevance classifier first, then a main topic classifier within the relevant data, then a sentiment classifier within the relevant, on-topic subset. The end result is much better than going from raw data directly to four categories with a pre-wrapped model. In Twitter, duplicates are RTs and you really never want to label the same item over and over, which is why deduplication is a prerequisite to get going. Exclusivity in categories is among the most important decisions. I always try to build mutually exclusive categories, in layers as needed, to keep the training data as focused as possible on the problem at hand. Even adding a third category and making the codes non-mutually exclusive will complicate the human labeling, measurement, and the signal produced in a machine-learning model. Exclusive categories produce better results in nearly all machine-learning we do. If I care about three topics, I build three classifiers: A-Not A, B-Not B, and C-Not C. Each is a separate task, easier for the humans and the machines. You can mashup the results. Boundary cases are the hard part in annotation and machine-learning. Some are irresolvable. Others you can plan around or learn your way through if you label in short stints of small groups who all write reflection memos in a shared Google form after each 5-15 minute labeling session. The best way to limit the impact of irresolvable boundary cases is to decompose the problems as noted above. Someone in the end has to decide when multiple annotators disagree on a boundary case, who is right? We call that process CoderRank, because it reveals over time that people are never equally able to understand and execute a classification task. The more people you add, the clearer this fundamental fact becomes. Coders are not equal; some are terrible and a small number are truly legendary. You need to know where you and your annotators sit on the spectrum. With respect to the availability of data for all categories, if you have 80% of the data category x, 10% category y, 5% category z and 5% boundary cases, a random sample for training will serve you poorly and the model will be very difficult to scale and highly dubious on accuracy. In general, it is best to define classification problems in such a way that you can get more evenly balanced training sets. This is another reason I like binary code schemes with lower levels of complexity in the annotation. This all gets at the core question: should you build your own model or apply existing training sets? There is no right answer here. In fact, the optimal approach is probably a hybrid of using relevant training data that is out in the ether (if it is and you trust it) to jump start the process, but also your own annotation to see what unexpected difficulties may be encountered. I rarely code for more than 5 minutes before having to jot down notes about something I did not foresee in the data. Then you adjust, experiment, test, validate, and move on informed by every interaction with the data. Overall, my biggest caution is to avoid the pitfall of shortcuts, dashboards, and visualizations that eliminate the need to do the harder work. There probably is not a perfect training set out there for you. Only your work over time can determine what is relevant, accurate, and reportable as a finding. Plato argued categories are hard; he was right. As researchers, we need to embody this core idea in how we think about categorization. Just as all coders are not created equal, not all code schemes are equal either. Some approaches are better, depending on the desired outcome and underlying theoretical and applied assumptions. Over the last 26 days, I have labeled 75,000 Twitter user descriptions in one very important binary model. There are still boundary cases that produce vexing results in human and machine classification, but the model is probably the most powerful and accurate I have ever built. For me, it is the fullest expression of all these ideas and a roadmap of how I will do this sort of work going forward. ~Stu On Wed, Apr 29, 2020 at 5:08 AM Sina Furkan Özdemir <sina.ozdemir@ntnu.no> wrote:
Dear all,
I have been following some 800 Twitter accounts for my Ph.D. dissertation over the last four months. I have ended up with 400.000 tweets that I need to categorize by four mutually exclusive categories.
I looked up some previous works with similar tasks, and it seems that the best way is to use a combination of word embeddings and recurrent neural networks with LSTM structure.
The problem I am having right now is that I couldn't find training data for the classification. Can anyone recommend me some literature on sampling strategies for short-text classification tasks?
Best, Sina Özdemir Ph.D. Candidate NTNU, Trondheim M.A Comparative and International Studies ETH Zurich & University of Zurich, Switzerland B.A. Political Science and International Relations Middle East Technical University, Turkey
_______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
Join the Association of Internet Researchers: http://www.aoir.org/
-- Dr. Stuart W. Shulman Founder and CEO, Texifter