Hi everyone, We are excited to announce the publication of our new CDT research report, “Lost in Translation: Large Language Models in Non-English Content Analysis <https://cdt.org/insights/lost-in-translation-large-language-models-in-non-english-content-analysis/>.” The report explains the capabilities of a new AI technology called “multilingual language models” that technology companies claim can understand content in over 100 languages by extrapolating linguistic patterns from high-resource languages. We further describe how these models work, and argue that they have significant limitations <https://cdt.org/press/cdt-finds-key-shortcomings-when-large-language-models-analyze-non-english-languages/>, particularly in “low-resource languages” — languages for which AI developers have little text data available to train AI models, regardless of the number of speakers around the world. Companies, researchers, civil society advocates, and policymakers should be aware of these limitations, as they can create real barriers to information access and equitable online participation for individuals. We also offer guidance on how to help close the gap between companies’ ability to moderate content in English versus the world’s other 7,000 languages. The full report is available on CDT’s website, along with executive summaries <https://cdt.org/insights/lost-in-translation-large-language-models-in-non-english-content-analysis/> in Spanish, French, and Arabic<https://cdt.org/press/cdt-finds-key-shortcomings-when-large-language-models-analyze-non-english-languages/>. Tomorrow, we’ll discuss the paper at an event called “Mind the Gap” <https://cdt.org/event/mind-the-gap-can-large-language-models-analyze-non-english-content/>(see below for more details) — we hope you can join us! Finally, we have an article out in WIRED <https://www.wired.com/story/content-moderation-language-artificial-intelligence/>about how social media companies specifically use multilingual language models to moderate content in languages other than English. Feel free to share, and let us know if you have any questions or feedback. take care, Dhanaraj On 5/10/23 4:45 PM, Dhanaraj Thakur wrote:
Hi everyone,
Please see details below about an online event CDT is hosting on May 24 at 10am ET. This will follow the upcoming launch of our research report "Lost in Translation: Large Language Models in Non-English Content Analysis." In the meantime please RSVP for our event here <https://www.eventbrite.com/e/mind-the-gap-can-large-language-models-analyze-non-english-content-tickets-631677633807>.
thanks,
Dhanaraj
*Mind the Gap: Can Large Language Models Analyze Non-English Content?*
*Time: *10:00 AM EDT
*Date: *May 24, 2023
From search engines to social media to hiring algorithms, automated systems increasingly shape people’s online experiences worldwide. Despite internet users speaking thousands of languages, most of these systems are primarily trained using English-language data. Computer scientists claim that they have found a solution to this linguistic gap in a new technology called “multilingual language models.” Multilingual language models work similarly to the language models that power new generative systems like ChatGPT, but instead of being trained on millions of examples of text in mostly one language, they pull text from dozens or hundreds of languages and learn connections between them.
But do these multilingual language models work as well as companies say they do? A new technical primer <https://cdt.org/insights/languages-left-behind-automated-content-analysis-in-non-english-languages/>by CDT shows that these systems may have key shortcomings which only compound when used to analyze non-English languages.
This panel will convene NLP researchers building systems and digitizing languages spoken by millions of people in India and South Africa, content policy experts evaluating the impact these systems have on users’ rights, and CDT’s research and policy team members for a deep dive into how these multilingual language models work, what their capabilities and limitations are, how they can be improved, and what’s at stake when these systems fall short.
Speakers:
* Aliya Bhatia <https://cdt.org/staff/aliya-bhatia/>, Center for Democracy & Technology * Gabriel Nicholas <https://cdt.org/staff/gabriel-nicholas/>, Center for Democracy & Technology * Dr Monojit Choudhury <https://www.microsoft.com/en-us/research/people/monojitc/>, Turing Institute * Dr Vukosi Marivate <https://africa.harvard.edu/people/vukosi-marivate>, Masakhane * Jacqueline Rowe <https://www.gp-digital.org/team/jacqueline-rowe/>, Global Partners Digital
*RSVP here* <https://www.eventbrite.com/e/mind-the-gap-can-large-language-models-analyze-non-english-content-tickets-631677633807>
--
*Dhanaraj Thakur* (he/him) | Research Director Center for Democracy & Technology |*cdt.org <https://cdt.org/>* *E:* dthakur@cdt.org | *P:* +1 202 407 8849
-- *Dhanaraj Thakur* (he/him) | Research Director Center for Democracy & Technology |*cdt.org <https://cdt.org/>* *E:* dthakur@cdt.org | *P:* +1 202 407 8849