Re: [Air-L] New CDT report + event tomorrow May 24, 2023 (10 am ET) - Can Large Language Models Analyze Non-English Content?

23 May 2023

      Hi everyone,

We are excited to announce the publication of our new CDT research 
report, “Lost in Translation: Large Language Models in Non-English 
Content Analysis 
<https://cdt.org/insights/lost-in-translation-large-language-models-in-non-english-content-analysis/>.” 

The report explains the capabilities of a new AI technology called 
“multilingual language models” that technology companies claim can 
understand content in over 100 languages by extrapolating linguistic 
patterns from high-resource languages. We further describe how these 
models work, and argue that they have significant limitations 
<https://cdt.org/press/cdt-finds-key-shortcomings-when-large-language-models-analyze-non-english-languages/>, 
particularly in “low-resource languages” — languages for which AI 
developers have little text data available to train AI models, 
regardless of the number of speakers around the world.

Companies, researchers, civil society advocates, and policymakers should 
be aware of these limitations, as they can create real barriers to 
information access and equitable online participation for individuals. 
We also offer guidance on how to help close the gap between companies’ 
ability to moderate content in English versus the world’s other 7,000 
languages.

The full report is available on CDT’s website, along with executive 
summaries 
<https://cdt.org/insights/lost-in-translation-large-language-models-in-non-english-content-analysis/> 
in Spanish, French, and 
Arabic<https://cdt.org/press/cdt-finds-key-shortcomings-when-large-language-models-analyze-non-english-languages/>. 
Tomorrow, we’ll discuss the paper at an event called “Mind the Gap” 
<https://cdt.org/event/mind-the-gap-can-large-language-models-analyze-non-english-content/>(see 
below for more details) — we hope you can join us!

Finally, we have an article out in WIRED 
<https://www.wired.com/story/content-moderation-language-artificial-intelligence/>about 
how social media companies specifically use multilingual language models 
to moderate content in languages other than English.

Feel free to share, and let us know if you have any questions or feedback.

take care,

Dhanaraj

On 5/10/23 4:45 PM, Dhanaraj Thakur wrote:
...
Hi everyone,
Please see details below about an online event CDT is hosting on May 
24 at 10am ET. This will follow the upcoming launch of our research 
report "Lost in Translation: Large Language Models in Non-English 
Content Analysis." In the meantime please RSVP for our event here 
<https://www.eventbrite.com/e/mind-the-gap-can-large-language-models-analyze-non-english-content-tickets-631677633807>.
thanks,
Dhanaraj
*Mind the Gap: Can Large Language Models Analyze Non-English Content?*
*Time: *10:00 AM EDT
*Date: *May 24, 2023
From search engines to social media to hiring algorithms, automated 
systems increasingly shape people’s online experiences worldwide. 
Despite internet users speaking thousands of languages, most of these 
systems are primarily trained using English-language data. Computer 
scientists claim that they have found a solution to this linguistic 
gap in a new technology called “multilingual language models.” 
Multilingual language models work similarly to the language models 
that power new generative systems like ChatGPT, but instead of being 
trained on millions of examples of text in mostly one language, they 
pull text from dozens or hundreds of languages and learn connections 
between them.
But do these multilingual language models work as well as companies 
say they do? A new technical primer 
<https://cdt.org/insights/languages-left-behind-automated-content-analysis-in-non-english-languages/>by 
CDT shows that these systems may have key shortcomings which only 
compound when used to analyze non-English languages.
This panel will convene NLP researchers building systems and 
digitizing languages spoken by millions of people in India and South 
Africa, content policy experts evaluating the impact these systems 
have on users’ rights, and CDT’s research and policy team members for 
a deep dive into how these multilingual language models work, what 
their capabilities and limitations are, how they can be improved, and 
what’s at stake when these systems fall short.
Speakers:
* Aliya Bhatia <https://cdt.org/staff/aliya-bhatia/>, Center for
    Democracy & Technology
  * Gabriel Nicholas <https://cdt.org/staff/gabriel-nicholas/>, Center
    for Democracy & Technology
  * Dr Monojit Choudhury
    <https://www.microsoft.com/en-us/research/people/monojitc/>,
    Turing Institute
  * Dr Vukosi Marivate
    <https://africa.harvard.edu/people/vukosi-marivate>, Masakhane
  * Jacqueline Rowe
    <https://www.gp-digital.org/team/jacqueline-rowe/>, Global
    Partners Digital
*RSVP here* 
<https://www.eventbrite.com/e/mind-the-gap-can-large-language-models-analyze-non-english-content-tickets-631677633807>
--
*Dhanaraj Thakur* (he/him) | Research Director
Center for Democracy & Technology |*cdt.org <https://cdt.org/>*
*E:* dthakur@cdt.org | *P:* +1 202 407 8849
-- 

*Dhanaraj Thakur* (he/him) | Research Director
Center for Democracy & Technology |*cdt.org <https://cdt.org/>*
*E:* dthakur@cdt.org | *P:* +1 202 407 8849