FLAME University

MEDIA

FLAME in the news

Challenges with developing AI applications in Indic languages

www.news9live.com | December 9, 2024

The accessibility gap for providing applications and services to Indic language users has never really been bridged. There are unique challenges when it comes to developing AI tools in Indic languages.

New Delhi: Large Language Models (LLMs) can be understood as autocomplete on steroids, at scale. These generative chatbots are finding widespread applications, and streamlining as well as simplifying how humans process and disseminate information. LLMs are trained on text corpora, vast quantities of text on a variety of topics. In English, there are plenty of resources available for training the LLMs, including news sites, blogs, forum posts and crowdsourced resources such as Wikipedia articles. When it comes to Indic languages however, the resources available are far more limited, with companies having to resort to creative methods to derive the vast quantities of text required for training.

Director of the Centre for Interdisciplinary Artificial Intelligence at FLAME University, Kaushik Gopalan explains, “Some of the challenges in developing AI applications for Indian languages lie in the relative paucity of training data available in these languages. The digital ecosystem for these languages—despite their richness and diversity—is unfortunately quite limited. Further, the large variety of dialects and regional variations in major Indian languages further complicate the process of training AI models for these languages. As a result, the major commercially available Large Language Models (LLMs) are not as adept in their performance in Indian languages as they are in English and some European languages.”

The complexity of the script poses challenges as well
English uses an alphabet, where each of the letters are distinct. Indic languages use a form of script where the letters are mushed together, known as abugida. The algorithms, and indeed the Unicode system for writing, are both oriented towards the alphabet and not abugida. This also makes developing AI applications for Indic languages challenging. Gopalan explains, “The scarcity of training data constitutes only part of the issues facing AI models in Indian languages. Imitating human language-as LLMs attempt to do-is a complex and resource-intensive venture. It requires first breaking down texts into smaller units, called tokens, where shorter words form a token of their own, whereas words like ‘Honorificabilitudinitatibus’ might be broken down into several tokens. In the context of Indic languages, my colleague, Shagun Dwivedi, has been investigating tokenizers that follow English-centric algorithms and how they impact the performance of a language model in answering fact-based questions in Hindi. ”

One possible approach to improve the efficiency, reduce the environmental impact, and develop AI technologies for Indic languages is to use different algorithms and tokenising approaches. Gopalan says, “Commercial LLMs have advanced their linguistic skills significantly in languages such as English, French, or German, which are written in the alphabetic Roman script. However, for Indic languages such as Hindi, which use a more complex system of writing (called abugida) with inherent vowels, conjunct characters, and diacritic forms of vowels, LLMs have a long way to go before they achieve a similar level of proficiency. Shagun’s work suggests that tokenizers using an English-centric algorithm led to poor downstream performance in Hindi; they also prove to be less computationally efficient and lead to erroneous language generation when compared to models with more representative tokenizers. We are now exploring tokenizers tailored specifically to Indic scripts based on these insights.”

In conversation with News 9: Prof. Kaushik Gopalan, Faculty of Computer Science & Director of the Centre for Interdisciplinary Artificial Intelligence, FLAME University. 


(Source:- https://www.news9live.com/science/challenges-with-developing-ai-applications-in-indic-languages-2768894 )