Indian Sarvam AI has released
OpenHathi-Hi-v0.1, the first Hindi large language model or LLM in the OpenHathi
series.
The model is manufactured on Meta AI’s
Llama2-7B architecture, and it delivers performance on par with GPT-3.5 for
Indic languages, stated Sarvam AI.
The model has a 48,000-token extension
of Llama2-7B’s tokenizer and undergoes a two-phase training process.
The first phase includes embedding
alignment, which aligns randomly initialised Hindi embeddings. The second phase
is bilingual language modeling, where the model is trained to attend
cross-lingually across tokens.
“We show that our model works as well
as, if not better than GPT-3.5 on various Hindi tasks while maintaining its
English performance,” the company said in a post on X.
The company said it reviewed the model’s
performance on real-world tasks beyond standard Natural Language Generation
(NLG) tasks.
The five-month-old AI startup partnered
with KissanAI to fine-tune its base model using conversational data they collected.
This dataset comprises conversations from a GPT-powered bot engaging with
farmers in different languages.
“The first step in adding Hindi skills
to Llama-2 is decreasing the fertility score (the average number of tokens a
word is split into) of its tokeniser on Hindi text. This would make both
training and inferencing faster and more efficient,” the company said in a blog
post.
“We train a sentence-piece tokeniser
from a subsample of 100K documents from the Sangraha corpus, created at
AI4Bharat, with a vocabulary size of 16K. We then unite this with the Llama2
tokeniser and create a new tokeniser with a 48K vocabulary (32K original
vocabulary plus our added 16K),” it added.
Sarvam AI, founded in July 2023 by Vivek
Raghavan and Pratyush Kumar, secured $41 million in a funding round earlier
this month. Lightspeed Ventures led the investment, with participation from
Peak XV Partners and Khosla Ventures.
Agency