Indian startup Sarvam AI unveils OpenHathi, first Hindi large language model

Indian Sarvam AI has released
OpenHathi-Hi-v0.1, the first Hindi large language model or LLM in the OpenHathi
series.

The model is manufactured on Meta AI’s
Llama2-7B architecture, and it delivers performance on par with GPT-3.5 for
Indic languages, stated Sarvam AI.

The model has a 48,000-token extension
of Llama2-7B’s tokenizer and undergoes a two-phase training process.

The first phase includes embedding
alignment, which aligns randomly initialised Hindi embeddings. The second phase
is bilingual language modeling, where the model is trained to attend
cross-lingually across tokens.

“We show that our model works as well
as, if not better than GPT-3.5 on various Hindi tasks while maintaining its
English performance,” the company said in a post on X.

The company said it reviewed the model’s
performance on real-world tasks beyond standard Natural Language Generation
(NLG) tasks.

The five-month-old AI startup partnered
with KissanAI to fine-tune its base model using conversational data they collected.
This dataset comprises conversations from a GPT-powered bot engaging with
farmers in different languages.

“The first step in adding Hindi skills
to Llama-2 is decreasing the fertility score (the average number of tokens a
word is split into) of its tokeniser on Hindi text. This would make both
training and inferencing faster and more efficient,” the company said in a blog
post.

“We train a sentence-piece tokeniser
from a subsample of 100K documents from the Sangraha corpus, created at
AI4Bharat, with a vocabulary size of 16K. We then unite this with the Llama2
tokeniser and create a new tokeniser with a 48K vocabulary (32K original
vocabulary plus our added 16K),” it added.

Sarvam AI, founded in July 2023 by Vivek
Raghavan and Pratyush Kumar, secured $41 million in a funding round earlier
this month. Lightspeed Ventures led the investment, with participation from
Peak XV Partners and Khosla Ventures.

Agency

Tags: NULL

Indian startup Sarvam AI unveils OpenHathi, first Hindi large language model

Related News

Domestic Stock Market Experiences Volatility with Minor Declines

Inflation Drops to 3.54 Per Cent, Reaches Lowest Level Since August 2019

Bharat Aims to Achieve ₹50,000 Crore in Defence Exports by 2028-29

Experts Express Concern Over Hindenburg Report, Call for Action Against Foreign Entities

Fresh Hindenburg Allegations Trigger Drop in Adani Group Stocks; SEBI Chief Rejects Claims

Latest News

Bengal Women Organise Midnight Protest Over Kolkata Doctor’s Rape & Murder

UP Doctors Stage Indefinite Strike Over Kolkata Doctor’s Murder, Patients Left Stranded

Domestic Stock Market Experiences Volatility with Minor Declines

Bangladesh’s Muhammad Yunus to Confer with Hindu Leaders Amid Rising Attacks on Temples, Homes, Businesses

World Bank Delegation Inspects Amaravati Capital Region Ahead of Loan Approval

Himanta Biswa Sarma Criticises Meghalaya University’s Architecture, Claims It Resembles ‘Mecca’

Inflation Drops to 3.54 Per Cent, Reaches Lowest Level Since August 2019

Trump Breaks Promise on Discussing Assassination Attempt During Musk Interview

Vinesh Phogat’s Legal Team May Leverage Wrestling Rule ‘Loophole’ to Secure Silver Medal

Kolkata Rape-Murder: Autopsy Reveals Disturbing Details in Trainee Doctor’s Death