Articles

Indian Datasets

Natural Language Processing

Open Source AI

Speech Recognition

IIT Madras, AI4Bharat, and Sarvam AI launch IndicVoices: A milestone in Indian speech recognition

IIT Madras, AI4Bharat, and Sarvam AI have launched IndicVoices, a 12,000-hour multilingual speech dataset covering 22 Indian languages and 208 districts. Accompanied by IndicASR, the first ASR model supporting all 22 official Indian languages, the initiative is open-sourced under CC-BY-4.0, offering a global blueprint for multilingual speech data collection and advancing inclusive AI development.

Kallakuri Radhakrishna

0
3
1 min read

View Article

Big Data

Huggingface

LLM

Synthetic Data

Cosmopedia: Redefining the synthetic data landscape with the largest open dataset

Cosmopedia v0.1, hosted on HuggingFace, is the largest open synthetic dataset with 30 million samples and 25 billion tokens, generated by Mixtral 7b. It includes textbooks, blog posts, stories, and WikiHow articles across eight dataset splits. Designed to democratize AI research, it supports NLP, model training, and scalable AI development with rich metadata and diverse content.

Kallakuri Radhakrishna

1
5
1 min read

View Article

AI4Bharat

Indian Datasets

Indic Languages

natural language processing (NLP)

Open Source AI

Speech translation

Synthetic Data

AI4Bharat unveils BhasaAnuvaad: Speech translation dataset in 13 languages

AI4Bharat launches BhasaAnuvaad, the largest speech translation dataset for Indian languages, covering 44,400 hours of audio across 13 languages including Hindi, Tamil, Telugu, and Bengali. It tackles India-specific challenges like code-switching and dialectal diversity. A synthetic benchmark, Indic-Spontaneous-Synth, is also introduced to test real-world translation model robustness.

Kallakuri Radhakrishna

0
2
1 min read

View Article

AI Research

Indic AI

LLM

NaturalLanguageProcessing

Open Source AI

Advancing Telugu NLP: Telugu LLM Labs with native and romanized datasets

Telugu LLM Labs, led by researchers from LlamaIndex, is advancing NLP for Telugu — a language with 100M+ speakers historically underrepresented in AI. The initiative creates open datasets in both native and Romanized Telugu scripts and fine-tunes LLMs like Llama 2, Mistral, and TinyLlama, setting a precedent for other regional Indian languages in AI development.

Kallakuri Radhakrishna

0
2
1 min read

View Article

Showing 4 items

Accessibility options by UX4G

AIKosh

Resources

Support