Indian Flag
Government Of India
A-
A
A+
Indian Datasets
Natural Language Processing
Open Source AI
Speech Recognition
IIT Madras, AI4Bharat, and Sarvam AI launch IndicVoices: A milestone in Indian speech recognition
IIT Madras, AI4Bharat, and Sarvam AI have launched IndicVoices, a 12,000-hour multilingual speech dataset covering 22 Indian languages and 208 districts. Accompanied by IndicASR, the first ASR model supporting all 22 official Indian languages, the initiative is open-sourced under CC-BY-4.0, offering a global blueprint for multilingual speech data collection and advancing inclusive AI development.
IIT Madras, AI4Bharat, and Sarvam AI launch IndicVoices: A milestone in Indian speech recognition
Kallakuri RadhakrishnaKallakuri Radhakrishna
  • See Upvoters0
  • Views3
  • Read Time1 min read
Big Data
Huggingface
LLM
Synthetic Data
Cosmopedia: Redefining the synthetic data landscape with the largest open dataset
Cosmopedia v0.1, hosted on HuggingFace, is the largest open synthetic dataset with 30 million samples and 25 billion tokens, generated by Mixtral 7b. It includes textbooks, blog posts, stories, and WikiHow articles across eight dataset splits. Designed to democratize AI research, it supports NLP, model training, and scalable AI development with rich metadata and diverse content.
Cosmopedia: Redefining the synthetic data landscape with the largest open dataset
Kallakuri RadhakrishnaKallakuri Radhakrishna
  • See Upvoters1
  • Views5
  • Read Time1 min read
AI4Bharat
Indian Datasets
Indic Languages
natural language processing (NLP)
Open Source AI
Speech translation
Synthetic Data
AI4Bharat unveils BhasaAnuvaad: Speech translation dataset in 13 languages
AI4Bharat launches BhasaAnuvaad, the largest speech translation dataset for Indian languages, covering 44,400 hours of audio across 13 languages including Hindi, Tamil, Telugu, and Bengali. It tackles India-specific challenges like code-switching and dialectal diversity. A synthetic benchmark, Indic-Spontaneous-Synth, is also introduced to test real-world translation model robustness.
AI4Bharat unveils BhasaAnuvaad: Speech translation dataset in 13 languages
Kallakuri RadhakrishnaKallakuri Radhakrishna
  • See Upvoters0
  • Views2
  • Read Time1 min read
AI Research
Indic AI
LLM
NaturalLanguageProcessing
Open Source AI
Advancing Telugu NLP: Telugu LLM Labs with native and romanized datasets
Telugu LLM Labs, led by researchers from LlamaIndex, is advancing NLP for Telugu — a language with 100M+ speakers historically underrepresented in AI. The initiative creates open datasets in both native and Romanized Telugu scripts and fine-tunes LLMs like Llama 2, Mistral, and TinyLlama, setting a precedent for other regional Indian languages in AI development.
Advancing Telugu NLP: Telugu LLM Labs with native and romanized datasets
Kallakuri RadhakrishnaKallakuri Radhakrishna
  • See Upvoters0
  • Views2
  • Read Time1 min read
Showing 4 items