Three production deployments. One coming soon. Every metric below is measured by the client, not by us.
Preference ranking across 500 challenging Hindi prompts, spanning urban, rural and code-mixed inputs — for a foundation-model lab's alignment release.
The client had scaled English-origin preference data well, but Hindi response ratings from generalist annotation vendors showed high disagreement (Kappa < 0.55) and missed cultural context — responses about rural banking, Ayurveda and local transit were being mis-ranked.
We assembled 24 Hindi-native reviewers (14 urban, 10 rural/small-town), built a 12-point cultural-context rubric, and ran every single task through four QA layers before hand-off.
TrainPlex's Hindi preference data was the first external dataset our eval pipeline graded above our internal team's work. It shipped.
Parallel evaluation across Hindi, Tamil, Bengali, Telugu and Marathi — per-language error typology for a global AI platform's Indic launch.
The client's Indic eval gave one aggregate score across five languages — useful for marketing, useless for engineering. Tamil performance was masking a 9-point drop vs Hindi; Bengali nominal forms were silently failing.
We staffed five independent native-speaker pods, ran 100 parallel tasks per language, and delivered a 22-dimension error typology report — not just a score, but a diagnosis.
The per-language error typology changed how we thought about our Indic launch. We went from one dashboard to five, and our user satisfaction scores followed.
200 adversarial Hindi prompts across 5 domains — designed to expose LLM blindspots an English-origin eval would miss entirely.
The client had deployed a model that was SOTA on English harm benchmarks but generating unsafe Hindi output in production. Standard jailbreaks didn't translate — but Hindi-specific adversarial patterns did.
We commissioned prompts across legal, medical, finance, education and civic domains — each authored by a credentialed domain expert and adversarially reviewed by a linguistics specialist for script, register and code-mixing attacks.
We thought we were ready to ship in Hindi. TrainPlex's prompts showed us we weren't — and gave us the exact failure modes to fix before launch.
Multilingual audio annotation at scale. Publishing Q2 2026.