TL;DR
- 01
The solution is: Launched AI-powered Spanish conversation practice to 4,000 weekly active users with 5x higher engagement than projected, with users sending 100 messages per session instead of the expected 20
- 02
Built a custom real-time voice processing pipeline with GPT-4o mini and 11Labs TTS to serve 175,000 potential users while keeping operational costs sustainable
- 03
Achieved conversational AI quality at scale using DeepEval automated testing and a progressive 3-strike feedback system that balances corrections with learner confidence
The Challenge
Pimsleur needed to bring AI-powered Spanish conversation practice to 175,000 existing learners on their mobile platform. The cost of real-time conversational AI could spiral quickly, the quality had to match Pimsleur's audio-first reputation, and the system needed to integrate with legacy mobile infrastructure that wasn't built for real-time AI interactions.
Most language learning apps avoid this problem entirely. They stick to multiple choice exercises or pre-recorded audio because real conversation is expensive and hard to get right. But Pimsleur's entire methodology centers on audio immersion and speaking practice. Offering AI conversation wasn't a nice-to-have feature. It was the natural evolution of their core product.
Three constraints shaped every technical decision. First, 175,000 Spanish learners represented massive potential usage, and real-time conversational AI APIs from major providers would have made the economics untenable. Second, Pimsleur's brand is built on audio-first methodology, meaning text-to-speech quality wasn't negotiable, especially for Spanish pronunciation nuances. Third, Pimsleur's existing platform wasn't designed for real-time AI interactions, requiring custom middleware that could handle real-time voice processing without major mobile app rewrites.
Key Results
- 01
4,000 weekly active users in the first week of launch
- 02
5x higher engagement than projected (100 messages per user vs. expected 20)
- 03
175,000 potential users served by cost-optimized architecture
- 04
Free tier message allowance increased from 20 to 100 based on engagement data
- 05
80+ internal testers validated personalization approach before launch
The Solution
A Voice Pipeline Built for Cost and Quality
We built a custom voice pipeline instead of using off-the-shelf real-time APIs. The flow is voice to text, text to AI, AI response back to speech. Each step was optimised separately. For text-to-speech, we chose a higher-cost voice provider because Spanish pronunciation quality is not a nice-to-have for a language learning product. It is the product. The AI model we chose delivers strong conversational quality at a cost structure that works for 175,000 potential users.
Correcting Errors Without Breaking the Flow
Interrupting a student every time they make a mistake destroys confidence and stops learning. Ignoring every mistake means students develop bad habits. We built a system that tracks errors in real time but waits until a student has made the same type of mistake three times before gently stepping in. This gives learners the chance to catch themselves. When a correction does happen, it comes naturally within the conversation rather than as a formal interruption.
Conversations About What Each User Actually Cares About
Generic conversation topics do not hold people's attention. We built an onboarding system that asks each user about their goals and what they want to be able to do with Spanish. Someone who wants to order food at restaurants gets different conversation scenarios than someone who needs to talk with Spanish-speaking family members. This personalisation was the reason engagement hit five times the initial projection during internal testing.
A Month of Discovery Before a Line of Production Code
We spent six weeks in discovery doing technology assessment and cost modelling before building anything for production. Clickable prototypes within the first month let Pimsleur's team experience the proposed flow and give feedback before we invested in full design and development. This approach avoided expensive changes later when requirements are clearer but the code is already written.
Testing That Matches the Reality of Conversational AI
Testing a conversational AI system is fundamentally different from testing standard software. There is no finite set of inputs to check against. We used a testing framework that let us run hundreds of conversation simulations, evaluate AI responses against quality criteria, and identify problems before they reached users. The project was 50 to 60% testing and quality work. That level of investment is what made the final product reliable at scale.
Results
Key Metrics
4,000 weekly active users in the first week of launch
5x higher engagement than projected (100 messages per user vs. expected 20)
175,000 potential users served by cost-optimized architecture
Free tier message allowance increased from 20 to 100 based on engagement data
80+ internal testers validated personalization approach before launch
The Full Story
The system launched to 100% of Spanish learners on the Pimsleur platform. First week results showed 4,000 weekly active users engaging with AI conversation practice.
User engagement hit 5x initial projections. During internal testing with 80 people, users were sending 100 messages per session instead of the projected 20, forcing an increase in the free tier message allowance before launch. Users weren't just testing the feature. They were having actual conversations about topics they cared about.
The custom pipeline architecture achieved the cost efficiency needed to make the economics work at scale. GPT-4o mini with custom prompts delivered conversational quality at a price point that scales. The heavy QA investment prevented quality issues at scale, with the progressive feedback system working as intended to balance learning with confidence.
The architecture supports expansion to other languages Pimsleur offers. The personalization system and progressive feedback approach apply across languages, with prompt engineering adapted for each language's specific learning challenges.
Conclusion
In summary, Pimsleur went from exploring vague AI conversation concepts to serving 4,000 weekly active users having personalized Spanish conversations. As a result, the custom architecture handles real-time voice processing at a cost structure that scales to 175,000 learners. User engagement hit 5x initial projections because personalization and progressive feedback create practice that feels valuable, not gimmicky.
The technical approach proves that conversational AI at scale doesn't require unlimited budgets or perfect infrastructure. It requires deliberate tradeoffs based on what actually matters for the product, heavy QA investment to ensure quality, and architecture that controls costs without sacrificing user experience. Furthermore, as AI conversation features expand to Pimsleur's other language offerings, the patterns established here will scale with them.
Key Insights
- 1
Budget 50-60% of project time for QA when building conversational AI. Standard 40% testing buffers are insufficient because AI behavior requires extensive tuning across thousands of conversation paths that traditional unit tests can't cover.
- 2
Build custom real-time processing pipelines instead of relying on expensive real-time APIs when serving large user bases. A voice to text to AI to text to speech architecture can achieve acceptable latency while controlling operational costs at scale.
- 3
Use clickable prototypes within the first month of discovery to validate concepts before investing in full UI/UX design. Design changes are cheap in prototypes and expensive in production code, especially when integrating with legacy infrastructure.
- 4
Progressive feedback systems balance educational effectiveness with user confidence. A 3-strike approach gives learners space to self-correct before intervention, preventing frustration while maintaining learning outcomes.
- 5
Personalization drives engagement when done right. AI-powered onboarding that captures user goals and generates relevant conversation topics drove 5x higher engagement than generic conversation scenarios.
- 6
Choose technology based on what actually matters for your product, not just cost. 11Labs TTS cost more than alternatives, but Spanish pronunciation quality is non-negotiable for a language learning product.
- 7
Automated QA frameworks like DeepEval enable client collaboration on AI quality validation. Bulk testing and shared evaluation criteria let domain experts provide feedback on AI behavior at scale.
Key Terms
- Spaced Repetition
- Spaced repetition is defined as a learning technique that schedules review of material at increasing intervals over time, exploiting the psychological spacing effect to maximise long-term retention with minimal study time.
- Conversational AI Language Tutor
- A conversational AI language tutor refers to an AI system that simulates real dialogue in a target language, providing learners with adaptive speaking practice, pronunciation feedback, and contextual vocabulary instruction at scale.
Implementation Details
Real-Time Voice Processing Pipeline
We built a custom pipeline instead of using expensive real-time APIs. The architecture flows: voice, then text, then AI, then text, then speech. Each step was optimized to keep costs down while maintaining acceptable latency.
Users speak into their mobile device. Audio gets transcribed to text using speech-to-text services. GPT-4o mini processes the transcribed text with custom prompt engineering, generating natural Spanish conversation responses appropriate to the user's level, identifying language errors without being pedantic, and maintaining conversation flow while implementing guardrails. 11Labs TTS Turbo model then converts the AI's Spanish text response back to natural-sounding speech.
We evaluated OpenAI TTS against 11Labs TTS Turbo. 11Labs cost more, but the Spanish pronunciation quality was noticeably superior. For a language learning product, authentic pronunciation isn't a luxury. It's the product. We chose 11Labs despite the higher cost because cutting corners here would undermine the entire feature.
GPT-4o mini became the core model choice because it delivers conversational quality with custom prompt engineering while keeping per-interaction costs manageable. The guardrails were handled through prompt engineering rather than fine-tuning custom models, which was more cost-effective while still preventing users from steering conversations outside language learning.
Progressive 3-Strike Feedback System
Language learning AI faces a unique challenge. Corrections are necessary for learning, but too many corrections destroy confidence. Interrupt every mistake and users quit. Ignore mistakes and they don't improve.
We implemented a progressive 3-strike feedback system that tracks errors in real-time during conversation but doesn't immediately interrupt. When a user makes a language error, the AI notes it but continues the conversation naturally. If the user makes the same type of error a second time, the system still holds back. Only on the third occurrence does the AI provide gentle correction.
This approach gives users the chance to self-correct. Often learners catch their own mistakes when they hear themselves speak. Immediate correction can feel patronizing and interrupt the flow of conversation.
The 3-strike threshold was tuned through testing. Two strikes felt too aggressive. Four strikes let errors become habits. Three strikes hit the balance between giving users space to learn and providing necessary guidance.
When correction happens, the AI delivers it conversationally within the context of the ongoing dialogue, modeling correct usage while keeping the interaction flowing rather than stopping conversation with an explicit correction.
Personalized Onboarding and Topic Generation
Generic conversation topics kill engagement. We built an AI-powered onboarding system that conducts an initial conversation with each user, asking about their Spanish learning goals, interests, and what they want to be able to do with the language. These are tracked as can-do statements that drive personalized conversation generation.
A user might say 'I want to order food at restaurants' or 'I need to talk to my Spanish-speaking in-laws.' The system captures these goals and generates conversation scenarios tailored to each user's specific objectives.
This personalization drove the 5x engagement increase. During internal testing with 80 people, users were sending 100 messages instead of the projected 20. The engagement was so much higher than expected that we increased the free tier message allowance from 20 to 100 messages.
The system also tracks progress against can-do statements. As users demonstrate competency in one area, the AI introduces related topics or increases complexity. This creates a learning path that feels organic rather than following a rigid curriculum, and scales to 175,000 potential users with diverse goals without requiring manual content creation for every possible scenario.
Discovery Phase and Technology Validation
We spent 1.5 months in discovery doing iterative technology assessment. This was essential for understanding the cost, quality, and legacy infrastructure constraints that shaped every technical decision.
Clickable prototypes validated concepts within the first month. These weren't full UI/UX designs but functional mockups that let Pimsleur's team experience the proposed user flow and provide feedback before we invested in complete design and development. Design changes are cheap when working with prototypes and expensive when refactoring production code.
We evaluated multiple AI vendors and APIs during discovery, modeling cost scenarios across different usage patterns to ensure the economics would work at scale. A vendor with attractive introductory pricing might become prohibitively expensive at 175,000 users.
Pimsleur brought domain expertise in language learning methodology. We brought technical expertise in AI implementation. The feature set emerged from combining both perspectives.
QA Strategy with DeepEval Automated Testing
Testing conversational AI is fundamentally different from testing traditional software. You can't write unit tests that cover every possible conversation path. We used DeepEval automated QA testing framework to ensure consistent quality across diverse user interactions, enabling bulk testing of conversation scenarios and AI response quality at scale.
AI projects require significantly more QA time than traditional development. Our time split was 50-50 or even 60-40 QA-heavy. The standard 40% QA buffer that works for typical features is insufficient for AI behavior tuning because AI behavior isn't deterministic.
DeepEval let us create test scenarios covering common conversation patterns, edge cases, and guardrail violations. We could run hundreds of conversation simulations, evaluate AI responses against quality criteria, and identify issues before they reached users. The framework also enabled Pimsleur's team to review bulk test results and provide feedback on conversation quality, educational effectiveness, and brand alignment.
Testing revealed prompt engineering issues that weren't obvious in initial development. We iterated on prompts based on test results, then ran the test suite again. This cycle continued until conversation quality met standards across the full range of scenarios.
