Custom AI Voice Agent Development Guide: Costs, Features & ROI

Every AI voice agent starts with one decision, and it gets made before anyone writes a line of code. A voice agent might answer a customer call, book an appointment, remind a patient about medication, qualify an inbound lead, or log a claims notification. Behind each of those moments sits an architecture choice. Do you use a managed voice platform, configure a Contact Centre as a Service product, or build a custom agent on a composable stack? That single choice decides what the agent can do, how well it does it, what it costs to run at scale, and how easily it adapts as the business changes.

It also decides something bigger. You either own a strategic capability or you rent a commodity. This guide is written for teams that have already decided custom AI voice agent development is right for their use case. We believe you deserve to know exactly what you are buying, building, and paying for, and how Mobisoft makes it happen.

A few numbers frame the whole guide. The defining choice is custom versus platform. Managed products like Vapi.ai, Bland.ai, and Retell AI deploy fast, in weeks, though they limit customisation. Custom builds on composable stacks such as Twilio, Deepgram, an LLM, and ElevenLabs give you control over every component, at the cost of a longer build measured in months. A production-ready custom build in 2026 typically runs £30,000 to £150,000. That range reflects use case complexity, integration count, conversation depth, and compliance needs. A production agent usually has six to eight engineering components, from the telephony layer through to monitoring, and each one is a separate surface to build and maintain. Mobisoft delivers this as a fixed-scope MVP, roughly 12 to 16 weeks from first conversation to a live agent, across three stages we call Discover, Build, and Optimise.

Quick Answer on Custom AI Voice Agent Development and What It Includes

Custom AI voice agent development means building an AI voice agent on a composable technology stack rather than configuring a managed platform. You own the code. You own the conversation design. You own the integration layer, and you control every component decision.

A custom build usually covers ten areas of work.

Architecture design: Component selection across STT, LLM, TTS, and telephony, plus hosting, data residency, latency tuning, redundancy, and failover.
Conversation design: Call flows for each use case, intent handling, escalation triggers, persona and voice choice, and multi-language setup.
Component integration: STT APIs like Deepgram or AssemblyAI, an LLM such as GPT-4o, Claude, or Gemini, or an on-premise open-source model, and TTS from ElevenLabs, Deepgram Aura, or Azure Neural.
Telephony work: This is your AI telephony integration, covering PSTN access through Twilio, Vonage, or Plivo, WebRTC for browser calling, SIP trunking for PBX, and DTMF fallback.
Business system integration: The tool layer connects CRM platforms like Salesforce, HubSpot, and Zoho, EHRs such as Epic FHIR and eClinicalWorks, booking tools, payment processors like Stripe and Braintree, ticketing systems, and ERP.
Knowledge base and RAG: Document ingestion, vector embedding, and semantic retrieval so answers stay grounded and accurate.
Quality assurance: Conversation testing, edge case simulation, CSAT benchmarks, and latency profiling.
Compliance engineering: TCPA and OFCOM outbound rules, FCA Consumer Duty, HIPAA for healthcare, EU AI Act transparency, and GDPR data handling.
Deployment and monitoring: Production hosting, a real-time dashboard, conversation analytics, escalation tracking, and cost monitoring.
Post-launch optimization: Prompt iteration, knowledge base updates, A/B testing of conversation variants, and a containment rate improvement programme.

The headline figures are easy to remember. A production MVP costs £30,000 to £150,000. Enterprise builds run £150,000 to £500,000 and beyond. A well-scoped build reaches production in 12 to 18 weeks. Payback usually lands between 2 and 6 months once you pass 5,000 calls a month.

Custom AI Voice Agent vs Managed Platform and When Custom Is Right

The first decision in any voice project is not which STT or LLM to pick. It is whether to build custom at all. Managed platforms have matured a lot through 2025 and 2026. Vapi.ai, Bland.ai, Retell AI, and Twilio ConversationRelay handle telephony, transcription, the model layer, speech output, and basic tool calling with very little configuration. They can ship a working agent for a well-scoped use case in one to four weeks. For a first voice project, or for a use case that fits inside the platform's design limits, managed is often the right call.

Custom becomes the right answer when the platform's limits start to cost you. Those limits are predictable. Model choice is restricted, so you cannot swap in a better or cheaper LLM. Integration depth is capped, so the platform's tool calling struggles with multi-step, multi-system transactions. Voice customisation is bounded, so a truly distinctive persona stays out of reach. Data residency is fixed, since the platform processes your call audio and transcripts on its own infrastructure, which raises GDPR, HIPAA, and national security questions. And per-minute fees keep adding up at scale, while a custom build pays down its engineering cost over time.

Many teams arrive here gradually. Their conversational AI development starts on a managed platform to prove the idea, then moves custom as volume and ambition grow.

If voice is one piece of a wider plan, our AI agents for enterprise work shows how it links to the rest of your service operation.

The Custom Build vs Managed Platform Decision Matrix

Here is the decision laid out across the dimensions that matter most.

Dimension	Managed platform	Custom build	Choose custom when
Time to production	1 to 4 weeks, config not code	12 to 18 weeks, full build	Quality and long-term cost matter more than launch date
LLM flexibility	Platform-defined models	Any model, any version, routing, on-premise	You need specific model capability or cost routing
Integration depth	Standard REST tool calls	Any pattern, multi-step chains, legacy bridges	Logic needs multi-system transactions or legacy systems
Conversation control	Configurable prompts and voice	Custom barge-in, turn-taking, voice adaptation	Behaviour must go beyond configurable prompts
Voice persona	Platform voice library	Custom clone, per-segment voices	Brand voice is a strategic differentiator
Data residency	Processed on platform infrastructure	Full control, your region or on-premise	HIPAA, strict GDPR, or national security rules
Long-term cost	Per-minute fee on top of components	Component cost only, amortised	Volume tops 20,000 calls a month
Strategic ownership	Vendor dependency	Own code, design, and integrations	Voice is a differentiator, not a commodity

The Custom AI Voice Agent Architecture and Its Core Components

A custom voice agent is a real-time audio pipeline where latency stacks up. Every millisecond added at each step widens the gap between the caller finishing a sentence and hearing a reply. So the design of each component matters. Below sits the AI voice architecture in full, the core components, what each one adds to latency, and the engineering call that defines quality, cost, and behaviour.

Telephony layer: Handles the PSTN call, answering inbound, placing outbound, capturing audio, and playing back speech. Options include Twilio with its Media Streams WebSocket, Vonage with strong EU presence, Plivo for cost and APAC reach, SignalWire for self-hosted FreeSWITCH, and Livekit for lower-latency browser calls. Call setup takes one to two seconds, with 20 to 100ms per audio chunk. The key choice is chunk size, since 20ms versus 100ms affects both STT accuracy and latency.
Speech-to-text engine: Converts caller audio into text in real time, word by word, with end-of-utterance detection. Deepgram Nova-3 gives the best latency near 150ms and suits conversational work. AssemblyAI brings strong accuracy and disfluency handling for noisy settings. Google STT and Azure STT add HIPAA BAA coverage and medical vocabulary, and OpenAI Whisper leads on non-English accuracy but runs batch only. STT adds 350 to 900ms in total. End-of-utterance detection is the hardest call here, since too sensitive cuts the caller off and too slow creates awkward pauses. This is your speech-to-text integration layer, and Deepgram's tunable VAD plus custom vocabulary handle it well.
Conversation orchestration layer: The real-time state machine that manages turns and LLM orchestration. It takes the transcript, decides when to call the model, handles barge-in, holds context across turns, and coordinates the response. Build it as a custom Node.js or Go service for maximum control, or use the Livekit Agents framework or the OpenAI Realtime API for end-to-end speech with GPT-4o. Orchestration adds 30 to 60ms. Barge-in handling is the most important UX decision, since you either let callers interrupt mid-response for a natural feel or suppress it so the agent finishes. Most production systems allow it.
LLM inference: Reads intent and generates the response, deciding any action. Pick the model by need. GPT-4o-mini gives the best cost-performance for most agents, Claude Haiku brings strong tool use and low latency, Gemini Flash is competitive and multimodal, and Llama 3 70B runs on-premise for data residency. First-token latency runs 200 to 700ms, though streaming lets playback start after a few tokens. This is also where function calling lives, the moment the model decides to trigger a CRM lookup or a booking check.
Text-to-speech engine: Turns the response into natural audio. Deepgram Aura gives the lowest latency at a 120ms first chunk, ElevenLabs leads on quality and voice cloning, and Azure Neural offers HIPAA BAA and SSML control. TTS adds 120 to 300ms before the caller hears speech. Strong text-to-speech integration also caches repeated phrases, like greetings, to save cost.
Tool execution and auxiliary components: The tool layer runs business API calls, while auxiliary parts cover the knowledge base and RAG, escalation logic, and monitoring and analytics. Tool calls take 100 to 1,000ms depending on the system, and parallel calls cut that on multi-tool turns. Each auxiliary piece is its own engineering surface to integrate and maintain.

Getting all six right is specialist work, which is one reason teams partner with an experienced AI services company rather than assembling the pipeline alone.

Total Latency and the End-to-End Response Time

The latency a caller feels is the sum of each step. Here is a typical turn, from the moment they stop speaking to the first audio reply.

End-of-utterance detection (STT VAD): 200 to 500ms. Faster settings risk more false cuts.
STT transcript delivery: 150 to 300ms. Streaming sends words as they arrive, trimming the total.
Orchestration processing: 20 to 50ms. Turn management, barge-in check, state update, and message prep.
LLM first token: 200 to 700ms. Streaming means playback can begin early.
Tool call execution, if needed: 100 to 1,000ms. Parallel calls reduce multi-tool turns.
Second LLM call after tool results, if a tool ran: 200 to 500ms.
TTS first audio chunk: 120 to 300ms. The caller hears the agent after this point.

Add it up: With no tool call, a turn runs 590ms to 1,850ms at the median, and around 3.4 seconds at the 95th percentile. With one tool call, it runs 810ms to 3,350ms at the median, and around 4.5 seconds at the 95th percentile. Perceived latency is lower than the maths suggests. Streaming speech helps, and a short filler sound, a brief mmm or one moment at the start of a tool call, cuts the felt delay a lot. It is standard practice in production agents.

Custom AI voice agent development for scalable enterprise automation

Feature Specification for Core and Advanced Capabilities

The feature set gets defined during discovery, and it drives both complexity and cost. It helps to split features into three groups. Core features are required for any production agent. Standard advanced features serve specific use cases. Premium advanced features create a competitive moat, the capabilities rivals cannot quickly copy. Sorting them this way lets product leaders scope the build correctly and sequence work for the best return. Get the sequence right and early features fund the later ones.

Core Features for Any Production Agent

Core features form the backbone of reliable AI-powered customer service, and every production agent needs them.

Natural language understanding and intent classification: The agent works out what the caller wants, even when the request is vague, partial, or changes topic. Complexity is low, since the LLM handles it with good prompt design and a clear intent taxonomy. Without it, the agent either answers everything poorly or sticks to a narrow script, and neither works commercially.
Multi-turn context management: The agent remembers what was said earlier in the call and builds on it, rather than treating each utterance alone. Complexity is medium, covering history management, context compression for long calls, and a state machine for what is confirmed or pending. Without it, every turn restarts from zero, and callers repeat themselves, which drives the most common complaint of all.
Caller identification and CRM lookup: In answer, the agent matches caller ID to a CRM record and pulls the account before the conversation starts. Complexity is low to medium, with CRM API calls and graceful handling of unrecognised numbers. Without it, every call starts cold, and the agent cannot personalise or retrieve account details without a slow manual check.
Graceful escalation to humans: When the agent cannot resolve an issue, detects strong negative sentiment, or hears a human request, it transfers the caller with full context. Complexity is medium, using warm transfer config, a context package, and ACD or queue integration. Without it, callers needing help get stuck, dropped, or pushed to a separate number, all of which wreck CSAT.
Interruption handling, or barge-in: The caller can speak while the agent talks, and the agent stops, processes the interruption, and responds. Complexity is medium, tuning VAD to tell real speech from background noise. Without it, callers wait through the whole reply, which feels robotic and signals IVR rather than conversation.
DTMF fallback for keypad input: For noisy places, speech difficulties, or moments when voice will not do, the agent accepts numeric keypad input instead. Complexity is low, mapping keypad presses to intent. Without it, callers in noisy spots are excluded, and accessibility compliance can slip for public services.
Post-call logging and analytics: Every call is logged with transcript, duration, intent, tools used, escalation status, and outcome. Complexity is medium, with structured logging, encrypted recording storage, and an analytics dashboard. Without it, you cannot improve the agent, prove compliance, or calculate real ROI.

Several of these features lean on modern language models, the same foundations behind our generative AI solutions development work.

Advanced Features for Specific Use Cases

Advanced features serve particular use cases, and each carries its own complexity and cost on top of the core build.

RAG knowledge base

For customer service, healthcare, and technical helpdesk, RAG lets the agent answer factual questions by retrieving the right passage from a curated knowledge base, instead of guessing from training data. Complexity is high, covering ingestion of PDF, DOCX, and HTML, chunking and embedding with OpenAI text-embedding-3-small or on-device all-MiniLM, a vector database such as pgvector on PostgreSQL or Pinecone, and ongoing maintenance. Cost adds £8,000 to £25,000 to build, plus £500 to £2,000 a month for embeddings, hosting, and upkeep.

Outbound campaign management

For sales, debt collection, healthcare reminders, and utility notices, the agent initiates calls, manages lists, schedules retries, respects do-not-call rules, and reports performance. Complexity is medium to high. Cost adds £5,000 to £15,000, plus higher outbound telephony, around $0.013 a minute on Twilio.

Voice biometrics

For banking, financial services, healthcare, and high-security work, the agent verifies identity by voiceprint rather than security questions, passively while the caller speaks. Complexity is high, with provider integration through Nuance Gatekeeper, Verint, or Aculab VeriCall, an enrolment flow, liveness detection to stop replay attacks, and KBA fallback when confidence is low. Cost adds £15,000 to £40,000, plus per-verification fees near $0.02 to $0.05.

Sentiment analysis and emotional routing

For contact centres, complaint handling, and healthcare, the agent scores the caller's emotional state in real time, using AWS Comprehend, Google Natural Language, or a custom model on speech features, and escalates when distress crosses a threshold. It can also modulate its own response style on the sentiment score. Complexity is medium. Cost adds £5,000 to £12,000, with negligible ongoing sentiment API cost near $0.0001 per character of transcript.

Multilingual support

For international businesses, diaspora customers, and public services, the agent detects the caller's language from the first utterance and runs the whole call in it, with no menu. Complexity is medium. Cost adds £3,000 to £8,000 per extra language, and some languages need a separate STT model.

Payment processing over voice

For e-commerce, utilities, debt collection, and telehealth, the caller pays by card over the phone, with PCI DSS-compliant DTMF capture through Twilio so card numbers never touch the agent's audio recording. The processor side uses Stripe PaymentIntents or Braintree, with receipt generation, failed payment handling, and payment plan arrangement. Complexity is high. Cost adds £10,000 to £25,000, with a possible PCI audit for the wider system.

Smart scheduling with calendar intelligence

For healthcare, B2B services, consultancy, and property, the agent books across multiple calendars, applies business rules, and confirms to both parties. Complexity is medium to high. Cost adds £6,000 to £18,000, higher for clinical scheduling with FHIR.

On-premise or private cloud LLM

For healthcare, financial services, government, defence, and legal, the agent runs inference on your own infrastructure instead of sending content to OpenAI, Anthropic, or Google. Complexity is very high, with vLLM or Ollama serving an open model like Llama 3 70B, Mistral Large, or Qwen 2.5 72B, GPU provisioning on A100 or H100 hardware, quantisation for efficiency, and benchmarking to keep the first token under 700ms. Cost adds £20,000 to £60,000 for setup, plus £3,000 to £10,000 a month for GPU capacity.

Integration Architecture and How the Agent Connects to Business Systems

The integration layer is the set of API connections that let the agent read from and write to business systems live during a call. It is where the custom build earns its keep over a managed platform. An agent that checks account status, confirms availability, books with calendar intelligence, takes a payment, and raises a support ticket inside one conversation is a different thing from one that only reads out information. This is also where AI workflow automation comes alive, since each integration needs design, authentication, error handling, and testing.

Mapping these systems well is part strategy and part engineering, which is where AI business consulting earns its place early in a project.

Integration Patterns by Business System

CRM.: Salesforce, HubSpot, Microsoft Dynamics, Zoho, and for healthcare Epic, Cerner, and eClinicalWorks. The pattern is REST with OAuth 2.0, SOQL for Salesforce and GraphQL for HubSpot, entity search on phone or email at call start, and a record update on completion. It reads contact, account status, history, and open tickets, and writes activity logs, transcripts, tags, and follow-ups. Complexity is medium.
Booking and scheduling: Calendly, Acuity, Google Calendar, Outlook, custom databases, and for healthcare Epic Scheduling, EMIS, and SystmOne via FHIR. The pattern is REST for availability and booking, webhooks for confirmations, and FHIR R4 Schedule, Slot, and Appointment resources for clinical work. Complexity is medium to high, with healthcare and multi-resource booking the harder cases.
Payment processing: Stripe, Braintree, Adyen, PayPal, Square, plus NHS Payment Service and debt platform APIs. The pattern is PCI DSS-compliant DTMF capture through Twilio, PaymentIntent creation, a status webhook, and a card vault for return customers, with no card numbers stored in the agent. Complexity is high, driven by PCI compliance and failure handling.
Knowledge base and document store: Confluence, SharePoint, Notion, product catalogues, policy databases, and clinical decision support. The pattern is semantic retrieval, with embeddings at ingest, similarity search at query time, pgvector on PostgreSQL or Pinecone, and source-grounded RAG responses. Complexity is high, especially for frequently updated content and governance.
ERP and inventory: SAP, Oracle Fusion, Microsoft Dynamics 365, NetSuite, plus warehouse and MES systems. The pattern is SAP RFC, BAPI, or REST, Oracle Integration Cloud, modern REST with OAuth, and middleware like MuleSoft or Dell Boomi for legacy. Complexity ranges from very high for legacy SAP to far simpler for modern cloud ERP.
Ticketing and helpdesk: Zendesk, Freshdesk, Jira Service Management, ServiceNow, and HubSpot Service Hub. The pattern is REST for ticket creation, search by customer, status query, updates, and resolution webhooks. Complexity is low to medium, with intelligent categorisation the main challenge.
Healthcare EHR. Epic, Cerner, eClinicalWorks, EMIS, and SystmOne, all via FHIR. The pattern is FHIR R4 REST across Patient, Appointment, MedicationRequest, AllergyIntolerance, and Condition resources, with SMART on FHIR authentication. It never writes clinical observations without clinical oversight. Complexity is very high, with HIPAA and NHS DSP governance on every access.
Contact centre platform: Genesys Cloud, Salesforce Service Cloud, Zendesk Talk, Five9, Avaya, and Amazon Connect. The pattern is warm transfer over SIP or conference API, an agent screen pop with the conversation summary, priority queue insertion, and ACD routing on intent. Complexity is medium to high, with the screen pop the most valuable piece on escalated calls.

The Real Cost of Custom AI Voice Agent Development

The cost of custom AI voice agent development is one of the most misquoted figures in voice sales calls. Vendors who build on managed platforms quote platform prices. Agencies quote hourly rates with no scope. This section gives the realistic component-by-component breakdown at different complexity levels, drawn from production project data rather than guesswork.

Development Cost Breakdown by Project Tier

Tier 1 is a focused MVP with one to two use cases and two to three integrations. Tier 2 is a production platform with three to five use cases and five to eight integrations. Tier 3 is an enterprise platform with five to ten use cases, eight to fifteen integrations, and compliance.

Cost component	Tier 1 MVP	Tier 2 platform	Tier 3 enterprise
Architecture and tech selection	£3,000 to £5,000	£6,000 to £10,000	£12,000 to £25,000
Conversation design	£2,500 to £8,000	£7,500 to £20,000	£15,000 to £50,000
Core component integration	£8,000 to £12,000	£8,000 to £12,000	£10,000 to £18,000
Business system integrations	£5,000 to £15,000	£12,500 to £40,000	£28,000 to £120,000
Knowledge base and RAG	£5,000 to £10,000	£8,000 to £15,000	£15,000 to £30,000
Quality assurance and testing	£4,000 to £7,000	£7,000 to £12,000	£12,000 to £25,000
Compliance engineering	£2,000 to £5,000	£5,000 to £12,000	£15,000 to £40,000
Deployment and DevOps	£3,000 to £6,000	£5,000 to £10,000	£10,000 to £25,000
Total build (development)	£28,000 to £57,000	£59,000 to £131,000	£117,000 to £333,000
Post-launch optimisation (3 months)	£5,000 to £10,000	£10,000 to £20,000	£20,000 to £50,000
Total first-year investment	£33,000 to £67,000	£69,000 to £151,000	£137,000 to £383,000

The effort behind these numbers is mostly people-days. A Tier 1 architecture takes four to six days, rising to eight to fourteen for Tier 2 and sixteen to thirty-two for Tier 3. Conversation design runs three to five days per flow, covering the intent taxonomy, the conversation tree, edge case mapping, and persona definition. The core STT, LLM, TTS, and telephony pipeline takes ten to fifteen days, including latency optimisation, barge-in handling, streaming setup, and DTMF fallback. Each business system integration takes three to seven days for standard REST work, and longer for ERP, EHR, and legacy systems. Quality assurance runs five to nine days for a focused MVP, and up to thirty for an enterprise build with full regression, compliance, and clinical safety testing.

Ongoing Operating Costs After Launch

These are the monthly costs once the agent is live. Figures are shown per 1,000 calls where usage drives them, and per month where they are fixed.

Operating component	Cost	Cost driver	Optimisation
STT (Deepgram Nova-3)	£1.60 to £4.80 per 1,000 calls	$0.0059 per minute, 4-minute average	Volume discount; on-device Whisper for batch
LLM (GPT-4o-mini)	£5.60 to £14.00 per 1,000 calls	About 2,000 tokens per call	Route simple queries to cheaper models; cache repeats
TTS (Deepgram Aura)	£5.40 to £13.50 per 1,000 calls	$0.0135 per minute, 2-minute average	Tighten response length; cache repeated phrases
Telephony (Twilio inbound)	£8.50 to £10.00 per 1,000 calls	$0.0085 per minute inbound	Plivo or Vonage cheaper; self-host at high volume
Infrastructure (AWS, Redis, DB)	£200 to £800 per month	Fargate, RDS, ElastiCache, S3	Reserved instances; Graviton for 20 to 30% saving
Monitoring	£100 to £400 per month	CloudWatch, Datadog, Sentry, LLM tracing	Consolidate tools; sample traces
Engineering maintenance	£1,000 to £3,000 per month	1 to 3 days a month of prompt and integration work	Client-side knowledge owner; automate regression tests
Total at 10,000 calls a month	£6,800 to £14,800 per month	Combined usage and fixed costs	Target under £10,000 a month when well optimised

The per-call math is small, but it adds up. Deepgram STT runs about $0.0059 a minute, so a 4-minute call costs roughly $0.024, or about $23.60 per 1,000 calls, with a volume discount above 100,000 minutes a month. GPT-4o-mini handles around 2,000 tokens per call at roughly $0.007, and routing simple queries to Claude Haiku or caching identical ones trims that further. Deepgram Aura TTS runs $0.0135 a minute, near $27 per 1,000 calls, with ElevenLabs comparable and Azure Neural slightly cheaper. Twilio inbound is $0.0085 a minute, around $34 per 1,000 calls, while outbound costs more at $0.013. Infrastructure scales linearly with call volume, and Graviton processors plus reserved instances are where most teams trim the monthly bill.

ROI Model and the Business Case for Custom AI Voice

The business case for custom AI voice agent development combines three things. There is the cost saving from AI containment, where each AI-handled call avoids a human agent. There is the revenue lift from better service, through instant answers, round-the-clock cover, and consistent information. And there is the strategic value of owning a differentiating capability rather than renting one. The numbers below are set for real production environments, not optimistic vendor projections.

The ROI Model in Three Scenarios

ROI component	Conservative, 60%	Base case, 75%	Optimistic, 85%
Monthly call volume	10,000	10,000	5,000
Current human cost, loaded	£140,000	£140,000	£70,000
AI-handled calls	6,000	7,500	4,250
Human calls remaining	4,000 at £14 = £56,000	2,500 at £14 = £35,000	750 at £14 = £10,500
Monthly AI operating cost	£9,500	£9,500	£5,000
Build amortisation, 36 months	£2,778	£2,778	£1,111
Total monthly cost with AI	£68,278	£47,278	£16,611
Monthly saving vs human-only	£71,722	£92,722	£53,389
Annual saving	£860,664	£1,112,664	£640,668
Upfront investment	£100,000	£100,000	£40,000
Payback period	1.4 months	1.1 months	0.75 months
5-year NPV, 10% discount	£3.26M	£4.22M	£2.43M

Revenue Impact the Cost Models Miss

Cost reduction is the visible half of the return. Three revenue effects are consistently under-modelled in business cases and consistently confirmed after deployment.

After-hours revenue capture: If you currently miss after-hours enquiries through voicemail or engaged lines, the agent captures them at no extra staffing cost. For revenue-generating inbound, like lead qualification, sales, or bookings, that cover has a clear payoff. A business with 20% of inbound arriving after hours, previously lost to voicemail at a 30% conversion rate, has direct revenue it can now win.
Faster lead response: For B2B sales, moving response time from the 47-hour industry average to 90 seconds drives a documented 40 to 60% rise in lead-to-meeting conversion. For a team converting 10 meetings a month at £5,000 average deal value, a 50% improvement means 5 more meetings, or £25,000 in extra monthly pipeline. At a 20% close rate, that is £5,000 a month in new revenue from the qualification agent alone.
Fewer appointment no-shows: Where each appointment carries value, the 30 to 45% drop in no-shows from AI reminder calls preserves revenue directly. For a healthcare practice losing £50 to £80 per no-show in revenue and wasted slots, cutting no-shows from 20% to 12% across 500 appointments a month saves 40 no-shows at £65 average, or £2,600 a month.

How Mobisoft Builds Custom AI Voice Agents

Mobisoft Infotech has built voice agent systems across healthcare, financial services, logistics, B2B sales, and enterprise customer service since 2023. Our practice rests on one belief. The quality of a voice agent comes mostly from the quality of its design and AI agent architecture, the call type categorisation, the conversation flow, the integration reliability, and the escalation logic, far more than from any single component technology. Any team can wire Twilio to Deepgram to GPT-4o to ElevenLabs. The teams that consistently reach 80% containment, CSAT on par with humans, and payback under four months are the ones who understand the call type deeply, design with operational empathy, and keep improving the agent after go-live.

We do not sell voice technology. We sell the operational improvement that voice technology makes possible. The distinction matters. A technology vendor's incentive is to deploy and move on. Ours is to make sure containment, CSAT, and ROI hit the numbers in the business case we built with you, because our next engagement depends on the first one working.

Our Three-Stage Engagement Structure

Discover, 2 to 3 weeks

We run requirement workshops with operations, CX, compliance, and IT, and we listen to 50 to 100 real calls in your target category before designing anything. You get a call type categorisation matrix, a technology architecture decision, an integration architecture document, conversation design for Wave 1, a compliance and regulatory map, an investment and ROI model, and a clear go or no-go recommendation. We will recommend a managed platform instead if the evidence points that way.

Build, 8 to 12 weeks

A small team of one lead architect, one or two backend engineers, a conversation designer, and a QA engineer delivers production-ready code, all business system integrations, telephony configuration, conversation design, compliance controls, a monitoring dashboard, a QA test suite, and a staging environment. We run weekly calls with your stakeholders, work alongside your IT team on integrations, and include a UAT phase with real calls in staging before promotion.

Optimise, 3 to 6 months post-launch

We stay engaged rather than hand over. You get monthly reports on containment, CSAT, FCR, escalation reasons, and cost per call, plus prompt iterations, knowledge base updates, integration upkeep, and A/B testing. We hold a 4-hour SLA for production incidents in business hours, review every unexpected escalation, update the ROI model monthly with real data, and make the case for Wave 2 once Wave 1 performance confirms readiness.

Mobisoft's Specialist Capabilities in AI Voice

Our enterprise AI solutions cover the harder corners of voice, where compliance and integration make or break a deployment.

Healthcare AI voice, HIPAA and NHS

FHIR EHR integration with Epic, Cerner, EMIS, and SystmOne, HIPAA-compliant architecture with BAAs, NHS DSP Toolkit compliance, a clinical safety review for any workflow that touches clinical data, and CQC-aware escalation for mental health adjacent cases. Healthcare voice is more complex than commercial voice, and a team without clinical IT experience usually needs a six-month redesign to meet EHR and compliance requirements.

Financial services AI voice, FCA and TCPA

FCA Consumer Duty engineering with vulnerability screening, audit trails, and scripts reviewed against CONC and COB rules, TCPA-compliant outbound with consent management, DNC integration, and call-hour controls, voice biometrics for passive authentication, and PCI DSS-compliant payment over voice. Non-compliant financial voices expose the operator to enforcement, fines, and reputational damage, and the required design choices are non-obvious, so they must sit in the architecture from day one.

Enterprise contact centre integration

Warm transfer with Genesys Cloud, Salesforce Service Cloud, Amazon Connect, Five9, and Avaya. Agent screen pop with the conversation summary, ACD routing on intent, a supervisor dashboard, WFM integration, and SIP trunking for legacy PBX. Contact centres have existing investments the agent must work with, not replace, and our pre-built Genesys and Salesforce components cut integration time by four to six weeks.

Multi-language AI voice

Language detection from the first utterance, STT model selection per language, LLM prompt adaptation, TTS voice per language and dialect, multilingual knowledge bases, and cultural review by native speakers. An agent that works in English often fails in French, Mandarin, or Arabic without per-language testing, since accuracy, cultural style, and TTS quality all vary.

On-premise and private cloud LLM

vLLM deployment of Llama 3 70B, Mistral Large, or Qwen 2.5 72B on AWS, Azure, GCP, or on-premise GPUs, model quantisation, latency benchmarking under 700ms first token, pipeline integration, and cloud failover for peak load. Organisations with strict data residency rules cannot use cloud LLM APIs that process content on vendor infrastructure, and this work needs GPU and model-serving expertise most voice vendors lack.

Voice agent quality programme

Monthly analysis that categorises every escalation by reason, identifies the top three improvement opportunities, runs prompt iterations and A/B tests, analyses knowledge base gaps, and reports containment trends against the business case. A well-designed agent improves from 55 to 65% containment at launch to 75 to 85% over six months, but only with a structured programme. Hand it over at go-live, and performance usually plateaus or slips.

Client Engagements in Practice

Three examples show the range of work we deliver. Details are anonymised at client request.

UK Primary Care Network, 40 GP practices

The network's 40 practices handled 48,000 inbound calls a month for appointments, with morning waits of 14 minutes. Two-thirds were booking, cancellation, and confirmation with no clinical content. We built a custom agent integrated with SystmOne and EMIS via their APIs, with NHS number verification and a postcode double-check, real-time availability and booking, outbound reminders at 48 and 24 hours, clinical escalation to the duty manager for any clinical content, and GDPR-compliant logging under NHS DSP. The agent reached 84% containment against an 80% target. Wait time for AI-handled calls dropped to zero. No-shows fell from 18% to 11%, worth £3.5M a year across the network at £65 a missed slot. Morning peaks ran without extra reception staff, and the booking satisfaction score rose from 3.4 to 4.2 out of 5. Payback came in 2.8 months.

B2B SaaS company, London, £15M ARR

An SDR team of 8 was answering inbound web leads in 31 hours on average, with 18% lead-to-demo conversion. We built an inbound qualification agent that calls every qualified lead within 90 seconds of form submission, opens with context from the form, runs BANT qualification in a natural 3 to 4 minute conversation, books confirmed demos into the right account executive's Calendly, updates HubSpot with scores and transcript, escalates enterprise leads to the VP Sales mobile, and routes SMB leads to the SDR queue. Response time fell from 47 hours to 90 seconds. Lead-to-demo conversion rose from 18% to 31%, a 72% improvement. The agent now handles 94% of inbound qualification without SDR involvement, the team refocused on demo prep and follow-up, and annual qualified pipeline grew by £4.2M. Payback came in 1.1 months.

Regional insurance broker, FNOL and renewals

The broker ran First Notification of Loss with 7 claims handlers, averaging 18 minutes per call, while renewals went out by human dialling at a 40% answer rate. We built two agents. The FNOL agent captures structured claim data in sequence. This includes the incident type, time, location, third party, witnesses, and an early damage estimate, then creates the claim record, issues a reference number, and routes to a human for assessment. The outbound renewal agent calls every renewing policyholder 30 days ahead, presents the quote with a prior-year comparison, handles acceptance with payment, and routes negotiations and unhappy customers to the human retention team. FNOL data completeness rose to 97% from 83%. FNOL handle time fell from 18 minutes to 9. Renewal retention via the AI reached 67% against 58% for human outbound, helped by round-the-clock reach at preferred times. Handlers moved from data capture to assessment. Payback came in 3.4 months.

Getting Started with Mobisoft

The question we hear most from teams evaluating custom voice is simple. How do we know if this is right for us, and what does the first conversation look like? Discovery answers the first part before you commit to a build. We assess your use case, analyse your call data, review your integration environment, and build the business case with you. You get a clear recommendation, including, where it fits, a recommendation that a managed platform suits you better than custom, before you spend on development.

Most teams start their voice AI development with a short Discovery engagement, which keeps risk low and decisions evidence-based.

What We Need from You to Get Started

Call data

Call recordings or transcripts for your target use case, around 50 to 100 calls. If you cannot share them, a description of the 10 to 15 most common call patterns.
Current call volume, with a monthly total and a breakdown by type where possible.
Current average handle time per call type.
Current containment rate if you already run an IVR or basic automation.

Business systems

A list of systems the agent must reach, such as CRM, EHR, booking, payment, and ticketing, with API docs or an IT contact.
Your authentication approach, whether OAuth 2.0, API key, or SSO.
Any data residency rules or cloud provider constraints.

Operational context

Your current contact centre platform, if the agent must integrate for escalations.
Your regulatory environment, covering FCA, HIPAA, GDPR, TCPA, or the EU AI Act.
The target call types for Wave 1.

Commercial context

Your current human agent team cost, by headcount or monthly budget.
After-hours call volume and how you handle it today.
Target containment and CSAT KPIs from the business case.

Timeline and governance

Your target go-live date, if you have one.
Key stakeholders across operations, IT, compliance, and finance, and their involvement.
A budget range, which helps us recommend the right tier.

Our Engagement Models

Discovery Only

We run Discovery as a standalone engagement, covering call analysis, architecture recommendation, integration assessment, and ROI model. You receive a fully specified project brief and business case you can take to any partner, including a competitor, with no obligation to continue with us. Best for teams validating the case internally, needing an independent recommendation before an RFP, or with in-house developers who want the design done externally. Investment is £8,000 to £15,000, delivered in 2 to 3 weeks, and refundable against Stage 2 if the build proceeds with Mobisoft.

Fixed-Price MVP Build

Fixed scope, price, and timeline for Wave 1, covering one to two use cases and two to four integrations. Scope locks at the end of Discovery, and any change runs through a formal request with transparent cost impact. It includes three months of post-launch optimisation. Best for teams with a clear use case, ready to build, who want cost certainty. Investment is £35,000 to £80,000 depending on complexity, with 12 to 16 weeks to production.

Phased Platform Build

A multi-stage engagement where the MVP ships first and later waves follow, each scoped and priced after the previous wave's performance data. The platform architecture is designed from day one to support future waves. Best for teams building a strategic platform rather than a point solution, who want to learn from Wave 1 first. Investment is £65,000 to £150,000 for Wave 1 at Tier 2 scope, with each later wave £25,000 to £60,000.

Engineering Pod Augmentation

We embed two to four voice engineers with your existing team, and you keep full control and ownership while we supply the specialist skills you lack across STT, LLM, and TTS integration, conversation design, and compliance. Best for teams with engineering capacity but no voice specialism, or those building in-house capability rather than outsourcing. Investment is £8,000 to £18,000 per engineer per month, minimum three months, with UK business hours and India-based extended cover.

The Case for Owning Your AI Voice Capability

Every voice agent is either rented or owned. Rented means a managed platform runs the telephony, the STT, the LLM routing, the TTS, and the data processing on its terms, at its per-minute price, with its technology choices. Owned means you built the capability on a stack you control, with conversation design you own, integrations you engineered, and a per-minute cost that reflects component prices without platform markup.

For teams where voice is a strategic channel, owned is the right answer. That holds where caller experience is a competitive edge, where integration depth creates real operational advantage, where call volume justifies the build, and where long-term cost matters. For teams where voice is a tactical fix for one bounded use case, a managed platform is often the better start. The answer is not universal. It needs the candid analysis Discovery provides, listening to real calls, assessing the real integration environment, building the real ROI model, and making one specific recommendation for your situation.

If you are weighing custom AI voice agent development, the most useful first step is a conversation with our AI voice engineering practice. Not a sales call, a technical conversation about your call types, your integration environment, and your commercial goals. Discovery answers whether custom is the right investment for you before you commit to it. The work begins when you are ready to ask the right question.

Mobisoft Infotech AI Voice Engineering Practice

We build custom voice agents for enterprises, healthcare providers, financial services firms, and growth-stage businesses worldwide. Our services span Discovery with call type analysis, architecture, and ROI modelling, custom builds on Twilio, Deepgram, an LLM, ElevenLabs, and a tool layer, healthcare voice with FHIR EHR integration for Epic, EMIS, and SystmOne plus HIPAA and NHS DSP, financial services voice with FCA Consumer Duty, TCPA, voice biometrics, and PCI DSS, enterprise contact centre integration with Genesys, Salesforce, Amazon Connect, and Five9, RAG knowledge bases, on-premise LLM deployment with Llama 3, Mistral, and Qwen 2.5, multilingual voice across more than 40 languages, outbound campaign management, and post-launch optimisation.

Technology stack: STT from Deepgram Nova-3, AssemblyAI, Google, and Azure. LLM from GPT-4o, Claude, Gemini, Llama 3, and Mistral. TTS from ElevenLabs, Deepgram Aura, Azure Neural, and Google WaveNet. Telephony from Twilio, Vonage, Plivo, and Livekit.
Engagement models: Discovery Only, Fixed-Price MVP, Phased Platform, and Pod Augmentation.
Industries: Healthcare, financial services, e-commerce, logistics, insurance, HR technology, B2B sales, and public sector.
Global delivery: US, UK, UAE, Australia, and Singapore, with engineering teams in India and UK-hours synchronous overlap.

Enterprise AI voice development with custom AI voice agent integrations

Frequently Asked Questions

What is custom AI voice agent development?

It means building an AI voice agent on a composable stack you own and control, rather than configuring a managed platform. The work covers architecture design across STT, LLM, TTS, and telephony, conversation design for each use case, business system integrations through API, a knowledge base with RAG for grounded answers, compliance engineering across TCPA, OFCOM, FCA, HIPAA, and the EU AI Act, and post-launch optimisation. The result runs on infrastructure you control, with code and design you own, at a per-minute cost that reflects component prices rather than platform markup. Expect £30,000 to £150,000 to build and 12 to 18 weeks to production.

How long does it take to build a custom AI voice agent?

Plan around three stages. Discovery takes 2 to 3 weeks, covering call analysis, architecture, integration assessment, the ROI model, and Wave 1 conversation design. Build takes 8 to 12 weeks, covering component integration, conversation implementation, business system integrations, QA, and a supervised pilot. Optimise runs 3 to 6 months after launch, with prompt refinement, knowledge base updates, and containment improvement. Total to production is 10 to 15 weeks, nearer 10 to 12 for a well-scoped Tier 1 build and 14 to 18 for a Tier 2 platform. A managed platform ships in 1 to 4 weeks, but with customisation limits.

What technology stack does a custom AI voice agent use?

A typical 2026 stack starts with Twilio for telephony, using Media Streams WebSocket, with Vonage or Plivo as alternatives. STT uses Deepgram Nova-3 for low latency, or AssemblyAI, Google, or Azure for accuracy and HIPAA. Orchestration runs as a custom Node.js or Go service, or the Livekit Agents framework. The model is often GPT-4o-mini for cost, Claude Haiku for tool use, or Llama 3 on-premises. TTS uses Deepgram Aura, ElevenLabs, or Azure Neural. The knowledge base uses pgvector or Pinecone, and infrastructure runs on AWS. Median latency with one tool call is 810ms to 3,350ms.

What integrations does a custom AI voice agent support?

Any business system reachable by API. Common ones include CRM such as Salesforce, HubSpot, Dynamics, and Zoho, booking through Google Calendar, Microsoft Graph, Calendly, and FHIR for healthcare, and payment via Stripe and Braintree with PCI-compliant DTMF capture through Twilio. Healthcare EHR covers Epic, Cerner, EMIS, and SystmOne over FHIR with SMART authentication. ERP covers SAP, Oracle, and NetSuite, ticketing covers Zendesk, Freshdesk, Jira, and ServiceNow, and contact centres cover Genesys, Amazon Connect, Salesforce Service Cloud, and Five9. Complexity varies. A REST integration takes 3 to 5 days, while FHIR EHR and legacy ERP take 1 to 4 weeks each.

What is the ROI of a custom AI voice agent?

Take a Tier 2 build at 10,000 calls a month, base case. Current human cost is £140,000 a month for 5 loaded FTE. After AI, residual human cost is £35,000, AI operating cost is £9,500, and build amortisation is £2,778, for £47,278 total. That is a £92,722 monthly saving, or £1,112,664 a year. On a £100,000 build, payback is 1.1 months, and 5-year NPV at a 10% discount is £4.22M. Revenue upside sits on top, through after-hours capture, faster lead response, and fewer no-shows. Best-in-class agents reach 85% containment with CSAT within 3 points of humans.

Should I use a managed platform or build custom?

Choose a managed platform when speed matters most, with 1 to 4 weeks to production, the use case fits standard patterns, volume sits under 20,000 calls a month, data residency is met, and you need no integration depth beyond standard REST. Choose custom when you need model flexibility, deeper integration for multi-step transactions or legacy systems, a brand voice persona, controlled data processing for HIPAA or GDPR, volume above 20,000 calls a month, or strategic ownership with no lock-in. Many teams start managed to prove the case, then move custom once the business case holds and volume justifies the engineering.

How does Mobisoft help with AI voice agent development?

We deliver in three stages. Discovery, at 2 to 3 weeks and £8,000 to £15,000, covers call analysis, architecture, integration assessment, and the ROI model, with a go or no-go recommendation. Build, at 8 to 12 weeks and fixed price, delivers a production agent with integrations, conversation design, compliance, and QA. Optimise, over 3 to 6 months, covers reporting, prompt iteration, and containment improvement. Engagement models include Discovery Only, Fixed-Price MVP at £35,000 to £80,000, Phased Platform, and Pod Augmentation. Specialisms span healthcare, financial services, contact centre integration, multilingual voice, and on-premise LLM.

This content is for informational purposes only and may include AI-assisted research or content generation. While we strive for accuracy, information may evolve over time. Readers are advised to independently verify critical information before making decisions.

Nitin Lahoti

Co-Founder and Director

Nitin Lahoti is the Co-Founder and Director at Mobisoft Infotech. He has 15 years of experience in Design, Business Development and Startups. His expertise is in Product Ideation, UX/UI design, Startup consulting and mentoring. He prefers business readings and loves traveling.

Custom AI Voice Agent Development Guide for Cost, Features, Integrations, and ROI

Table Of Contents