The Costs and Challenges of Building my AI-Generated Podcast
A retrospective on the process, constraints, and realities of building with voice AI

Picking up where we left off
Welcome again! šš¾
If you havenāt already, I recommend reading Part 1 of this series to get a wholistic idea of what I was trying to accomplish when I set out to build Listen Better. That post contains a lot more detail on what exactly I built and what influenced my decisions as I built it. Youāre currently reading a reflection on that process, so some of the technical details will still be relevant here.
Hereās a quick refresher: Listen Better is a web app that takes in French audio, translates and explains each sentence in it, and generates a new audio file featuring AI-generated podcast hosts who discuss the translations and explanations.
In this article, Iāll discuss the cost of creating an episode and offer some thoughts on the process of building the app.
Costs, you say? Spill the beans! š
Okay okay, I will! As youāll recall from Part 1, there are 3 sets of services underpinning Listen Better: transcription using Deepgram, translation and explanation using OpenAIās GPT-5, and audio generation using Googleās Gemini 2.5 Flash TTS. Quick side note, this āGemini 2.5 Flash TTSā name is so annoying to pronounce and type. For the rest of this post Iāll just call it āG25Tā.
The pricing of those 3 services is a mix of per-minute and per-character billing, so presenting the costs in that same manner wonāt be particularly insightful here. More so, thereās obviously no fixed size for audio input sourced from all over the internet, thereās no fixed length on this projectās translation and explanation outputs, and consequently no fixed duration on audio output either. Thus, such intuitive anchor points donāt really exist here. For the numbers to make sense, itās necessary to view them in the context of a unit thatās applicable to this project. That unit will be a ādailyā episode.
A daily episode is the longest episode I can generate in one day, which is determined by the rate limits of the Gemini API. Of the 3 rate limit dimensions described on that page, requests per day (RPD) is the one that always blocks my progress š„ŗ. My account has a limit of 100 RPD, which isnāt bad for the deployed application, but was wayyy too low during development. It was frustrating at first but I had to become stoic about it because, as the saying goes, it eezz what it eezzzz. The daily episode tends be around 90 minutes on average, which is enough to fill my French listening cup on any given day.
With all that said, what then is the cost of a daily episode? Brace yourself⦠are you ready⦠on average, it costs a whopping $1.33! š Does that surprise you? I thought this project would be very pricey so I went into it with the expectation that satisfactory output, whatever it would look like, would cost around $5 a day. So I was pleasantly surprised by that amount, even if itās still higher than Iād like. You may already be wondering how that number breaks down across the 3 services, so here you go: Deepgram transcription contributes a mercifully insignificant 2 cents š, GPT-5 translation + explanation throws in a manageable 21 cents š, whereas Gemini TTS greedily hogs the remaining $1.10 š« . Frankly, I wasnāt surprised that TTS was the most expensive part, though I did expect the transcription to cost a lot more. To understand why I say that, letās explore what that money actually buys.
The deployed application generates 3 audio segments for each sentence: an intro, the main translation + explanation, and an outro. 100 RPD therefore means about 33 sentences, though I round that down to 30 to create a little buffer. This number of sentences is a bit misleading because sentences can be long or short. The application actually combines shorter sentences up to a limit of 150 chars, and breaks up longer sentences down to around 200 chars. This strategy keeps the explanation within the 1500-2500 character range as mentioned in Part 1. So the more accurate statement is that the daily episode can contain explanations for about 30 chunks of transcribed French sentences from the original audio, with each chunk containing 150-200 characters of text. In reality, this represents around 260 seconds (~4.3 minutes) of input audio, on average. Knowing that, it makes sense why the transcription costs so little, right? This project simply doesnāt consume a lot of input audio in a single daily episode. If you look at the pricing pages of both the transcription and TTS services, youāll notice that the Speech-to-Text transcription (pre-recorded, Nova-3 multilingual) costs 4.3 cents per minute of audio input, which is nearly three times more expensive than the TTS serviceās 1.5 cents per minute of audio output. Ergo, doing less of the most expensive thing remains a great way to save costs š.
Can it be cheaper?
Simple answer: sort of, but the quality of output will almost certainly be worse. My notion of quality is admittedly quite squishy and poorly defined. Itās closer to āI know it when I hear itā than āit hits these benchmarksā. Why? Because the output of this project really is a matter of taste. So if youāre on the hunt for a TTS model to serve your needs, your mileage may indeed vary. Anyways, the cost of TTS is the elephant in the room here, as you just saw. So Iāll focus on that.
Geminiās Batch API
The Batch API promises a 50% cost reduction for the same quality of output if youāre willing to accept delayed results. That wouldāve been terrific here, and I went as far as to build out the logic to use the batch API⦠only to discover that itās not supported for the TTS model š. Part 1 of this series captured some of the frustrations I experienced when working with the Google model, and this was yet another one. I console myself with the knowledge that I have the code ready to go if they ever decide to support the Batch API on that model.
Other proprietary TTS models
As a category, proprietary models generally gave me better results compared to open-source models. Hereās a list of other proprietary models I considered or tested, and a brief summary of the reasons I didnāt end up choosing them:
Google WaveNet TTS model: Google has a number of legacy TTS models, one of which is the WaveNet model. Itās cheaper than G25T, which is promising. However, if you listen to the audio sample of this model, I suspect youāll agree with me that itās a bit too robotic and monotonic for this project. Cheaper? Yes. Better? Nope.
Gemini 2.5 Pro TTS: as the name suggests, this is G25Tās more capable sibling. Great output, sometimes noticeably better than what I got from G25T. However, for double the price, āsometimesā just wasnāt good enough. Cheaper? Nope. Better? Often, but not often enough.
ElevenLabs: their v3 model is particularly impressive, with comparable results to the Gemini models. Sometimes I felt this modelās output was better, other times not. However⦠look at the pricing page and youāll see why I didnāt choose this option. The Creator Plan is the cheapest plan with no hard cap on available minutes. It starts at 22 cents per minute, which is 15 times more expensive than G25T! More so, you only enjoy that price for a miserly 100 minutes, which is basically just one daily episode š. After 100 minutes, the price rises to 30 cents per minute, so 20x more expensive! š„µ Even though the higher priced plans have lower unit prices, those unit prices are still many multiples of G25Tās unit price, and they require a volume of spend that I was certainly not going to commit to this project. Oh well, until we meet again, ElevenLabs.
OpenAI models: three of them, to be precise:
gpt-4o-mini-tts,TTSandTTS HD. Onlygpt-4o-mini-ttshas a price that competes with G25T; it costs about the same. The other 2 are more expensive. More importantly, after testing the models in their playground, I felt that G25T gave me better control over emotional expression, and ultimately better output.InWorld TTS models: 2 models,
TTS-1andTTS-1-Max. The first is half the price of G25T, whereas the second costs the same. In terms of output,TTS-1-Maxpredictably had better output thanTTS-1, but not as good as G25T. I wasnāt able to dial in the French pronunciations or emotional expression as consistently as with G25T, so for the same amount of money it just didnāt make sense to use this one. Still good models though, and you can get a sample here.
Open-Source TTS models
I actually started out looking for open-source TTS models because I initially assumed that proprietary models would be too expensive to be worth the bother. As it turns out, open-source TTS models that support French were not as readily available as I expected. For the few that I did find, I either didnāt like their speech output, or didnāt have the hardware to test them. To bridge the hardware gap I considered using an inference provider, but after seeing the number of options to explore, and factoring in the unsatisfactory results from the models that were already accessible to me, I decided to try out proprietary models at that point. As youāve already seen, I didnāt look back.
Drawbacks of the audio processing strategy
No control over changes to the underlying model
Model providers are always updating their models, and sometimes users have no other option than to accept such changes to keep using those models. After Listen Better first went live, Google published an update to G25T that they said would bring good tidings all across the board: āenhancementsā that included ābetter expressivityā, āprecision pacingā, āseamless dialogueā, and āsignificant improvementsā to overall audio quality. Had Christmas š š¾š really come early? The key part was that the update would happen in place, with no action required from me. Crucially, I couldnāt opt out and take the change at my own time. In other words, with no extra effort on my part, everything would just get better overnight. If that sounds too good to be true⦠yeah, it was too good to be true.
The first problem was that the distinct male and female voices I had chosen started sounding more gender neutral more often, to the point that I wouldnāt even be sure who was speaking sometimes. After a few days I could tell that this was part of a wider issue where the new model now blends the roles and voices of the speakers. A line attributed to one speaker in the dialogue script sometimes gets read by the other speaker in the generated audio. And thatās not even the strangest part: occasionally the active speaker fully transitions from one of my selected voices to the other one literally mid-speech! Voice metamorphosis š. This problem was magnified by my usage of speaker names in the dialogue script, because I wanted the speakers to reference themselves by name to make the dialogue sound more natural. However, when Marie speaks a line thatās meant for ClĆ©ment, and calls out her own name because ClĆ©ment was scripted to call her by name, the listening ear canāt help but notice the proverbial record scratch. I eventually took out the names because of how annoying this got. At least without the names it just sounds like 2 people speaking alternately š¤·š¾āāļø.
The second problem was that the new model would occasionally skip segments of the dialogue script and sometimes add extensive periods of silence in the generated speech. The same markup tags that I had used to introduce natural-sounding pauses between sentences and words seemed to cause the speakers to pause for much longer than expected. Sometimes they would pause for tens of seconds if not minutes, and in some cases, they would stop speaking entirely without reading out the full dialogue script. I ended up adding code to perform an additional check for extended silence on each generated audio file. Whenever more than 10 seconds of silence is detected, the audio gets regenerated. Thatās not an ideal solution because extra generations means more money spent. Thankfully the problem is rare enough that this additional cost remains negligible.
All that said, was the update really so bad? Nope, not at all. In fact, with hindsight I can now say the update was good. I do believe the audio quality has improved, as well the modelās adherence to the prompt and script⦠well, at least when the audio is complete! The real problem was that I didnāt have the option of testing the new model and refining my strategy to take advantage of its strengths in a controlled environment, before pushing it to the deployed application. In a professional setting, that couldāve been a very big problem.
Non-deterministic LLM output
The same script content does not consistently generate the same audio content. Some words get mispronounced, others get swapped for closely-related alternatives, and some are skipped entirely (although without really changing the meaning of the overall sentence). This variability is not necessarily a bad thing because deterministic output can very quickly sound boring and monotonous. However, in this particular project, any word used in the generated speech that wasnāt part of the script is a potential source of misunderstanding for me, the end listener. Thus, script adherence was essential for keeping Listen Better usable. Thankfully, G25T is already quite good at sticking closely to the original script. With a few stern instructions added to the prompt, I have been able to keep the occurrence of unintended variations to a satisfactorily low level.
Transcription inaccuracies
Deepgram can sometimes identify the wrong words in the audio. When Iāve seen this happen, itās been with words that sound the same but are spelled differently. Normally, the surrounding context would indicate what word was most likely said, and Deepgram generally handles that really well. The real issue here is that solving the problem is challenging unless thereās a preexisting canonical transcript to compare with. Thankfully, I havenāt really needed to solve this issue because of the very high accuracy of the transcriptions. In Listen Better, Iāve only ever noticed this issue twice, as in affecting 2 words, in over 28,000 transcribed words. Yeah, no urgent need to fix that! š
Was the coding process a vibe?
Using coding agents, writing the code was as doable as any typical web-based project is these days. For the most part I accepted the UI choices of the coding agents (with some refinements, of course) because UI design is not one of my strengths. It would be accurate to say that a combination of Cursor, Codex, and Claude Code built the vast majority of the app! š This doesnāt mean I handed over completely in the way that the āvibe codingā term suggests. Iāve been a developer for some years so I find it difficult to accept autogenerated code without building up the confidence that I know what itās doing and not doing. So even though most of the code was written by coding agents, I set the direction and performed course correction by constantly reviewing and refining the outputs throughout the process.
Estimating the time I spent building Listen Better is tricky. It was not a full-time commitment, and I didnāt build from a pre-defined product specification. Nevertheless, if I could squash all the time together, I would estimate that it took me a month of full-time work to go from initial exploration to the current web app that I use daily. A big chunk of that time was spent learning a few new things, most notably TTS, RSS, agentic coding workflows, and self-hosting. I was also simultaneously making product and engineering decisions, some of which came more naturally to me than others, and most of which I didnāt bother about until after I started getting results from the TTS pipeline. Essentially, I created a working TTS pipeline then bolted a UI and RSS integration on top! š Knowing what I know now, if the UI, product behavior, and TTS strategy are all specified in detail in advance, I believe a developer skilled at using coding agents can recreate the app in no more than 3 days of full-time work. That includes code review, refinement and testing, and a number of admin features not visible to regular users.
What was the biggest bottleneck I faced?
Simply put, wrangling the TTS models to get my desired output. Right from my earliest explorations of the concept for Listen Better, the first and most important signal to judge was the quality of generated audio output. My ideal TTS model would need to fluidly handle multilingual speech, support multispeaker audio generation, and provide tools for tweaking the tone and energy of the speakers according to my preferences. To properly assess how well a given model fit my objectives, I needed to spend some time prompting and prodding it until I felt confident of what its ābestā output could be, even if I hadnāt yet gotten that level of output. This prompting and prodding exercise involved:
doing a lot of good old āprompt engineeringā
changing the code continuously to match model features and APIs
exploring the effects of controlling tone and emotion on the final output
trying different tactics to
ensure words in each language were pronounced appropriately
maintain the prosody (meaning rhythm and intonation) of speakers across multiple generation attempts when hitting character limits
get the most natural-sounding dialogue, either by generating each speakerās portion independently then stitching them together, or, if available, using the modelās built-in dialogue generation capabilities
This exploration looked a bit different for each model, and thereās no standard API or feature set for TTS generation. Overall, the process was much longer than I expected going in. It left me with the feeling that weāre still very much in the earliest innings of what can broadly be called voice AI, despite the outputs already sounding so good in some cases.
Rounding up
Listen Better was borne out of a very personal desire to improve my listening comprehension in a foreign language. Has this project been enough to fill that need? No, though itās certainly been helpful. My vocabulary has grown faster than before I started using it. More so, hearing French words and expressions repeatedly from different audio sources has been reinforcing them in my subconscious in a way that I would not have managed unless I moved to France! Or some other French-speaking region. For now, the face twitch I mentioned in Part 1 hasnāt gone away š. Word by word, sentence by sentence, and in combination with other learning efforts, Iām optimistic that it will.
Beyond the language learning outcome, this project reacquainted me with todayās reality that long-form AI-generated content can only be created in small chunks that add up to a whole, often seconds or at best minutes at a time. In my case I had to stick to strict character limits then combine audio segments into full episodes. Input and output limits are not going away anytime soon, so learning to effectively work with them remains necessary.
And with that, I say a big Ā« merci ! Ā» to you for reading this post. Until next time! šš¾
Questions? Feedback? Nice words, or mean ones? Feel free to reach out to @CodeWithOz on all the socials, or on LinkedIn.



