<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[In Code This Means { ... }]]></title><description><![CDATA[Lessons learned in the process of becoming a better software programmer.]]></description><link>https://incodethismeans.com</link><generator>RSS for Node</generator><lastBuildDate>Fri, 15 May 2026 04:49:23 GMT</lastBuildDate><atom:link href="https://incodethismeans.com/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[The Costs and Challenges of Building my AI-Generated Podcast]]></title><description><![CDATA[Picking up where we left off
Welcome again! 👋🏾
If you haven’t already, I recommend reading Part 1 of this series to get a wholistic idea of what I was trying to accomplish when I set out to build Listen Better. That post contains a lot more detail ...]]></description><link>https://incodethismeans.com/the-costs-and-challenges-of-building-my-ai-generated-podcast</link><guid isPermaLink="true">https://incodethismeans.com/the-costs-and-challenges-of-building-my-ai-generated-podcast</guid><category><![CDATA[tts]]></category><category><![CDATA[vibe coding]]></category><category><![CDATA[GPT-5]]></category><category><![CDATA[deepgram]]></category><category><![CDATA[gemini]]></category><dc:creator><![CDATA[Uche Ozoemena]]></dc:creator><pubDate>Wed, 28 Jan 2026 07:01:16 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1769434052679/ca26d2b1-1369-491b-80db-fe5f75a9a48a.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-picking-up-where-we-left-off">Picking up where we left off</h2>
<p>Welcome again! 👋🏾</p>
<p>If you haven’t already, I recommend reading <a target="_blank" href="https://incodethismeans.com/can-ai-solve-listening-comprehension">Part 1</a> of this <a target="_blank" href="https://incodethismeans.com/series/building-listen-better">series</a> to get a wholistic idea of what I was trying to accomplish when I set out to build <a target="_blank" href="https://listenbetter.prototypes.haus">Listen Better</a>. That post contains a lot more detail on what exactly I built and what influenced my decisions as I built it. You’re currently reading a reflection on that process, so some of the technical details will still be relevant here.</p>
<p>Here’s a quick refresher: Listen Better is a web app that takes in French audio, translates and explains each sentence in it, and generates a new audio file featuring AI-generated podcast hosts who discuss the translations and explanations.</p>
<p>In this article, I’ll discuss the cost of creating an episode and offer some thoughts on the process of building the app.</p>
<h2 id="heading-costs-you-say-spill-the-beans">Costs, you say? Spill the beans! 👀</h2>
<p>Okay okay, I will! As you’ll recall from Part 1, there are 3 sets of services underpinning Listen Better: transcription using Deepgram, translation and explanation using OpenAI’s GPT-5, and audio generation using Google’s Gemini 2.5 Flash TTS. Quick side note, this “Gemini 2.5 Flash TTS” name is so annoying to pronounce and type. For the rest of this post I’ll just call it “G25T”.</p>
<p>The pricing of those 3 services is a mix of per-minute and per-character billing, so presenting the costs in that same manner won’t be particularly insightful here. More so, there’s obviously no fixed size for audio input sourced from all over the internet, there’s no fixed length on this project’s translation and explanation outputs, and consequently no fixed duration on audio output either. Thus, such intuitive anchor points don’t really exist here. For the numbers to make sense, it’s necessary to view them in the context of a unit that’s applicable to this project. That unit will be a “daily” episode.</p>
<p>A daily episode is the longest episode I can generate in one day, which is determined by the <a target="_blank" href="https://ai.google.dev/gemini-api/docs/rate-limits">rate limits</a> of the Gemini API. Of the 3 rate limit dimensions described on that page, <strong>requests per day (RPD)</strong> is the one that always blocks my progress 🥺. My account has a limit of 100 RPD, which isn’t bad for the deployed application, but was wayyy too low during development. It was frustrating at first but I had to become stoic about it because, as the saying goes, <a target="_blank" href="https://youtube.com/shorts/KpXsfimrkFo?si=ncqQijXTA2cl1scp">it eezz what it eezzzz</a>. The daily episode tends be around 90 minutes on average, which is enough to fill my French listening cup on any given day.</p>
<p>With all that said, what then is the cost of a daily episode? Brace yourself… are you ready… on average, it costs a whopping <strong>$1.33</strong>! 😄 Does that surprise you? I thought this project would be very pricey so I went into it with the expectation that satisfactory output, whatever it would look like, would cost around $5 a day. So I was pleasantly surprised by that amount, even if it’s still higher than I’d like. You may already be wondering how that number breaks down across the 3 services, so here you go: Deepgram transcription contributes a mercifully insignificant <strong>2 cents</strong> 😁, GPT-5 translation + explanation throws in a manageable <strong>21 cents</strong> 🙂, whereas Gemini TTS greedily hogs the remaining <strong>$1.10</strong> 🫠. Frankly, I wasn’t surprised that TTS was the most expensive part, though I did expect the transcription to cost a lot more. To understand why I say that, let’s explore what that money actually buys.</p>
<p>The deployed application generates 3 audio segments for each sentence: an intro, the main translation + explanation, and an outro. 100 RPD therefore means about 33 sentences, though I round that down to 30 to create a little buffer. This number of sentences is a bit misleading because sentences can be long or short. The application actually combines shorter sentences up to a limit of 150 chars, and breaks up longer sentences down to around 200 chars. This strategy keeps the explanation within the 1500-2500 character range as mentioned in Part 1. So the more accurate statement is that the daily episode can contain explanations for about 30 chunks of transcribed French sentences from the original audio, with each chunk containing 150-200 characters of text. In reality, this represents around <strong>260 seconds</strong> (~4.3 minutes) of input audio, on average. Knowing that, it makes sense why the transcription costs so little, right? This project simply doesn’t consume a lot of input audio in a single daily episode. If you look at the pricing pages of both the <a target="_blank" href="https://deepgram.com/pricing">transcription</a> and <a target="_blank" href="https://cloud.google.com/text-to-speech/pricing?hl=en">TTS</a> services, you’ll notice that the Speech-to-Text transcription (pre-recorded, Nova-3 multilingual) costs <strong>4.3 cents</strong> per minute of audio input, which is nearly <em>three times</em> more expensive than the TTS service’s <strong>1.5 cents</strong> per minute of audio output. Ergo, doing less of the most expensive thing remains a great way to save costs 😀.</p>
<h3 id="heading-can-it-be-cheaper">Can it be cheaper?</h3>
<p>Simple answer: sort of, but the quality of output will almost certainly be worse. My notion of quality is admittedly quite squishy and poorly defined. It’s closer to “I know it when I hear it” than “it hits these benchmarks”. Why? Because the output of this project really is a matter of taste. So if you’re on the hunt for a TTS model to serve your needs, your mileage may indeed vary. Anyways, the cost of TTS is the elephant in the room here, as you just saw. So I’ll focus on that.</p>
<h4 id="heading-geminis-batch-api">Gemini’s Batch API</h4>
<p>The Batch API promises a <a target="_blank" href="https://ai.google.dev/gemini-api/docs/pricing">50% cost reduction</a> for the same quality of output if you’re willing to accept delayed results. That would’ve been terrific here, and I went as far as to build out the logic to use the batch API… only to <a target="_blank" href="https://github.com/googleapis/js-genai/issues/1077">discover that it’s not supported</a> for the TTS model 😞. Part 1 of this series captured some of the frustrations I experienced when working with the Google model, and this was yet another one. I console myself with the knowledge that I have the code ready to go if they ever decide to support the Batch API on that model.</p>
<h4 id="heading-other-proprietary-tts-models">Other proprietary TTS models</h4>
<p>As a category, proprietary models generally gave me better results compared to open-source models. Here’s a list of other proprietary models I considered or tested, and a brief summary of the reasons I didn’t end up choosing them:</p>
<ol>
<li><p><strong>Google WaveNet TTS model</strong>: Google has a number of <a target="_blank" href="https://cloud.google.com/text-to-speech/pricing?hl=en">legacy TTS models</a>, one of which is the WaveNet model. It’s cheaper than G25T, which is promising. However, if you listen to the <a target="_blank" href="https://docs.cloud.google.com/text-to-speech/docs/list-voices-and-types#wavenet_voices">audio sample</a> of this model, I suspect you’ll agree with me that it’s a bit too robotic and monotonic for this project. Cheaper? Yes. Better? Nope.</p>
</li>
<li><p><strong>Gemini 2.5 Pro TTS</strong>: as the name suggests, this is G25T’s more capable sibling. Great output, sometimes noticeably better than what I got from G25T. However, for <a target="_blank" href="https://ai.google.dev/gemini-api/docs/pricing#gemini-2.5-pro-preview-tts">double the price</a>, “sometimes” just wasn’t good enough. Cheaper? Nope. Better? Often, but not often enough.</p>
</li>
<li><p><strong>ElevenLabs</strong>: their <a target="_blank" href="https://elevenlabs.io/v3">v3 model</a> is particularly impressive, with comparable results to the Gemini models. Sometimes I felt this model’s output was better, other times not. However… look at the <a target="_blank" href="https://elevenlabs.io/pricing?price.model=highest_quality#pricing-table">pricing page</a> and you’ll see why I didn’t choose this option. The Creator Plan is the cheapest plan with no hard cap on available minutes. It <em>starts</em> at <strong>22 cents</strong> per minute, which is <strong>15 times more expensive</strong> than G25T! More so, you only enjoy that price for a miserly 100 minutes, which is basically just one daily episode 😂. After 100 minutes, the price rises to <strong>30 cents</strong> per minute, so <strong>20x more expensive</strong>! 🥵 Even though the higher priced plans have lower unit prices, those unit prices are still many multiples of G25T’s unit price, and they require a volume of spend that I was certainly not going to commit to this project. Oh well, until we meet again, ElevenLabs.</p>
</li>
<li><p><strong>OpenAI models</strong>: three of them, to be precise: <code>gpt-4o-mini-tts</code>, <code>TTS</code> and <code>TTS HD</code>. Only <code>gpt-4o-mini-tts</code> has a <a target="_blank" href="https://platform.openai.com/docs/pricing#transcription-and-speech">price that competes</a> with G25T; it costs about the same. The other 2 are more expensive. More importantly, after testing the models in their <a target="_blank" href="https://platform.openai.com/audio/tts">playground</a>, I felt that G25T gave me better control over emotional expression, and ultimately better output.</p>
</li>
<li><p><strong>InWorld TTS models</strong>: <a target="_blank" href="https://inworld.ai/pricing">2 models</a>, <code>TTS-1</code> and <code>TTS-1-Max</code>. The first is half the price of G25T, whereas the second costs the same. In terms of output, <code>TTS-1-Max</code> predictably had better output than <code>TTS-1</code>, but not as good as G25T. I wasn’t able to dial in the French pronunciations or emotional expression as consistently as with G25T, so for the same amount of money it just didn’t make sense to use this one. Still good models though, and you can get a sample <a target="_blank" href="https://inworld.ai/tts">here</a>.</p>
</li>
</ol>
<h4 id="heading-open-source-tts-models">Open-Source TTS models</h4>
<p>I actually started out looking for open-source TTS models because I initially assumed that proprietary models would be too expensive to be worth the bother. As it turns out, open-source TTS models that support French were not as readily available as I expected. For the few that I did find, I either didn’t like their speech output, or didn’t have the hardware to test them. To bridge the hardware gap I considered using an inference provider, but after seeing the <a target="_blank" href="https://huggingface.co/docs/inference-providers/en/index">number of options</a> to explore, and factoring in the unsatisfactory results from the models that were already accessible to me, I decided to try out proprietary models at that point. As you’ve already seen, I didn’t look back.</p>
<h2 id="heading-drawbacks-of-the-audio-processing-strategy">Drawbacks of the audio processing strategy</h2>
<h3 id="heading-no-control-over-changes-to-the-underlying-model">No control over changes to the underlying model</h3>
<p>Model providers are always updating their models, and sometimes users have no other option than to accept such changes to keep using those models. After Listen Better first went live, Google published an update to G25T that they said would bring good tidings all across the board: “enhancements” that included “better expressivity”, “precision pacing”, “seamless dialogue”, and “significant improvements” to overall audio quality. Had Christmas 🎅🏾🎁 really come early? The key part was that the update would happen in place, with no action required from me. Crucially, I couldn’t opt out and take the change at my own time. In other words, with no extra effort on my part, everything would just get better overnight. If that sounds too good to be true… yeah, it was too good to be true.</p>
<p>The first problem was that the distinct male and female voices I had chosen started sounding more gender neutral more often, to the point that I wouldn’t even be sure who was speaking sometimes. After a few days I could tell that this was part of a wider issue where the new model now blends the roles and voices of the speakers. A line attributed to one speaker in the dialogue script sometimes gets read by the other speaker in the generated audio. And that’s not even the strangest part: occasionally the active speaker fully transitions from one of my selected voices to the other one literally mid-speech! Voice metamorphosis 😎. This problem was magnified by my usage of speaker names in the dialogue script, because I wanted the speakers to reference themselves by name to make the dialogue sound more natural. However, when Marie speaks a line that’s meant for Clément, and calls out her own name because Clément was scripted to call her by name, the listening ear can’t help but notice the proverbial record scratch. I eventually took out the names because of how annoying this got. At least without the names it just sounds like 2 people speaking alternately 🤷🏾‍♂️.</p>
<p>The second problem was that the new model would occasionally skip segments of the dialogue script and sometimes add extensive periods of silence in the generated speech. The same markup tags that I had used to introduce natural-sounding pauses between sentences and words seemed to cause the speakers to pause for much longer than expected. Sometimes they would pause for tens of seconds if not minutes, and in some cases, they would stop speaking entirely without reading out the full dialogue script. I ended up adding code to perform an additional check for extended silence on each generated audio file. Whenever more than 10 seconds of silence is detected, the audio gets regenerated. That’s not an ideal solution because extra generations means more money spent. Thankfully the problem is rare enough that this additional cost remains negligible.</p>
<p>All that said, was the update really so bad? Nope, not at all. In fact, with hindsight I can now say the update was good. I do believe the audio quality has improved, as well the model’s adherence to the prompt and script… well, at least when the audio is complete! The real problem was that I didn’t have the option of testing the new model and refining my strategy to take advantage of its strengths in a controlled environment, before pushing it to the deployed application. In a professional setting, that could’ve been a very big problem.</p>
<h3 id="heading-non-deterministic-llm-output">Non-deterministic LLM output</h3>
<p>The same script content does not consistently generate the same audio content. Some words get mispronounced, others get swapped for closely-related alternatives, and some are skipped entirely (although without really changing the meaning of the overall sentence). This variability is not necessarily a bad thing because deterministic output can very quickly sound boring and monotonous. However, in this particular project, any word used in the generated speech that wasn’t part of the script is a potential source of misunderstanding for me, the end listener. Thus, script adherence was essential for keeping Listen Better usable. Thankfully, G25T is already quite good at sticking closely to the original script. With a few stern instructions added to the prompt, I have been able to keep the occurrence of unintended variations to a satisfactorily low level.</p>
<h3 id="heading-transcription-inaccuracies">Transcription inaccuracies</h3>
<p>Deepgram can sometimes identify the wrong words in the audio. When I’ve seen this happen, it’s been with words that sound the same but are spelled differently. Normally, the surrounding context would indicate what word was most likely said, and Deepgram generally handles that really well. The real issue here is that solving the problem is challenging unless there’s a preexisting canonical transcript to compare with. Thankfully, I haven’t really needed to solve this issue because of the very high accuracy of the transcriptions. In Listen Better, I’ve only ever noticed this issue twice, as in affecting 2 words, in over 28,000 transcribed words. Yeah, no urgent need to fix that! 🙂</p>
<h2 id="heading-was-the-coding-process-a-vibe">Was the coding process a vibe?</h2>
<p>Using coding agents, writing the code was as doable as any typical web-based project is these days. For the most part I accepted the UI choices of the coding agents (with some refinements, of course) because UI design is not one of my strengths. It would be accurate to say that a combination of Cursor, Codex, and Claude Code built the vast majority of the app! 😄 This doesn’t mean I handed over completely in the way that the “vibe coding” term suggests. I’ve been a developer for some years so I find it difficult to accept autogenerated code without building up the confidence that I know what it’s doing and not doing. So even though most of the code was written by coding agents, I set the direction and performed course correction by constantly reviewing and refining the outputs throughout the process.</p>
<p>Estimating the time I spent building Listen Better is tricky. It was not a full-time commitment, and I didn’t build from a pre-defined product specification. Nevertheless, if I could squash all the time together, I would estimate that it took me a month of full-time work to go from initial exploration to the current web app that I use daily. A big chunk of that time was spent learning a few new things, most notably TTS, <a target="_blank" href="https://en.wikipedia.org/wiki/RSS">RSS</a>, agentic coding workflows, and self-hosting. I was also simultaneously making product and engineering decisions, some of which came more naturally to me than others, and most of which I didn’t bother about until after I started getting results from the TTS pipeline. Essentially, I created a working TTS pipeline then bolted a UI and RSS integration on top! 😆 Knowing what I know now, if the UI, product behavior, and TTS strategy are all specified in detail in advance, I believe a developer skilled at using coding agents can recreate the app in no more than 3 days of full-time work. That includes code review, refinement and testing, and a number of admin features not visible to regular users.</p>
<h2 id="heading-what-was-the-biggest-bottleneck-i-faced">What was the biggest bottleneck I faced?</h2>
<p>Simply put, wrangling the TTS models to get my desired output. Right from my earliest explorations of the concept for Listen Better, the first and most important signal to judge was the quality of generated audio output. My ideal TTS model would need to fluidly handle multilingual speech, support multispeaker audio generation, and provide tools for tweaking the tone and energy of the speakers according to my preferences. To properly assess how well a given model fit my objectives, I needed to spend some time prompting and prodding it until I felt confident of what its “best” output could be, even if I hadn’t yet gotten that level of output. This prompting and prodding exercise involved:</p>
<ul>
<li><p>doing a lot of good old “prompt engineering”</p>
</li>
<li><p>changing the code continuously to match model features and APIs</p>
</li>
<li><p>exploring the effects of controlling tone and emotion on the final output</p>
</li>
<li><p>trying different tactics to</p>
<ul>
<li><p>ensure words in each language were pronounced appropriately</p>
</li>
<li><p>maintain the prosody (meaning rhythm and intonation) of speakers across multiple generation attempts when hitting character limits</p>
</li>
<li><p>get the most natural-sounding dialogue, either by generating each speaker’s portion independently then stitching them together, or, if available, using the model’s built-in dialogue generation capabilities</p>
</li>
</ul>
</li>
</ul>
<p>This exploration looked a bit different for each model, and there’s no standard API or feature set for TTS generation. Overall, the process was much longer than I expected going in. It left me with the feeling that we’re still very much in the earliest innings of what can broadly be called voice AI, despite the outputs already sounding so good in some cases.</p>
<h2 id="heading-rounding-up">Rounding up</h2>
<p><a target="_blank" href="https://listenbetter.prototypes.haus/">Listen Better</a> was borne out of a very personal desire to improve my listening comprehension in a foreign language. Has this project been enough to fill that need? No, though it’s certainly been helpful. My vocabulary has grown faster than before I started using it. More so, hearing French words and expressions repeatedly from different audio sources has been reinforcing them in my subconscious in a way that I would not have managed unless I moved to France! Or some other French-speaking region. For now, the face twitch I <a target="_blank" href="https://incodethismeans.com/can-ai-solve-listening-comprehension#heading-le-debut">mentioned in Part 1</a> hasn’t gone away 😄. Word by word, sentence by sentence, and in combination with other learning efforts, I’m optimistic that it will.</p>
<p>Beyond the language learning outcome, this project reacquainted me with today’s reality that long-form AI-generated content can only be created in small chunks that add up to a whole, often seconds or at best minutes at a time. In my case I had to stick to strict character limits then combine audio segments into full episodes. Input and output limits are not going away anytime soon, so learning to effectively work with them remains necessary.</p>
<p>And with that, I say a big « <strong><em>merci !</em></strong> » to you for reading this post. Until next time! 👋🏾</p>
<hr />
<p>Questions? Feedback? Nice words, or mean ones? Feel free to reach out to @CodeWithOz on all the socials, or on <a target="_blank" href="https://www.linkedin.com/in/uchechukwu-ozoemena/">LinkedIn</a>.</p>
]]></content:encoded></item><item><title><![CDATA[Can AI Solve Listening Comprehension?]]></title><description><![CDATA[Le début
Ever since I started learning a new language, I’ve learned how much I take listening comprehension for granted. In my primary language, words float into my ears, and mere moments later, meaning has been conjured in my brain. And it’s almost ...]]></description><link>https://incodethismeans.com/can-ai-solve-listening-comprehension</link><guid isPermaLink="true">https://incodethismeans.com/can-ai-solve-listening-comprehension</guid><category><![CDATA[gemini-tts]]></category><category><![CDATA[gpt-5-mini]]></category><category><![CDATA[gemini-2-5-flash-tts]]></category><category><![CDATA[TextToSpeech]]></category><category><![CDATA[TTS technology ]]></category><category><![CDATA[tts applications]]></category><category><![CDATA[llm]]></category><category><![CDATA[geminiAPI]]></category><category><![CDATA[deepgram]]></category><category><![CDATA[openai]]></category><category><![CDATA[langchain]]></category><category><![CDATA[OpenAI API]]></category><category><![CDATA[gpt5]]></category><category><![CDATA[GPT-5]]></category><category><![CDATA[#VoiceAI]]></category><dc:creator><![CDATA[Uche Ozoemena]]></dc:creator><pubDate>Mon, 29 Dec 2025 07:54:57 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1766669276430/bb1960be-8dc0-4bce-8d91-a2c55bc9a752.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-le-debut">Le début</h2>
<p>Ever since I started learning a new language, I’ve learned how much I take listening comprehension for granted. In my primary language, words float into my ears, and mere moments later, meaning has been conjured in my brain. And it’s almost always subconscious. Rather remarkable, isn’t it? Transferring that skill to a new language is, shall I say, <em>quite</em> a challenge. My listening experience often goes something like this:</p>
<blockquote>
<p>I hear a sentence, now I’m trying to make sense of it. This word means that, that word means this, so the sentence probably means… oh crap, the next sentence is already tumbling out of Speaker’s mouth! Speaker is still saying more, I’m understanding even less, I’m increasingly uncomfortable, my face is twitching weirdly, dear Speaker slow down pleeeaseeee… reality sets in: I don’t understand the language, sigh. [Me]: “In English please?”</p>
</blockquote>
<p>After facing this scenario a few times, I thought it may be helpful for me to start listening to audio where each sentence is followed by a translation and an explanation of what was said. Gives my brain enough time to process what was said, provides info that supports my learning, and hopefully improves my ability to comprehend at the speed of speech.</p>
<p>The more I thought about it, the more it sounded like an interesting project to build for myself. Well, I did decide to build it, and will now bring you up to speed on how that went. <a target="_blank" href="https://www.frenchlearner.com/expressions/on-y-va/"><em>On y va !</em></a></p>
<h2 id="heading-what-to-build-exactly">What to build exactly?</h2>
<p>I wanted a pipeline that would accept an audio file containing French speech, and transform that into another audio file featuring English explanations of that French speech. This pipeline needed 4 steps:</p>
<ol>
<li><p>Transcribe the French audio input into text</p>
</li>
<li><p>Translate the French text into English</p>
</li>
<li><p>Generate the script of a dialogue that weaves the translation into an explanation</p>
</li>
<li><p>Generate audio output narrating the dialogue script</p>
</li>
</ol>
<p>For the purpose of this writeup, I’ve implemented those steps in a simple typescript project that can be executed from the command line. The code lives in <a target="_blank" href="https://github.com/CodeWithOz/french-listening-demo">this repo</a>, and I will reference some of the files and code snippets from there.</p>
<h3 id="heading-transcribe-audio-input-into-text">Transcribe audio input into text</h3>
<p>This was the easiest part of this project. <a target="_blank" href="https://deepgram.com/">Deepgram</a> provides a high-quality Speech-to-Text service that supports my desired language, French. They have a very generous $200 starter credit, so frankly I didn’t even bother to look elsewhere. After installing the <a target="_blank" href="https://github.com/deepgram/deepgram-js-sdk/">JS SDK</a>, I could easily transcribe the audio file using a single invocation of the <a target="_blank" href="https://github.com/deepgram/deepgram-js-sdk/?tab=readme-ov-file#local-files"><code>transcribeFile</code></a> method. After looking at the <a target="_blank" href="https://developers.deepgram.com/reference/speech-to-text/listen-pre-recorded#response.body.results">response schema</a> of the underlying endpoint, I decided that I’d likely need the <a target="_blank" href="https://developers.deepgram.com/reference/speech-to-text/listen-pre-recorded#response.body.results.channels.alternatives.paragraphs.paragraphs.sentences">list of sentences</a> and the <a target="_blank" href="https://developers.deepgram.com/reference/speech-to-text/listen-pre-recorded#response.body.results.channels.alternatives.words">list of words</a>. The code that saves that data can be found in my <a target="_blank" href="https://github.com/CodeWithOz/french-listening-demo/blob/main/transcribe.ts#L39-L87"><code>transcribe</code> function</a>. To give you a sense of the output of this step, I used an audio clip that spans the first 37 seconds of a particular YouTube video… turn the volume down a bit if you want to watch it. <a target="_blank" href="https://www.youtube.com/watch?v=grnDE2e2iW4">Here you go</a> 🙂. The output of <code>transcribe</code> for that audio clip looks like this:</p>
<div class="gist-block embed-wrapper" data-gist-show-loading="false" data-id="21085e5108f029168f1a4f8d1ea64196f2950d0f"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a href="https://gist.github.com/CodeWithOz/bfbc9b2d4534ccd2fb029aee2b129b99/21085e5108f029168f1a4f8d1ea64196f2950d0f" class="embed-card">https://gist.github.com/CodeWithOz/bfbc9b2d4534ccd2fb029aee2b129b99/21085e5108f029168f1a4f8d1ea64196f2950d0f</a></div><p> </p>
<p>An auspicious start! The <code>words</code> array is pretty long so it’s not shown fully in that snippet. If you’re craving some JSON to brighten up your day, view the full thing <a target="_blank" href="https://gist.github.com/CodeWithOz/074451ba8bb3a36bca9433c27ef7d00e/89a48a64560066ebdfcf4f1d91b263f1872064f3">here</a>.</p>
<p>On that note, Step 1 is done! ✅</p>
<h3 id="heading-translate-french-text-into-english">Translate French text into English</h3>
<p>The most obvious candidate for this step is a dedicated service like Google Translate or DeepL. However, I chose to use an LLM because modern LLMs use <a target="_blank" href="https://docs.langchain.com/oss/javascript/langchain/models#structured-output">structured outputs</a> to enable an almost unbounded range of output schemas. For this project, I wanted the following extras on top of the raw translation:</p>
<ul>
<li><p>a list of the verbs used in the sentence, with their conjugated and infinitive forms</p>
</li>
<li><p>a list of other words like nouns, adverbs, and adjectives, each with its meaning</p>
</li>
<li><p>a list of idiomatic expressions present in the sentence, each with its literal and contextual meaning</p>
</li>
</ul>
<p>LLMs can provide all of these in one query response, and these days they are pretty great at raw translation too. Below is the <a target="_blank" href="https://zod.dev/">Zod</a> schema I used to capture my desired outputs:</p>
<div class="gist-block embed-wrapper" data-gist-show-loading="false" data-id="b86a6ddf5558227765de7496ff1629fa3620f53a"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a href="https://gist.github.com/CodeWithOz/28c5165862ce2df83936ffa93c2bbfce/b86a6ddf5558227765de7496ff1629fa3620f53a" class="embed-card">https://gist.github.com/CodeWithOz/28c5165862ce2df83936ffa93c2bbfce/b86a6ddf5558227765de7496ff1629fa3620f53a</a></div><p> </p>
<p>Equipped with this schema, I tasked OpenAI’s <a target="_blank" href="https://platform.openai.com/docs/models/gpt-5-mini">GPT 5 Mini</a> with handling the translations, using a <a target="_blank" href="https://github.com/CodeWithOz/french-listening-demo/blob/main/translate.ts#L10-L46">system prompt</a> that of course reflects the data in <code>TranslationSchema</code>. My <a target="_blank" href="https://github.com/CodeWithOz/french-listening-demo/blob/main/translate.ts#L48-L61"><code>translate</code></a> function feeds the sentence to the LLM and gets the explained translation in return. Using the first sentence in the output from <code>transcribe</code>, I’m happy to report that the output of <code>translate</code> was as expected 🙂:</p>
<div class="gist-block embed-wrapper" data-gist-show-loading="false" data-id="ffc96fa9d685f7c42dce834c36276251cccde91c"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a href="https://gist.github.com/CodeWithOz/bfbc9b2d4534ccd2fb029aee2b129b99/ffc96fa9d685f7c42dce834c36276251cccde91c" class="embed-card">https://gist.github.com/CodeWithOz/bfbc9b2d4534ccd2fb029aee2b129b99/ffc96fa9d685f7c42dce834c36276251cccde91c</a></div><p> </p>
<p>That’s just one sentence though. To process all the sentences, a simple loop will suffice… buttt as you may know, LLMs generally respond on a time scale of seconds. Stacking tens or maybe hundreds of sentences for sequential translation is not very efficient. Ideally this step should only last as long as the longest individual translation. In other words, the name of the game here is, say it with me, <strong>concurrency</strong>! My <a target="_blank" href="https://github.com/CodeWithOz/french-listening-demo/blob/main/translate.ts#L85-L97"><code>translateSentences</code></a> function handles that using <code>Promise.all</code>. If you’re really keen to improve your French vocabulary 📚, go <a target="_blank" href="https://gist.github.com/CodeWithOz/474e9c58674464a626c11093ebe1a5fd">here</a> to see the full set of translations for all 4 sentences.</p>
<p>And with that, let’s move on to Step 3. 🏃🏾‍♂️</p>
<h3 id="heading-generate-dialogue-script-featuring-translation-and-explanation">Generate dialogue script featuring translation and explanation</h3>
<p>Look, the path to completing this step proved to be <em>muuch</em> less straightforward than I imagined, my oh my. Essentially, the content of the script is directly influenced by the features of the text-to-speech (TTS) service used to narrate the script. These features vary by service provider, so I had to explore various paths and possibilities in search of satisfactory output. I’ll briefly try to give you a sense of what I mean.</p>
<p>Emotional expressiveness emerged as an important concern because I didn’t want to get bored out of my mind while listening to the explanations. There’s a cohort of TTS services that generate a level of monotone that Lex Fridman would be proud of. Thankfully, on the other end of the spectrum, some services provide knobs to make the speech sound less robotic, but even among these services, the knobs come in different shapes and sizes! 😩</p>
<p>There’s also the small matter of choosing the voices of the speakers. Each voice has its own profile of expressiveness, speed of speech, and appropriate use-cases. Even when using the same voice, maintaining prosody (fancy word for rhythm and intonation) across multiple turns can be challenging, or worse, impossible.</p>
<p>Additionally, my specific project needed to generate both French and English snippets close together, often within the same sentence. These snippets needed to be in the same voice, with appropriate pronunciation for each language, much like you’d expect when listening to a bilingual speaker.</p>
<p>And last but not least, the old truism applies here too: it’s all about the dollar bills yo! 💸 The models that performed best for my needs were proprietary and <em>not</em> cheap 😮‍💨. Even though some of the open-source models showed promise, hosting and serving them was not as straightforward as making API calls. The cost of serving and in some cases deploying the best of the open-source models was not low enough to justify using them over the proprietary models.</p>
<p>At this point, esteemed reader, you’re well within your rights to wonder what service I finally settled upon. Without further ado, the illustrious award of 🏅<em>Featured in Uche’s Obscure Project</em>🏅 goes to none… other… than… 🥁🥁🥁</p>
<p>🏆🎊 <a target="_blank" href="https://ai.google.dev/gemini-api/docs/models#gemini-2.5-flash-tts">Gemini 2.5 Flash TTS</a>! 🎊🏆</p>
<p>It supports dialogue such that you can build <a target="_blank" href="https://notebooklm.google.com/?pli=1">NotebookLM</a>-style conversations if you really put your mind to it. That means, among other things, it’s pretty good at emotional expressiveness. In fact, of all the options I tested, it was the best at capturing emotional cues sprinkled throughout the script. It is <a target="_blank" href="https://cloud.google.com/text-to-speech/docs/gemini-tts#available_languages">very multilingual</a> and has a nice <a target="_blank" href="https://ai.google.dev/gemini-api/docs/speech-generation#voices">range of voices</a>, each of which is available in all the supported languages. It is therefore well-suited to mix French and English in the same voice, with appropriate accents and pronunciations… at least most of the time 🙈. Given that it’s an LLM, it enabled me to prompt my way to (sometimes partial) success with customized instructions. In terms of cost, this model was in the middle of the pack of the options I looked at, though it still isn’t cheap 😭. I could at least console myself knowing that it gave me the best results of all the options I considered.</p>
<p>Alrighty, we’ve gone a little while without seeing any code. I feel like a fish flapping around in open air 🎣. Time to dive back in! 👨🏾‍💻</p>
<p>I suspect that you may find the details of this part of the project a bit more interesting, so I’ll walk through my approach in more depth. <code>generateExplanationScript</code> is the function that will generate the script:</p>
<div class="gist-block embed-wrapper" data-gist-show-loading="false" data-id="8165a8eea9ddfa64f6e3d1eb702b861a78871b13"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a href="https://gist.github.com/CodeWithOz/d254c811d5a20c2083f593d0a856f875/8165a8eea9ddfa64f6e3d1eb702b861a78871b13" class="embed-card">https://gist.github.com/CodeWithOz/d254c811d5a20c2083f593d0a856f875/8165a8eea9ddfa64f6e3d1eb702b861a78871b13</a></div><p> </p>
<p>A few things are worthy of mention. First, the function returns an array of strings, rather than a single string. That structure made it easier to generate the audio in sensible chunks, in case a sentence’s full explanation runs longer than the <a target="_blank" href="https://docs.cloud.google.com/text-to-speech/docs/gemini-tts#use-cloud-text-to-speech-api">character limits</a> of the Gemini TTS API. If you’ve ever had to chunk large strings for any kind of processing, you’ve probably had to decide how best to truncate the string while maintaining coherence within the chunks. This was my way of preempting that decision.</p>
<p>Secondly, remember earlier when I spoke about knobs for tuning emotional expressiveness? The text in square brackets are examples of that. Gemini’s TTS is capable of picking up cues embedded in the script when the cues are presented as <a target="_blank" href="https://docs.cloud.google.com/text-to-speech/docs/gemini-tts#markup_tag_guide">markup tags</a>. And, for good measure, I’ll eventually explain exactly what those tags mean in the prompt I give to the LLM. For now, just note that the <code>[English explanation]</code> tag is more of an instruction to the LLM, telling it to read the subsequent text in English rather than French.</p>
<p>Following from that, the next notable thing is my heavy usage of French linking text coupled with the need to specify that some text should be read in English. The AI-generated speech needed to sound bilingual, but I noticed a quirk in all the TTS models I tried: the longer the text goes on in the same language, the stronger the speaker’s accent becomes in that language, and consequently, the worse it pronounces words in any other language. As a result, I often generated speech that featured <em>very</em> badly pronounced French words, because the speaker had progressively transitioned to a strong English accent by the time the French word came around. And no, explicitly specifying the language of the speaker via API options didn’t help much. To illustrate this phenomenon, the following script starts with 4 French words, goes on for much longer in English, then inserts 4 more French words close to the end:</p>
<blockquote>
<p>Cette phrase veut dire: "This sentence is a demonstration of accent drift, which can happen with text-to-speech models. When a sentence starts in French, transitions to English, and spends much more time in English than in French, the speaker's accent starts to sound almost completely English. It's quite a strange and fascinating phenomenon. I think it's because of the way LLMs work. They are essentially prediction machines that try to guess the most likely next word for a given piece of text. So, for something which is generating audio, the more English it sees close to the point at which it needs to generate the next sound, the more it predicts that the sound should be read with an English accent. After all, an English speaker speaking English should sound English, no? Conversely, when the French text is closeby, it still predicts that the English should be pronounced with a French accent. But when the French is far behind a lot of English text, then you hear something like 'donnez-moi le livre', which doesn't sound as French as it should sound."</p>
</blockquote>
<p>And here’s the generated audio. Listen carefully for how “<em>donnez-moi le livre</em>” is pronounced towards the end:</p>
<iframe src="https://listenbetter.prototypes.haus/samples/accent-drift" width="100%" height="200"></iframe>

<p>I know, esteemed reader, that you may not know a single word of French, let alone how French words should be pronounced. That’s totally fine! 😄 Take my word for it when I say that the speaker says “<em>donnez-moi le livre</em>” in a rather English-y way. That is exactly what I needed to prevent. My solution was to use French for all the linking text, to embed as much French as possible in-between the English translations. Very nice for me because it ended up providing even more French practice than I had planned 😏. Here’s what this strategy looked like for the verbs:</p>
<div class="gist-block embed-wrapper" data-gist-show-loading="false" data-id="0d0670d994bb50f83411f29355d9f07d0fd38691"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a href="https://gist.github.com/CodeWithOz/d254c811d5a20c2083f593d0a856f875/0d0670d994bb50f83411f29355d9f07d0fd38691" class="embed-card">https://gist.github.com/CodeWithOz/d254c811d5a20c2083f593d0a856f875/0d0670d994bb50f83411f29355d9f07d0fd38691</a></div><p> </p>
<p>Once again, you don’t need to understand the French 🙂. You only need to understand that the words and meanings are interpolated using the general structure “<em>{French words} mean {English meaning}</em>”. And importantly, the text of each section is stitched together into one string <em>before</em> adding that final string to the <code>explanationParts</code> array, in line with the coherent chunking strategy mentioned earlier. The full definition of <code>generateExplanationScript</code> is available <a target="_blank" href="https://github.com/CodeWithOz/french-listening-demo/blob/main/tts.ts#L51-L130">here</a>, along with the definitions of <a target="_blank" href="https://github.com/CodeWithOz/french-listening-demo/blob/main/tts.ts#L135-L138"><code>replaceSlash</code></a> and <a target="_blank" href="https://github.com/CodeWithOz/french-listening-demo/blob/main/tts.ts#L143-L151"><code>formatList</code></a>.</p>
<p>To wrap up this step, here’s the output of <code>generateExplanationScript</code> for the first sentence in the transcription:</p>
<div class="gist-block embed-wrapper" data-gist-show-loading="false" data-id="5e782a12ef80e7bcfc13a719b7a379d843548ee1"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a href="https://gist.github.com/CodeWithOz/bfbc9b2d4534ccd2fb029aee2b129b99/5e782a12ef80e7bcfc13a719b7a379d843548ee1" class="embed-card">https://gist.github.com/CodeWithOz/bfbc9b2d4534ccd2fb029aee2b129b99/5e782a12ef80e7bcfc13a719b7a379d843548ee1</a></div><p> </p>
<p>Great!</p>
<p>Combining those strings to form a coherent script is the domain of Step 4, so let’s head there <em>tout de suite !</em> 🏃🏾‍♂️</p>
<h3 id="heading-generate-audio-output-narrating-the-dialogue-script">Generate audio output narrating the dialogue script</h3>
<p>Simply put, the goal here is to generate an audio file in which the speakers narrate the script from Step 3. This process needs to happen in 2 sub-steps:</p>
<ul>
<li><p>Send the API request that will generate the audio content</p>
</li>
<li><p>Convert the API response into the desired audio format</p>
</li>
</ul>
<h4 id="heading-send-audio-generation-api-request">Send audio generation API request</h4>
<p>Given that we’re dealing with an LLM, I’ll start with the system prompt. As mentioned earlier, my <a target="_blank" href="https://github.com/CodeWithOz/french-listening-demo/blob/main/tts.ts#L233-L266">TTS system prompt</a> describes the meaning of the <a target="_blank" href="https://github.com/CodeWithOz/french-listening-demo/blob/main/tts.ts#L22-L49">markup tags I use</a>, and explains how they should be applied to the generated speech. If you read through the full prompt, you may have noticed something curious: it repeatedly emphasizes generating audio rather than text. Why do that if it’s for a TTS model?</p>
<p>Well, this is one of the pleasures of dealing with the guardrails Google has placed on some of their models. Long story short, the Gemini API server has an internal check that assesses the prompt to determine if it’s trying to trick the model to generate something other than audio. If that check concludes that your prompt is attempting such trickery, your request gets blocked ⛔. That’s reasonable in principle, but in reality the check is a bit too enthusiastic. The only way my prompt could consistently pass was to distribute repeated affirmations of the output type throughout the prompt. Of course this behavior is not documented anywhere, so the way I discovered it was by innocently sending my API requests, hitting unpredictable and confusing errors that messed up my results, banging my head against the wall for hours, scouring through GitHub issues for answers, finally discovering the behavior for myself with trial and error, then <a target="_blank" href="https://github.com/googleapis/js-genai/issues/1058#issuecomment-3474547370">receiving confirmation</a> of my discovery a few days late. Fun stuff.</p>
<p>Now we can switch focus to the API options used for the request. My intermediary of choice, LangChain’s <a target="_blank" href="https://www.npmjs.com/package/@langchain/google-webauth">google-webauth</a> package, exposes two important API options:</p>
<ul>
<li><p><code>responseModalities</code> maps to <a target="_blank" href="https://docs.cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform_v1beta1.types.Modality?hl=en"><code>Modality</code></a>, and is used to specify that the response should be audio (yet again, right?)</p>
</li>
<li><p><code>speechConfig</code> maps to <a target="_blank" href="https://docs.cloud.google.com/python/docs/reference/texttospeech/latest/google.cloud.texttospeech_v1.types.MultiSpeakerVoiceConfig?hl=en"><code>MultiSpeakerVoiceConfig</code></a>, and is used to define the names of the speakers and map them to specific <a target="_blank" href="https://ai.google.dev/gemini-api/docs/speech-generation#voices">voices</a>.</p>
</li>
</ul>
<p>The full code of my <code>generateDialogue</code> function lives <a target="_blank" href="https://github.com/CodeWithOz/french-listening-demo/blob/main/tts.ts#L156-L228">here</a>, and if you go through it you’ll notice another curiosity:</p>
<div class="gist-block embed-wrapper" data-gist-show-loading="false" data-id="1d831c9f3515db74f0455c3afa2d0016c1b843ec"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a href="https://gist.github.com/CodeWithOz/d254c811d5a20c2083f593d0a856f875/1d831c9f3515db74f0455c3afa2d0016c1b843ec" class="embed-card">https://gist.github.com/CodeWithOz/d254c811d5a20c2083f593d0a856f875/1d831c9f3515db74f0455c3afa2d0016c1b843ec</a></div><p> </p>
<p>In other words, the system prompt is being injected into the <a target="_blank" href="https://docs.langchain.com/oss/javascript/langchain/messages#human-message"><code>HumanMessage</code></a> instead of sending it as a standalone <a target="_blank" href="https://docs.langchain.com/oss/javascript/langchain/messages#system-message"><code>SystemMessage</code></a>. Yep, that’s the result of yet another undocumented quirk of the Gemini TTS API. Like the earlier one, I discovered this one <a target="_blank" href="https://github.com/googleapis/js-genai/issues/1058#issuecomment-3463807058">the hard way too</a>. Moreover, you can see that <code>humanMessageContent</code> contains one more affirmation that the output type should be audio not text. At this point, if you’ve gotten the impression that making this part work reliably was a lot less intuitive and a lot more time-consuming than an API call should be, you’re not far off the mark! I’m not salty though, not at all… 😒</p>
<h4 id="heading-convert-api-response-to-audio">Convert API response to audio</h4>
<p>What actually is the response from the API endpoint? The “Technical specifications” section of <a target="_blank" href="https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-5-flash-live-api">this page</a> shows “Raw 16-bit PCM audio at 24kHz, little-endian” as the output format. The better way to be sure is actually to read the values specified in the <code>mimeType</code> property of the response, because docs pages can of course get outdated. Nevertheless, I can confirm that both are in sync at the time of writing 😀. If you’re wondering, <a target="_blank" href="https://en.wikipedia.org/wiki/Pulse-code_modulation">PCM</a> is the raw digital representation of the audio, so it’s not immediately usable by media players. A conversion step to a well-supported audio format is necessary, and for this project I picked the MP3 format.</p>
<p>The steps required here are to get a buffer representing the audio bytes, confirm the PCM format details from the MIME type, then generate the MP3 file from the PCM data. In Code This Means:</p>
<div class="gist-block embed-wrapper" data-gist-show-loading="false" data-id="ad3aa70d0b77dfb64b60951dc399d4b016ee09f7"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a href="https://gist.github.com/CodeWithOz/d254c811d5a20c2083f593d0a856f875/ad3aa70d0b77dfb64b60951dc399d4b016ee09f7" class="embed-card">https://gist.github.com/CodeWithOz/d254c811d5a20c2083f593d0a856f875/ad3aa70d0b77dfb64b60951dc399d4b016ee09f7</a></div><p> </p>
<p>Note that the defaults for sample rate and bit depth come from the output format specified by the docs. <code>convertAudioContentToMp3</code> accepts options for working directory and file base name, to allow customizing the output file’s location and name. The actual conversion will be performed by <code>pcmBufferToMp3</code>. It’s a function that writes the PCM data to a file, then uses <a target="_blank" href="https://www.ffmpeg.org/">FFmpeg</a> to convert that PCM file into a separate MP3 file. If you’re interested, you can see the logic of the function <a target="_blank" href="https://github.com/CodeWithOz/french-listening-demo/blob/main/ffmpeg.ts#L37-L97">here</a>.</p>
<h4 id="heading-chunking-it-all-together">Chunking it all together</h4>
<p>To finally generate the audio, the script chunks needed to be combined in a way that respected the <a target="_blank" href="https://docs.cloud.google.com/text-to-speech/docs/gemini-tts#use-cloud-text-to-speech-api">character limit of 8000 bytes</a>.</p>
<p>The first question was how many characters can fit into 8000 bytes? Simply put, it depends on the character encoding used to read those bytes. For this project, I assumed that the Gemini server supports and uses <a target="_blank" href="https://en.wikipedia.org/wiki/UTF-8">UTF-8</a> because that’s the standard used across the web. UTF-8 uses between 1-4 bytes per character, which makes the limit potentially as high as 8000 characters or as low as 2000 characters. Accented Latin characters like those found in French (é, ç, î, etc) use 2 or more bytes, so I was pretty certain that the practical limit would be between 4000 and 8000 characters. In reality, there’s a rather low probability of encountering a French sentence where accented characters are even up to half of the letters in the sentence, so I felt confident that I’d be fine anywhere below 6000 characters.</p>
<p>Armed with this info, the next question became: how much of that limit was already used up by the full prompt, excluding the script text? 2352 characters, to be exact. Therefore, the prompt was less than halfway there, which was great news. However, it still wasn’t fully clear how much dialogue could fit within the remaining space. So I asked myself a different question: how long can I listen to French speech before I can no longer follow what’s being said? In other words, going back to the description in the intro, what’s the point at which my face starts twitching weirdly? 😁 After some trial and error, I found a sweet spot at 200 characters of transcribed French text. As it turns out, the generated script featuring all the explanations uses between 1500-2500 characters when starting from around 200 characters. I rounded that up to 3000 characters to cover unexpectedly long explanations. That would mean the total input text could contain as many as ~5300 characters, which was still well within my projected safe limit of 6000 characters. Excellent!</p>
<p>With that established, I added some code to:</p>
<ul>
<li><p>take the script chunks from <code>generateExplanationScript</code></p>
</li>
<li><p>create new combined chunks that each do not exceed 3000 characters, and</p>
</li>
<li><p>transform those combined chunks into MP3 files using <code>generateDialogue</code> followed by <code>convertAudioContentToMp3</code></p>
</li>
</ul>
<p>You can find that logic <a target="_blank" href="https://github.com/CodeWithOz/french-listening-demo/blob/main/index.ts#L24-L67">here</a>, and enjoy the final mp3 files for each of the 4 sentences <a target="_blank" href="https://github.com/CodeWithOz/french-listening-demo/tree/main/samples">here</a>. Want a preview already? The mp3 of the third sentence sounds like this:</p>
<iframe src="https://listenbetter.prototypes.haus/samples/full-narration" width="100%" height="200"></iframe>

<p>That’s definitely a success in my book! 🙌🏾</p>
<h3 id="heading-the-real-thing">The Real Thing</h3>
<p>So, everything I’ve described so far has been to explain and showcase the principle of what I wanted. How about what I actually built and use daily? Well I’m glad you asked 😀 I present to you… <a target="_blank" href="https://listenbetter.prototypes.haus">Listen Better</a>! Creative name, right?</p>
<p>At its core, Listen Better is a podcast-style audio feed featuring translations and explanations of different pieces of French audio. I source the input audio files from various corners of the interwebs. Each “episode” features 2 bilingual AI-generated hosts who dissect each sentence in the original audio, just as described in this writeup. The feed contains the full dialogue script for each episode so I can see and read what was discussed. I also created a dedicated <a target="_blank" href="https://listenbetter.prototypes.haus/api/feed/rss">RSS feed</a> so I can listen in my podcast app. Additionally, I added a Telegram integration to send myself the full translated vocabulary of each sentence in the original audio file, as generated by Step 2. If any of this sounds interesting to you, the web app is literally <a target="_blank" href="https://listenbetter.prototypes.haus">at your fingertips</a> 😀. If you’d like to do a few things differently, such as provide your own French audio, customize the output, or change the languages, feel free to reach out to me so we can talk about it.</p>
<p>If you listen to any of the episodes, you’ll notice a few extras not captured in this writeup:</p>
<ul>
<li><p>Each piece of generated audio is stitched together sequentially to form a continuous audio file that constitutes the episode.</p>
</li>
<li><p>Each episode contains an intro and an outro, because I came to appreciate the value of at least half-decent transitions when listening to something informative.</p>
</li>
<li><p>The original audio clips of the translated sentences are played before and after the explanations, to allow me hear the audio at regular speed and associate that with the explanation.</p>
</li>
</ul>
<p>These extras are handled by some additional FFmpeg manipulation and a few extra API calls, but nothing fundamentally different from what I described above.</p>
<h2 id="heading-la-fin">La Fin</h2>
<p>So, coming back to the title of this post, can AI “solve” listening comprehension? I’m sure you already know the answer 😄 it certainly can’t right now! Who knows what the future holds though… but in the meantime, Listen Better demonstrates what’s possible today, which is not bad at all, if I do say so myself.</p>
<h3 id="heading-is-that-all">Is That All?</h3>
<p>For how I built the project, yes. For all my thoughts on the project, no 😀. This writeup is only the first member of a <a target="_blank" href="https://incodethismeans.com/series/building-listen-better">2-part series</a> focused on this project. Would you like to know the cost of an episode, or how well this whole pipeline performs, or whether vibe coding can build you an equivalent in next to no time? I’ll share my thoughts on those topics and more when I publish Part 2 of this series. Watch this space! 👀</p>
<p>Until then, thank you for taking time out of your day to read this 😊. <em>À la prochaine !</em></p>
<hr />
<p>Questions? Feedback? Nice words, or mean ones? Feel free to reach out to @CodeWithOz on all the socials, or on <a target="_blank" href="https://www.linkedin.com/in/uchechukwu-ozoemena/">LinkedIn</a>.</p>
]]></content:encoded></item><item><title><![CDATA[Async vs Non-Blocking Operations for Responsive FastAPI Endpoints]]></title><description><![CDATA[Introduction
You’ve probably heard this principle before: “don’t put blocking operations on the main thread”.
Until recently, I’d been working only with synchronous python for the best part of 3 years. Using an outdated version of Django will do that...]]></description><link>https://incodethismeans.com/async-vs-non-blocking-operations-for-responsive-fastapi-endpoints</link><guid isPermaLink="true">https://incodethismeans.com/async-vs-non-blocking-operations-for-responsive-fastapi-endpoints</guid><category><![CDATA[Python]]></category><category><![CDATA[FastAPI]]></category><category><![CDATA[asyncio]]></category><dc:creator><![CDATA[Uche Ozoemena]]></dc:creator><pubDate>Thu, 09 Oct 2025 09:09:22 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/ZM32iaMO2XM/upload/bf2a0bb3214b062d8b5ee1a061425cd4.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-introduction">Introduction</h2>
<p>You’ve probably heard this principle before: “don’t put blocking operations on the main thread”.</p>
<p>Until recently, I’d been working only with synchronous python for the best part of 3 years. Using an outdated version of Django will do that to you… So when I finally started working with asynchronous python, I failed, predictably, to apply that principle in my code.</p>
<p>Thankfully I’ve seen the error of my ways. This post is an attempt to pass along (some of) my understanding. On we go! 🤓</p>
<h2 id="heading-in-search-of-responsive-endpoints">In search of responsive endpoints</h2>
<p><a target="_blank" href="https://fastapi.tiangolo.com/">FastAPI</a> encourages the use of async endpoints. A simple FastAPI server with a basic endpoint that processes files may look like this:</p>
<div class="gist-block embed-wrapper" data-gist-show-loading="false" data-id="fcd57014a8b1cb8c3b17426464162e967849e767"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a href="https://gist.github.com/CodeWithOz/d31e788d4bec21eec47a3247e593163e/fcd57014a8b1cb8c3b17426464162e967849e767" class="embed-card">https://gist.github.com/CodeWithOz/d31e788d4bec21eec47a3247e593163e/fcd57014a8b1cb8c3b17426464162e967849e767</a></div><p> </p>
<p>I made the mistake of thinking that this server would be able to handle a new request while the <code>process_files</code> endpoint function is paused on the <code>await upload_to_s3()</code> step. The endpoint function is async after all, and <code>upload_to_s3</code> is also an async function, even though the <code>sleep</code> operation inside it is not. Subconsciously, I expected Python to just handle whatever was necessary to run the sync operation without blocking the main thread.</p>
<p>To my surprise, the server would block and not process any new requests sent within each 5s span. The logs looked like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758044180923/22303008-45ff-4b35-9877-fb39eb07a821.png" alt class="image--center mx-auto" /></p>
<p>The server waited for 5 seconds while sleeping and didn’t start the new request until the first one was completed, and similarly remained blocked during the second request. Why did this happen?</p>
<p>It’s because python’s async-await implementation uses an event loop that processes only one task at a time. So when a blocking operation is on that event loop, nothing else will run on the loop until that operation is complete. Wrapping the operation in an async function won’t change that. <strong>An async function is not necessarily a non-blocking function</strong>.</p>
<h3 id="heading-unblocking-the-server-the-wrong-way">Unblocking the server… the wrong way ❌</h3>
<p>Before I fully understood what I wrote in the paragraph above, my first thought was to simply defer the upload operation until after the client receives the server response. FastAPI’s <a target="_blank" href="https://fastapi.tiangolo.com/tutorial/background-tasks/">background tasks</a> is a good way to try to achieve that. A background task can schedule <code>upload_to_s3</code> to be executed after the server responds to the client, as follows:</p>
<div class="gist-block embed-wrapper" data-gist-show-loading="false" data-id="c29baaee7736952730fe84afbcbdb427fcaf3380"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a href="https://gist.github.com/CodeWithOz/d31e788d4bec21eec47a3247e593163e/c29baaee7736952730fe84afbcbdb427fcaf3380" class="embed-card">https://gist.github.com/CodeWithOz/d31e788d4bec21eec47a3247e593163e/c29baaee7736952730fe84afbcbdb427fcaf3380</a></div><p> </p>
<p><code>background_tasks.add_task</code> accepts the function to be executed as the first argument, followed by the args and/or kwargs that the function should be invoked with. As for how to pass <code>background_tasks</code> into the endpoint function, simply defining a parameter with the <code>BackgroundTasks</code> type is enough. FastAPI does the work of invoking the function with the appropriate argument.</p>
<p>Repeating the test with that code shows that the blocking problem didn’t actually go away 😭.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758044578796/a227e595-7b85-4d30-9871-ac36dcf82ce5.png" alt class="image--center mx-auto" /></p>
<p>Even though the request finished processing before the upload started, the server still remained blocked on the upload and didn’t pick up the new request until afterwards. Whuttt?</p>
<p>The blocking task actually prevented the event loop from picking up the next scheduled task (i.e. the new request). More so, the client that made the initial request was actually kept waiting for the server response while the server was occupied with the blocking background task. So, for all intents and purposes, nothing changed 😞.</p>
<p>Okay then, what’s the correct way to go about this?</p>
<h3 id="heading-unblocking-the-server-the-right-way">Unblocking the server… the right way ✅</h3>
<p>Either make everything async, or push the sync stuff into a separate thread.</p>
<h4 id="heading-double-down-on-async">Double down on async 😎</h4>
<p>This route requires finding or creating a version of the blocking operation that <em>properly</em> frees up the event loop when the operation is not executing python code. “Properly” means <code>yield</code>ing control back to the event loop appropriately. That can be tricky to get right, so I generally prefer to search for a widely used async implementation of an operation rather than rolling my own. A lot of libraries and packages expose both async and sync versions of their available methods, so I’ve mostly not needed to look very far. For the <code>sleep</code> function used in this writeup, asyncio has a <a target="_blank" href="https://docs.python.org/3/library/asyncio-task.html#sleeping">drop-in replacement</a> for that:</p>
<div class="gist-block embed-wrapper" data-gist-show-loading="false" data-id="56ede7b11664e78c187e709e4e6ece8f3006d43b"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a href="https://gist.github.com/CodeWithOz/d31e788d4bec21eec47a3247e593163e/56ede7b11664e78c187e709e4e6ece8f3006d43b" class="embed-card">https://gist.github.com/CodeWithOz/d31e788d4bec21eec47a3247e593163e/56ede7b11664e78c187e709e4e6ece8f3006d43b</a></div><p> </p>
<p>After testing again with a few simultaneous requests, they all got handled without the server blocking.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758052784353/152e5afb-57db-46b3-9659-ecb6621be790.png" alt class="image--center mx-auto" /></p>
<p>Success! 🎉</p>
<p>If you can’t find an async alternative to your blocking operation, and you’re a cool kid feeling up to the task, you can always build your own async implementation of that operation. As a token of appreciation for reading this far, I recommend <a target="_blank" href="https://docs.python.org/3/howto/a-conceptual-overview-of-asyncio.html#a-conceptual-overview-of-asyncio">this guide</a> as a place to start 🙂.</p>
<p>For any number of valid reasons, the async path may not be the right one to follow in a particular context. Not a problem though!</p>
<h4 id="heading-give-up-and-stay-sync">Give up and stay sync 🤷🏾‍♂️</h4>
<p>Python’s concurrency implementation allows the execution of synchronous functions in a <a target="_blank" href="https://docs.python.org/3/library/asyncio-task.html#running-in-threads">separate thread</a>. This frees up the main thread to keep the event loop running unblocked. FastAPI’s <a target="_blank" href="https://github.com/fastapi/fastapi/blob/a372edf7e8825068a780df643700e6cce7e035c5/fastapi/concurrency.py">concurrency module</a> exposes a <code>run_in_threadpool</code> helper for achieving this. Just like <code>background_tasks.add_task</code>, <code>run_in_threadpool</code> accepts the function to be executed as the first argument, followed by the args and/or kwargs that the function should be invoked with. In Code This Means:</p>
<div class="gist-block embed-wrapper" data-gist-show-loading="false" data-id="af1adb152924e4cccd87b2a176b85ca1f7b2deaf"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a href="https://gist.github.com/CodeWithOz/d31e788d4bec21eec47a3247e593163e/af1adb152924e4cccd87b2a176b85ca1f7b2deaf" class="embed-card">https://gist.github.com/CodeWithOz/d31e788d4bec21eec47a3247e593163e/af1adb152924e4cccd87b2a176b85ca1f7b2deaf</a></div><p> </p>
<p>Notice that <code>upload_to_s3</code> is no longer an async function, and it’s executed in a separate thread using <code>run_in_threadpool</code>. The pre-upload operations are also now performed synchronously in <code>prepare_for_upload_synchronously</code>. Lastly, <code>background_tasks.add_task</code> is still used to ensure that the execution happens after the endpoint returns a response.</p>
<p>Repeating the test of quick-fire requests showed similar behavior to the fully async technique:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758055293414/e5ffdd59-c38c-4801-ac03-6d71d6ca5637.png" alt class="image--center mx-auto" /></p>
<p>The goodies 🛍️ don’t stop there though. There’s an even simpler way to get this same behavior that doesn’t involve calling <code>run_in_threadpool</code>: just pass the regular non-async function to <code>background_tasks.add_task</code>. <a target="_blank" href="https://github.com/Kludex/starlette/blob/61a0ba5101e726a9a2cf150303ede0650f6ca69c/starlette/background.py#L25-L29">Internally</a> it figures out if the function is async, and runs it in a thread pool if so. Convenient! 🙌🏾</p>
<h5 id="heading-a-note-on-thread-safety"><strong>A note on thread safety</strong></h5>
<p>Executing logic in parallel using multiple threads introduces the possibility that more than one thread will access and try to modify the same resources at almost the same time, which can create <a target="_blank" href="https://en.wikipedia.org/wiki/Thread_safety">some serious problems</a>. Personally, I favor using multiple threads in scenarios where the thread operations are sufficiently isolated for my taste.</p>
<p>For example, I recently used the <code>run_in_threadpool</code> method for performing file conversion without saving the converted bytes to the filesystem. They were uploaded straight to s3 instead. The function that did this upload had no other side-effects, and the filenames in s3 were freshly generated UUIDs. Ergo, very small chance of a name collision leading to a mishap. Make a similar assessment for your use case and take any necessary precautions. Tread carefully when working with multiple threads!</p>
<h2 id="heading-closing-thoughts">Closing thoughts</h2>
<p>A quick recap of the essentials I learned:</p>
<ul>
<li><p>defining a function with <code>async</code> does not make its operations non-blocking; you must make sure the operations are actually non-blocking</p>
</li>
<li><p>when possible, avoid mixing blocking and non-blocking operations inside the same function; use threads for synchronous blocking work if necessary</p>
</li>
<li><p>when using FastAPI, background tasks are a good way to defer non-essential chunks of work until after an endpoint has responded to the client</p>
</li>
</ul>
<p>Lastly, it’s important to note that this writeup focuses on scenarios in which the background tasks are small enough to be safely executed within the same server as the FastAPI process, yet long enough to make a bad experience for the client hitting the endpoint. This writeup also contains an implicit assumption that custom sequencing of background tasks is not a requirement. Some or all of these may not be true for your use case. For larger workloads or workloads that require customized or consistent scheduling of tasks (eg FIFO task queues), tools like <a target="_blank" href="https://docs.celeryq.dev/en/stable/">Celery</a> and <a target="_blank" href="https://python-rq.org/">RQ</a> are more appropriate.</p>
<p>Thanks for your time!</p>
<hr />
<p>Questions? Feedback? Nice words, or mean ones? Feel free to reach out to @CodeWithOz on all the socials, or on <a target="_blank" href="https://www.linkedin.com/in/uchechukwu-ozoemena/">LinkedIn</a>.</p>
]]></content:encoded></item><item><title><![CDATA[Sanitizing YouTube Transcripts the Trendy Way: with AI Agents!]]></title><description><![CDATA[Introduction 👋🏾
Do you ever worry that the machines are close to taking over from us humans? Look no further than YouTube’s machine-generated closed captions for peace of mind.

That screenshot 👆🏾 is from a video uploaded in 2011. A decade and a ...]]></description><link>https://incodethismeans.com/sanitizing-youtube-transcripts-the-trendy-way-with-ai-agents</link><guid isPermaLink="true">https://incodethismeans.com/sanitizing-youtube-transcripts-the-trendy-way-with-ai-agents</guid><category><![CDATA[langchain]]></category><category><![CDATA[langgraph]]></category><category><![CDATA[tavily]]></category><category><![CDATA[agentic AI]]></category><category><![CDATA[agentic workflow]]></category><dc:creator><![CDATA[Uche Ozoemena]]></dc:creator><pubDate>Tue, 02 Sep 2025 15:32:24 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/iuuJC_pjLU0/upload/056cff24514222a61a5e2cc5998c1ab9.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-introduction">Introduction 👋🏾</h2>
<p>Do you ever worry that the machines are close to taking over from us humans? Look no further than YouTube’s machine-generated closed captions for peace of mind.</p>
<p><a target="_blank" href="https://www.youtube.com/watch?v=23H8IdaS3tk"><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1755537118825/6016b82c-a201-4c5c-9edb-7d3376e56555.png" alt class="image--center mx-auto" /></a></p>
<p>That screenshot 👆🏾 is from a <a target="_blank" href="https://www.youtube.com/watch?v=23H8IdaS3tk">video</a> uploaded in 2011. A decade and a half later, I still found it necessary to build a tool that makes sense of the closed captions in some videos I watch. Luckily for us, YouTube’s parent company <a target="_blank" href="https://en.wikipedia.org/wiki/Attention_Is_All_You_Need">brought the gift</a> of LLMs to the world. Let’s go through my LLM-infused attempt at improving those gnarly transcriptions.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">I’ll use “closed captions” and “transcripts” interchangeably to mean the text of a video’s audio track, but those 2 terms describe different though related things.</div>
</div>

<h2 id="heading-the-ai-stack">The AI Stack ⚙️</h2>
<h3 id="heading-langgraph">LangGraph</h3>
<p><a target="_blank" href="https://www.langchain.com/langgraph">LangGraph</a> is a framework for building the latest and greatest agentic workflows, making it a natural choice here of course. At a basic level, LangGraph represents an AI agent as a <a target="_blank" href="https://langchain-ai.github.io/langgraph/concepts/low_level/#graphs">graph network</a> where <a target="_blank" href="https://langchain-ai.github.io/langgraph/concepts/low_level/#nodes">nodes</a> perform actions and <a target="_blank" href="https://langchain-ai.github.io/langgraph/concepts/low_level/#edges">edges</a> (the connections between nodes) dictate which nodes get executed in what order. A simple example is an agent with one node and two edges that connect the single node to the start and end of the graph, as seen below:</p>
<p><a target="_blank" href="https://langchain-ai.github.io/langgraph/tutorials/get-started/1-build-basic-chatbot/#7-visualize-the-graph-optional"><img src="https://langchain-ai.github.io/langgraph/tutorials/get-started/basic-chatbot.png" alt="basic chatbot diagram" class="image--center mx-auto" /></a></p>
<p>The <a target="_blank" href="https://langchain-ai.github.io/langgraph/tutorials/get-started/1-build-basic-chatbot/#3-add-a-node"><code>chatbot</code></a> node can be any function, whether or not it uses an LLM within. So it’s actually perfectly fine to build a LangGraph workflow that doesn’t use LLMs directly. For instance, a dev team may happily use LangGraph to orchestrate parts of their application knowing they’ll eventually spice things up with AI ✨.</p>
<p><a target="_blank" href="https://langchain-ai.github.io/langgraph/concepts/low_level/#state">State</a> is another important concept to keep in mind. The state object is exactly what it sounds like: a place to preserve any data that’s relevant to the behavior of the agent. What’s the quote again… “with great freedom to save anything to state comes great responsibility to manage state updates wisely” 🤔 💭… or something like that anyways. LangGraph uses the <a target="_blank" href="https://langchain-ai.github.io/langgraph/concepts/low_level/#reducers">reducer pattern</a> for state management, so nodes can independently update specific parts of agent state without touching other parts. If you are familiar with state management in popular frontend frameworks, you probably already understand how this system works.</p>
<h3 id="heading-langchain">LangChain</h3>
<p>LangGraph’s elder sibling. <a target="_blank" href="https://www.langchain.com/">LangChain</a> comes packed with lots of utilities and prebuilt functions that implement common patterns and best practices for building with LLMs. OpenAI is the model provider of choice for this writeup so I’ll be using LangChain’s <a target="_blank" href="https://python.langchain.com/docs/integrations/llms/openai/">OpenAI</a> companion package, though <a target="_blank" href="https://python.langchain.com/docs/integrations/llms/anthropic/">Anthropic</a>, <a target="_blank" href="https://python.langchain.com/docs/integrations/llms/google_ai/">Google</a> and <a target="_blank" href="https://python.langchain.com/docs/integrations/llms/">many others</a> are perfectly capable alternatives too.</p>
<h3 id="heading-tavily">Tavily</h3>
<p>The hero message on their <a target="_blank" href="https://www.tavily.com/">landing page</a> sums it up neatly: “Connect Your Agent to the Web”. Tavily provides a highly configurable suite of tools for getting structured results from the web. As we will see shortly, web search is a critical part of this agent’s internal operations.</p>
<h2 id="heading-the-agent">The Agent 🤖</h2>
<p>To keep things manageable, the objective of this agent is to correct misspellings of real-world entities that show up in the transcripts. That’s because the names of such entities get butchered a lot in those transcripts, though other sentence elements are often captured accurately enough now. Restoring the honor of the butchered entities is a worthy task for this agent. This focus on names means that this agent will not try to correct “<em>imitated Jamaican vacation dot</em>” to “<em>Hey man. How’s your Jamaican vacation going?</em>” as seen in the earlier screenshot. Rather, the agent should detect that “Jamaican” refers to a real-world entity and focus only on correcting it if it’s misspelled.</p>
<p>Broadly speaking, the agent works in 3 steps:</p>
<ol>
<li><p>Extract named entities</p>
</li>
<li><p>Find canonical references for the named entities using web search</p>
</li>
<li><p>Replace incorrect names with canonical names where appropriate</p>
</li>
</ol>
<p>To explore each step further, let’s get a basic outline of the agent:</p>
<div class="gist-block embed-wrapper" data-gist-show-loading="false" data-id="c4bca7456334e081624b9e9796b57774b8e1ff23"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a href="https://gist.github.com/CodeWithOz/0b01d558adac703673aceca48471dca7/c4bca7456334e081624b9e9796b57774b8e1ff23" class="embed-card">https://gist.github.com/CodeWithOz/0b01d558adac703673aceca48471dca7/c4bca7456334e081624b9e9796b57774b8e1ff23</a></div><p> </p>
<p>Nothing fancy at this stage, just placeholders for the agent and its state object.</p>
<h3 id="heading-step-1-extract-named-entities">Step 1: Extract named entities</h3>
<p>LLMs are a great fit for this step. They have gotten pretty darn good at extracting data from raw text. Plus, they’re also very good at formatting their output according to defined structures. The output of this step needs to go into a web search query, and an ideal query will contain both the named entity and some additional context to make the web search results a lil’ bit sharper. The agent must therefore extract named entities and relevant additional context from the transcript.</p>
<p>To kick things off, the first node in the agent will need the transcript text and an LLM to extract the relevant data. In Code This Means (1) the <code>AgentState</code> needs a property for the transcript text and (2) the node function needs an LLM to perform the extraction. Updating the agent’s state schema is easy enough:</p>
<div class="gist-block embed-wrapper" data-gist-show-loading="false" data-id="9c516ab667ed577844b9fec1078cbfa79686e75b"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a href="https://gist.github.com/CodeWithOz/0b01d558adac703673aceca48471dca7/9c516ab667ed577844b9fec1078cbfa79686e75b" class="embed-card">https://gist.github.com/CodeWithOz/0b01d558adac703673aceca48471dca7/9c516ab667ed577844b9fec1078cbfa79686e75b</a></div><p> </p>
<p>And then comes the node function:</p>
<div class="gist-block embed-wrapper" data-gist-show-loading="false" data-id="2a8c9556e8386d5809e50d0753e4d55c3215fe2d"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a href="https://gist.github.com/CodeWithOz/0b01d558adac703673aceca48471dca7/2a8c9556e8386d5809e50d0753e4d55c3215fe2d" class="embed-card">https://gist.github.com/CodeWithOz/0b01d558adac703673aceca48471dca7/2a8c9556e8386d5809e50d0753e4d55c3215fe2d</a></div><p> </p>
<p>The <code>extractor_node</code> function performs the extraction by passing the <code>transcript_text</code> piece of state into the <code>extractor_llm</code>. Now the node must actually go into the agent’s graph:</p>
<div class="gist-block embed-wrapper" data-gist-show-loading="false" data-id="692946ece2322d4dfe984cbdcc94b4874695bfe0"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a href="https://gist.github.com/CodeWithOz/0b01d558adac703673aceca48471dca7/692946ece2322d4dfe984cbdcc94b4874695bfe0" class="embed-card">https://gist.github.com/CodeWithOz/0b01d558adac703673aceca48471dca7/692946ece2322d4dfe984cbdcc94b4874695bfe0</a></div><p> </p>
<p>The <code>extractor</code> is the entry point of the graph, though for now it’s also the exit point because it has no other friends in the graph 🥺.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1755713005373/1a9e51a1-8812-45a9-a831-5e26e8efff50.png" alt class="image--center mx-auto" /></p>
<p>Looking back at the system prompt for the LLM, there’s no guarantee of the structure and format of the LLM’s response. That needs to change:</p>
<div class="gist-block embed-wrapper" data-gist-show-loading="false" data-id="a61ca1197219f2938a6d5cc644266ff7c9ad7694"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a href="https://gist.github.com/CodeWithOz/0b01d558adac703673aceca48471dca7/a61ca1197219f2938a6d5cc644266ff7c9ad7694" class="embed-card">https://gist.github.com/CodeWithOz/0b01d558adac703673aceca48471dca7/a61ca1197219f2938a6d5cc644266ff7c9ad7694</a></div><p> </p>
<p>The prompt now stipulates that the LLM must provide the name and context for each extracted entity in JSON format. Still, even if the LLM gets the format right, it’ll return <em>text</em> that contains JSON. To get the desired equivalent python objects, LangChain’s <a target="_blank" href="https://python.langchain.com/docs/concepts/structured_outputs/">with_structured_output()</a> method will try to convert the LLM’s text response into a defined Pydantic model.</p>
<div class="gist-block embed-wrapper" data-gist-show-loading="false" data-id="efbd41c310371f72030814a35d8fdb5ec373a89d"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a href="https://gist.github.com/CodeWithOz/0b01d558adac703673aceca48471dca7/efbd41c310371f72030814a35d8fdb5ec373a89d" class="embed-card">https://gist.github.com/CodeWithOz/0b01d558adac703673aceca48471dca7/efbd41c310371f72030814a35d8fdb5ec373a89d</a></div><p> </p>
<p>Before going further with the agent logic, it’s a good idea to test the existing code. To do that, the graph needs to be <a target="_blank" href="https://langchain-ai.github.io/langgraph/reference/graphs/#langgraph.graph.state.StateGraph.compile">compiled</a> into an object that can be invoked:</p>
<div class="gist-block embed-wrapper" data-gist-show-loading="false" data-id="7fe0c54b1f2c09e77c3baf31ab29787cde322a28"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a href="https://gist.github.com/CodeWithOz/0b01d558adac703673aceca48471dca7/7fe0c54b1f2c09e77c3baf31ab29787cde322a28" class="embed-card">https://gist.github.com/CodeWithOz/0b01d558adac703673aceca48471dca7/7fe0c54b1f2c09e77c3baf31ab29787cde322a28</a></div><p> </p>
<p>The agent will be triggered by invoking the <code>graph</code> property. LangGraph provides 2 methods for this: <a target="_blank" href="https://langchain-ai.github.io/langgraph/reference/graphs/#langgraph.graph.state.CompiledStateGraph.invoke"><code>.invoke()</code></a> and <a target="_blank" href="https://langchain-ai.github.io/langgraph/reference/graphs/#langgraph.graph.state.CompiledStateGraph.stream"><code>.stream()</code></a>. Both of them run the agent end-to-end with the provided input. However, <code>.invoke()</code> runs without interruption and returns the final complete state object when the run is complete, whereas <code>.stream()</code> returns an iterable containing every update to the agent’s state as the agent was running. Put differently, <code>.stream()</code> is more appropriate when the developer wants finer-grained control over each step of the agent’s logical flow. I prefer using the asynchronous versions <a target="_blank" href="https://langchain-ai.github.io/langgraph/reference/graphs/#langgraph.graph.state.CompiledStateGraph.ainvoke"><code>.ainvoke()</code></a> and <a target="_blank" href="https://langchain-ai.github.io/langgraph/reference/graphs/#langgraph.graph.state.CompiledStateGraph.astream"><code>.astream()</code></a>.</p>
<p>So then, what input do these methods accept? The initial state of the agent! Suppose the video transcript is the following:</p>
<blockquote>
<p>"Hollywood director Kristoffer Nolen has announced his next big project, rumored to be another complex sci-fi thriller. Fans of Nolen’s earlier work are eager to see if this film will rival the success of ‘Inceptshun’ or ‘Dunkrik.’"</p>
</blockquote>
<p>I’ll pass it into the agent like this:</p>
<div class="gist-block embed-wrapper" data-gist-show-loading="false" data-id="8cbb50a3e091f62c9d341cf7bbba641d4223ec34"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a href="https://gist.github.com/CodeWithOz/0b01d558adac703673aceca48471dca7/8cbb50a3e091f62c9d341cf7bbba641d4223ec34" class="embed-card">https://gist.github.com/CodeWithOz/0b01d558adac703673aceca48471dca7/8cbb50a3e091f62c9d341cf7bbba641d4223ec34</a></div><p> </p>
<p>Executing the file prints the following output:</p>
<div class="gist-block embed-wrapper" data-gist-show-loading="false" data-id="1b016e10e519bd24205618a9c8dfe2abedf5d757"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a href="https://gist.github.com/CodeWithOz/0f455f11b593b2db32d575f6127635bc/1b016e10e519bd24205618a9c8dfe2abedf5d757" class="embed-card">https://gist.github.com/CodeWithOz/0f455f11b593b2db32d575f6127635bc/1b016e10e519bd24205618a9c8dfe2abedf5d757</a></div><p> </p>
<p>Success! The LLM extracted 3 named entities (1 director’s name and 2 movie names) from the transcript, and provided some context around them too. The “<em>Agent response: …</em>” line shows the final state of the agent (from using <code>.ainvoke()</code>), which is the same as the initial state because the extractor node didn’t update the agent’s state.</p>
<p>Step 2 of the agent’s workflow needs the extracted data, so the extractor node needs to save that information to the agent’s state. In LangGraph, a node updates a piece of state by returning a dictionary containing a key for that piece of state. Saving the extracted data to state is therefore as easy as returning the extracted data in a dictionary with the appropriate key.</p>
<div class="gist-block embed-wrapper" data-gist-show-loading="false" data-id="4627e3501ae8955e0b46fbac59df401894268ec9"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a href="https://gist.github.com/CodeWithOz/0b01d558adac703673aceca48471dca7/4627e3501ae8955e0b46fbac59df401894268ec9" class="embed-card">https://gist.github.com/CodeWithOz/0b01d558adac703673aceca48471dca7/4627e3501ae8955e0b46fbac59df401894268ec9</a></div><p> </p>
<p>Now invoking the agent on that same transcript yields:</p>
<div class="gist-block embed-wrapper" data-gist-show-loading="false" data-id="055fbf0ada23b2e7e337592083cdfc946e0935c2"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a href="https://gist.github.com/CodeWithOz/0f455f11b593b2db32d575f6127635bc/055fbf0ada23b2e7e337592083cdfc946e0935c2" class="embed-card">https://gist.github.com/CodeWithOz/0f455f11b593b2db32d575f6127635bc/055fbf0ada23b2e7e337592083cdfc946e0935c2</a></div><p> </p>
<p><code>extracted_entities</code> is now present, which means Step 1 is complete! ✅</p>
<h3 id="heading-step-2-find-canonical-references-for-the-named-entities-using-web-search">Step 2: Find canonical references for the named entities using web search</h3>
<p>Here’s the basic idea: search the web using outputs of the extractor, then verify the canonical names from the search results using another LLM.</p>
<p>In steps Tavily. Using the <a target="_blank" href="https://docs.tavily.com/sdk/python/reference#tavily-search">Tavily Search</a> module, the agent can research the entity name along with the extracted context. In Code This Means:</p>
<div class="gist-block embed-wrapper" data-gist-show-loading="false" data-id="48ba9a3aeb940f0f8deb6e6296d309d2693ba296"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a href="https://gist.github.com/CodeWithOz/0b01d558adac703673aceca48471dca7/48ba9a3aeb940f0f8deb6e6296d309d2693ba296" class="embed-card">https://gist.github.com/CodeWithOz/0b01d558adac703673aceca48471dca7/48ba9a3aeb940f0f8deb6e6296d309d2693ba296</a></div><p> </p>
<p>The <code>research_entity</code> function constructs a web search query by simply concatenating the entity’s name and the extracted context, and takes only the top 3 results. Now the agent needs to get the canonical name from the search results. Big wall of code incoming! ⚠️🫣</p>
<div class="gist-block embed-wrapper" data-gist-show-loading="false" data-id="49c66a7e1873bd4776871c0c71d3f659c943af58"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a href="https://gist.github.com/CodeWithOz/0b01d558adac703673aceca48471dca7/49c66a7e1873bd4776871c0c71d3f659c943af58" class="embed-card">https://gist.github.com/CodeWithOz/0b01d558adac703673aceca48471dca7/49c66a7e1873bd4776871c0c71d3f659c943af58</a></div><p> </p>
<p>Thankfully what the code does is not actually complicated. <code>TavilySearchResult</code> is just a convenience type based on the <a target="_blank" href="https://docs.tavily.com/sdk/python/reference#results">response schema</a> of Tavily’s search endpoint. <code>get_canonical_name</code> formats the search results into a digestible string for the verifier LLM, invokes the LLM, and returns a structured <code>VerifiedEntity</code> after collecting the LLM’s response. The LLM is configured to output a <code>BaseVerifiedEntity</code> to make the process smoother.</p>
<p>The 2 new functions, <code>research_entity</code> and <code>get_canonical_name</code>, will be used by the next node in the graph, but… there could be any number of <code>NamedEntity</code> objects coming from the extractor node. How then can the agent handle this unbounded structure?</p>
<p>LangGraph’s <a target="_blank" href="https://langchain-ai.github.io/langgraph/concepts/low_level/#send"><code>Send</code> API</a> to the rescue! The <code>Send</code> API covers scenarios in which there’s an unknown number of parallel nodes (let’s call them workers) that will flow from a given node, and allows each worker to maintain its own state while also syncing to the state of the parent agent if desired. From the perspective of <code>DemoEnrichmentAgent</code>, each individual worker should receive a <code>NamedEntity</code>, coordinate with <code>research_entity</code> and <code>get_canonical_name</code>, then return a <code>VerifiedEntity</code> that can be saved to the agent’s state. We’ll return to the <code>Send</code> API shortly. First, the logic for the worker:</p>
<div class="gist-block embed-wrapper" data-gist-show-loading="false" data-id="b6a28a2d3fa6a5276f519eaf8b8764443ff06d33"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a href="https://gist.github.com/CodeWithOz/0b01d558adac703673aceca48471dca7/b6a28a2d3fa6a5276f519eaf8b8764443ff06d33" class="embed-card">https://gist.github.com/CodeWithOz/0b01d558adac703673aceca48471dca7/b6a28a2d3fa6a5276f519eaf8b8764443ff06d33</a></div><p> </p>
<p><code>get_verified_entity_worker</code> takes the named entity from its own <code>WorkerState</code>, uses <code>research_entity</code> and <code>get_canonical_name</code> to verify that entity, then updates the worker state with the verified entity. As you’ve probably already noticed, the <code>verified_entities</code> piece of <code>WorkerState</code> uses a slightly unusual <code>Annotated[list, operator.add]</code> type. The base <code>list</code> type is to keep track of all the verified entities. Annotating <code>list</code> with <code>operator.add</code> tells LangGraph to merge (by appending) each new list of items to the existing list of items for that state key. Without that annotation, the verified entities will be lost because LangGraph’s default behavior is to replace the previous list of values with the incoming list of values. Definitely don’t want the verified entities getting lost 🙅🏾‍♂️.</p>
<p>Lest we forget, the worker state is still isolated from the agent’s state. Would be really nice to have the <code>verified_entities</code> list synced up to the agent state… which is surprisingly trivial to achieve: define the same key in the agent state and LangGraph will keep them in sync.</p>
<div class="gist-block embed-wrapper" data-gist-show-loading="false" data-id="68749dca601453ca5e372985422d0ac3e47e3a95"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a href="https://gist.github.com/CodeWithOz/0b01d558adac703673aceca48471dca7/68749dca601453ca5e372985422d0ac3e47e3a95" class="embed-card">https://gist.github.com/CodeWithOz/0b01d558adac703673aceca48471dca7/68749dca601453ca5e372985422d0ac3e47e3a95</a></div><p> </p>
<p>With the state synced up, the <code>Send</code> API comes back into the picture. For this the agent needs a function that will spawn a worker for each extracted entity:</p>
<div class="gist-block embed-wrapper" data-gist-show-loading="false" data-id="785bb1bba115e86709adeb9a0f2726bb719577d1"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a href="https://gist.github.com/CodeWithOz/0b01d558adac703673aceca48471dca7/785bb1bba115e86709adeb9a0f2726bb719577d1" class="embed-card">https://gist.github.com/CodeWithOz/0b01d558adac703673aceca48471dca7/785bb1bba115e86709adeb9a0f2726bb719577d1</a></div><p> </p>
<p>The first argument passed to <code>Send()</code> is the name of the node that will handle the requested operation (i.e. the worker), and the second argument is the piece of state that will be passed to that node (i.e. the <code>WorkerState</code>). However, <code>get_verified_entity_worker</code> is not actually a node in the graph. Well then, it’s time to fix that!</p>
<div class="gist-block embed-wrapper" data-gist-show-loading="false" data-id="68c569a1d2a4561e67279dfb64d7f17628f5a1cb"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a href="https://gist.github.com/CodeWithOz/0b01d558adac703673aceca48471dca7/68c569a1d2a4561e67279dfb64d7f17628f5a1cb" class="embed-card">https://gist.github.com/CodeWithOz/0b01d558adac703673aceca48471dca7/68c569a1d2a4561e67279dfb64d7f17628f5a1cb</a></div><p> </p>
<p>The graph now features a <a target="_blank" href="https://langchain-ai.github.io/langgraph/reference/graphs/#langgraph.graph.state.StateGraph.add_conditional_edges">conditional edge</a> that connects <code>extractor</code> to any number of <code>get_verified_entity_worker</code> nodes, as many as created by <code>spawn_workers</code>. Here’s the updated graph representation:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1755784805163/e4bd74bb-133f-450e-ad0f-be53b52d7494.png" alt class="image--center mx-auto" /></p>
<p>Now I don’t know about you, but for all that code we just went through, I feel like the graph should definitely look fancier 😒 [shakes fist].</p>
<p>Oh well, the thought of seeing the output of the updated agent brings a smile back to my face 🙂. Quick comment on the graph representation: the dotted lines between <code>extractor</code> and <code>get_verified_entity_worker</code> signify the 1-to-many relationship between the two nodes.</p>
<p>And now for the test! The same video transcript from earlier will suffice. The logs for this operation are quite a bit longer than those from the previous step, so I’ve pasted them <a target="_blank" href="https://gist.github.com/CodeWithOz/0f455f11b593b2db32d575f6127635bc/6d8059bb00396088b284a199362c2c1dbf6bcef5">here</a> for your viewing pleasure 😍. They show the agent going through the research steps concurrently and then ending with the final state object that contains the verified entities. Smile definitely restored 😁. On we go to Step 3! 🏃🏾‍♂️</p>
<h3 id="heading-step-3-replace-incorrect-names-with-canonical-names-where-appropriate">Step 3: Replace incorrect names with canonical names where appropriate</h3>
<p>There are no prizes for guessing that, once again, an LLM is a capable tool for this step. A simple deterministic "Find and Replace” operation (like a regex) may conceptually seem to do the trick here. Turns out that strategy will quickly go off the rails because there’s no guarantee that every reference to a named entity will use the exact text that the extractor produced for that entity. It’s well and good to find and replace occurrences of “Kristoffer Nolen” with “Christopher Nolan”, but that won’t catch something like “Nolen’s” where “Kristoffer” is missing, or “Kristóffer Nolen” where an accented “ó” is present. With an appropriate prompt and some examples, an LLM can be coached to handle the inevitable variability quite well.</p>
<p>This agent will use a modified “Find and Replace” strategy. The agent will make successive scans over the text, use an LLM to replace any incorrect references it detects for each verified entity, and review the replacement work to ensure completeness. This strategy is not at all perfect, and I will discuss avenues for optimization. More on that to come.</p>
<p>Right now it’s time to scaffold the strategy. For each verified entity, the agent will:</p>
<ul>
<li><p>scan the transcript text and replace any occurrences with the canonical name. The agent needs a node for this.</p>
</li>
<li><p>review the output of the replacement to make sure it was complete. The agent also needs a node for this.</p>
</li>
<li><p>decide when to move on to the next verified entity. The agent needs a conditional edge for this, though with different behavior from the earlier one.</p>
</li>
</ul>
<p>And on to the code 👨🏾‍💻:</p>
<div class="gist-block embed-wrapper" data-gist-show-loading="false" data-id="26b475e72b70eedc094218140572e7055eb9f5fb"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a href="https://gist.github.com/CodeWithOz/0b01d558adac703673aceca48471dca7/26b475e72b70eedc094218140572e7055eb9f5fb" class="embed-card">https://gist.github.com/CodeWithOz/0b01d558adac703673aceca48471dca7/26b475e72b70eedc094218140572e7055eb9f5fb</a></div><p> </p>
<p><code>get_verified_entity_worker</code> now connects to <code>replacement_reviewer</code>, which compresses the unbounded structure back into one node after all the workers have completed their tasks. The new conditional edge stipulates that a return value of <code>continue</code> from <code>continue_replacement_router</code> will push the logical flow to <code>replace_entity</code>, whereas a return value of <code>end</code> will terminate the agent’s work. There’s also a normal edge connecting <code>replace_entity</code> to <code>replacement_reviewer</code>, which means that every time <code>continue_replacement_router</code> returns <code>continue</code>, a replacement will be attempted and subsequently reviewed. Voila! A decision-making loop in the agent. Now the graph looks like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1755911762999/99e337c2-874c-4571-9695-07a5b11f4db5.png" alt class="image--center mx-auto" /></p>
<p>Finally it looks more interesting! 🤓</p>
<p>Let’s flesh out the new functions in reverse order, starting with <code>continue_replacement_router</code>. This function’s behavior is relatively straightforward: if the agent has replaced all the verified entities, the task is done. Otherwise, the process should continue. A loop counter will be helpful to know which member of the <code>verified_entities</code> list the agent is working on.</p>
<div class="gist-block embed-wrapper" data-gist-show-loading="false" data-id="b3208810b0a62c3858e432d2a3aaacfef57d9c15"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a href="https://gist.github.com/CodeWithOz/0b01d558adac703673aceca48471dca7/b3208810b0a62c3858e432d2a3aaacfef57d9c15" class="embed-card">https://gist.github.com/CodeWithOz/0b01d558adac703673aceca48471dca7/b3208810b0a62c3858e432d2a3aaacfef57d9c15</a></div><p> </p>
<p><code>continue_replacement_router</code> simply signals if the loop counter has reached the end of the list of verified entities. Once again, the <code>operator.add</code> annotation tells LangGraph to merge updates into the existing value of the state key using an add operation. For an <code>int</code> type, this just means “increment by the new value”.</p>
<p>Next up is <code>replacement_reviewer_node</code>. To review effectively, the agent needs to keep track of the reviewed transcript text. I noticed, after some trial and error, that making multiple replacement attempts for each verified entity boosted the accuracy of the replacement work. To achieve that, the agent needs to track the number of attempts it’s made for the entity it’s currently processing. All this means that <code>AgentState</code> gets two new properties:</p>
<div class="gist-block embed-wrapper" data-gist-show-loading="false" data-id="d4ddc230aaed419fb26af238968e9e1040af10c3"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a href="https://gist.github.com/CodeWithOz/0b01d558adac703673aceca48471dca7/d4ddc230aaed419fb26af238968e9e1040af10c3" class="embed-card">https://gist.github.com/CodeWithOz/0b01d558adac703673aceca48471dca7/d4ddc230aaed419fb26af238968e9e1040af10c3</a></div><p> </p>
<p>And <code>replacement_reviewer_node</code> gets the following logic (it’s kind of a lot but not too bad):</p>
<div class="gist-block embed-wrapper" data-gist-show-loading="false" data-id="0659fbfec75b2cb1ec174a07b35a06c4751328e5"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a href="https://gist.github.com/CodeWithOz/0b01d558adac703673aceca48471dca7/0659fbfec75b2cb1ec174a07b35a06c4751328e5" class="embed-card">https://gist.github.com/CodeWithOz/0b01d558adac703673aceca48471dca7/0659fbfec75b2cb1ec174a07b35a06c4751328e5</a></div><p> </p>
<p>Once again, easier than the length suggests 🙂. <code>replacement_reviewer_node</code> uses <code>replacement_reviewer_llm</code> to check if the current entity has been fully replaced in the updated text. It includes a guard that skips the review when both <code>replacement_loop_idx</code> and <code>replacement_pass_count</code> are <code>0</code>, which means no replacement has even happened. If the LLM concludes that the replacement was successful, or if 2 replacement attempts have been made for the entity, <code>replacement_loop_idx</code> gets incremented to move the agent to the next entity on the next round of the loop, and <code>replacement_pass_count</code> is reset to <code>0</code> in preparation for that. <code>replacement_reviewer_llm</code> is configured to always return a boolean using the <code>ReplacementReviewOutcome</code> structure.</p>
<p>Okay, one more function to go. Almost there! 🤏🏾</p>
<p><code>replace_entity_node</code> basically just needs to establish the current entity it should work on, feed that entity along with the transcript text into an LLM that will do the replacement, then update the agent’s state with the new transcript text. One more big block of code, I promise it’s the last one 🙏🏾. Here it is:</p>
<div class="gist-block embed-wrapper" data-gist-show-loading="false" data-id="b5dc91741f40048f0c69ee8e0be427462830e642"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a href="https://gist.github.com/CodeWithOz/0b01d558adac703673aceca48471dca7/b5dc91741f40048f0c69ee8e0be427462830e642" class="embed-card">https://gist.github.com/CodeWithOz/0b01d558adac703673aceca48471dca7/b5dc91741f40048f0c69ee8e0be427462830e642</a></div><p> </p>
<p>As described above, <code>entity_replacer_llm</code> does the replacement, with output structured using <code>TextReplacement</code>. This node also updates <code>replacement_pass_count</code> in the agent’s state, because this operation is the replacement attempt.</p>
<p>Whew! That was a lot 😅. With all of those changes done, it’s time to take the agent for a spin. Fresh, real transcript data will be tested in short order, but for now let’s stick to Chris Nolan’s affairs to validate all the new code. The logs of this run are even longer than the last set, so feel free to feast your eyes on them <a target="_blank" href="https://gist.github.com/CodeWithOz/0f455f11b593b2db32d575f6127635bc/759fd357868c4221ba12efcf4e91988798c61cda">here</a>. They show the value of <code>updated_transcript_text</code> in the final state object, along with other markers of the agent’s work. Ergo, success! 🙌🏾</p>
<p>Looking closely at the logs, some interesting things happened in this run. First, the reviewer LLM correctly flagged up that the “Kristoffer Nolen” entity had not yet been fully replaced after the first attempt, and the second attempt finished off the job. Exactly what we want to see 👏🏾. Secondly, the replacer LLM inserted <code>TRANSCRIPT_TEXT:</code> into the updated transcript more than once. Thankfully it had corrected itself by the end, but that kind of variability is not welcome here. Some strategies that can minimize this issue will be discussed further below.</p>
<p>But now… it’s time for… 🥁🥁🥁 real data!</p>
<h2 id="heading-the-results">The Results 📊</h2>
<p>I built this agent as part of a tool to summarize videos from <a target="_blank" href="https://www.youtube.com/@FabrizioRomanoYT">Fabrizio Romano’s YouTube channel</a>. Yes, I’m a football transfer news junkie, don’t judge me. A typical video from that channel is <a target="_blank" href="https://www.youtube.com/watch?v=FDaSEBzyjFQ">this one</a>, about 8 minutes long. The video’s transcript contains around 1500 words using around 8300 characters. Running the transcript through this agent yielded… a <a target="_blank" href="https://langchain-ai.github.io/langgraph/troubleshooting/errors/GRAPH_RECURSION_LIMIT/">recursion limit</a> error 😭. My fancy “agentic” loop wins the award for causing that. Small matter though, the error page offers a simple solution: specify a <code>recursion_limit</code> that’s as high as you need it to be when calling <code>.invoke()</code> or <code>.stream()</code>. If you’re thinking that maybe, just maybe, this is a sign of an architectural problem, your instincts are not wrong. I’ll touch on that in a bit. After increasing the recursion limit and getting a successful run of the agent, I could show you some more exciting logs, but I’ll highlight some not-so-in-depth metrics instead.</p>
<p>One interesting metric for the text replacement step is the <strong>number of incorrect replacements for each extracted entity</strong>. This number hovered between <strong>0-2</strong> for each entity, and it includes partial and accented replacements as described earlier. It also includes replacements of variations of misspellings for the same entity, for example replacing “Antract Frankfurt” with “Eintracht Frankfurt” even though the extracted name was already the correct canonical name. 2 incorrect replacements per entity is as bad as I am willing to accept for my use-case, so this result just about falls within my limits. Here’s a per-entity breakdown in the format “<em>number of incorrect replacements / total number of occurrences to replace</em>” for a run in which 16 entities were extracted:</p>
<ol>
<li><p>“Benjamin Chesco” → “Benjamin Sesko” - <strong>1/11</strong></p>
</li>
<li><p>“Manchester United” → “Manchester United F.C.” - <strong>1/12</strong></p>
</li>
<li><p>“Newcastle” → “Newcastle United” - <strong>0/8</strong></p>
</li>
<li><p>“Red Bull Leipzig” → “RB Leipzig” - <strong>1/7</strong></p>
</li>
<li><p>“Nicholas Jackson” → “Nicolas Jackson” - <strong>2/7</strong></p>
</li>
<li><p>“Chelsea” → “Chelsea FC” - <strong>0/4</strong></p>
</li>
<li><p>“Darwin Nunes” → “Darwin Núñez” - <strong>2/10</strong></p>
</li>
<li><p>“Al Hilal” → “Al Hilal” - <strong>0/7</strong></p>
</li>
<li><p>“Liverpool” → “Liverpool FC” - <strong>0/4</strong></p>
</li>
<li><p>“Ritsu Doan” → “Ritsu Doan” - <strong>2/3</strong></p>
</li>
<li><p>“Eintracht Frankfurt” → “Eintracht Frankfurt” - <strong>0/2</strong></p>
</li>
<li><p>“Jack Grealish” → “Jack Grealish” - <strong>0/9</strong></p>
</li>
<li><p>“Everton” → “Everton Football Club” - <strong>0/3</strong></p>
</li>
<li><p>“Tottenham” → “Tottenham Hotspur” - <strong>0/1</strong></p>
</li>
<li><p>“Southampton” → “Southampton” - <strong>0/1</strong></p>
</li>
<li><p>“Malik Fofana” → “Malick Fofana” - <strong>0/2</strong></p>
</li>
</ol>
<p>Another way to view those same numbers is to calculate “<em>number of correct replacements / total number of occurrences to replace</em>” expressed as a percentage, as a measure of <strong>accuracy</strong> of the text replacement step. For this same run, the average accuracy per-entity is <strong>90.2%</strong>. That suggests, from a ridiculously small sample size 😂, that the agent can do the replacement quite well most of the time.</p>
<p>One more interesting number is the rate at which the review step <em>correctly</em> identified when the replacement was complete or incomplete. This number can be calculated as “<em>number of correct reviews / total number of reviews performed</em>” and expressed as a percentage, as a measure of <strong>accuracy</strong> of the review step. Across the full run the agent performed 19 reviews rather than the maximum 32 possible, because the reviewer LLM concluded that some entities did not require a second replacement attempt. 15 of those reviews were correct, meaning <strong>78.9%</strong> accuracy. Meh, definitely room for improvement there.</p>
<p>And on that note, it’s time to talk about getting better results.</p>
<h2 id="heading-the-upgrades">The Upgrades(?) 👨🏾‍🔧</h2>
<p>General rule of thumb: optimize only after measuring, and start optimizing where the measured results are worst. Aaand right after saying that, I’m going to violate that rule just a bit 😇 because my objective here is to provide some food 🥘 for thought 🧠. So I’ll speak broadly about different optimizations I either considered or applied while building this agent for my video summarizer.</p>
<h3 id="heading-better-prompts">Better prompts</h3>
<p>Improving the agent’s LLM prompts is perhaps an obvious change. In this blog post I used very simplistic prompts that were intended to (1) conserve space and (2) maintain focus on what the system should be doing rather than how well it does its job. For my tool I used much more descriptive prompts that included a number of examples for how I wanted the LLMs to respond. If you’re wondering how to get highly descriptive prompts with appropriate examples and formatting, start by using ChatGPT or your fave chatbot to do the heavy lifting after describing your needs. As an example, here’s the <a target="_blank" href="https://chatgpt.com/s/t_68ad9c221b1081918a7277b7e60b0bca">final prompt</a> I used for the text replacement step. Apart from “normal” prompting, <a target="_blank" href="https://en.wikipedia.org/wiki/Retrieval-augmented_generation">RAG</a> is a more advanced technique to inject domain-specific context into prompts, and the idea of <a target="_blank" href="https://blog.langchain.com/the-rise-of-context-engineering/">“context engineering”</a> is a more wholistic way to think of this process.</p>
<h3 id="heading-model-choice">Model choice</h3>
<p>Testing different models is another way to eek out better performance. Each model has a unique profile for cost and suitability to a given task, so there’s no substitute for trying out different models to see what works best for your use-case. For example, in the verifier step I was keen to use <strong>gpt-5-nano</strong> because of its low cost compared to other high-end OpenAI models, but I found that it performed noticeably worse compared to <strong>gpt-4o-mini</strong>. Similarly, I would’ve used OpenAI’s best available reasoning models for the review step but they were too expensive for this use-case. Lesser models and other optimizations could get me my desired results.</p>
<h3 id="heading-reduce-llm-workload">Reduce LLM workload</h3>
<p>Steps 1 and 3 of this agent are variants of “needle in a haystack” challenges where the LLM has to find some data in a sea of text. One can generally assume that the smaller the haystack, the easier the LLM will find the needles. To this end, the next 5 subsections discuss different ways to reduce the amount of work sent to the LLM.</p>
<h4 id="heading-summarizingshortening-input">Summarizing/shortening input</h4>
<p>You may have noticed that the <a target="_blank" href="https://chatgpt.com/s/t_68ad9c221b1081918a7277b7e60b0bca">final prompt</a> I provided earlier specifies that a video summary will be one of the inputs to the LLM, not the full transcript text. That’s because my tool summarizes the transcript first, then passes the summary into the agent for enrichment. That step alone cut down around <strong>60-70%</strong> of the transcript text sent to the agent, making the process faster, cheaper, and more accurate. Of course there’s a risk of losing some context relevant to some entities, but ultimately the outcomes balanced out pretty nicely for my needs.</p>
<h4 id="heading-chunking">Chunking</h4>
<p>When dealing with large inputs it often makes sense to work with chunks of input at a time instead of all the inputs at once. However, the videos I worked with were short enough to not need chunking for better performance. So I didn’t need to pursue this strategy.</p>
<h4 id="heading-deterministic-fuzzy-search-and-replace">Deterministic + fuzzy search and replace</h4>
<p>A combination of deterministic and fuzzy searching could be a viable strategy to identify the bits of text that need to be replaced in Step 3. Depending on how well it works, this strategy could augment if not replace the modified “find and replace” strategy used by <code>DemoEnrichmentAgent</code>. I didn’t pursue this path seriously because, frankly, I was more interested in learning how to build an agentic workflow than in making each step the best it could possibly be. My gut feeling is that deterministic + fuzzy search and replace would still not have been enough, but that remains to be seen.</p>
<h4 id="heading-one-shot-replacement">One-shot replacement</h4>
<p>This means replacing all the entities in one attempt, rather than using the replace-review loop that caused the recursion error. Once again, the <a target="_blank" href="https://chatgpt.com/s/t_68ad9c221b1081918a7277b7e60b0bca">final prompt</a> I provided earlier shows that one-shot replacement is actually the strategy I use in my tool, not the replace-review loop. But why, you may ask? Well, with a small enough amount of text and with relatively few entities to replace, that step can be done by an LLM in one go. It comes with a slight dip in accuracy, but I was willing to accept it because of the time and cost savings. Choosing to summarize the transcript before feeding it into the agent made this a viable alternative to the replace-review loop.</p>
<h4 id="heading-caching">Caching</h4>
<p>Saving and reusing any outputs that the agent repeatedly generates can bring some welcome benefits. For example, the canonical names are good candidates for caching because they’re unlikely to change for a given input. With a store of such data, the agent’s logic can be modified to first check the cache and use results from there if available, rather than going to the web or relying on an LLM every time. However, caching is notoriously hard to get right if you’re not careful. For this agent, I decided not to implement caching because the potential time and cost savings were not large enough to risk my debugging sanity 🙂.</p>
<h2 id="heading-conclusion">Conclusion 😌</h2>
<p>If you made it this far, you have my sincere gratitude 🙇🏾‍♂️. You deserve a gift…</p>
<p><a target="_blank" href="https://www.deviantart.com/jusuchyne/art/Brownie-Points-gif-1016077209"><img src="https://images-wixmp-ed30a86b8c4ca887773594c2.wixmp.com/f/19a30f1b-3089-4e91-b56c-b65508bf0456/dgsy26x-b81604b0-d527-4865-b670-e6fe43dc59f6.gif?token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJzdWIiOiJ1cm46YXBwOjdlMGQxODg5ODIyNjQzNzNhNWYwZDQxNWVhMGQyNmUwIiwiaXNzIjoidXJuOmFwcDo3ZTBkMTg4OTgyMjY0MzczYTVmMGQ0MTVlYTBkMjZlMCIsIm9iaiI6W1t7InBhdGgiOiJcL2ZcLzE5YTMwZjFiLTMwODktNGU5MS1iNTZjLWI2NTUwOGJmMDQ1NlwvZGdzeTI2eC1iODE2MDRiMC1kNTI3LTQ4NjUtYjY3MC1lNmZlNDNkYzU5ZjYuZ2lmIn1dXSwiYXVkIjpbInVybjpzZXJ2aWNlOmZpbGUuZG93bmxvYWQiXX0.wx295whhn3b0TeokoOXFg6dXASNNFMIIKeWCSfEt9ds" alt="Brownie Points (gif)" class="image--center mx-auto" /></a></p>
<p>The complete code shown in this post is available on <a target="_blank" href="https://github.com/CodeWithOz/demo-enrichment-agent">GitHub</a>.</p>
<p>Now go forth and build your own agents! And let me know about them too 😉.</p>
<hr />
<p>Questions? Feedback? Nice words, or mean ones? Feel free to reach out to @CodeWithOz on all the socials, or on <a target="_blank" href="https://www.linkedin.com/in/uchechukwu-ozoemena/">LinkedIn</a>.</p>
]]></content:encoded></item><item><title><![CDATA[Server Management for the Faint of Heart, Featuring Caddy]]></title><description><![CDATA[Introduction
Caddy is an awesome piece of software that puts SSL cert management on cruise-control, provides approachable yet flexible reverse proxying, and offers a powerful and configurable HTTP server, with some extra goodies for static files. And...]]></description><link>https://incodethismeans.com/server-management-for-the-faint-of-heart-featuring-caddy</link><guid isPermaLink="true">https://incodethismeans.com/server-management-for-the-faint-of-heart-featuring-caddy</guid><category><![CDATA[Caddy]]></category><category><![CDATA[caddyfile]]></category><category><![CDATA[TLS]]></category><category><![CDATA[Reverse Proxy]]></category><dc:creator><![CDATA[Uche Ozoemena]]></dc:creator><pubDate>Fri, 03 Jan 2025 10:58:36 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/AfStyhXC5kM/upload/a8505b8fe01f219e4c27fe209fcbeeb9.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3 id="heading-introduction">Introduction</h3>
<p><a target="_blank" href="https://caddyserver.com/">Caddy</a> is an awesome piece of software that puts SSL cert management on cruise-control, provides approachable yet flexible reverse proxying, and offers a powerful and configurable HTTP server, with some extra goodies for static files. And while providing all that value, Caddy still makes server management sooo much easier and more intuitive! If you can’t tell, I’m a fan of Caddy. I recently setup a server with Caddy, so this article features the tips and unblocks I picked up along the way. Let’s dive in!</p>
<h3 id="heading-getting-started">Getting Started</h3>
<p>If you’re new to Caddy, I highly recommend Caddy’s recommendation 🤭 that you get started in the following order:</p>
<ul>
<li><p>the <a target="_blank" href="https://caddyserver.com/docs/getting-started">Getting Started</a> guide</p>
</li>
<li><p>the <a target="_blank" href="https://caddyserver.com/docs/quick-starts">Quick-starts</a> guides</p>
</li>
</ul>
<p>This is one of the rare occasions when the official guides are as clear as any others you’re likely to find. When you’re done with those guides, you can then consult the reference for the <a target="_blank" href="https://caddyserver.com/docs/api">API</a> or <a target="_blank" href="https://caddyserver.com/docs/caddyfile">Caddyfile</a>, depending on what you want to do and how you want to do it with Caddy. Let’s explore an example.</p>
<h3 id="heading-use-case-reverse-proxies-made-easy">Use Case: Reverse Proxies Made Easy 🤌🏾</h3>
<p>One of the reasons I love Caddy is the ease it brings to setting up a reverse proxy. For instance, putting <code>reverse_proxy :9000</code> in a Caddyfile is enough to route all traffic to the application running on port 9000. You’ll often need something more complicated than that, yet the simplicity remains. To illustrate, imagine that you own the domain <code>yourdomain.tld</code>. You’ve built a helpful web application and you want it live on the internet at <code>sub.yourdomain.tld</code>. What will you do with the main <code>yourdomain.tld</code> site? Maybe it’s too much effort to worry about that right now. You do know, however, that you don’t want to redirect elsewhere. So if anyone chooses to visit <code>yourdomain.tld</code>, you decide to just show a very boring placeholder message in plain text: “Welcome to yourdomain.tld! Full website coming soon.” How to do this? 🤔</p>
<p>You could (1) capture all the traffic heading to the main domain and any subdomains you want to use, (2) use your helpful web app to serve the requests intended for <code>sub.yourdomain.tld</code>, and (3) show your boring message for all the remaining traffic. In Code This Means the following config in a Caddyfile will serve your needs:</p>
<pre><code class="lang-plaintext">*.yourdomain.tld, yourdomain.tld {
    @sub host sub.yourdomain.tld
    route {
        reverse_proxy @sub localhost:port
        respond "Welcome to yourdomain.tld! Full website coming soon."
    }
}
</code></pre>
<ul>
<li><p><code>*.yourdomain.tld, yourdomain.tld</code>: matches incoming requests for all the domains and subdomains that should be handled by the <a target="_blank" href="https://caddyserver.com/docs/caddyfile/directives#caddyfile-directives">directives</a> inside this <a target="_blank" href="https://caddyserver.com/docs/caddyfile/concepts#blocks">site block</a>.</p>
</li>
<li><p><code>@sub host sub.yourdomain.tld</code>: further <a target="_blank" href="https://caddyserver.com/docs/caddyfile/matchers">matches the subset of requests</a> where the hostname is <code>sub.yourdomain.tld</code>, and assigns them the shorthand name <code>@sub</code>.</p>
</li>
<li><p><code>reverse_proxy @sub localhost:port</code>: <a target="_blank" href="https://caddyserver.com/docs/caddyfile/directives/reverse_proxy">routes</a> all traffic matched by <code>@sub</code> to the application running on <code>localhost:port</code>.</p>
</li>
<li><p><code>respond "Welcome to ..."</code>: <a target="_blank" href="https://caddyserver.com/docs/caddyfile/directives/respond">responds with the specified plain text</a> to all other requests not matched by <code>@sub</code>.</p>
</li>
<li><p><a target="_blank" href="https://caddyserver.com/docs/caddyfile/directives/route"><code>route</code> directive</a> allows you to override the <a target="_blank" href="https://caddyserver.com/docs/caddyfile/directives#directive-order">default order</a> in which other directives are handled. For instance, by default Caddy gives the <code>respond</code> directive higher priority over the <code>reverse_proxy</code> directive. Without using the <code>route</code> directive, requests to <code>sub.yourdomain.tld</code> will get the plain text response instead of being served by your helpful web app. Of course that’s not the behavior you want here. You can therefore use <code>route</code> to specify your desired order of priorities, as shown above.</p>
</li>
</ul>
<p>So, 7 lines of code. Not bad. Let’s briefly consider doing the same thing using perhaps the most popular reverse proxy solution, NGINX.</p>
<p>I’ll be honest: I’ve never gotten proficient with NGINX. That’s mostly because it can be rather verbose and unwieldy, increasing the likelihood that I’ll do the wrong thing. So, I hope you’ll forgive me for consulting with ChatGPT to generate an NGINX config that’s equivalent to the Caddyfile config above. Hallucinations 👻 are not welcome here, so I ran the chat bot’s answer through <a target="_blank" href="https://www.getpagespeed.com/check-nginx-config">this validator</a> to identify and fix any obvious problems. Without further ado, here’s the final snippet:</p>
<pre><code class="lang-plaintext">server {
    listen 80;
    listen [::]:80;
    server_name *.yourdomain.tld yourdomain.tld;

    # Redirect HTTP to HTTPS
    return 301 https://$host$request_uri;
}

server {
    listen 443 ssl;
    listen [::]:443 ssl;
    server_name *.yourdomain.tld yourdomain.tld;

    # SSL configuration
    ssl_certificate /path/to/your/certificate.crt;
    ssl_certificate_key /path/to/your/private.key;

    ssl_protocols TLSv1.2 TLSv1.3;
    ssl_ciphers HIGH:!aNULL:!MD5;

    # Rule for sub.yourdomain.tld
    if ($host = sub.yourdomain.tld) {
        proxy_pass http://localhost:port;
        break;
    }

    # Default response for other subdomains and yourdomain.tld
    location / {
        return 200 "Welcome to yourdomain.tld! Full website coming soon.";
        default_type text/plain;
    }
}
</code></pre>
<p>😳😲😱</p>
<p>Bear in mind that Caddy provides automatic SSL out of the box. So the <em>extra</em> NGINX settings for SSL are actually required to match the <em>absent</em> SSL settings in the caddy config.</p>
<p>Well... easy choice for me. I prefer the Caddy way.</p>
<h3 id="heading-going-live">Going Live</h3>
<p>As you may know, it’s pretty easy to do the wrong thing when manually setting up your live environment. So, I highly recommend reading through <a target="_blank" href="https://www.digitalocean.com/community/tutorials/how-to-host-a-website-with-caddy-on-ubuntu-22-04">Digital Ocean’s guide</a> to deploying Caddy for a live website, combined with the following troubleshooting notes.</p>
<h4 id="heading-memory-requirements">Memory Requirements</h4>
<p>Do you want to build the Caddy binary yourself? I might have some bad news for you: you need about 2GB of RAM for the build process to succeed. If you’re using a small DO droplet like the one I used, that’s definitely bad news because you don’t have that much RAM. Well then… is there good news? Yep 🙂‍↕️ you can build the binary on your computer using Go’s cross-platform compilation feature, as mentioned <a target="_blank" href="https://github.com/caddyserver/xcaddy/issues/56#issuecomment-796737063">here</a>. Then copy the binary to your server using <code>rsync -avz /path/to/binary_file user@destination:/path/to/destination</code>. <code>user@destination</code> represents your SSH username and server IP address.</p>
<p>Does building from source sound too stressful for you? Fear not — you can still <a target="_blank" href="https://caddyserver.com/download">download a prebuilt binary</a> and start serving right away!</p>
<h4 id="heading-uniform-firewalls">Uniform Firewalls</h4>
<p>At the step for configuring the firewalls on your instance (i.e. <code>sudo ufw allow …</code>), avoid using <code>ufw</code> if you previously configured your firewall rules in a different place, such as in the DO dashboard. Instead, you should go back to that place and add 2 inbound rules for ports 80 and 443. Those ports correspond to HTTP and HTTPS, respectively. Here’s a screenshot of what this looks like in DO’s dashboard:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1730181045639/c3f0e4dc-4e3f-4fb8-a95b-1656b040b31c.png" alt class="image--center mx-auto" /></p>
<h4 id="heading-dns-awareness">DNS Awareness</h4>
<p>In your domain host’s DNS settings, you need to add CNAME alias records that define the subdomains you want Caddy to control. This step is <strong>critical</strong>. Without it, the subdomain traffic will never even reach your server instance for Caddy to handle as you desire. Setting up only the main A record for your domain is not sufficient.</p>
<h4 id="heading-location-location-location">Location, Location, Location!</h4>
<p>Does your <code>systemd</code> setup fail when you try to enable the Caddy service with <code>systemctl</code>? If you’re using a Caddyfile, check whether you saved the file at the location described in the <code>ExecStart</code> and <code>ExecReload</code> lines of the general <a target="_blank" href="https://caddyserver.com/docs/running#unit-files"><code>caddy.service</code></a> file. At the time of writing, both lines default to <code>/etc/caddy/Caddyfile</code>. So if your Caddyfile is located elsewhere, you need to edit the <code>caddy.service</code> file on your system to point to the correct location.</p>
<h4 id="heading-logging">Logging 🪵</h4>
<p>You probably don’t need me to tell you that you should set up logging so you can track what Caddy is doing. Caddy’s logging philosophy is quite powerful, but you may not be familiar with it. Thankfully they have an <a target="_blank" href="https://caddyserver.com/docs/logging">explainer</a> you can read through. Setting up logging is easy enough though:</p>
<ul>
<li><p>create the folder <code>/var/log/caddy</code> (use <code>sudo</code> if necessary)</p>
</li>
<li><p>create the <code>access.log</code> file inside that folder</p>
</li>
<li><p>give Caddy full control of the folder: <code>sudo chown -R caddy:caddy /var/log/caddy</code></p>
<ul>
<li>this ensures Caddy can write to the log file without permission issues</li>
</ul>
</li>
<li><p>add a <a target="_blank" href="https://caddyserver.com/docs/caddyfile/directives/log"><code>log</code> directive</a> to your Caddyfile and specify <code>access.log</code> as the output file, like this:</p>
<pre><code class="lang-plaintext">  log {
      output file /var/log/caddy/access.log
  }
</code></pre>
</li>
</ul>
<p>Logging setup complete ✅.</p>
<h4 id="heading-tls-configuration">TLS Configuration 🤝</h4>
<p>Depending on your cloud host provider, when starting Caddy you may get an error like the following:</p>
<blockquote>
<p>parsing caddyfile tokens for 'tls': getting module named 'dns.providers.digitalocean': module not registered: dns.providers.digitalocean</p>
</blockquote>
<p>If so, you need to confirm that your Caddy binary actually contains the necessary TLS module. To do that, run this command: <code>caddy list-modules | grep -i &lt;name-of-your-tls-plugin-provider&gt;</code>. If that command doesn’t find the TLS plugin provider, then the binary doesn’t include it. This happened to me when I downloaded a prebuilt Caddy binary from Caddy’s downloads page using <code>curl</code> <em>on the server instance</em>. For some reason, that command didn’t get the correct binary in that environment. <strong>My solution</strong>: download the binary to my computer via the browser, then copy the file to my server using the <code>rsync</code> command mentioned earlier. Easy peasy 😊. Of course if you built the binary yourself then you instead need to rebuild it with the necessary plugin included.</p>
<h4 id="heading-other-cloud-providers">Other Cloud Providers</h4>
<p>If you’re deploying to a server on a cloud host provider that’s not Digital Ocean, the setup is mostly the same provided your server is running Ubuntu linux. However, behind the scenes DO uses some juju for authentication and cert management. So the automatic TLS step must plug into that juju when hosting on DO. What does this mean for you? Simply put, you may be able to skip the automatic TLS step entirely if Caddy’s default TLS module works smoothly with your provider. If not, you will need to use a different and appropriate TLS plugin for that step. Don’t get too worried though. The TLS config steps may still be very similar because you generally want to achieve these 4 things:</p>
<ul>
<li><p>build (or download) Caddy with the TLS plugin for your cloud host</p>
</li>
<li><p>get an auth token with permission to interact with your cloud host’s SSL juju</p>
<ul>
<li>an account token with general read/write access should work, though you may want to limit the token’s scope to fit your needs</li>
</ul>
</li>
<li><p>set that token as an environment variable for the <code>caddy start</code> command</p>
<ul>
<li>just like in the Digital Ocean example, you can update the <code>Environment</code> key in your <code>caddy.service</code> file with your token, in the form <code>Environment=CLOUD_HOST_AUTH_TOKEN=your_token_here</code></li>
</ul>
</li>
<li><p>use that environment variable for the <a target="_blank" href="https://caddyserver.com/docs/caddyfile/directives/tls"><code>tls</code> directive</a> in your Caddyfile, like this:</p>
<pre><code class="lang-plaintext">  tls {
      dns &lt;tls_plugin_name&gt; {env.CLOUD_HOST_AUTH_TOKEN}
  }
</code></pre>
</li>
</ul>
<p>You can then restart the caddy service and voila! Automatically renewing HTTPS!</p>
<h3 id="heading-conclusion">Conclusion</h3>
<p>Don’t you just love it when powerful software remains as friendly to use as it is capable? With Caddy you can derive great value right from the moment you start your local server, up until you’re configuring load balancers for a busy web app. There’s a range of <a target="_blank" href="https://caddyserver.com/features">use cases</a> to consider, and an entire other half of Caddy (the API) that I didn’t even discuss! I urge you to explore the documentation to see if any of it can work for you.</p>
<hr />
<p>Questions? Feedback? Nice words, or mean ones? Feel free to reach out to @CodeWithOz on all the socials, or on <a target="_blank" href="https://www.linkedin.com/in/uchechukwu-ozoemena/">LinkedIn</a>.</p>
]]></content:encoded></item><item><title><![CDATA[Understanding Composability, One Service at a Time]]></title><description><![CDATA[The docker-compose.yml file. For me, just one of those config files. For years I’ve worked on projects that use this file, but the file was always defined by others. I’ve never had to seriously care what it contains. All I needed to know was that doc...]]></description><link>https://incodethismeans.com/understanding-composability-one-service-at-a-time</link><guid isPermaLink="true">https://incodethismeans.com/understanding-composability-one-service-at-a-time</guid><category><![CDATA[Docker]]></category><category><![CDATA[Docker compose]]></category><category><![CDATA[docker images]]></category><dc:creator><![CDATA[Uche Ozoemena]]></dc:creator><pubDate>Fri, 06 Dec 2024 06:17:33 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/HpMihL323k0/upload/62f15476e05f71b7f989d862c173b41c.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The <code>docker-compose.yml</code> file. For me, just one of those config files. For years I’ve worked on projects that use this file, but the file was always defined by others. I’ve never had to seriously care what it contains. All I needed to know was that <code>docker compose up -d</code> sets up and starts all the necessary stuff. After all, isn’t Docker about making <a target="_blank" href="https://www.reddit.com/r/ProgrammerHumor/comments/cw58z7/it_works_on_my_machine/">“it works on my machine”</a> an excuse of the past? Plus, I understood some of the options and words whenever I had to glance through the file. So it wasn’t all that bad, I could ultimately ignore the file… until something breaks! 😁 Well, I’ve finally created a simple docker compose file containing just one service, for myself. The proverbial lightbulb has flickered on! 💡</p>
<p>Suppose you want to start a simple postgres db instance on your computer (or a server). You could go through the manual installation, port configuration, user and role management, blah, blah, blah… or you could just use a postgres docker image. Let’s use a postgres docker image 🙂. You probably don’t want to lose all your db data if/when the docker container stops. For that you need a <a target="_blank" href="https://docs.docker.com/engine/storage/volumes/">docker volume</a> to persist the data in a location outside the docker container. In Code This Means you should run the following shell command:</p>
<pre><code class="lang-plaintext">docker volume create postgres
</code></pre>
<p>Creates the volume, names it <code>postgres</code>, done. Straightforward enough. Now you need to actually start the docker container and your postgres db instance. You should give the container a name, obviously. You should specify that it uses that volume you just created. You should specify that the container should be detached from your terminal, because you will close down that terminal sooner or later. That’s only for the container, how about for postgres? You should specify some basic values: username, password, database name, and port. You should specify the postgres image you actually want to run. Translating allll that into docker cli options, we get this shell command:</p>
<pre><code class="lang-plaintext">docker run -d \
  --name postgres \
  -v postgres:/var/lib/postgresql/data \
  -e POSTGRES_USER=postgres \
  -e POSTGRES_PASSWORD=postgres \
  -e POSTGRES_DB=my_db \
  -p 5435:5432 \
  postgres:13.4
</code></pre>
<p>Explaining very quickly:</p>
<ul>
<li><p><code>-d</code> says detach from the terminal</p>
</li>
<li><p><code>-v</code> says the location in the container whose data should be persisted to the <code>postgres</code> volume you created earlier</p>
</li>
<li><p><code>-e</code> specifies environment variables inside the container</p>
</li>
<li><p><code>-p</code> specifies the host computer’s port (left side of <code>:</code>) that should be mapped to the container’s port (right side of <code>:</code>)</p>
</li>
</ul>
<p>Now, here’s a <code>docker-compose.yml</code> file that combines the 2 commands you’ve used so far:</p>
<pre><code class="lang-plaintext">version: '3'

services:
  postgres:
    image: 'postgres:13.4'
    volumes:
      - 'postgres:/var/lib/postgresql/data'
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: postgres
      POSTGRES_DB: my_db
    ports:
      - '5435:5432'

volumes:
  postgres:
</code></pre>
<p>I don’t even have to translate anything! Except <code>services</code>, maybe? You might have expected <code>containers</code> instead. Think of using Docker for setting up micro<em>services</em> 😉. It’s “services” because any particular service could be backed by <a target="_blank" href="https://docs.docker.com/reference/compose-file/services/">multiple underlying containers</a>. Composability, yay! To start your container, you already know what to do:</p>
<pre><code class="lang-plaintext">docker compose up -d
</code></pre>
<p>The <code>-d</code> option is still necessary to detach the container from your terminal.</p>
<p>And should you ever need to stop the container, <code>docker compose down</code> does that for you. If you’re determined to remove all traces of your activities 👻, add the <code>-v</code> option to delete your container and all its associated resources.</p>
<p>In light of all that, I’d explain the value prop of Docker Compose like this:</p>
<ul>
<li><p>define all your docker cli instructions and options in one place,</p>
</li>
<li><p>in language that is more friendly and approachable,</p>
</li>
<li><p>with the ability to define multiple containers and services together, and</p>
</li>
<li><p>the ease of starting/stopping/managing them together and (roughly) in sync.</p>
</li>
</ul>
<p>Now I (and perhaps you too) can appreciate why Docker Compose is so valuable for managing both simple and complex server infrastructure.</p>
<hr />
<p>Questions? Feedback? Nice words, or mean ones? Feel free to reach out to @CodeWithOz on all the socials, or on <a target="_blank" href="https://www.linkedin.com/in/uchechukwu-ozoemena/">LinkedIn</a>.</p>
]]></content:encoded></item><item><title><![CDATA[Building a book shelf with html and css]]></title><description><![CDATA[Intro
2022 has been a year in which I've read a lot more books than usual, many thanks to the wonders of listening to audiobooks at 2x speed. I've wanted a way to catalogue these books so I thought I should build my own virtual bookshelf, because obv...]]></description><link>https://incodethismeans.com/building-a-book-shelf-with-html-and-css</link><guid isPermaLink="true">https://incodethismeans.com/building-a-book-shelf-with-html-and-css</guid><category><![CDATA[HTML5]]></category><category><![CDATA[CSS]]></category><dc:creator><![CDATA[Uche Ozoemena]]></dc:creator><pubDate>Tue, 29 Nov 2022 09:07:13 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/unsplash/ux1iYpfnTqo/upload/v1669498716971/p91Vf9bYV.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3 id="heading-intro">Intro</h3>
<p>2022 has been a year in which I've read a lot more books than usual, many thanks to the wonders of listening to audiobooks at 2x speed. I've wanted a way to catalogue these books so I thought I should build my own virtual bookshelf, because obviously there are no apps or websites that do this already, right? More so, I suspected that building a book shelf would be a good exercise in learning and practicing some CSS tricks that were not yet familiar to me. So without further ado, here's how it went! 📚🤓</p>
<h3 id="heading-starting-point">Starting Point</h3>
<p>My book shelf is modeled after the older design of iBooks as shown in this image:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1668374308281/jKpuzSwY3.jpeg" alt="ibooks-book-shelf.jpeg" /></p>
<p>Source: <a target="_blank" href="https://www.cultofmac.com/197791/apple-releases-ibooks-3-0-in-the-app-store-with-continuous-scrolling-icloud-integration-ios-6-sharing/">Cult of Mac</a></p>
<p>Each layer of the shelf is essentially a <a target="_blank" href="https://en.wikipedia.org/wiki/Cuboid#Rectangular_cuboid">rectangular cuboid</a>, so I started with <a target="_blank" href="https://dev.to/joeattardi/let-s-make-a-css-cube-1fed">this</a> implementation of a cube and extended it as described in the rest of this article. Here's a copy of the finalized code in that article, which was my starting point:</p>
<iframe height="500" style="width:100%" src="https://codepen.io/OzCodes/embed/ExRXPEE?default-tab=html%2Cresult">
  See the Pen <a href="https://codepen.io/OzCodes/pen/ExRXPEE">
  css cube</a> by Uche Ozoemena (<a href="https://codepen.io/OzCodes">@OzCodes</a>)
  on <a href="https://codepen.io">CodePen</a>.
</iframe>

<h3 id="heading-sub-goal-1-stretch-the-cube-into-a-rectangular-cuboid">Sub goal 1: stretch the cube into a rectangular cuboid</h3>
<p>To do this, I started by creating 3 css variables to represent the height, width, and depth of the cube, and I substituted them into the existing code. To understand which variable should go where, I thought about the transforms that were applied to each of the faces from their initial starting position (upright facing the screen). Using the bottom face as an example, it had <code>transform: translateY(100px) rotateX(-90deg);</code>, which meant it first got pushed vertically downwards by half the height of the cube then rotated to face downwards. To reflect this I needed to change its <code>translateY(100px)</code> to <code>translateY(calc(var(--height) * 0.5))</code>. More so, due to the rotation, the edge that was previously the height had become the depth. So I updated the <code>height</code> property by setting it to the <code>--depth</code> variable. Reasoning similarly for the other faces, I ended up with the following changes:</p>
<pre><code class="lang-css"><span class="hljs-selector-pseudo">:root</span> {
  <span class="hljs-attribute">--height</span>: <span class="hljs-number">200px</span>;
  <span class="hljs-attribute">--width</span>: <span class="hljs-number">200px</span>;
  <span class="hljs-attribute">--depth</span>: <span class="hljs-number">200px</span>;
}

<span class="hljs-selector-class">.container</span> {
  <span class="hljs-attribute">width</span>: <span class="hljs-built_in">var</span>(--width);
  <span class="hljs-attribute">height</span>: <span class="hljs-built_in">var</span>(--height);
  ...
}

<span class="hljs-selector-class">.cuboid</span> {
  <span class="hljs-attribute">width</span>: <span class="hljs-built_in">var</span>(--width);
  <span class="hljs-attribute">height</span>: <span class="hljs-built_in">var</span>(--height);
  ...
}

<span class="hljs-selector-class">.cuboid__face</span> {
  <span class="hljs-attribute">width</span>: <span class="hljs-built_in">var</span>(--width);
  <span class="hljs-attribute">height</span>: <span class="hljs-built_in">var</span>(--height);
  ...
}

<span class="hljs-selector-class">.cuboid__face--front</span> {
  <span class="hljs-attribute">transform</span>: <span class="hljs-built_in">translateZ</span>(calc(var(--depth) * <span class="hljs-number">0.5</span>));
}

<span class="hljs-selector-class">.cuboid__face--back</span> {
  <span class="hljs-attribute">transform</span>: <span class="hljs-built_in">translateZ</span>(calc(var(--depth) * -<span class="hljs-number">0.5</span>)) <span class="hljs-built_in">rotateY</span>(<span class="hljs-number">180deg</span>);
}

<span class="hljs-selector-class">.cuboid__face--left</span> {
  <span class="hljs-attribute">transform</span>: <span class="hljs-built_in">translateX</span>(calc(var(--width) * -<span class="hljs-number">0.5</span>)) <span class="hljs-built_in">rotateY</span>(-<span class="hljs-number">90deg</span>);
  <span class="hljs-attribute">width</span>: <span class="hljs-built_in">var</span>(--depth);
}

<span class="hljs-selector-class">.cuboid__face--right</span> {
  <span class="hljs-attribute">transform</span>: <span class="hljs-built_in">translateX</span>(calc(var(--width) * <span class="hljs-number">0.5</span>)) <span class="hljs-built_in">rotateY</span>(<span class="hljs-number">90deg</span>);
  <span class="hljs-attribute">width</span>: <span class="hljs-built_in">var</span>(--depth);
}

<span class="hljs-selector-class">.cuboid__face--top</span> {
  <span class="hljs-attribute">transform</span>: <span class="hljs-built_in">translateY</span>(calc(var(--height) * -<span class="hljs-number">0.5</span>)) <span class="hljs-built_in">rotateX</span>(<span class="hljs-number">90deg</span>);
  <span class="hljs-attribute">height</span>: <span class="hljs-built_in">var</span>(--depth);
}

<span class="hljs-selector-class">.cuboid__face--bottom</span> {
  <span class="hljs-attribute">transform</span>: <span class="hljs-built_in">translateY</span>(calc(var(--height) * <span class="hljs-number">0.5</span>)) <span class="hljs-built_in">rotateX</span>(-<span class="hljs-number">90deg</span>);
  <span class="hljs-attribute">height</span>: <span class="hljs-built_in">var</span>(--depth);
}
</code></pre>
<p>At this point the cube still looked the same, but theoretically I could stretch it into a cuboid by simply setting the <code>--width</code> variable to a value that's higher than that of the <code>--height</code> variable. To test this, I used a value of <code>90%</code> for <code>--width</code>, which made the cube look like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1668376834175/h04FhNDvF.png" alt="Screen Shot 2022-11-13 at 11.00.12 PM.png" /></p>
<p>A rectangular cuboid! 🥳</p>
<p>Notice, however, that the left and right faces didn't quite adjust as expected. I initially tried to solve this, but after some head scratching I recognized that I only really needed the demarcations of the faces. In other words, if I had the front, back, top and bottom faces, I would be able to implicitly demarcate the left and right faces. Therefore, I could actually remove the side faces entirely, leaving the cuboid looking like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1668376920390/_25Ze0NPF.png" alt="Screen Shot 2022-11-13 at 11.01.45 PM.png" /></p>
<p>Revisiting the iBooks screenshot above, the shelf fills up the width of the viewport, so I needed to set the <code>--width</code> variable to <code>100%</code>. Doing that made the cuboid look like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1668376562242/Gkm6rrzSv.png" alt="Screen Shot 2022-11-13 at 10.55.09 PM.png" /></p>
<p>Something was clearly still off at the left and right edges of the viewport - the front face seemed to be overflowing the container. My first thought was that I needed to do some trigonometry to figure out how many pixels should be subtracted from the cuboid's width in order to constrain the front face to the viewport's width. Thankfully there proved to be a far less complicated solution once I remembered that the initial cube was made by moving all the faces, particularly the front and back faces, away from their starting position. Put differently, the <code>100%</code> value is the width of the <em>initial</em> state of each face, which meant that the front face overflowed its container because it was "pulled forward" by half the cuboid's depth (recall that at this point the front face had <code>transform: translateZ(calc(var(--depth) * 0.5));</code>). To further illustrate this, I added another cuboid face that had no transforms so that it would occupy the initial position of the faces. You can see it in red in this screenshot:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1668376043889/-9ORzH6oa.png" alt="Screen Shot 2022-11-13 at 10.45.50 PM.png" /></p>
<p>That untransformed red face filled exactly the width of the container as the <code>100%</code> value would suggest. Thus the solution became clear: the front face needed to be kept at the initial position and all the other faces needed to be "pushed back" based on the depth of the cuboid. Returning to the example of the bottom face, its <code>transform</code> at this point was <code>transform: translateY(calc(var(--height) * 0.5)) rotateX(-90deg);</code>, which meant it first got pushed vertically downwards by half the height of the cube then rotated to face downwards. The first step should actually be to push the face backwards by half the depth of the cuboid so that its final position would span from the bottom edge of the front face to the bottom edge of the back face after the <code>translateY</code> and <code>rotateX</code> are applied. In Code This Means a <code>translateZ(calc(var(--depth) * -0.5))</code> was needed at the beginning of the existing transform. Reasoning similarly for the other faces led to these changes:</p>
<pre><code class="lang-css"><span class="hljs-selector-class">.cuboid__face--front</span> {
}

<span class="hljs-selector-class">.cuboid__face--back</span> {
  <span class="hljs-attribute">transform</span>: <span class="hljs-built_in">translateZ</span>(calc(var(--depth) * -<span class="hljs-number">1</span>)) <span class="hljs-built_in">rotateY</span>(<span class="hljs-number">180deg</span>);
}

<span class="hljs-selector-class">.cuboid__face--top</span> {
  <span class="hljs-attribute">transform</span>: <span class="hljs-built_in">translateZ</span>(calc(var(--depth) * -<span class="hljs-number">0.5</span>)) <span class="hljs-built_in">translateY</span>(calc(var(--height) * -<span class="hljs-number">0.5</span>)) <span class="hljs-built_in">rotateX</span>(<span class="hljs-number">90deg</span>);
  <span class="hljs-attribute">height</span>: <span class="hljs-built_in">var</span>(--depth);
}

<span class="hljs-selector-class">.cuboid__face--bottom</span> {
  <span class="hljs-attribute">transform</span>: <span class="hljs-built_in">translateZ</span>(calc(var(--depth) * -<span class="hljs-number">0.5</span>)) <span class="hljs-built_in">translateY</span>(calc(var(--height) * <span class="hljs-number">0.5</span>)) <span class="hljs-built_in">rotateX</span>(-<span class="hljs-number">90deg</span>);
  <span class="hljs-attribute">height</span>: <span class="hljs-built_in">var</span>(--depth);
}
</code></pre>
<p>Notice that I removed the <code>transform</code> entirely from the <code>.cuboid__face--front</code> because the front face needed to be kept in its starting position as mentioned above. After these changes the cuboid looked like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1668377999567/vvBR9x2RU.png" alt="Screen Shot 2022-11-13 at 11.19.34 PM.png" /></p>
<p>Much better! 👌🏾</p>
<h3 id="heading-sub-goal-2-complete-the-shelf-appearance">Sub goal 2: complete the shelf appearance</h3>
<p>First, in the iBooks screenshot there's actually no top face. The flat horizontal base of the upper shelf covers what would be the top face. I chose to remove the top face entirely and extend the height of the back face such that it would hit the topmost part of the shelf container. The additional height didn't need to be precise, it just needed to be enough to cover the distance between the top edge of the back face and the top of the shelf container. <code>3rem</code> proved to be sufficient:</p>
<pre><code class="lang-css"><span class="hljs-selector-class">.cuboid__face--bottom</span> {
  <span class="hljs-attribute">--extra-height</span>: <span class="hljs-number">3rem</span>;
  <span class="hljs-attribute">transform</span>: <span class="hljs-built_in">translateZ</span>(calc(var(--depth) * -<span class="hljs-number">0.5</span>)) <span class="hljs-built_in">translateY</span>(calc(var(--height) * <span class="hljs-number">0.5</span>)) <span class="hljs-built_in">rotateX</span>(-<span class="hljs-number">90deg</span>);
  <span class="hljs-attribute">height</span>: <span class="hljs-built_in">calc</span>(var(--depth) + <span class="hljs-built_in">var</span>(--extra-height));
}
</code></pre>
<p>As a result of that change the cuboid looked like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1668411449509/MCcD2CHwv.png" alt="Screen Shot 2022-11-14 at 8.37.08 AM.png" /></p>
<p>The added height actually caused the face to grow downward instead of upward! 🤔 Why was that? It's all about the transforms baby! At this point the back face had a <code>rotateY(180deg)</code> in its <code>transform</code>, which meant that its top edge was facing downward and vice versa for its bottom edge. So the extra height got added at the top, but with the top edge facing downward. I rectified this easily by prepending a corresponding negative <code>translateY</code> to push the face down initially, knowing that the downward push would eventually turn upward because of the <code>rotateY(180deg)</code>. In Code This Means:</p>
<pre><code class="lang-css"><span class="hljs-selector-class">.cuboid__face--back</span> {
  <span class="hljs-attribute">--extra-height</span>: <span class="hljs-number">3rem</span>;
  <span class="hljs-attribute">transform</span>: <span class="hljs-built_in">translateY</span>(calc(var(--extra-height) * -<span class="hljs-number">1</span>)) <span class="hljs-built_in">translateZ</span>(calc(var(--depth) * -<span class="hljs-number">1</span>)) <span class="hljs-built_in">rotateY</span>(<span class="hljs-number">180deg</span>);
  <span class="hljs-attribute">height</span>: <span class="hljs-built_in">calc</span>(var(--height) + <span class="hljs-built_in">var</span>(--extra-height));
}
</code></pre>
<p>Now the cuboid looks like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1668412029845/ygVnvra5I.png" alt="Screen Shot 2022-11-14 at 8.46.54 AM.png" /></p>
<p>I then applied <code>overflow: hidden</code> on the shelf container to trim off the overflow at the top, after which I also needed to remove the <code>border: 2px solid black;</code> from the faces to make sure the alignment of the edges remained intact. The outcome was this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1668412397391/eoEwtxPYj.png" alt="Screen Shot 2022-11-14 at 8.53.06 AM.png" /></p>
<p>Getting closer! Time to flesh out some of the finer details. First I added the thickness of the shelf floor with a simple div below the <code>.container</code> div, and gave it a slight height. Next I updated the background colors to use shades of brown that look closer to wood. I then added box shadows to give a bit more depth to the look of the shelf. Lastly I gave the front face a transparent background so that it wouldn't obscure the faces behind it, and I removed the text in the <code>.cuboid__face</code> divs. These changes left the cuboid looking like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1668460053574/j3Wm6elVA.png" alt="Screen Shot 2022-11-14 at 10.07.26 PM.png" /></p>
<p>Almost there!</p>
<h3 id="heading-sub-goal-3-adding-books-to-the-shelf">Sub goal 3: adding books to the shelf</h3>
<p>I needed to achieve two main objectives: ensure that the books overlay the back and bottom faces of the cube, and position them such that they would appear to be sitting on the floor of the shelf. The first objective was achieved by creating a <code>.books-container</code> div whose height was equal to the height of the cuboid, and giving it <code>position: relative</code> to ensure that it overlaid the cuboid container (because the <code>.container</code> was in its own layer due to having <code>position: absolute</code>). The second objective was achieved by adding a row of divs, each representing a book, within the <code>.books-container</code>, and adding  a little <code>padding-bottom</code> to the <code>.books-container</code> to simulate the books standing in the middle of the shelf floor as in the iBooks screenshot. After these changes the shelf looked like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1668463700598/gxx9oP0nQ.png" alt="Screen Shot 2022-11-14 at 11.08.15 PM.png" /></p>
<p>🙌🏾 🙌🏾</p>
<p>One last detail: the shadow at the top of the shelf that simulates the effect of light being blocked off by the "ceiling" of the shelf. I added that using a box shadow, leaving the shelf looking like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1668464260134/XgG_cI-iB.png" alt="Screen Shot 2022-11-14 at 11.17.34 PM.png" /></p>
<p>😍 ✅</p>
<p>Putting it all together, the code thus far is viewable in <a target="_blank" href="https://codepen.io/OzCodes/pen/ZERyQjX">this codepen</a>.</p>
<h3 id="heading-conclusion">Conclusion</h3>
<p>This was a fun little side project through which I learned about the <code>perspective</code> property and learned how to use and think of different CSS transforms (I sliced the air with my hands quite a lot because order of transforms is really important). One possible improvement would be to angle the top shadow on the left and right faces to give a more realistic depiction of how the shadows fall, as seen in the iBooks screenshot. I tried to achieve that with pseudoelements but didn't succeed in making it look realistic enough. The final live bookshelf showing my books is available at <a target="_blank" href="https://halfbaked.ucheoz.tech">halfbaked.ucheoz.tech</a> so feel free to have a look and let me know what you think. You can always reach me at <a target="_blank" href="https://twitter.com/cinexa7254">@cinexa7254</a> on twitter and you can <a target="_blank" href="https://www.linkedin.com/in/uchechukwu-ozoemena/">connect with me on LinkedIn</a>. </p>
]]></content:encoded></item><item><title><![CDATA[Taming Dependencies in Integration Tests]]></title><description><![CDATA[Integration Tests and Stubs
Integration tests are used to test interrelated parts of an app. Some typical candidates for integration tests are code that interacts with a database and code that relies upon different standalone modules within a codebas...]]></description><link>https://incodethismeans.com/taming-dependencies-in-integration-tests</link><guid isPermaLink="true">https://incodethismeans.com/taming-dependencies-in-integration-tests</guid><category><![CDATA[JavaScript]]></category><category><![CDATA[Testing]]></category><category><![CDATA[test driven development]]></category><category><![CDATA[mock]]></category><category><![CDATA[stub]]></category><dc:creator><![CDATA[Uche Ozoemena]]></dc:creator><pubDate>Sun, 19 Jun 2022 20:16:31 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/unsplash/cuNdbhxijVw/upload/v1655576134567/2yhxmgKPk.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-integration-tests-and-stubs">Integration Tests and Stubs</h2>
<p>Integration tests are used to test interrelated parts of an app. Some typical candidates for integration tests are code that interacts with a database and code that relies upon different standalone modules within a codebase. Often these tests will require you to know and/or control the output and side-effects of some of the interrelated modules, and this is where stubbing comes in. Very briefly, stubs allow you to create an alternate implementation of a function, with the added ability to track useful information such as when that function is executed, how many times it has been executed, and what arguments it was executed with. Therefore, stubs give you the ability to define how those interrelated modules will behave during your tests. So what do you do when you want to test a function that relies on values defined in other scripts? You could use one of the 2 packages I learned about this week: <a target="_blank" href="https://github.com/thlorenz/proxyquire">proxyquire</a> and <a target="_blank" href="https://github.com/iambumblehead/esmock">esmock</a>.</p>
<h2 id="heading-proxyquire">proxyquire</h2>
<p>The project's README summarizes its functionality succinctly:</p>
<blockquote>
<p>Proxies nodejs's require in order to make overriding dependencies during testing easy while staying totally unobtrusive</p>
</blockquote>
<p>In other words, proxyquire provides a way to <code>require()</code> modules and specify which of their imports are replaced by the custom functionality you have defined in your tests. I won't repeat basic code samples here because the README provides <a target="_blank" href="https://github.com/thlorenz/proxyquire#example">good examples</a>. One feature I found particularly helpful is that proxyquire detects which of the <code>require()</code>d values are not stubbed, and ensures that the file continues to use the original implementation for those values. And even better, it gives you the flexibility to disable this behavior via the <a target="_blank" href="https://github.com/thlorenz/proxyquire#preventing-call-thru-to-original-dependency">"call thru"</a> option. Pretty neat!</p>
<p>Beware though, you will need to supply your stubbed dependencies using the exact same path that the parent file uses to import them. To illustrate this, consider a project with file structure as follows:</p>
<pre><code class="lang-plaintext">📦ballonDor
 📂modules
   - 📜helpers.js
   - 📜action.js
 📂test
   - 📜action.spec.js
</code></pre>
<p>The scripts have the following contents:</p>
<pre><code class="lang-javascript"><span class="hljs-comment">// modules/helpers.js</span>
<span class="hljs-built_in">module</span>.exports { <span class="hljs-attr">predictWinner</span>: <span class="hljs-function">() =&gt;</span> <span class="hljs-built_in">Math</span>.random() &lt; <span class="hljs-number">0.5</span> ? <span class="hljs-string">'player1'</span> : <span class="hljs-string">'player2'</span> };
</code></pre>
<pre><code class="lang-javascript"><span class="hljs-comment">// modules/action.js</span>
<span class="hljs-keyword">const</span> { predictWinner } = <span class="hljs-built_in">require</span>(<span class="hljs-string">'./helpers'</span>);

<span class="hljs-keyword">const</span> playerNamesMap = <span class="hljs-keyword">new</span> <span class="hljs-built_in">Map</span>([
  [<span class="hljs-string">'player1'</span>, <span class="hljs-string">'Kylian Mbappe'</span>],
  [<span class="hljs-string">'player2'</span>, <span class="hljs-string">'Erling Haaland'</span>],
  [<span class="hljs-string">'player3'</span>, <span class="hljs-string">'Karim Benzema'</span>]
]);
<span class="hljs-keyword">const</span> getWinnerName = <span class="hljs-function">() =&gt;</span> playerNamesMap.get(predictWinner());

<span class="hljs-built_in">module</span>.exports = { getWinnerName };
</code></pre>
<p>We want to test that the <code>getWinnerName</code> function gets the correct name, so we do this by supplying a value (<code>"player3"</code>) that it would normally not receive. We can see that <code>predictWinner</code> only returns <code>"player1"</code> or <code>"player2"</code>, so we need a custom implementation of <code>predictWinner</code> that will return <code>"player3"</code>. The path to the stubbed import should <strong>NOT be relative to the test file</strong>, but instead should be the exact same as the path used in the module file. In Code This Means that the correct way to stub the import would be as follows:</p>
<pre><code class="lang-javascript"><span class="hljs-comment">// test/action.spec.js</span>
<span class="hljs-keyword">const</span> { expect } = <span class="hljs-built_in">require</span>(<span class="hljs-string">'chai'</span>);
<span class="hljs-keyword">const</span> sinon = <span class="hljs-built_in">require</span>(<span class="hljs-string">'sinon'</span>);

describe(<span class="hljs-string">`getWinnerName`</span>, <span class="hljs-function"><span class="hljs-keyword">function</span> (<span class="hljs-params"></span>) </span>{
  it(<span class="hljs-string">`picks the correct name`</span>, <span class="hljs-function"><span class="hljs-keyword">function</span>(<span class="hljs-params"></span>) </span>{
    <span class="hljs-keyword">const</span> customPredictWinner = sinon.stub().returns(<span class="hljs-string">'player3'</span>);
    <span class="hljs-keyword">const</span> { getWinnerName } = proxyquire(<span class="hljs-string">'../modules/action'</span>, {
      <span class="hljs-comment">// "./helpers" is the same path used in modules/action.js</span>
      <span class="hljs-string">'./helpers'</span>: { <span class="hljs-attr">predictWinner</span>: customPredictWinner },
    });
    expect(getWinnerName()).to.equal(<span class="hljs-string">'Karim Benzema'</span>);
    expect(customPredictWinner.called).to.be.true;
  });
});
</code></pre>
<p>Notice that the first argument of <code>proxyquire</code> is the correct path to the module being imported <strong>relative to the test file</strong>, but the <code>customPredictWinner</code> stub is supplied using the exact same path that's present in <code>modules/action.js</code>.</p>
<p>Unfortunately, proxyquire doesn't support ES module imports, which brings us to the second package: esmock.</p>
<h2 id="heading-esmock">esmock</h2>
<p>esmock is very similar to proxyquire, providing essentially the same functionality but with a focus on environments where ES modules are imported using the <code>import</code> keyword rather than <code>require</code>. Its README gives a similarly straightforward description:</p>
<blockquote>
<p>esmock provides native ESM import mocking for unit tests.</p>
</blockquote>
<p>esmock also covers modules loaded with <a target="_blank" href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/import#dynamic_imports">dynamic <code>import()</code>s</a> by using a slightly different syntax, <code>esmock.p('pathToDynamicallyImportedModule')</code>. One potential gotcha with esmock is that it is always asynchronous, so all the assertions that rely on esmock imports will need to <code>await</code> the import statements and therefore be contained within <code>async</code> functions. Probably not a big deal in most scenarios, but something to be aware of nevertheless.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Both proxyquire and esmock do the same job quite well: they give you an efficient and easy way to control the behavior of imported dependencies in your tests. Before I discovered these options, I would use the much more tedious strategy of globally stubbing any imports whose behavior I wanted to control, and then resetting to the original behavior when necessary. With proxyquire and esmock, I can now do the opposite by preserving the original behavior except when I something custom, which is quite nice.</p>
<p>Happy learning!</p>
]]></content:encoded></item><item><title><![CDATA[Bundling a Non-Modular Codebase: Lessons Learned]]></title><description><![CDATA[Intro: what did I do and why did I do it?
I recently had the unenviable task of refactoring my employer's main frontend codebase to use ES6 modules. The codebase is reasonably large, with more than 100k lines of JavaScript code across more than 30 fi...]]></description><link>https://incodethismeans.com/bundling-a-non-modular-codebase-lessons-learned</link><guid isPermaLink="true">https://incodethismeans.com/bundling-a-non-modular-codebase-lessons-learned</guid><category><![CDATA[JavaScript]]></category><category><![CDATA[ES6]]></category><category><![CDATA[javascript modules]]></category><category><![CDATA[Build tool]]></category><category><![CDATA[build]]></category><dc:creator><![CDATA[Uche Ozoemena]]></dc:creator><pubDate>Sun, 22 May 2022 22:11:23 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/unsplash/McX3XuJRsUM/upload/v1653257331543/O4wnAkhXW.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-intro-what-did-i-do-and-why-did-i-do-it">Intro: what did I do and why did I do it?</h2>
<p>I recently had the unenviable task of refactoring my employer's main frontend codebase to use <a target="_blank" href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Modules">ES6 modules</a>. The codebase is reasonably large, with more than 100k lines of JavaScript code across more than 30 files. The files were not modular, meaning they defined their own variables and functions in the <a target="_blank" href="https://developer.mozilla.org/en-US/docs/Glossary/Global_scope">global scope</a>, where logic in other files would expect to find those variables and functions... If you're cringing at that, yes that's one of the reasons why I had this task, to finally do away with the problems caused by that system. As you can probably imagine, the developer experience wasn't great with this codebase, and we had no way of removing inactive crufty code from the final build. This meant our builds were larger than they needed to be, and we couldn't significantly and efficiently reduce load time for the end user. So, there I stood, with the weight of these problems on my shoulders. <a target="_blank" href="https://esbuild.github.io/">esbuild</a> was our bundler of choice, particularly because of its speed - that animation on its landing page is pretty convincing.</p>
<h2 id="heading-lessons-learned">Lessons Learned</h2>
<p>Thankfully I finished off the task conclusively, and the rest of this article is a reflection on the lessons I learned in the process.</p>
<h3 id="heading-1-be-thoughtful-of-when-to-import-modules">1: Be thoughtful of when to import modules</h3>
<p><a target="_blank" href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/import#dynamic_imports">Dynamic <code>import()</code></a> is a powerful way to defer loading scripts until they're needed, so it's important to take advantage of this. Top-level imports are perfect if they're absolutely needed when your bundle is parsed and executed, but with a bit more thought you may find that most of your imports are actually not strictly needed as top-level imports. A good bundler with tree-shaking will isolate the modules that are <code>import()</code>ed, thereby excluding them from your main js bundle and reducing the time required to parse and execute said bundle. Of course your code can get very messy if you're <code>import()</code>ing a lot of little functions, so some balance is needed. A good policy is to group together and <code>import()</code> modules that are limited to specific parts of your web app. In my case I used <code>import()</code> for modules that were specific to certain pages.</p>
<h3 id="heading-2-beware-of-circular-dependencies">2: Beware of circular dependencies</h3>
<p>Circular dependencies can be introduced very easily due to the reusable nature of modules, so there is value in being mindful of when exactly functions are executed. Depending on how your files are organized, it may be almost impossible to avoid some files importing each other. Thankfully esbuild is quite good at finding and ordering imported modules to minimize conflicts, and I imagine other bundlers are similarly effective in this regard. Nevertheless, for the gnarly problems that persist, I found it helpful to move the affected logic out of the module's global scope and into functions that are executed when necessary. In Code This Means:</p>
<pre><code class="lang-javascript"><span class="hljs-comment">// module1.js</span>
<span class="hljs-keyword">import</span> { objectOfInterest } <span class="hljs-keyword">from</span> <span class="hljs-string">'./module2'</span>;

<span class="hljs-keyword">export</span> <span class="hljs-keyword">const</span> anInterestingString = <span class="hljs-string">'an interesting string'</span>;

<span class="hljs-keyword">export</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">someFunc</span>(<span class="hljs-params"></span>) </span>{
  <span class="hljs-keyword">const</span> { prop } = objectOfInterest;
  <span class="hljs-keyword">return</span> <span class="hljs-string">`value of "prop": <span class="hljs-subst">${prop}</span>`</span>;
}

someFunc();

<span class="hljs-comment">// module2.js</span>
<span class="hljs-keyword">import</span> { anInterestingString } <span class="hljs-keyword">from</span> <span class="hljs-string">'./module1'</span>;

<span class="hljs-keyword">export</span> <span class="hljs-keyword">const</span> objectOfInterest = {
  <span class="hljs-attr">foo</span>: <span class="hljs-string">'bar'</span>,
  <span class="hljs-attr">prop</span>: <span class="hljs-string">`check this out: <span class="hljs-subst">${anInterestingString}</span>`</span>
};
</code></pre>
<p>can be changed to:</p>
<pre><code class="lang-diff">// module1.js
export function someFunc() {
<span class="hljs-deletion">-  const { prop } = objectOfInterest;</span>
<span class="hljs-deletion">-  return `value of "prop": ${prop}`;</span>
<span class="hljs-addition">+  const { getPropVal } = objectOfInterest;</span>
<span class="hljs-addition">+  return `value of "prop": ${getPropVal()}`;</span>
}

// module2.js
export const objectOfInterest = {
  foo: 'bar',
<span class="hljs-deletion">-  prop: `check this out: ${anInterestingString}`</span>
<span class="hljs-addition">+  getPropVal: () =&gt; `check this out: ${anInterestingString}`</span>
};
</code></pre>
<p>That is a contrived and simplified example that mimics an actual situation I faced. The idea is that <code>objectOfInterest</code> was being initialized with the value of <code>anInterestingString</code> from the other module, so the value of <code>anInterestingString</code> needed to be known when <code>objectOfInterest</code> was defined. Both modules depend on each other, hence the possibility of errors due to undefined/uninitialized values. However, by switching the definition of <code>objectOfInterest</code> to use a function that returns the desired string, the value of <code>anInterestingString</code> does not need to be known when <code>objectOfInterest</code> is defined, hence removing the possibility of conflicts.</p>
<h3 id="heading-3-choose-between-minification-and-sourcemaps-for-better-debugging">3: Choose between minification and sourcemaps for better debugging</h3>
<p>Consider disabling minification if sourcemaps worsen your debugging experience. A sourcemap allows the browser to reconstruct and present the original file in the browser's developer tools, but for different reasons you may be unsatisfied with the reconstructed output. In my case the reconstructed files did not get the same names of variables and functions as the original files, so I was essentially only getting beautified versions of the minified files. This made debugging quite difficult as you can imagine. Disabling minification gave me exactly what I needed because the variable names remained almost exactly the same, allowing me to traverse the code as easily as I would in my code editor. Keep in mind, however, that minification helps significantly reduce the size of your final bundle and also obfuscates your source code, so you probably only want to disable it for debug builds rather than production builds.</p>
<h3 id="heading-4-preserve-nested-folders-by-putting-a-non-nested-entrypoint-first">4: Preserve nested folders by putting a non-nested entrypoint first</h3>
<p>If it is important to preserve your folder structure in the final build folder, then you should consider ordering your entrypoints such that those in the root of the project come before those in nested folders. This may be specific to esbuild because it seems like an implementation quirk, but I could be wrong. Consider a folder structure like this:</p>
<pre><code><span class="hljs-operator">-</span> src
  <span class="hljs-operator">|</span>
  <span class="hljs-operator">-</span> index.js
  <span class="hljs-operator">-</span> pages
    <span class="hljs-operator">|</span>
    <span class="hljs-operator">-</span> shop.js
  <span class="hljs-operator">-</span> funcs.js
  <span class="hljs-operator">-</span> helpers
    <span class="hljs-operator">|</span>
    <span class="hljs-operator">-</span> utils.js
</code></pre><p><code>index.js</code> and <code>pages/shop.js</code> are entrypoints. esbuild accepts entry points as an array of file paths, and I found that using <code>['pages/shop.js', 'index.js']</code> caused the <code>pages</code> folder to not be recreated in the final build folder. Instead, the equivalent of <code>shop.js</code> was placed at the root of the build folder. In other words, the build folder looked like this:</p>
<pre><code><span class="hljs-operator">-</span> build
  <span class="hljs-operator">|</span>
  <span class="hljs-operator">-</span> index.js
  <span class="hljs-operator">-</span> shop.js
  ...
</code></pre><p>When I changed the order of entrypoints to <code>['index.js', 'pages/shop.js']</code>, the original folder structure was maintained:</p>
<pre><code><span class="hljs-operator">-</span> build
  <span class="hljs-operator">|</span>
  <span class="hljs-operator">-</span> index.js
  <span class="hljs-operator">-</span> pages
    <span class="hljs-operator">|</span>
    <span class="hljs-operator">-</span> shop.js
  ...
</code></pre><h2 id="heading-conclusion">Conclusion</h2>
<p>ES modules are a great way to enforce modularity in a code base, and they allow bundlers to give you nice features like tree-shaking. Most, if not all, modern frameworks already revolve around using ES modules, so you may not even need to make a decision at this point. However if you're starting a project from scratch, I highly recommend that you organize your files using ES modules, even if you do not immediately plan on bundling your js. Modularity comes with other important benefits such as code isolation and often better readability, and ES modules provide us with a unified standard module system compared to some of the differing and sometimes incompatible module systems of the past. Don't hesitate to take advantage of this awesome feature!</p>
<p>Feel free to reach out to ask any questions, correct errors, or just say hi to me <a target="_blank" href="https://twitter.com/cinexa7254">@cinexa7254</a> on Twitter. Thanks for your time!</p>
]]></content:encoded></item><item><title><![CDATA[Scripting CSS transforms with DOMMatrix]]></title><description><![CDATA[I recently discovered that the css transform of an element can be set to the stringified value of a DOMMatrix. This means that we can use a DOMMatrix instance to construct the transforms, scales, and rotatess that we need, and then set that instance ...]]></description><link>https://incodethismeans.com/scripting-css-transforms-with-dommatrix</link><guid isPermaLink="true">https://incodethismeans.com/scripting-css-transforms-with-dommatrix</guid><category><![CDATA[CSS]]></category><category><![CDATA[JavaScript]]></category><dc:creator><![CDATA[Uche Ozoemena]]></dc:creator><pubDate>Sun, 17 Apr 2022 22:43:26 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/unsplash/05A-kdOH6Hw/upload/v1652988357121/7_tZoENAT.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I recently discovered that the css <code>transform</code> of an element can be set to the stringified value of a <a target="_blank" href="https://developer.mozilla.org/en-US/docs/Web/API/DOMMatrix/DOMMatrix"><code>DOMMatrix</code></a>. This means that we can use a <code>DOMMatrix</code> instance to construct the <code>transform</code>s, <code>scale</code>s, and <code>rotates</code>s that we need, and then set that instance as the value of the element's <code>transform</code>. In Code This Means we can do this:</p>
<pre><code class="lang-javascript"><span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">updateTransform</span>(<span class="hljs-params">element</span>) </span>{
  element.style.transform = <span class="hljs-keyword">new</span> DOMMatrix([val1, val2, val3, val4, val5, val6]);
}
</code></pre>
<p>You may be wondering how this matches up to the css transform functions that you already know, which involve syntax like <code>transform: translate(10px, 50px) scale(0.9)</code>. The answer is that the familiar syntax is represented by a <a target="_blank" href="https://developer.mozilla.org/en-US/docs/Web/API/DOMMatrix#usage_notes">special matrix</a> under the hood, so using <code>DOMMatrix</code> is our way of accessing and manipulating that matrix directly.</p>
<p>A simple example is that for an element with those same transforms I used above (<code>translate(10px, 50px) scale(0.9)</code>), the corresponding <code>DOMMatrix</code> looks like this:</p>
<pre><code class="lang-javascript">{
    <span class="hljs-string">"a"</span>: <span class="hljs-number">0.9</span>,
    <span class="hljs-string">"b"</span>: <span class="hljs-number">0</span>,
    <span class="hljs-string">"c"</span>: <span class="hljs-number">0</span>,
    <span class="hljs-string">"d"</span>: <span class="hljs-number">0.9</span>,
    <span class="hljs-string">"e"</span>: <span class="hljs-number">10</span>,
    <span class="hljs-string">"f"</span>: <span class="hljs-number">50</span>,
    <span class="hljs-string">"m11"</span>: <span class="hljs-number">0.9</span>,
    <span class="hljs-string">"m12"</span>: <span class="hljs-number">0</span>,
    <span class="hljs-string">"m13"</span>: <span class="hljs-number">0</span>,
    <span class="hljs-string">"m14"</span>: <span class="hljs-number">0</span>,
    <span class="hljs-string">"m21"</span>: <span class="hljs-number">0</span>,
    <span class="hljs-string">"m22"</span>: <span class="hljs-number">0.9</span>,
    <span class="hljs-string">"m23"</span>: <span class="hljs-number">0</span>,
    <span class="hljs-string">"m24"</span>: <span class="hljs-number">0</span>,
    <span class="hljs-string">"m31"</span>: <span class="hljs-number">0</span>,
    <span class="hljs-string">"m32"</span>: <span class="hljs-number">0</span>,
    <span class="hljs-string">"m33"</span>: <span class="hljs-number">1</span>,
    <span class="hljs-string">"m34"</span>: <span class="hljs-number">0</span>,
    <span class="hljs-string">"m41"</span>: <span class="hljs-number">10</span>,
    <span class="hljs-string">"m42"</span>: <span class="hljs-number">50</span>,
    <span class="hljs-string">"m43"</span>: <span class="hljs-number">0</span>,
    <span class="hljs-string">"m44"</span>: <span class="hljs-number">1</span>,
    <span class="hljs-string">"is2D"</span>: <span class="hljs-literal">true</span>,
    <span class="hljs-string">"isIdentity"</span>: <span class="hljs-literal">false</span>
}
</code></pre>
<p>... 👀 that's complicated!</p>
<p>Despite how confusing that may look, there are 2 main takeaways I want you to have:</p>
<ul>
<li>there are 2 types of matrices, 2d and 3d. 2d requires 6 values to form the matrix, whereas 3d requires 16 values. 2d can represent simpler transforms (like my example above) and is represented by the properties from <code>a</code>-<code>f</code> in the object shown above, whereas 3d can represent more complex transforms and is represented by the properties from <code>m11</code> to <code>m44</code>.</li>
<li>you can directly change these values to update the transforms applied to your DOM nodes! This was the light bulb 💡 moment for me. I was trying to update the <code>transformX</code> and <code>transformY</code> of an element without using ugly regexes like <code>/translate\(\d+,\d+\)/</code> (yuck! 🤮) because the element had its <code>transform</code> set by an external stylesheet.</li>
</ul>
<p>My solution built on this second takeaway, particularly because you can construct a <code>DOMMatrix</code> using the existing <code>transform</code> of an element. In Code This Means:</p>
<pre><code class="lang-javascript"><span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">updateScaleUsingMatrix</span>(<span class="hljs-params">element</span>) </span>{
  <span class="hljs-keyword">const</span> currentTransformMatrix = <span class="hljs-keyword">new</span> DOMMatrix(getComputedStyle(element).transform);
  <span class="hljs-comment">// I can bump up the scale by just incrementing the numbers!</span>
  currentTransformMatrix.a += <span class="hljs-number">0.3</span>;
  currentTransformMatrix.d += <span class="hljs-number">0.3</span>;
  element.style.transform = currentTransformMatrix;
  <span class="hljs-comment">// or more explicitly,</span>
  <span class="hljs-comment">// element.style.transform = currentTransformMatrix.toString();</span>
}
</code></pre>
<h3 id="heading-additional-notes">Additional Notes</h3>
<ul>
<li>Incrementing numerical css values directly is part of the promise of the <a target="_blank" href="https://developer.mozilla.org/en-US/docs/Web/API/CSS_Typed_OM_API">Typed OM API</a> that's part of <a target="_blank" href="https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Houdini">CSS Houdini</a>. Unfortunately it's not yet fully implemented by browsers, most notably in <a target="_blank" href="https://ishoudinireadyyet.com/">Safari and Firefox</a>. So you could view this matrix method as a fallback for those two browsers if you're already using Typed OM in the chromium-based browsers that support it.</li>
<li>You could claim, fairly in my opinion, that this is not very intuitive. There's still the challenge of knowing which of those properties from <code>a</code>-<code>f</code> and <code>m11</code>-<code>m44</code> represent the normal css transform functions that we're used to. For simpler transforms involving <code>scale()</code> and <code>translate()</code>, <a target="_blank" href="https://developer.mozilla.org/en-US/docs/Web/API/DOMMatrix/DOMMatrix#examples">MDN has an example</a> showing how <code>scale()</code> and <code>transform()</code> map to the matrix properties. There's also <a target="_blank" href="https://github.com/ismailman/decompose-dommatrix">this package</a> that will give you the familiar syntax of transform functions, so you can use the <code>DOMMatrix</code> as your intermediate only for manipulating the numbers. You can then convert back to the familiar transform functions to preserve readability. To be clear I have no affiliation with that package, I only came across it during my research. Most importantly, my recommendation is that you should test and see what properties of the 2d/3d matrix change when you perform your desired update, and then write your logic to update only that property.</li>
<li><code>DOMMatrix</code> has the alias <code>WebKitCSSMatrix</code> in some browsers. In Code This Means <code>const getMatrix = window.DOMMatrix || window.WebKitCSSMatrix;</code> can cover both possibilities.</li>
</ul>
<p>So, I hope you found this as fascinating as I did. I punched the air and let out a loud "yussssss!" (yes I'm that type of dev 😂) when I realized this could work for me. Feel free to reach out to share your enthusiasm, ask any questions, correct errors, or just say hi to me <a target="_blank" href="https://twitter.com/cinexa7254">@cinexa7254</a> on Twitter. Thanks for your time!</p>
<h2 id="heading-tldr">TL;DR</h2>
<ul>
<li>You can use <a target="_blank" href="https://developer.mozilla.org/en-US/docs/Web/API/DOMMatrix"><code>DOMMatrix</code></a> to represent your css transforms and easily manipulate the numbers like any other javascript variable.</li>
<li>You can map the familiar syntax of css transform functions to individual properties in the matrix either by testing manually to see what matrix properties change as you update the transform, or by using a package that does this for you, such as <a target="_blank" href="https://github.com/ismailman/decompose-dommatrix">this one</a> (I'm NOT affiliated with that project, I only came across it as part of my research).</li>
</ul>
]]></content:encoded></item></channel></rss>