<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Tech @ Zeko AI]]></title><description><![CDATA[Real-world engineering stories from Zeko AI. We share the challenges we face, the solutions we discover, and the hard-won lessons learned while building and scaling our technology.]]></description><link>https://tech.zeko.ai</link><generator>RSS for Node</generator><lastBuildDate>Fri, 17 Apr 2026 10:18:32 GMT</lastBuildDate><atom:link href="https://tech.zeko.ai/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[LangFuse Upgrade Experience: Navigating from v3.29 to v3.54]]></title><description><![CDATA[LangFuse is awesome. If you're working with Large Language Models (LLMs), you probably know how useful it is for seeing what your AI is actually doing. It helps us track everything, keep an eye on costs, get feedback, and manage the prompts we feed o...]]></description><link>https://tech.zeko.ai/langfuse-upgrade-experience-navigating-from-v329-to-v354</link><guid isPermaLink="true">https://tech.zeko.ai/langfuse-upgrade-experience-navigating-from-v329-to-v354</guid><category><![CDATA[langfuse]]></category><category><![CDATA[llm]]></category><category><![CDATA[Prompt Engineering]]></category><dc:creator><![CDATA[Hemendra Chaudhary]]></dc:creator><pubDate>Fri, 02 May 2025 12:40:54 GMT</pubDate><content:encoded><![CDATA[<p><a target="_blank" href="https://langfuse.com">LangFuse</a> is awesome. If you're working with Large Language Models (LLMs), you probably know how useful it is for seeing what your AI is actually <em>doing</em>. It helps us track everything, keep an eye on costs, get feedback, and manage the prompts we feed our models. We love it.</p>
<p>We were using an older version (v3.29.1) and started seeing a weird bug where sometimes the tracking data (called "traces") just wouldn't load. Annoying! So, we decided to upgrade to the latest version (v3.54.0) hoping it would fix things.</p>
<p>We thought it would be easy – just tell our system (which uses Kubernetes and Helm, think of Helm as a package manager for Kubernetes apps) to use the new Langfuse version. Spoiler: it wasn't <em>that</em> easy. It turned into a bit of a detective story figuring out configuration changes.</p>
<p>Here's what happened and how we fixed it, in case you run into similar bumps:</p>
<p><strong>Problem 1: The Upgrade Just Wouldn't Start</strong></p>
<p>We tried the basic command to upgrade, and BAM! Errors everywhere.</p>
<ul>
<li><p><strong>Password Demands:</strong> The system complained about needing passwords for things like Redis (a speedy database Langfuse uses) and S3 (for storage). It turns out, when you upgrade these kinds of tools using popular setup charts (from Bitnami), they make you re-enter the <em>current</em> password as a safety check.</p>
<ul>
<li><strong>Fix:</strong> We had to dig into our Kubernetes setup to find the existing passwords (using a specific command) and then explicitly give those passwords back to the upgrade command. We even had to provide the Redis password in a few different places in the command for it to be happy!</li>
</ul>
</li>
<li><p><strong>Config Format Confusion:</strong> We also got warnings about how some settings were written down in our configuration file (<code>values.yaml</code>). Things like secret keys (<code>salt</code>, <code>nextauth.secret</code>) used to be just plain text. The new version wanted them wrapped in a slightly different format (like <code>salt: { value: "our-secret-here" }</code>).</p>
<ul>
<li><strong>Fix:</strong> We had to go into our configuration file and change the format for those specific settings to match what the new version expected.</li>
</ul>
</li>
</ul>
<p><strong>Problem 2: It Tried to Install Storage We Didn't Need</strong></p>
<p>We use AWS S3 for storage, which is separate from our Kubernetes cluster. But the Langfuse upgrade setup <em>assumed</em> we wanted to install internal storage (MinIO/S3) by default. This triggered more password errors for storage we weren't even using!</p>
<ul>
<li><strong>Fix:</strong> Easy one! We just added a couple of lines to our configuration file explicitly telling Helm: "Nope, don't install your own S3/MinIO." (<code>s3: { deploy: false }</code>, <code>minio: { deploy: false }</code>).</li>
</ul>
<p><strong>Problem 3: Where Do the Settings Go Now?!</strong></p>
<p>Okay, we cleared the first hurdles, but then hit <em>more</em> errors. This time, it was about how we configured connections to things like our S3 storage and our main database (ClickHouse).</p>
<ul>
<li><p><strong>The Big Change:</strong> In the older version, we just dumped a lot of settings (like S3 keys, database usernames) into a general list of "environment variables." The <em>new</em> version is much more organized. It has dedicated sections in the configuration file specifically for S3 settings, ClickHouse settings, Redis settings, etc.</p>
</li>
<li><p><strong>The Problem:</strong> The upgrade process got confused because it either couldn't find settings in their <em>new</em> dedicated spots, or it found them in <em>both</em> the old general list and the new sections.</p>
</li>
<li><p><strong>Fix:</strong> We had to carefully move all those connection settings out of the old general list and put them into their proper new sections in the configuration file (e.g., all S3 details went under the <code>s3:</code> section).</p>
</li>
</ul>
<p><strong>Problem 4: Okay, It Upgraded... But Now It's Broken!</strong></p>
<p>Finally, the Helm upgrade command finished without errors! But when we tried to use Langfuse...</p>
<ul>
<li><p><strong>App Can't Find the Database:</strong> The main Langfuse web part couldn't connect to its helper database (Redis/Valkey). The error logs showed it was looking for a server name that didn't exist.</p>
<ul>
<li><strong>Fix:</strong> We realized the upgrade had slightly changed the internal network name for the Redis service. We just had to find the <em>correct</em> new name and update our connection setting to point to it.</li>
</ul>
</li>
<li><p><strong>Database Out of Disk Space:</strong> When Langfuse tried to update its main database structure (ClickHouse), it failed because the virtual disk was full.</p>
<ul>
<li><p><strong>Fix:</strong> We needed to give the ClickHouse database more disk space. This was tricky because you often can't just change a setting in Helm to resize an <em>existing</em> disk. We had to:</p>
<ol>
<li><p>Make sure our underlying cloud storage system allowed resizing.</p>
</li>
<li><p>Use a direct Kubernetes command (<code>kubectl patch pvc...</code>) to tell the existing virtual disk to grow bigger.</p>
</li>
<li><p><em>Then</em> update our Helm configuration file with the new size so it matches reality for the future.</p>
</li>
</ol>
</li>
</ul>
</li>
<li><p><strong>Can't Log In!</strong> After all that, we couldn't even log in. Just kept getting "Invalid credentials."</p>
<ul>
<li><p><strong>Checks:</strong> We checked everything – was the Langfuse web address set correctly? Did the secret key change? Was the database connection really okay? Everything looked fine.</p>
</li>
<li><p><strong>The (Embarrassing) Fix:</strong> ...I was typing my email address wrong. Yep. Sometimes the simplest explanation is the right one. Always check the basics!</p>
</li>
</ul>
</li>
</ul>
<p><strong>Success and What We Learned</strong></p>
<p>With the login fixed, Langfuse v3.54.0 was finally up and running! And the good news? The original problem with traces not loading seems to be gone.</p>
<p>This whole process taught us a few things:</p>
<ol>
<li><p><strong>Read the Upgrade Notes:</strong> The people who make these tools often write guides for major changes. Read them!</p>
</li>
<li><p><strong>Dependencies Have Needs:</strong> Langfuse relies on other tools (like Redis). Be ready to handle their specific quirks during upgrades (like needing passwords).</p>
</li>
<li><p><strong>Configuration Isn't Static:</strong> How you set things up can change between versions. Pay attention to new formats or sections in the config files.</p>
</li>
<li><p><strong>Some Fixes Need Direct Intervention:</strong> Helm is great, but for some things (like resizing existing disks), you might need to use direct Kubernetes commands.</p>
</li>
<li><p><strong>Check for Typos!</strong> Don't spend hours debugging complex configurations before checking if you just misspelled your own login.</p>
</li>
</ol>
<p>Even though it was a bit more work than expected, the new way LangFuse organizes its configuration is actually cleaner. Hopefully, sharing our little adventure helps someone else have a smoother upgrade!</p>
]]></content:encoded></item><item><title><![CDATA[Efficient Async Programming: Celery and FastAPI in Action]]></title><description><![CDATA[You know how moving apps can feel like changing a tire while you're driving down the highway? Yeah, it was kinda like that recently! We were shifting a pretty important interview creation service at ZekoAI from AWS Lambda over to FastAPI on Kubernete...]]></description><link>https://tech.zeko.ai/efficient-async-programming-celery-and-fastapi-in-action</link><guid isPermaLink="true">https://tech.zeko.ai/efficient-async-programming-celery-and-fastapi-in-action</guid><category><![CDATA[FastAPI]]></category><category><![CDATA[asynchronous]]></category><category><![CDATA[celery]]></category><category><![CDATA[Redis]]></category><dc:creator><![CDATA[Hemendra Chaudhary]]></dc:creator><pubDate>Thu, 24 Apr 2025 11:14:36 GMT</pubDate><content:encoded><![CDATA[<p>You know how moving apps can feel like changing a tire while you're driving down the highway? Yeah, it was kinda like that recently! We were shifting a pretty important interview creation service at <a target="_blank" href="https://zeko.ai">ZekoAI</a> from AWS Lambda over to FastAPI on Kubernetes, and boy, did we hit a few bumps. Especially when dealing with async Python, Celery background tasks, and Redis. So, I thought I'd share the story of what tripped us up, mainly with those tricky event loops, and how we figured things out.</p>
<p><strong>What We Wanted: Speedy FastAPI with Celery Helpers</strong></p>
<p>The plan was simple: make our new FastAPI service super quick. A bunch of steps in creating an interview (like asking an LLM questions or saving files) took a while, so they were perfect jobs to hand off to Celery to do in the background. Since FastAPI loves <code>async</code>, and lots of our code was already doing <code>async</code> stuff (talking to databases, LLMs, S3), going all-in with <code>async</code>/<code>await</code> just made sense.</p>
<p><strong>First Little Puzzle: Running</strong> <code>async</code> Stuff in Celery</p>
<p>Okay, first thing: Celery usually just runs tasks one after the other, nice and simple (synchronously). But our helper functions were <code>async def</code>! How do you make those work? Turns out, there's a standard trick: just wrap your <code>async</code>function call inside <a target="_blank" href="http://asyncio.run"><code>asyncio.run</code></a><code>()</code>.</p>
<pre><code class="lang-python"><span class="hljs-comment"># tasks.py</span>
<span class="hljs-keyword">import</span> asyncio
<span class="hljs-keyword">from</span> app <span class="hljs-keyword">import</span> celery_app
<span class="hljs-keyword">from</span> .utils <span class="hljs-keyword">import</span> do_async_work <span class="hljs-comment"># Imagine this is your async def function</span>

<span class="hljs-meta">@celery_app.task(name="my_async_task")</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">run_async_task_wrapper</span>(<span class="hljs-params">arg1, arg2</span>):</span>
    <span class="hljs-string">"""The plain old Celery task that wraps the async stuff."""</span>
    <span class="hljs-comment"># asyncio.run() just spins up a little loop for the async function</span>
    <span class="hljs-keyword">return</span> asyncio.run(do_async_work(arg1, arg2))

<span class="hljs-comment"># Calling it from somewhere else:</span>
<span class="hljs-comment"># run_async_task_wrapper.delay(value1, value2)</span>
</code></pre>
<p>Easy enough! That let our <code>async</code> code run happily inside the Celery worker. Or so we thought...</p>
<p><strong>Uh Oh, Redis Trouble: "Event loop is closed"? Whaaat?</strong></p>
<p>Things got weird when our <code>async</code> functions needed to chat with Redis (using <code>redis-py</code>'s async features). We thought we were being smart by setting up a shared Redis connection pool when the app and workers first started up. Saves time making new connections, right?</p>
<pre><code class="lang-python"><span class="hljs-comment"># When the app/worker starts up...</span>
<span class="hljs-comment"># CAUTION: This global pool was the sneaky culprit!</span>
redis_connection_pool = redis.asyncio.ConnectionPool(host=..., decode_responses=<span class="hljs-literal">True</span>)

<span class="hljs-comment"># Inside an async function (like set_sample_questions_redis)</span>
<span class="hljs-comment"># that got called directly OR through that Celery asyncio.run() trick</span>
<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">set_sample_questions_redis</span>(<span class="hljs-params">...</span>):</span>
<span class="hljs-comment">#    # Using that global pool we made earlier</span>
   redis_client = RedisHandler(pool=redis_connection_pool)
   <span class="hljs-keyword">await</span> redis_client.get_key(...) <span class="hljs-comment"># &lt;-- BANG! Error right here in the Celery task</span>
   <span class="hljs-keyword">await</span> redis_client.set_key(...)
</code></pre>
<p>Now, when we called <code>set_sample_questions_redis</code> directly using <code>await</code> from FastAPI, it usually worked fine. But when Celery ran the <em>exact same function</em> using our <a target="_blank" href="http://asyncio.run"><code>asyncio.run</code></a><code>()</code> wrapper, we got these head-scratching errors:</p>
<pre><code class="lang-python">RuntimeError: Event loop <span class="hljs-keyword">is</span> closed
<span class="hljs-comment"># Or this fun one:</span>
RuntimeError: Task &lt;...&gt; got Future &lt;...&gt; attached to a different loop
</code></pre>
<p>Confusing, right?!</p>
<p><strong>Figuring it Out: The Event Loop Clash!</strong></p>
<p>So, what was the deal? Turns out, <a target="_blank" href="http://asyncio.run"><code>asyncio.run</code></a><code>()</code> basically sets up its own little temporary workspace with its own power source (that's the event loop) just for the function it's running. But our Redis connection pool? We'd made that <em>way</em> back when the worker first started, and it was hooked up to the <em>worker's original</em> power source.</p>
<p>When our code inside that temporary <a target="_blank" href="http://asyncio.run"><code>asyncio.run</code></a> workspace tried to use the Redis pool, it was like trying to plug a tool designed for one power outlet into a completely different one. The tool (Redis pool connections) just wasn't compatible with the temporary workspace's power (the new event loop), leading to those crashes.</p>
<p>That explained why direct <code>await</code> calls mostly worked (they were using the main power source the pool was already plugged into) but the <a target="_blank" href="http://asyncio.run"><code>asyncio.run</code></a> calls always failed!</p>
<p><strong>The Fix: Make the Tools Inside the Workshop!</strong></p>
<p>The big "aha!" moment was realizing that things sensitive to the event loop (like that connection pool) need to be created and used <em>inside the same workspace</em> (the same event loop).</p>
<p>Since our <code>set_sample_questions_redis</code> function was being run in two different "workspaces" (the main FastAPI one and the temporary Celery <a target="_blank" href="http://asyncio.run"><code>asyncio.run</code></a> one), the simplest, most reliable fix was to just create a <em>new</em> connection pool right inside the function every time it runs. This way, the pool is <em>always</em> plugged into the right power source for that specific run.</p>
<pre><code class="lang-python"><span class="hljs-comment"># utils.py (or wherever set_sample_questions_redis lives)</span>
<span class="hljs-keyword">import</span> asyncio
<span class="hljs-keyword">from</span> redis.asyncio <span class="hljs-keyword">import</span> ConnectionPool
<span class="hljs-keyword">from</span> app <span class="hljs-keyword">import</span> settings <span class="hljs-comment"># Need our Redis connection details</span>
<span class="hljs-keyword">from</span> .redis_handler <span class="hljs-keyword">import</span> RedisHandler <span class="hljs-comment"># Our helper class</span>

<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">set_sample_questions_redis</span>(<span class="hljs-params">interview_id: str, sample_questions: dict</span>):</span>
    <span class="hljs-string">"""
    Saves/updates sample questions in Redis.
    Makes its own pool each time for event loop compatibility!
    """</span>
    pool = <span class="hljs-literal">None</span> <span class="hljs-comment"># Need this defined here so 'finally' can see it</span>
    <span class="hljs-keyword">try</span>:
        <span class="hljs-comment"># Make a fresh pool right here! It'll use the current event loop.</span>
        pool = ConnectionPool(
            host=settings.REDIS_HOST,
            port=settings.REDIS_PORT,
            db=settings.REDIS_DB,
            max_connections=<span class="hljs-number">10</span>, <span class="hljs-comment"># A reasonable number</span>
            decode_responses=<span class="hljs-literal">True</span> <span class="hljs-comment"># Usually want strings back</span>
        )

        <span class="hljs-comment"># Use this new pool</span>
        redis_client = RedisHandler(pool=pool)
        redis_key = <span class="hljs-string">f"<span class="hljs-subst">{interview_id}</span>SampleQuestions"</span>

        <span class="hljs-comment"># ... (the actual work: get old data, mix in new, save it) ...</span>
        existing_data_raw = <span class="hljs-keyword">await</span> redis_client.get_key(redis_key)
        <span class="hljs-comment"># ... mix it up ...</span>
        <span class="hljs-keyword">await</span> redis_client.set_key(redis_key, json.dumps(existing_questions))
        <span class="hljs-keyword">await</span> redis_client.set_expiry(redis_key, <span class="hljs-number">60</span> * <span class="hljs-number">60</span> * <span class="hljs-number">24</span>) <span class="hljs-comment"># 1 day</span>

        <span class="hljs-keyword">return</span> <span class="hljs-literal">True</span> <span class="hljs-comment"># Hooray!</span>
    <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
        <span class="hljs-comment"># Uh oh, log it</span>
        <span class="hljs-comment"># ... error logging ...</span>
        <span class="hljs-keyword">return</span> <span class="hljs-literal">False</span>
    <span class="hljs-keyword">finally</span>:
        <span class="hljs-comment"># Super important: Clean up the pool we made!</span>
        <span class="hljs-keyword">if</span> pool:
            <span class="hljs-keyword">await</span> pool.disconnect()
</code></pre>
<p>Doing it this way totally fixed those annoying loop errors! Normally, you'd want a shared pool for best performance, as creating/destroying them has some overhead. But in this specific Celery + <a target="_blank" href="http://asyncio.run">asyncio.run</a> situation, that global pool caused those loop errors. Having it actually <em>work</em> without crashing was way more important, right? Correctness first!</p>
<p>While we were wrestling with those event loops, we also bumped into another little async-related snag: <strong>Bonus Tip: Celery Results Aren't Simple Data!</strong></p>
<p>Another little "gotcha" we saw mentioned (though we sidestepped it!) relates to calling Celery tasks with <code>.delay()</code>. When you call <code>.delay()</code>, it gives you back an <code>AsyncResult</code> object right away – think of it like a tracking number for your background job.</p>
<p>Now, if you needed the background job to calculate something and <em>return</em> it (like maybe the S3 key <em>after</em> it uploads a file), you'd have a problem trying to immediately save that <code>AsyncResult</code> object somewhere using <code>json.dumps()</code>. JSON just doesn't know what to do with that complex Python object! People often work around this by saving the <code>task_id</code> string from the <code>AsyncResult</code> instead.</p>
<p>But in our case, for saving the job description to S3, we realized we didn't actually need the task to <em>return</em> the S3 key. We could just <em>decide</em> what the key should be <em>before</em> even calling the task!</p>
<pre><code class="lang-python"><span class="hljs-comment"># What we actually did in get_jd_summary.py</span>

<span class="hljs-comment"># Decide the S3 key *before* calling the task</span>
object_key = <span class="hljs-string">f"ait_odam_jds/<span class="hljs-subst">{user_id}</span>_<span class="hljs-subst">{time.time()}</span>.txt"</span>

<span class="hljs-comment"># Call the task, passing the pre-decided key. We don't need its return value!</span>
save_jd_s3_task.delay(job_description, object_key)

<span class="hljs-comment"># ... later, when saving data to Redis ...</span>
<span class="hljs-comment"># We already know the object_key, so we can just use it directly!</span>
response[<span class="hljs-string">"jd_object_key"</span>] = object_key
<span class="hljs-keyword">await</span> redis_client.set_key(interview_id, json.dumps(response)) <span class="hljs-comment"># Works perfectly!</span>
</code></pre>
<p>So, by generating the identifier (the S3 object key) upfront, we completely avoided needing to get anything back from the Celery task or worrying about how to store its result. Simple and effective!</p>
<p><strong>Wrapping Up: Watch Those Loops!</strong></p>
<p>So yeah, moving to async tools like FastAPI while mixing in Celery means you really gotta pay attention to how you handle async resources, especially those event loops!</p>
<p>Our main lessons learned:</p>
<ol>
<li><p><a target="_blank" href="http://asyncio.run"><code>asyncio.run</code></a><code>()</code> is your friend for running <code>async</code> code in regular Celery tasks.</p>
</li>
<li><p>Watch out for async things (like connection pools) you create globally if you plan to use them inside <a target="_blank" href="http://asyncio.run"><code>asyncio.run</code></a><code>()</code>. That event loop mismatch is sneaky!</p>
</li>
<li><p>Making those sensitive resources <em>inside</em> the async function you pass to <a target="_blank" href="http://asyncio.run"><code>asyncio.run</code></a> is a solid way to keep things working correctly.</p>
</li>
<li><p>Remember, Celery's <code>.delay()</code> gives you a tracker (<code>AsyncResult</code>), not the final answer. Save the <code>task_id</code> if you need to refer back to the job.</p>
</li>
</ol>
<p>It took some head-scratching, but figuring out how Celery, <code>asyncio</code>, and things like Redis pools play together (or fight!) was key to getting our service running smoothly, prioritizing getting it working reliably over squeezing out every last drop of performance in this tricky spot. Hopefully, our little adventure saves you some trouble!</p>
]]></content:encoded></item></channel></rss>