Tech @ Zeko AI

LangFuse Upgrade Experience: Navigating from v3.29 to v3.54

Hemendra Chaudhary — Fri, 02 May 2025 12:40:54 GMT

LangFuse is awesome. If you're working with Large Language Models (LLMs), you probably know how useful it is for seeing what your AI is actually doing. It helps us track everything, keep an eye on costs, get feedback, and manage the prompts we feed our models. We love it.

We were using an older version (v3.29.1) and started seeing a weird bug where sometimes the tracking data (called "traces") just wouldn't load. Annoying! So, we decided to upgrade to the latest version (v3.54.0) hoping it would fix things.

We thought it would be easy – just tell our system (which uses Kubernetes and Helm, think of Helm as a package manager for Kubernetes apps) to use the new Langfuse version. Spoiler: it wasn't that easy. It turned into a bit of a detective story figuring out configuration changes.

Here's what happened and how we fixed it, in case you run into similar bumps:

Problem 1: The Upgrade Just Wouldn't Start

We tried the basic command to upgrade, and BAM! Errors everywhere.

Password Demands: The system complained about needing passwords for things like Redis (a speedy database Langfuse uses) and S3 (for storage). It turns out, when you upgrade these kinds of tools using popular setup charts (from Bitnami), they make you re-enter the current password as a safety check.
- Fix: We had to dig into our Kubernetes setup to find the existing passwords (using a specific command) and then explicitly give those passwords back to the upgrade command. We even had to provide the Redis password in a few different places in the command for it to be happy!
Config Format Confusion: We also got warnings about how some settings were written down in our configuration file (values.yaml). Things like secret keys (salt, nextauth.secret) used to be just plain text. The new version wanted them wrapped in a slightly different format (like salt: { value: "our-secret-here" }).
- Fix: We had to go into our configuration file and change the format for those specific settings to match what the new version expected.

Problem 2: It Tried to Install Storage We Didn't Need

We use AWS S3 for storage, which is separate from our Kubernetes cluster. But the Langfuse upgrade setup assumed we wanted to install internal storage (MinIO/S3) by default. This triggered more password errors for storage we weren't even using!

Fix: Easy one! We just added a couple of lines to our configuration file explicitly telling Helm: "Nope, don't install your own S3/MinIO." (s3: { deploy: false }, minio: { deploy: false }).

Problem 3: Where Do the Settings Go Now?!

Okay, we cleared the first hurdles, but then hit more errors. This time, it was about how we configured connections to things like our S3 storage and our main database (ClickHouse).

The Big Change: In the older version, we just dumped a lot of settings (like S3 keys, database usernames) into a general list of "environment variables." The new version is much more organized. It has dedicated sections in the configuration file specifically for S3 settings, ClickHouse settings, Redis settings, etc.
The Problem: The upgrade process got confused because it either couldn't find settings in their new dedicated spots, or it found them in both the old general list and the new sections.
Fix: We had to carefully move all those connection settings out of the old general list and put them into their proper new sections in the configuration file (e.g., all S3 details went under the s3: section).

Problem 4: Okay, It Upgraded... But Now It's Broken!

Finally, the Helm upgrade command finished without errors! But when we tried to use Langfuse...

App Can't Find the Database: The main Langfuse web part couldn't connect to its helper database (Redis/Valkey). The error logs showed it was looking for a server name that didn't exist.
- Fix: We realized the upgrade had slightly changed the internal network name for the Redis service. We just had to find the correct new name and update our connection setting to point to it.
Database Out of Disk Space: When Langfuse tried to update its main database structure (ClickHouse), it failed because the virtual disk was full.
- Fix: We needed to give the ClickHouse database more disk space. This was tricky because you often can't just change a setting in Helm to resize an existing disk. We had to:
  1. Make sure our underlying cloud storage system allowed resizing.
  2. Use a direct Kubernetes command (kubectl patch pvc...) to tell the existing virtual disk to grow bigger.
  3. Then update our Helm configuration file with the new size so it matches reality for the future.
Can't Log In! After all that, we couldn't even log in. Just kept getting "Invalid credentials."
- Checks: We checked everything – was the Langfuse web address set correctly? Did the secret key change? Was the database connection really okay? Everything looked fine.
- The (Embarrassing) Fix: ...I was typing my email address wrong. Yep. Sometimes the simplest explanation is the right one. Always check the basics!

Success and What We Learned

With the login fixed, Langfuse v3.54.0 was finally up and running! And the good news? The original problem with traces not loading seems to be gone.

This whole process taught us a few things:

Read the Upgrade Notes: The people who make these tools often write guides for major changes. Read them!
Dependencies Have Needs: Langfuse relies on other tools (like Redis). Be ready to handle their specific quirks during upgrades (like needing passwords).
Configuration Isn't Static: How you set things up can change between versions. Pay attention to new formats or sections in the config files.
Some Fixes Need Direct Intervention: Helm is great, but for some things (like resizing existing disks), you might need to use direct Kubernetes commands.
Check for Typos! Don't spend hours debugging complex configurations before checking if you just misspelled your own login.

Even though it was a bit more work than expected, the new way LangFuse organizes its configuration is actually cleaner. Hopefully, sharing our little adventure helps someone else have a smoother upgrade!

Efficient Async Programming: Celery and FastAPI in Action

Hemendra Chaudhary — Thu, 24 Apr 2025 11:14:36 GMT

You know how moving apps can feel like changing a tire while you're driving down the highway? Yeah, it was kinda like that recently! We were shifting a pretty important interview creation service at ZekoAI from AWS Lambda over to FastAPI on Kubernetes, and boy, did we hit a few bumps. Especially when dealing with async Python, Celery background tasks, and Redis. So, I thought I'd share the story of what tripped us up, mainly with those tricky event loops, and how we figured things out.

What We Wanted: Speedy FastAPI with Celery Helpers

The plan was simple: make our new FastAPI service super quick. A bunch of steps in creating an interview (like asking an LLM questions or saving files) took a while, so they were perfect jobs to hand off to Celery to do in the background. Since FastAPI loves async, and lots of our code was already doing async stuff (talking to databases, LLMs, S3), going all-in with async/await just made sense.

First Little Puzzle: Running async Stuff in Celery

Okay, first thing: Celery usually just runs tasks one after the other, nice and simple (synchronously). But our helper functions were async def! How do you make those work? Turns out, there's a standard trick: just wrap your asyncfunction call inside asyncio.run().

# tasks.py
import asyncio
from app import celery_app
from .utils import do_async_work # Imagine this is your async def function

@celery_app.task(name="my_async_task")
def run_async_task_wrapper(arg1, arg2):
    """The plain old Celery task that wraps the async stuff."""
    # asyncio.run() just spins up a little loop for the async function
    return asyncio.run(do_async_work(arg1, arg2))

# Calling it from somewhere else:
# run_async_task_wrapper.delay(value1, value2)

Easy enough! That let our async code run happily inside the Celery worker. Or so we thought...

Uh Oh, Redis Trouble: "Event loop is closed"? Whaaat?

Things got weird when our async functions needed to chat with Redis (using redis-py's async features). We thought we were being smart by setting up a shared Redis connection pool when the app and workers first started up. Saves time making new connections, right?

# When the app/worker starts up...
# CAUTION: This global pool was the sneaky culprit!
redis_connection_pool = redis.asyncio.ConnectionPool(host=..., decode_responses=True)

# Inside an async function (like set_sample_questions_redis)
# that got called directly OR through that Celery asyncio.run() trick
async def set_sample_questions_redis(...):
#    # Using that global pool we made earlier
   redis_client = RedisHandler(pool=redis_connection_pool)
   await redis_client.get_key(...) # <-- BANG! Error right here in the Celery task
   await redis_client.set_key(...)

Now, when we called set_sample_questions_redis directly using await from FastAPI, it usually worked fine. But when Celery ran the exact same function using our asyncio.run() wrapper, we got these head-scratching errors:

RuntimeError: Event loop is closed
# Or this fun one:
RuntimeError: Task <...> got Future <...> attached to a different loop

Confusing, right?!

Figuring it Out: The Event Loop Clash!

So, what was the deal? Turns out, asyncio.run() basically sets up its own little temporary workspace with its own power source (that's the event loop) just for the function it's running. But our Redis connection pool? We'd made that way back when the worker first started, and it was hooked up to the worker's original power source.

When our code inside that temporary asyncio.run workspace tried to use the Redis pool, it was like trying to plug a tool designed for one power outlet into a completely different one. The tool (Redis pool connections) just wasn't compatible with the temporary workspace's power (the new event loop), leading to those crashes.

That explained why direct await calls mostly worked (they were using the main power source the pool was already plugged into) but the asyncio.run calls always failed!

The Fix: Make the Tools Inside the Workshop!

The big "aha!" moment was realizing that things sensitive to the event loop (like that connection pool) need to be created and used inside the same workspace (the same event loop).

Since our set_sample_questions_redis function was being run in two different "workspaces" (the main FastAPI one and the temporary Celery asyncio.run one), the simplest, most reliable fix was to just create a new connection pool right inside the function every time it runs. This way, the pool is always plugged into the right power source for that specific run.

# utils.py (or wherever set_sample_questions_redis lives)
import asyncio
from redis.asyncio import ConnectionPool
from app import settings # Need our Redis connection details
from .redis_handler import RedisHandler # Our helper class

async def set_sample_questions_redis(interview_id: str, sample_questions: dict):
    """
    Saves/updates sample questions in Redis.
    Makes its own pool each time for event loop compatibility!
    """
    pool = None # Need this defined here so 'finally' can see it
    try:
        # Make a fresh pool right here! It'll use the current event loop.
        pool = ConnectionPool(
            host=settings.REDIS_HOST,
            port=settings.REDIS_PORT,
            db=settings.REDIS_DB,
            max_connections=10, # A reasonable number
            decode_responses=True # Usually want strings back
        )

        # Use this new pool
        redis_client = RedisHandler(pool=pool)
        redis_key = f"{interview_id}SampleQuestions"

        # ... (the actual work: get old data, mix in new, save it) ...
        existing_data_raw = await redis_client.get_key(redis_key)
        # ... mix it up ...
        await redis_client.set_key(redis_key, json.dumps(existing_questions))
        await redis_client.set_expiry(redis_key, 60 * 60 * 24) # 1 day

        return True # Hooray!
    except Exception as e:
        # Uh oh, log it
        # ... error logging ...
        return False
    finally:
        # Super important: Clean up the pool we made!
        if pool:
            await pool.disconnect()

Doing it this way totally fixed those annoying loop errors! Normally, you'd want a shared pool for best performance, as creating/destroying them has some overhead. But in this specific Celery + asyncio.run situation, that global pool caused those loop errors. Having it actually work without crashing was way more important, right? Correctness first!

While we were wrestling with those event loops, we also bumped into another little async-related snag: Bonus Tip: Celery Results Aren't Simple Data!

Another little "gotcha" we saw mentioned (though we sidestepped it!) relates to calling Celery tasks with .delay(). When you call .delay(), it gives you back an AsyncResult object right away – think of it like a tracking number for your background job.

Now, if you needed the background job to calculate something and return it (like maybe the S3 key after it uploads a file), you'd have a problem trying to immediately save that AsyncResult object somewhere using json.dumps(). JSON just doesn't know what to do with that complex Python object! People often work around this by saving the task_id string from the AsyncResult instead.

But in our case, for saving the job description to S3, we realized we didn't actually need the task to return the S3 key. We could just decide what the key should be before even calling the task!

# What we actually did in get_jd_summary.py

# Decide the S3 key *before* calling the task
object_key = f"ait_odam_jds/{user_id}_{time.time()}.txt"

# Call the task, passing the pre-decided key. We don't need its return value!
save_jd_s3_task.delay(job_description, object_key)

# ... later, when saving data to Redis ...
# We already know the object_key, so we can just use it directly!
response["jd_object_key"] = object_key
await redis_client.set_key(interview_id, json.dumps(response)) # Works perfectly!

So, by generating the identifier (the S3 object key) upfront, we completely avoided needing to get anything back from the Celery task or worrying about how to store its result. Simple and effective!

Wrapping Up: Watch Those Loops!

So yeah, moving to async tools like FastAPI while mixing in Celery means you really gotta pay attention to how you handle async resources, especially those event loops!

Our main lessons learned:

asyncio.run() is your friend for running async code in regular Celery tasks.
Watch out for async things (like connection pools) you create globally if you plan to use them inside asyncio.run(). That event loop mismatch is sneaky!
Making those sensitive resources inside the async function you pass to asyncio.run is a solid way to keep things working correctly.
Remember, Celery's .delay() gives you a tracker (AsyncResult), not the final answer. Save the task_id if you need to refer back to the job.

It took some head-scratching, but figuring out how Celery, asyncio, and things like Redis pools play together (or fight!) was key to getting our service running smoothly, prioritizing getting it working reliably over squeezing out every last drop of performance in this tricky spot. Hopefully, our little adventure saves you some trouble!