Setting Up for Transparency: The Foundation of Our LLM Workflow

AuthorNovember 22, 2025

0 6 minutes read

Building with Large Language Models (LLMs) is incredibly exciting, but let’s be honest: it can often feel like you’re peering into a sophisticated black box. You feed it a prompt, get an answer, and then wonder, “How did it get there?” or even “Is this answer actually any good?” For developers and MLOps teams, this lack of transparency isn’t just frustrating; it’s a roadblock to debugging, improving, and reliably deploying AI applications. We need visibility, measurability, and, most importantly, reproducibility in our AI workflows.

That’s precisely where tools like Opik come into play. Imagine a world where every step of your LLM’s thought process is laid bare, where you can quantify its performance with robust metrics, and where you can easily compare different iterations of your pipeline. Sounds pretty good, right? In this post, we’ll walk through how to implement a fully traced and evaluated local LLM pipeline using Opik, transforming that black box into a transparent, measurable, and reproducible system. We’re talking about taking back control of our AI development.

Setting Up for Transparency: The Foundation of Our LLM Workflow

Before we dive deep into the nitty-gritty, we need to set the stage. Our goal here isn’t just to get an LLM working; it’s to get it working *visibly*. Opik provides the scaffolding for this visibility from the get-go. The first step involves installing the necessary libraries and initializing Opik, configuring our project so that every subsequent action, every trace, flows into a designated workspace. This foundational step is crucial for aggregating insights later on.

For this walkthrough, we’re keeping things local and lean. We’ll be using a lightweight Hugging Face model, specifically distilgpt2. Why local? It offers unparalleled control, consistency, and reproducibility. There are no external API calls to worry about, no rate limits, and our environment remains self-contained. We’re essentially building our own miniature LLM service, ready to generate text reliably. A small helper function ensures the generated text is clean and focused, removing the input prompt for clarity. This preparation might seem minor, but it’s essential for a predictable generation layer.

Crafting Intent: Structured Prompting with Opik

One of the most powerful ways to guide an LLM’s behavior is through prompt engineering. But simply throwing text at a model often leads to inconsistent results. This is where structured prompts become invaluable. With Opik’s Prompt class, we can define clear templates for different phases of our LLM’s reasoning. In our example, we create two distinct prompts:

plan_prompt: This prompt instructs the LLM to act as an assistant that creates a plan to answer a question based *only* on provided context, enforcing a three-bullet-point output.
answer_prompt: This prompt then takes that generated plan, along with the original context and question, and guides the LLM to formulate a concise answer (2-4 sentences).

This separation of concerns—planning and answering—allows us to observe how structured prompting influences the model’s thought process. Opik helps us maintain this consistency, providing a clear window into the model’s internal “dialogue.”

Building the Traceable Pipeline: From Context to Answer

With our prompts defined and our local LLM ready, it’s time to assemble the pipeline. But not just any pipeline—a *fully traceable* one. Opik’s decorators are the magic ingredient here, allowing us to instrument every function call and see it reflected in our dashboard. Think of it as leaving breadcrumbs throughout your code, but these breadcrumbs are rich with data about execution times, inputs, outputs, and even LLM token usage.

Simulating RAG with a Tracked Tool

Even simple utilities can be part of our traceable workflow. To simulate a minimal Retrieval-Augmented Generation (RAG) style system without needing a full vector database, we construct a tiny document store (`DOCS`) and a `retrieve_context` function. This function intelligently selects context based on keywords in the user’s question. Crucially, we mark this function with Opik’s @track(type="tool") decorator. This means that even this “tool” function’s execution is logged, giving us a complete view of how context is retrieved before being fed to the LLM. It’s an often-overlooked aspect of LLM pipelines, yet critical for understanding behavior.

Connecting the Dots: The QA Pipeline

Now, we bring everything together into our main `qa_pipeline` function. This function orchestrates the entire process:

It retrieves context using our tracked `retrieve_context` tool.
It generates a plan using our plan_answer function, also marked with @track(type="llm").
It then formulates the final answer using the answer_from_plan function, similarly tracked.

The entire `qa_pipeline` itself is also tracked with @track(type="general"). This nested tracking is where Opik truly shines. When you run this pipeline, you don’t just get an answer; you get a complete visual trace of how that answer was arrived at. You can see the `retrieve_context` span, followed by the `plan_answer` span (which includes the LLM call), and finally the `answer_from_plan` span. This comprehensive view is invaluable for debugging, understanding latency, and pinpointing exactly where an issue might arise.

The Crucial Step: Data-Driven Evaluation and Reproducibility

A transparent pipeline is great, but transparency without evaluation is like having a perfectly detailed map without knowing if you’re going in the right direction. This is where Opik’s evaluation capabilities become indispensable. To truly understand our pipeline’s quality, we need to measure its performance against a ground truth.

Building the Dataset: Our Source of Truth

The first step in evaluation is creating a dataset. With Opik, this is straightforward. We use client.get_or_create_dataset to establish our evaluation dataset and then `insert` multiple question-answer pairs, complete with their expected contexts and a `reference` (the golden answer). This dataset isn’t just a collection of examples; it’s the bedrock for our quantitative assessment, allowing us to consistently measure performance against a known standard. Having this housed within Opik ensures that our evaluation setup is just as reproducible as our pipeline.

Defining the Task and Metrics

Next, we define the `evaluation_task`. This is a simple function that takes an item from our dataset (a question, context, and reference) and runs it through our `qa_pipeline`. It then returns the pipeline’s `output` along with the `reference` answer from the dataset. This format is crucial because Opik’s scoring metrics expect these two values for comparison.

For measuring output quality, we select two common metrics:

Equals(): A simple, strict metric that checks for exact matches. Ideal for knowing if our model produced the precise expected answer.
LevenshteinRatio(): A more forgiving metric that calculates the similarity between two strings based on the number of edits (insertions, deletions, substitutions) needed to change one into the other. This helps us understand how *close* our model’s answer is, even if it’s not an exact match.

By using both, we get a nuanced view: exactness and semantic closeness.

Running and Aggregating the Evaluation

The culmination of our efforts is running the evaluation experiment. Opik’s `evaluate` function takes our dataset, our `evaluation_task`, the `scoring_metrics`, and an `experiment_name`. Running this function executes our `qa_pipeline` for each item in the dataset, applies the metrics, and records all the results. The beauty here is that once the evaluation is complete, Opik provides an `experiment_url`. This URL leads directly to the experiment details in the Opik dashboard, allowing us to visually inspect each trace, review individual scores, and understand aggregate performance. This link is the cornerstone of reproducibility – anyone with access can see the exact results of that specific experiment.

Finally, we aggregate and print the evaluation scores. This step provides a concise summary of how well our pipeline performed across the chosen metrics. Seeing these numbers directly helps us identify strengths and weaknesses, informing our next steps for improvement. Perhaps the `Equals` metric is low, but `LevenshteinRatio` is high, suggesting our model is often close but struggles with exact phrasing. These insights are invaluable.

The Future is Transparent and Measured

What we’ve built here, though small, is a complete, fully functional LLM evaluation ecosystem powered by Opik and a local model. We’ve seen how traces, structured prompts, datasets, and robust metrics coalesce to provide unparalleled transparency into the model’s reasoning process. This isn’t just about getting an answer; it’s about understanding *how* that answer was derived, *why* it might be wrong, and *how much* better your next iteration is.

This level of instrumentation transforms LLM development from a guessing game into a scientific process. Opik empowers developers to iterate quickly, experiment systematically, and validate improvements with structured, reliable data. The days of the LLM black box are numbered, replaced by workflows that are inherently transparent, consistently measurable, and effortlessly reproducible.

LLM pipeline, Opik, AI workflows, tracing, evaluation, reproducibility, local LLM, prompt engineering, MLOps, AI development, machine learning, data science

AuthorNovember 22, 2025

0 6 minutes read