LLM Observability with LangSmith: A Practical Guide

API

LLM Observability with LangSmith: A Practical Guide

LangSmith brings powerful observability to LangChain apps with tracing, prompt evaluation, and performance monitoring helping developers debug faster, improve output quality, and ensure reliable LLM workflows in production.

Kanika Bansal

Last updated:

May 12, 2025

Min Read

Try Murf for Free

Contact Sales

LLM Observability with LangSmith: A Practical Guide

Table of Contents

Text Link

In the rapidly evolving landscape of Large Language Model (LLM) applications, observability has become a critical concern for developers. As these systems grow more complex, understanding what happens during execution, tracking performance, debugging issues, and refining prompts becomes increasingly challenging. Enter LangSmith a powerful observability platform designed specifically for LLM-powered applications.

In this article, we'll walk through a practical implementation of LangSmith with a LangChain application, exploring how it enhances visibility into your LLM chains, supports robust testing, and enables structured prompt evaluation to ensure high-quality outputs, all while helping identify potential issues before they impact production.

Setting Up a Basic LangChain Application

Let's start with a simple recipe generation application built using LangChain. This application takes inputs like dish name, maximum number of ingredients, difficulty level, and cooking time to generate a customized recipe.

Here's the foundation of our application:

import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate

# Load environment variables
load_dotenv()

# Initialize components
llm = ChatOpenAI(temperature=0.7)

To set up the foundational infrastructure for our LangChain application, we start by importing essential libraries. The os module allows interaction with the operating system, particularly for accessing environment variables. To manage sensitive information like API keys securely, we use dotenv, specifically load_dotenv, which reads a .env file and loads its variables into the environment. This is a best practice for handling configurations without exposing them in the code.

Next, we bring in ChatOpenAI from langchain_openai, which acts as a LangChain wrapper for OpenAI’s chat models (such as GPT-3.5 or GPT-4). To structure our prompts efficiently, we import ChatPromptTemplate from langchain.prompts, allowing us to define dynamic and reusable prompt formats. Finally, we initialize ChatOpenAI with a temperature value of 0.7, striking a balance between creativity and consistency higher values introduce more randomness, while lower ones make responses more predictable.

Our application uses a prompt template that instructs the LLM to generate a recipe with specific constraints:

recipe_template = """
You are a helpful assistant that generates cooking recipes.

Generate a recipe for {dish} with the following constraints:
- Use only {max_ingredients} ingredients or fewer
- The recipe should be {difficulty} to prepare
- It should take approximately {time_minutes} minutes to cook

Please format your response with the following sections:
- Ingredients
- Instructions
- Tips
"""

recipe_prompt = ChatPromptTemplate.from_template(recipe_template)

When using ChatPromptTemplate.from_template(recipe_template), we transform a simple template string into a structured prompt object that dynamically fills in placeholders with user-provided values. This method ensures a consistent and well-organized interaction with the language model while allowing flexibility for customization based on user input. By structuring prompts this way, we create a seamless and efficient way to generate responses tailored to specific needs.

Integrating LangSmith for Observability

Now, let's enhance our application with LangSmith observability. The integration requires just a few lines of code:

# Configure LangSmith
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = os.getenv("LANGSMITH_API_KEY")
os.environ["LANGCHAIN_PROJECT"] = "langsmith_demo_murf"

# Initialize LangSmith client
from langsmith import Client
client = Client()

To enable tracing in LangChain and gain deeper insights into your application's execution, you need to configure a few environment variables:

Enable Tracing: os.environ["LANGCHAIN_TRACING_V2"] = "true" activates LangChain’s tracing feature, capturing step-by-step execution details.
Set API Key: os.getenv("LANGSMITH_API_KEY") securely retrieves your LangSmith API key from the environment.
Define Project Name: langsmith_demo_murf is the project name here.

Once these variables are set, you can initialize the LangSmith client:

Import the Client: from langsmith import Client brings in the LangSmith library.
Initialize the Client: client = Client() authenticates the client using your API key.

With this setup, LangChain automatically sends trace data to LangSmith no additional code modifications required.

Building a Robust Chain with Input Validation

One critical aspect of LLM applications is input validation. LangSmith becomes particularly valuable when tracing the execution flow through different validation steps.

Let's implement a validation function:

def validate_inputs(input_dict):
    """Essential input validation for recipe generation"""
    errors = []
    
    # Required field check
    if "dish" not in input_dict:
        errors.append("'dish' field is required")
    else:
        dish = input_dict["dish"].strip()
        if not dish:
            errors.append("Dish name cannot be empty")
    
    # Type and basic constraints
    try:
        max_ingredients = int(input_dict.get("max_ingredients", 5))
        if max_ingredients < 1:
            errors.append("Ingredients must be at least 1")
    except ValueError:
        errors.append("Ingredients must be a valid number")
    
    # Difficulty validation
    difficulty = input_dict.get("difficulty", "easy")
    if difficulty not in ["easy", "medium", "hard"]:
        errors.append("Invalid difficulty level (easy/medium/hard)")
    
    # Time validation
    try:
        time_minutes = int(input_dict.get("time_minutes", 30))
        if time_minutes < 1:
            errors.append("Cooking time must be at least 1 minute")
    except ValueError:
        errors.append("Cooking time must be a valid number")
    
    if errors:
        raise ValueError("\n".join(errors))
    
    return input_dict

The validate_inputs function ensures that all user inputs for recipe generation meet basic quality standards before processing. It checks for required fields, data types, and sensible constraints, preventing errors later in the workflow.

First, it verifies that the dish field is present and not empty. Then, it ensures max_ingredients is a valid integer and at least 1, avoiding unrealistic values. The function also checks if the difficulty level is one of the accepted options: "easy," "medium," or "hard." Finally, it validates the time_minutes field, ensuring it is a number and at least 1 minute.

If any validation fails, a ValueError is raised with a detailed message listing all issues at once. This structured validation prevents bad data from entering the system, improving reliability.

Now we'll construct our chain with the input validation integrated:

from langchain_core.runnables import RunnablePassthrough, RunnableLambda
from langchain_core.output_parsers import StrOutputParser

recipe_chain = (
    RunnablePassthrough.assign(
        max_ingredients=lambda x: x.get("max_ingredients", 5),
        difficulty=lambda x: x.get("difficulty", "easy"),
        time_minutes=lambda x: x.get("time_minutes", 30)
    )
    | RunnableLambda(validate_inputs)
    | recipe_prompt
    | llm
    | StrOutputParser()
)

This LangChain pipeline ensures structured and efficient AI interactions. It starts with RunnablePassthrough.assign(), which fills in missing parameters with defaults—like setting max_ingredients to 5 if not provided ensuring complete input data. The pipeline flows sequentially using |, with RunnableLambda(validate_inputs) handling validation, either correcting errors or passing data forward.

Next, recipe_prompt formats the validated input into a structured prompt for the llm (ChatOpenAI), which sends it to OpenAI’s API and returns a response. Finally, StrOutputParser() extracts and returns the AI’s output as a clean string.

This modular design follows the functional composition pattern, making each step independently testable and traceable with LangSmith, ensuring flexibility and maintainability.

Analyzing Results in the LangSmith Dashboard

Now let's examine what our test runs look like in the LangSmith dashboard.

Overview of Runs

Breaking Down the Dashboard UI (as shown above)

At the core of the dashboard, each test run is displayed with key details, including:

Success vs. Failure Status – Green check marks indicate successful runs, while red error icons signal failures. For example, in the screenshot, one run encountered an OutputParserException, which suggests an issue with processing the model’s response.

Input and Output Logs – Each test run logs its input query (e.g., "Can I add vegetables..."), along with the generated output. This helps in tracing responses and identifying potential mismatches or errors in processing.

Error Messages for Debugging – Any failed run is accompanied by an error message, making it easier to diagnose and fix issues. In this case, the failed test run for "pancakes" has an error related to output parsing.

Latency Metrics – The dashboard tracks response times, helping assess model efficiency. Here, we see varying response times across runs, with latencies ranging from 1.42s to 3.82s. The latency percentile metrics (P50: 2.75s, P99: 3.82s) show the median and worst-case performance.

Token Usage Insights – The stats panel on the right shows the total token usage (1,956 tokens) and the median tokens per run (333 tokens). Monitoring token usage is crucial for cost efficiency, especially when working with large-scale AI models.

Error Rate Analysis – The error rate stands at 17%, meaning that roughly one out of every six runs encountered a failure. A lower error rate is preferable, so tracking this metric over time can help improve model reliability.

And many more metrics and parameters are provided on the dashboard.

Examining a Successful Run

For our successful run with "Chicken Stir-Fry", we can see:

The complete input parameters (difficulty: medium, dish: Chicken Stir-Fry, max ingredients: 8, time: 45 minutes)
The full output including ingredients, instructions
The execution time (3.83s)
Token usage (368 tokens)

This view helps us understand what a successful execution looks like and provides a reference point for comparing against failed runs.

Investigating a Parser Error

When we attempt to use a JSON parser on a text response, we get an OutputParserException. LangSmith shows:

The exact error message: "OutputParserException('Invalid json output: Ingredients...')"
The execution waterfall, showing each step in the chain and where the error occurred
Total tokens used (241)
Execution time (2.13s)

This visibility is crucial for debugging complex chains where errors can occur at any stage.

Analyzing Validation Errors

When providing an empty dish name (in the above screenshot), LangSmith shows:

The validation error: "ValueErrorDish name cannot be empty"
The input that caused the error (empty string for "dish")
The execution trace showing the error occurred in the validation step

Similarly, when the "dish" field is entirely missing (as shown in the above provided screenshot):

The error is clearly shown: "ValueError('dish' field is required)"
We can see the input only contained "Max Ingredients: 5"
The execution trace helps pinpoint where the validation failed

Enhancing LLM Applications with Structured Prompt Evaluation

One of the most powerful features of LangSmith is its ability to systematically evaluate your LLM prompts and responses. Let’s explore how to implement a structured evaluation system using Pydantic models to assess the quality of your LLM outputs against reference answers.

First, we'll need to set up our environment and dependencies:

# Import dependencies
from langsmith import wrappers, Client
from pydantic import BaseModel, Field, ValidationError
from openai import OpenAI
import os
from dotenv import load_dotenv
import json

# Initialize clients
client = Client()
openai_client = wrappers.wrap_openai(OpenAI())

The wrappers.wrap_openai() function is particularly important - it instruments your OpenAI client to automatically log all requests and responses to LangSmith, giving you complete visibility into the evaluation process.

Next, we'll create a dataset of example questions and reference answers:

# Create a dataset
examples = [
    (
        "What is the capital of France?",
        "The capital of France is Paris."
    ),
    (
        "Who wrote 'Romeo and Juliet'?",
        "William Shakespeare wrote 'Romeo and Juliet'."
    ),
    # Additional examples...
]

inputs = [{"question": input_prompt} for input_prompt, _ in examples]
outputs = [{"answer": output_answer} for _, output_answer in examples]

# Create the dataset in LangSmith
dataset_name = "Knowledge Dataset"
dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="A dataset of factual knowledge questions for evaluating LLM responses."
)

# Add examples to the dataset
client.create_examples(inputs=inputs, outputs=outputs, dataset_id=dataset.id)

This step sets up the foundation for evaluation by defining a dataset. The `examples` list pairs questions with their expected answers, like "What is the capital of France?" with "The capital of France is Paris." We then split these into `inputs` and `outputs` using list comprehensions—`inputs` holds dictionaries with the "question" key, and `outputs` holds dictionaries with the "answer" key. This structure aligns with LangSmith’s expectations. Using `client.create_dataset()`, we register a new dataset named "Knowledge Dataset" with a description that clarifies its purpose: testing factual knowledge. The returned `dataset` object gives us an ID, which we use in `client.create_examples()` to populate the dataset with our question-answer pairs. This dataset serves as the benchmark for evaluating how well our LLM’s prompts elicit correct responses.

Creating Structured Evaluators with Pydantic

The real power of our approach comes from using Pydantic models to create structured evaluators. Let's define two evaluators - one for accuracy and one for completeness:

Accuracy Evaluator:

class AccuracyGrade(BaseModel):
    """Pydantic model for accuracy evaluation results."""
    score: bool = Field(
        description="Boolean indicating whether the response is factually accurate compared to the reference answer"
    )
    explanation: str = Field(
        description="Brief explanation of why the answer was graded as accurate or inaccurate"
    )

def accuracy_evaluator(outputs: dict, reference_outputs: dict) -> dict:
    """
    Evaluates the factual accuracy of the model's response compared to the reference answer.
    Outputs a JSON object parsed into AccuracyGrade using Pydantic.
    """
    system_message = """
    You are to evaluate the factual accuracy of a model's response compared to a reference answer. 
    Your evaluation should focus on whether the core facts in the model's response match those in the reference answer, 
    regardless of the phrasing.

    Your output should be a JSON object with the following fields:
    - score: a boolean indicating whether the response is factually accurate (True if accurate, False otherwise)
    - explanation: a brief explanation of why the answer was graded as accurate or inaccurate
    """
    
    user_message = f"Reference answer: {reference_outputs['answer']}\nModel answer: {outputs['response']}"
    
    # Request JSON response format for structured output
    response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_message},
            {"role": "user", "content": user_message},
        ],
        response_format={"type": "json_object"},
    )
    
    # Parse and validate the response
    try:
        json_response = json.loads(response.choices[0].message.content)
        accuracy_grade = AccuracyGrade(**json_response)
        return accuracy_grade.model_dump()
    except (json.JSONDecodeError, ValidationError) as e:
        print(f"Error in accuracy_evaluator: {e}")
        return AccuracyGrade(score=False, explanation="Invalid response format").model_dump()

For the accuracy evaluator, we start with `AccuracyGrade`, a Pydantic class that defines the structure of our evaluation results. It has a `score` field, a boolean that flags whether the LLM’s response matches the reference answer factually, and an `explanation` field, a string that justifies the score. The `Field` function adds descriptions to document their purpose clearly. The `accuracy_evaluator` function takes two arguments: `outputs` (the LLM’s response) and `reference_outputs` (the expected answer), both as dictionaries. Inside, we define a `system_message` that instructs an LLM (acting as an evaluator) to compare the two answers for factual accuracy, ignoring phrasing differences, and return a JSON object with `score` and `explanation`. The `user_message` combines the reference and model answers for the evaluator to analyze. We call `openai_client.chat.completions.create()` with the "gpt-4o-mini" model, passing the system and user messages, and specify `response_format={"type": "json_object"}` to ensure structured output. The response is parsed with `json.loads()` into a dictionary, then validated against `AccuracyGrade`.

Completeness Evaluator:

class CompletenessGrade(BaseModel):
    """Pydantic model for completeness evaluation results."""
    score: float = Field(
        description="Score from 0.0 to 1.0 indicating how complete the response is",
        ge=0.0,
        le=1.0
    )
    missing_elements: str = Field(
        description="Any important elements from the reference that are missing in the response"
    )

def completeness_evaluator(outputs: dict, reference_outputs: dict) -> dict:
    """
    Evaluates how complete the model's response is compared to the reference answer.
    """
    system_message = """
    You are to evaluate how complete the model's response is compared to the reference answer.

    Your output should be a JSON object with the following fields:
    - score: a float from 0.0 to 1.0 indicating how complete the response is
    - missing_elements: a string describing any important elements from the reference that are missing in the response
    """
    
    # Similar implementation to accuracy_evaluator
    # ...

The completeness evaluator follows a similar pattern. The `CompletenessGrade` class uses Pydantic to define a `score` field, a float between 0.0 and 1.0 that quantifies how much of the reference answer’s content is present in the LLM’s response, with `ge` and `le` constraints ensuring it stays within bounds. The `missing_elements` field captures any key details omitted, stored as a string.

The use of Pydantic models offers several advantages:

Structured Output: Ensures evaluations follow a consistent format
Validation: Automatically validates that evaluations contain all required fields
Documentation: Self-documents the expected format through type hints and field descriptions

With our evaluators defined, we can now run the evaluation:

experiment_prefix = "pydantic-structured-eval"
experiment_results = client.evaluate(
    target,
    data=dataset.id,
    evaluators=[
        accuracy_evaluator,
        completeness_evaluator
    ],
    experiment_prefix=experiment_prefix,
    max_concurrency=2
)

This will:

Run the target function on each example in our dataset
Apply both evaluators to compare the results against reference answers
Log everything to LangSmith for analysis

Analyzing Results in the LangSmith Dashboard

The LangSmith dashboard provides powerful visualizations of your evaluation results. Let's look at what we can see:

Individual Evaluator Runs

As shown in the above image, we can drill down into individual evaluator runs to inspect:

The exact instructions given to the evaluator
The reference answer and model's response being compared
The structured output of the evaluation
Performance metrics like latency and token usage

For the completeness evaluator, we can see the system prompt instructing the model to evaluate "how complete the response is compared to the reference answer" with a structured output format. The evaluator produces a score between 0.0 and 1.0, along with notes on any missing elements.
‍

Accuracy Evaluation Details

The above image shows the accuracy evaluator in action. We can see:

The evaluator's system prompt and instructions
The reference and model answers being compared
The structured output with the score (true/false) and explanation
In this case, the model correctly identified that "Romeo and Juliet was written by William Shakespeare" matches the reference answer

Experiment Overview with Heat Map

Image 3 provides a comprehensive view of the entire evaluation experiment:

Each row represents a different example from our dataset
Columns show the input question, reference output, model output, and evaluation scores
The color-coded heat map makes it easy to spot patterns and outliers
We can see that most responses score high on both accuracy (1.0) and completeness (0.9-1.0)
The latency column shows how long each evaluation took

This visualization makes it easy to identify which questions have lower scores, so you can focus your prompt engineering efforts where they'll have the most impact.

Key Benefits of LangSmith Observability

Based on our implementation and dashboard exploration, we can identify several key benefits:

End-to-end Tracing: LangSmith provides visibility into each step of your LLM chain execution, from input validation to prompt processing and output generation, making it easy to trace the flow of data through complex workflows.
Error Diagnosis: The platform pinpoints where and why errors occur—whether in validation logic, LLM execution, output parsing, or prompt failures—streamlining debugging across all components.
Performance Monitoring: Metrics like token usage, latency, and error rates help assess your application’s efficiency, including how prompt design impacts response times and resource consumption.
Input/Output Inspection: Seeing exact inputs and outputs is invaluable not just for debugging code but also for evaluating prompt effectiveness, ensuring the LLM produces relevant and accurate responses.
Structured Prompt Evaluation: With tools like Pydantic models and custom evaluators (e.g., accuracy and completeness), LangSmith enables systematic assessment of prompt quality against reference answers, enhancing reliability and precision in LLM outputs.
Execution Visualization: The waterfall view reveals the sequence and timing of chain components, including prompt execution, helping optimize both code and prompt performance.

By combining these capabilities, LangSmith empowers developers to monitor code execution while refining prompts to meet specific goals, such as factual accuracy or response completeness.

Limitations and Best Practices

While LangSmith greatly enhances observability, there are some limitations and best practices to keep in mind:

Limitations

Storage Considerations: Extensive tracing of execution runs and prompt evaluations can generate significant data, especially for high-volume applications with large datasets.
Sensitive Data: Inputs, outputs, and prompt responses logged in LangSmith may contain sensitive information, requiring careful handling to ensure privacy.
Performance Overhead: Tracing code execution and evaluating prompts introduces minimal overhead, which could accumulate in performance-critical systems.
UI Complexity: As runs, tests, and prompt evaluations scale, the dashboard can become overwhelming, necessitating careful organization of projects and datasets.
Evaluation Setup Cost: While powerful, setting up structured prompt evaluation (e.g., defining Pydantic models and datasets) requires initial effort and expertise.

Best Practices

Structured Error Handling: Use structured error handling (e.g., detailed validation errors) to make failures in code or prompt execution traceable and actionable.
Descriptive Naming: Assign meaningful names to projects, chains, and datasets (e.g., “Knowledge Dataset” for prompt evaluation) for easier navigation across monitoring and evaluation tasks.
Strategic Tracing: Tailor tracing levels for different environments—full tracing in development for debugging code and prompts, selective tracing in production for efficiency.
Regular Monitoring: Routinely review the LangSmith dashboard to spot trends in execution errors, performance bottlenecks, or prompt quality issues like low accuracy scores.
Robust Validation Layer: Implement strong input validation (as shown in our recipe chain) to catch issues early, ensuring both code stability and prompt reliability.
Component Isolation: Design chains to separate validation, prompting, and parsing logic, simplifying debugging and enabling targeted prompt evaluation (e.g., accuracy vs. completeness).
Leverage Structured Evaluation: Use Pydantic-based evaluators and datasets to systematically assess prompt performance, refining them based on dashboard insights like heat maps or individual run details.

Frequently Asked Questions

Author’s Profile

Kanika Bansal

Kanika is a Principal Product Manager at Murf AI, specializing in AI-driven voice technology. Previously worked with Amazon's Alexa AI and Nova, she brings deep expertise in artificial intelligence, speech synthesis, and product innovation. At Murf, Kanika focuses on enhancing AI voice solutions to empower content creators, businesses, and developers - to bridge the gap between cutting-edge AI advancements and real-world applications.

Share this post