Skip to main content
Monitors passively score production traffic to surface trends and issues without requiring code changes. They run asynchronously in the background and have no impact on your application’s control flow. When to use monitors:
  • You want to observe and analyze production traffic without modifying your code
  • You need to surface trends and issues over time
  • You want configurable sampling (for example, score 10% of calls to reduce costs)
  • You’re evaluating both text and audio outputs
Monitors are ideal for ongoing observation and analysis. You can set them up entirely in the W&B UI without changing your code. All scorer results are automatically stored in Weave’s database, allowing you to analyze historical trends and patterns. If you need to actively intervene in your application’s behavior based on scores, use guardrails instead.

How to create a monitor in Weave

To create a monitor in Weave:
  1. Open the W&B UI and then open your Weave project.
  2. From the Weave side-nav, select Monitors and then select the + New Monitor button. This opens the Create new monitor menu.
  3. In the Create new monitor menu, configure the following fields:
    • Name: Must start with a letter or number. Can contain letters, numbers, hyphens, and underscores.
    • Description (optional): Explain what the monitor does.
    • Active monitor toggle: Turn the monitor on or off.
    • Calls to monitor:
      • Operations: Choose one or more @weave.ops to monitor. You must log at least one trace that uses the op before it appears in the list of available ops.
      • Filter (optional): Narrow down which calls are eligible (for example, by max_tokens or top_p).
      • Sampling rate: The percentage of calls to score (0% to 100%).
        A lower sampling rate reduces costs, since each scoring call has an associated cost.
    • LLM-as-a-judge configuration:
      • Scorer name: Must start with a letter or number. Can contain letters, numbers, hyphens, and underscores.
      • Judge model: Select the model to score your ops. The menu contains any commercial LLM models you have configured in your W&B account and W&B Inference models. Audio-enabled models have an Audio Input label beside their names. For the selected model, configure the following settings:
        • Configuration name: A name for this model configuration.
        • System prompt: Defines the judging model’s role and persona, for example, “You are an impartial AI judge.”
        • Response format: The format the judge should output its response in, such as a json_object or plain text.
        • Scoring prompt: The evaluation task used to score your ops. You can reference variables from your ops in your scoring prompts. For example, “Evaluate whether {output} is accurate based on {ground_truth}.” See Prompt variables.
Once you have configured the monitor’s fields, click Create monitor. This adds the monitor to your Weave project. When your code starts generating traces, you can review the scores in the Traces tab by selecting the monitor’s name and reviewing the data in the resulting panel. Weave automatically stores all scorer results in the Call object’s feedback field.

Example: Create a truthfulness monitor

The following example creates a monitor that evaluates the truthfulness of generated statements.
  1. Define a function that generates statements. Some are truthful, others are not:
import weave
import random
import openai

weave.init("my-team/my-weave-project")

client = openai.OpenAI()

@weave.op()
def generate_statement(ground_truth: str) -> str:
    if random.random() < 0.5:
        response = client.chat.completions.create(
            model="gpt-4.1",
            messages=[
                {
                    "role": "user",
                    "content": f"Generate a statement that is incorrect based on this fact: {ground_truth}"
                }
            ]
        )
        return response.choices[0].message.content
    else:
        return ground_truth
  1. Run the function at least once to log a trace in your project. This allows you to set up a monitor in the W&B UI:
generate_statement("The Earth revolves around the Sun.")
  1. Open your Weave project in the W&B UI and select Monitors from the side-nav. Then select New Monitor.
  2. In the Create new monitor menu, configure the fields using the following values:
    • Name: truthfulness-monitor
    • Description: Evaluates the truthfulness of generated statements.
    • Active monitor: Toggle on.
    Creating a monitor part 1
    • Operations: Select generate_statement.
    • Sampling rate: Set to 100% to score every call.
    Creating a monitor part 2
    • Scorer name: truthfulness-scorer
    • Judge model: o3-mini-2025-01-31
    • System prompt: You are an impartial AI judge. Your task is to evaluate the truthfulness of statements.
    • Response format: json_object
    • Scoring prompt:
      Evaluate whether the output statement is accurate based on the input statement.
      
      This is the input statement: {ground_truth}
      
      This is the output statement: {output}
      
      The response should be a JSON object with the following fields:
      - is_true: a boolean stating whether the output statement is true or false based on the input statement.
      - reasoning: your reasoning as to why the statement is true or false.
      
    Creating a monitor part 3
  3. Click Create Monitor. This adds the monitor to your Weave project.
  4. Back in your Python script, invoke your function using statements of varying degrees of truthfulness to test the scoring function:
generate_statement("The Earth revolves around the Sun.")
generate_statement("Water freezes at 0 degrees Celsius.")
generate_statement("The Great Wall of China was built over several centuries.")
  1. After running the script using several different statements, open the W&B UI and navigate to the Traces tab. Select any LLMAsAJudgeScorer.score trace to see the results.
Monitor trace