Set up guardrails

Guardrails actively intervene in your application’s behavior based on scores. They run in real time before outputs reach users and can block or modify responses when scores exceed thresholds. When to use guardrails:

You need to actively prevent issues (for example, block toxic content or filter sensitive information)
You must modify or block outputs based on scores
You require real-time intervention before responses reach users
You need to enforce safety policies or content guidelines

Weave guardrails use inline Weave Scorers to assess the input from a user or the output from an LLM and adjust the LLM’s responses in real time. You can configure custom scorers or use built-in scorers to assess content for a variety of purposes. The examples in this article show you how to use a built-in scorer and a custom scorer to adjust output from an LLM. Unlike monitors, guardrails require code changes because they affect your application’s control flow. However, every scorer result from guardrails is automatically stored in Weave’s database, so your guardrails also function as monitors without any extra configuration. You can analyze historical scorer results regardless of how they were originally used. If you want to passively score production traffic without modifying your application’s control flow, use monitors instead.

Example: Create a guardrail using a built-in moderation scorer

The following example sends user prompts to OpenAI’s GPT-4o mini model. The model’s response is then passed to Weave’s built-in access to OpenAI’s moderation API to assess whether the LLM’s response contains harmful or toxic content.

import weave
import openai
from weave.scorers import OpenAIModerationScorer
import asyncio

# Initialize Weave
weave.init("your-team-name/your-project-name")

# Initialize OpenAI client
client = openai.OpenAI()  # Uses OPENAI_API_KEY env var

# Initialize the moderation scorer
moderation_scorer = OpenAIModerationScorer()

# Send prompts to OpenAI
@weave.op
def generate_response(prompt: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ],
        max_tokens=200
    )
    return response.choices[0].message.content

# Guardrail function checks responses for toxicity
async def generate_safe_response(prompt: str) -> str:
    """Generate a response with content moderation guardrail."""
    # Get both the result and Call object
    result, call = generate_response.call(prompt)
    
    # Apply the moderation scorer before returning to user
    score = await call.apply_scorer(moderation_scorer)
    print("This is the score object:", score)
    
    # Check if content was flagged
    if not score.result.get("passed", True): 
        categories = score.result.get("categories", {})
        flagged_categories = list(categories.keys()) if categories else []
        print(f"Content blocked. Flagged categories: {flagged_categories}")
        return "I'm sorry, I can't provide that response due to content policy restrictions."
    
    return result

# Run the examples
if __name__ == "__main__":
    
    prompts = [
        "What's the capital of France?",
        "Tell me a funny fact about dogs.",
    ]
    
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        response = asyncio.run(generate_safe_response(prompt))
        print(f"Response: {response}")

You can reference variables from your ops in your scoring prompts. For example, “Evaluate whether {output} is accurate based on {ground_truth}.” See prompt variables for more information.

Example: Create a guardrail using a custom scorer

The following example creates a custom guardrail that detects personally identifiable information (PII) in LLM responses, such as email addresses, phone numbers, or social security numbers. This prevents sensitive information from being exposed in generated content.

import weave
import openai
import re
import asyncio
from weave import Scorer

weave.init("your-team-name/your-project-name")

client = openai.OpenAI()

class PIIDetectionScorer(Scorer):
    """Detects PII in LLM outputs to prevent data leaks."""
    
    @weave.op
    def score(self, output: str) -> dict:
        """
        Check for common PII patterns in the output.
        
        Returns:
            dict: Contains 'passed' (bool) and 'detected_types' (list)
        """
        detected_types = []
        
        # Email pattern
        if re.search(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', output):
            detected_types.append("email")
        
        # Phone number pattern (US format)
        if re.search(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', output):
            detected_types.append("phone")
        
        # SSN pattern
        if re.search(r'\b\d{3}-\d{2}-\d{4}\b', output):
            detected_types.append("ssn")
        
        return {
            "passed": len(detected_types) == 0,
            "detected_types": detected_types
        }

# Initialize scorer outside the function for optimal performance
pii_scorer = PIIDetectionScorer()

@weave.op
def generate_response(prompt: str) -> str:
    """Generate a response using an LLM."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ],
        max_tokens=200
    )
    return response.choices[0].message.content

async def generate_safe_response(prompt: str) -> str:
    """Generate a response with PII detection guardrail."""
    result, call = generate_response.call(prompt)
    
    # Apply PII detection scorer
    score = await call.apply_scorer(pii_scorer)
    
    # Block response if PII detected
    if not score.result.get("passed", True):
        detected_types = score.result.get("detected_types", [])
        return f"I cannot provide a response that may contain sensitive information (detected: {', '.join(detected_types)})."
    
    return result

# Example usage
if __name__ == "__main__":
    prompts = [
        "What's the weather like today?",
        "Can you help me contact someone at john.doe@example.com?",
        "Tell me about machine learning.",
    ]
    
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        response = asyncio.run(generate_safe_response(prompt))
        print(f"Response: {response}")

This guardrail blocks any response that contains email addresses, phone numbers, or SSNs, preventing accidental exposure of sensitive information.

Integrate Weave with AWS Bedrock’s guardrail

The BedrockGuardrailScorer uses AWS Bedrock’s guardrail feature to detect and filter content based on configured policies. Prerequisites:

An AWS account with Bedrock access
A configured guardrail in the AWS Bedrock console
The boto3 Python package

You don’t need to create your own Bedrock client. Weave creates it for you. To specify a region, pass the bedrock_runtime_kwargs parameter to the scorer.

For details on creating a guardrail in AWS, see the Bedrock guardrails notebook.

import weave
from weave.scorers.bedrock_guardrails import BedrockGuardrailScorer

weave.init("my_app")

guardrail_scorer = BedrockGuardrailScorer(
    guardrail_id="your-guardrail-id",
    guardrail_version="DRAFT",
    source="INPUT",
    bedrock_runtime_kwargs={"region_name": "us-east-1"}
)

@weave.op
def generate_text(prompt: str) -> str:
    # Your text generation logic here
    return "Generated text..."

async def generate_safe_text(prompt: str) -> str:
    result, call = generate_text.call(prompt)

    score = await call.apply_scorer(guardrail_scorer)

    if not score.result.passed:
        if score.result.metadata.get("modified_output"):
            return score.result.metadata["modified_output"]
        return "I cannot generate that content due to content policy restrictions."

    return result

Get Started

Concepts

Guides

Cookbooks

Reference

Details & Support

Open Source

Community

Example: Create a guardrail using a built-in moderation scorer

Example: Create a guardrail using a custom scorer

Integrate Weave with AWS Bedrock’s guardrail

Get Started

Concepts

Guides

Cookbooks

Reference

Details & Support

Open Source

Community

​Example: Create a guardrail using a built-in moderation scorer

​Example: Create a guardrail using a custom scorer

​Integrate Weave with AWS Bedrock’s guardrail

Example: Create a guardrail using a built-in moderation scorer

Example: Create a guardrail using a custom scorer

Integrate Weave with AWS Bedrock’s guardrail