Introduction to ORS

The Open Reward Standard (ORS) is an HTTP-based protocol that standardizes how language model agents interact with reinforcement learning environments.

What is ORS?

ORS specifies a standard interface for connecting language model agents to environments. It defines:

How agents discover available tools (actions they can take)
How agents access tasks (problems to solve)
How agents receive rewards (feedback signals for RL training)
How episodes progress until completion (via finished signals)

In ORS, an environment is a server that agents connect to via HTTP. The server implements the ORS protocol, providing endpoints for tool discovery, task retrieval, and tool execution.

Key Principle: Actions are Tools

A fundamental assumption in ORS:

The only way agents interact with environments is by calling tools.

This design decision has important benefits:

Leverages existing infrastructure: All major LLM providers support function calling
Clear interface boundary: Agent actions are explicit and well-defined
Traceable interactions: Every action is a structured function call
Type safety: Tools have schemas defining their inputs and outputs

For example, in a math environment, the agent might have access to a submit tool:

{
  "name": "submit",
  "description": "Submit an answer to the current math problem",
  "input_schema": {
    "type": "object",
    "properties": {
      "answer": {"type": "number"}
    },
    "required": ["answer"]
  }
}

Primary Use Case: RL Training

ORS is designed specifically to enable reinforcement learning with language models.

How RL Works with ORS

In reinforcement learning, an agent learns by:

Observing the environment state
Taking actions
Receiving rewards
Learning from the reward signal

ORS provides all these elements:

Agent observes → Gets prompt and tool outputs (state)
Agent acts     → Calls tools (actions)
Agent learns   → Receives tool output and reward signals (feedback)
Episode ends   → Finished signal terminates trajectory

Example: Math Problem Solving

Consider training an agent on math problems. Here’s the protocol flow:

1. List available tasks
   POST /math/tasks {"split": "train"}
   → {"tasks": [{"question": "If x + 5 = 12, what is x?", "answer": "7"}, ...]}

2. Create session
   POST /create_session
   → {"sid": "session-123"}

3. Create episode with a task
   POST /create
   Headers: X-Session-ID: session-123
   Body: {"env_name": "math", "task_spec": {"question": "If x + 5 = 12, what is x?", "answer": "7"}}

4. Get initial prompt
   GET /math/prompt
   Headers: X-Session-ID: session-123
   → [{"text": "If x + 5 = 12, what is x?", "detail": null, "type": "text"}]

5. Call submit tool
   POST /math/call
   Headers: X-Session-ID: session-123
   Body: {"name": "submit", "input": {"answer": "7"}}
   → (SSE) {"ok": true, "output": {
       "blocks": [{"text": "Correct!", "detail": null, "type": "text"}],
       "metadata": null,
       "reward": 1.0,
       "finished": true
     }}

Key insight: The reward signal (1.0 for correct, 0.0 or negative for incorrect) allows the agent to learn which actions lead to success.

Secondary Use Case: Agent Evaluation

While designed for RL training, ORS also excels at structured evaluation:

Standardized benchmarks: Common interface across different environments
Train/test splits: Organize tasks for proper evaluation
Reproducible results: Same protocol for all agents
Diverse task types: From math to coding to web navigation

Core Components

An ORS server provides access to four core components:

1. Tools

Tools are the actions available to agents. Each tool has:

A name (e.g., bash, submit, read_file)
A description explaining what it does
An input schema (JSON Schema) defining parameters
A return type (ToolOutput with blocks, reward, finished)

2. Tasks

Tasks are the problems agents need to solve. Each task is a JSON object containing problem-specific data:

{
  "question": "What is 2+2?",
  "ground_truth": 4,
  "difficulty": "easy"
}

The structure is environment-specific. Math environments have questions and answers. Coding environments have problem descriptions and test cases.

3. Splits

Splits organize tasks into categories:

train - Tasks for training agents
validation - Tasks for hyperparameter tuning
test - Tasks for final evaluation

This matches standard ML practice and prevents overfitting.

4. Prompts

Prompts are the initial instructions given to agents for each task. They’re returned as blocks (text or images):

# Agent gets prompt at start of episode
prompt = session.get_prompt()
# → [TextBlock(text="What is 2+2?")]

Prompts can be multi-modal (text + images) and are generated dynamically based on the task.

Episodes (Sessions)

A critical concept in ORS: A session IS an RL episode.

Episode Lifecycle

Create session → Start episode with a specific task
Get prompt    → Receive initial state
Call tools    → Take actions, get rewards
Repeat step 3 → Until finished=True
End session   → Episode complete

The episode continues until a tool returns finished: true. This is different from typical API sessions - there’s semantic meaning to when an episode ends. It represents task completion (success or failure).

Episode Example

Episode 1: Single-step (correct answer)

POST /create_session → session_id_1
POST /create (task: problem_1)
POST /env/call ("submit", {"answer": 42})
→ finished=true, reward=1.0

Episode 2: Multi-step interaction

POST /create_session → session_id_2
POST /create (task: problem_2)

Step 1: Explore
POST /env/call ("bash", {"command": "cat question.txt"})
→ finished=false, reward=0.0

Step 2: Solve
POST /env/call ("submit", {"answer": "Tokyo"})
→ finished=true, reward=1.0

Rewards

Rewards are numeric feedback signals that enable RL training.

Reward Design

Sparse rewards: Only at task completion (0 or 1)
Dense rewards: After each action (incremental progress)
Shaped rewards: Guide agent toward solution

Example sparse rewards:

POST /env/call ("submit", {"answer": 42})
→ reward=1.0, finished=true    # Correct

POST /env/call ("submit", {"answer": 43})
→ reward=0.0, finished=true    # Incorrect

Example dense rewards:

POST /env/call ("bash", {"command": "ls"})
→ reward=0.1, finished=false   # Progress

POST /env/call ("bash", {"command": "cat answer.txt"})
→ reward=0.3, finished=false   # More progress

POST /env/call ("submit", {"answer": 42})
→ reward=1.0, finished=true    # Complete

Why Rewards Matter

Rewards transform agent interaction from simple evaluation to learning:

Agents can be trained with RL algorithms (GRPO, CISPO, etc.)
Immediate feedback guides exploration
Credit assignment across multiple steps

Protocol Overview

ORS uses HTTP + Server-Sent Events for communication:

HTTP for Control

Standard REST endpoints for:

Listing tools, splits, tasks
Creating/deleting sessions
Health checks

SSE for Tool Execution

Server-Sent Events stream tool outputs:

Supports long-running operations
Allows for streaming responses
Graceful error handling

Language-Agnostic

Because it’s HTTP-based, ORS can be implemented in any language:

Python: OpenReward SDK (reference implementation)
TypeScript: Custom server with Express/Fastify
Go: Custom server with stdlib http
Rust: Custom server with Actix/Axum

The protocol is the standard, not the implementation.

ORS vs MCP

Both ORS and MCP involve agents calling tools, but they serve different purposes: MCP (Model Context Protocol):

Purpose: Connect LLMs to tools, data sources, workflows
Use case: General-purpose tool access
Protocol: JSON-RPC over stdio/SSE
Key feature: Seamless tool integration

ORS (Open Reward Standard):

Purpose: Connect agents to RL training environments
Use case: Training and evaluating agents
Protocol: HTTP + SSE
Key features: Rewards, episodes, task organization

What’s Different?

ORS adds RL-specific features:

Feature	MCP	ORS	Why ORS Needs It
Rewards	No	Yes	RL training signal
Finished	No	Yes	Episode termination
Tasks	No	Yes	Problem organization
Splits	No	Yes	Train/test separation

Can They Work Together?

Yes! They serve complementary purposes:

MCP: Agent uses tools to access external data/APIs
ORS: Agent operates in structured RL environment with rewards

You might use both: an agent in an ORS environment that uses MCP tools to access external resources.

Who Should Use ORS?

ORS is designed for:

Researchers

Training language models with reinforcement learning

Benchmark Creators

Building standardized evaluation environments for agent capabilities

Companies

Developing custom environments to train agents on internal workflows

Educators

Teaching RL concepts with language models in environments

Next Steps

Quick Start

Build your first ORS server with GSM8K example

Protocol Specification

Dive into the HTTP API details

Core Concepts

Understand tools, tasks, rewards, and prompts

Implementation Guide

Learn how to implement an ORS server

Key Takeaway: ORS brings RL to language models by providing a standardized protocol with rewards, episode structure, and task organization. It’s designed for training agents, not just calling tools.

Getting Started

Specification

Core Concepts

Implementation Guides

Comparison

What is ORS?

Key Principle: Actions are Tools

Primary Use Case: RL Training

How RL Works with ORS

Example: Math Problem Solving

Secondary Use Case: Agent Evaluation

Core Components

1. Tools

2. Tasks

3. Splits

4. Prompts

Episodes (Sessions)

Episode Lifecycle

Episode Example

Rewards

Reward Design

Why Rewards Matter

Protocol Overview

HTTP for Control

SSE for Tool Execution

Language-Agnostic

ORS vs MCP

What’s Different?

Can They Work Together?

Who Should Use ORS?

Researchers

Benchmark Creators

Companies

Educators

Next Steps

Quick Start

Protocol Specification

Core Concepts

Implementation Guide

Getting Started

Specification

Core Concepts

Implementation Guides

Comparison

​What is ORS?

​Key Principle: Actions are Tools

​Primary Use Case: RL Training

​How RL Works with ORS

​Example: Math Problem Solving

​Secondary Use Case: Agent Evaluation

​Core Components

​1. Tools

​2. Tasks

​3. Splits

​4. Prompts

​Episodes (Sessions)

​Episode Lifecycle

​Episode Example

​Rewards

​Reward Design

​Why Rewards Matter

​Protocol Overview

​HTTP for Control

​SSE for Tool Execution

​Language-Agnostic

​ORS vs MCP

​What’s Different?

​Can They Work Together?

​Who Should Use ORS?

​Researchers

​Benchmark Creators

​Companies

​Educators

​Next Steps

Quick Start

Protocol Specification

Core Concepts

Implementation Guide

What is ORS?

Key Principle: Actions are Tools

Primary Use Case: RL Training

How RL Works with ORS

Example: Math Problem Solving

Secondary Use Case: Agent Evaluation

Core Components

1. Tools

2. Tasks

3. Splits

4. Prompts

Episodes (Sessions)

Episode Lifecycle

Episode Example

Rewards

Reward Design

Why Rewards Matter

Protocol Overview

HTTP for Control

SSE for Tool Execution

Language-Agnostic

ORS vs MCP

What’s Different?

Can They Work Together?

Who Should Use ORS?

Researchers

Benchmark Creators

Companies

Educators

Next Steps