Skip to main content

Tasks & Splits

Tasks and splits are how ORS organizes problems for training and evaluation. Tasks are the individual problems agents solve, while splits categorize these tasks into train/validation/test sets.

Tasks

What is a Task?

A task is a specific problem for an agent to solve. Each task is represented as a JSON object with task-specific data. Key insight: Task structure is environment-specific. Different environments have different task formats.

Task Examples

Math environment:
{
  "question": "If x + 5 = 12, what is x?",
  "answer": "7",
  "difficulty": "easy"
}
Coding environment:
{
  "problem_id": "reverse_string",
  "description": "Write a function to reverse a string",
  "test_cases": [
    {"input": "hello", "output": "olleh"},
    {"input": "world", "output": "dlrow"}
  ],
  "time_limit_seconds": 5
}
Web navigation:
{
  "task_id": "find_price",
  "goal": "Find the price of iPhone 15",
  "start_url": "https://example.com",
  "success_criteria": "Price found and extracted correctly"
}

Task Lifecycle

1. Environment defines tasks

2. Tasks organized into splits (train/test)

3. Agent requests tasks from a split

4. For each task:
   a. Create episode with task
   b. Get prompt (derived from task)
   c. Solve task via tool calls
   d. Receive reward
   e. Cleanup episode

Accessing Tasks

Tasks are retrieved via the API:
POST /math/tasks
Content-Type: application/json

{
  "split": "train"
}
Response:
{
  "tasks": [
    {"question": "What is 2+2?", "answer": "4"},
    {"question": "If x + 5 = 12, what is x?", "answer": "7"},
    ...
  ],
  "env_name": "math"
}

Task as Episode Input

Tasks are passed when creating episodes:
POST /create
X-Session-ID: abc-123

{
  "env_name": "math",
  "task_spec": {
    "question": "What is 2+2?",
    "answer": "4"
  },
  "secrets": {}
}
The environment uses the task to: -Generate the initial prompt -Determine correct answers -Calculate rewards -Track episode progress

Splits

What is a Split?

A split is a named category of tasks. Splits organize tasks for different purposes in ML workflows. Standard splits:
  • train -Tasks for training agents
  • validation -Tasks for hyperparameter tuning
  • test -Tasks for final evaluation

Split Structure

interface Split {
  name: string  // Split identifier
  type: "train" | "validation" | "test"  // Category
}
Examples:
[
  {"name": "train", "type": "train"},
  {"name": "validation", "type": "validation"},
  {"name": "test", "type": "test"}
]

Why Splits Matter

Splits prevent overfitting and enable proper evaluation: Train split: -Used during RL training -Agent sees these tasks repeatedly -Can memorize solutions (acceptable) -Large number of tasks for diverse training Validation split: -Used for hyperparameter tuning -Agent doesn’t train on these -Evaluate different hyperparameters -Intermediate checkpoint evaluation Test split: -Used ONLY for final evaluation -Agent never sees during training -True measure of generalization -Evaluate on once at the end

Accessing Splits

List available splits:
GET /math/splits
Response:
[
  {"name": "train", "type": "train"},
  {"name": "test", "type": "test"}
]
Then request tasks from a specific split:
POST /math/tasks
{"split": "train"}

Custom Splits

Environments can define custom splits beyond train/validation/test:
[
  {"name": "easy", "type": "train"},
  {"name": "medium", "type": "train"},
  {"name": "hard", "type": "test"},
  {"name": "expert", "type": "test"}
]
Use cases: -Difficulty-based splits (easy/medium/hard) -Domain-specific splits (algebra/geometry/calculus) -Time-based splits (before_2020/after_2020) -Source-based splits (synthetic/human_generated) Convention: Custom splits should map to standard types: -Training-related → "type": "train" -Evaluation-related → "type": "test" -Tuning-related → "type": "validation"

Task Design Patterns

Pattern 1: Static Task List

Tasks are predefined and loaded from file:
class MathEnvironment(Environment):
    TASKS = [
        {"question": "What is 2+2?", "answer": "4"},
        {"question": "What is 10-3?", "answer": "7"},
        # ... 1000 more tasks
    ]

    @classmethod
    def list_tasks(cls, split: str):
        if split == "train":
            return cls.TASKS[:800]  # First 80%
        elif split == "test":
            return cls.TASKS[800:]  # Last 20%
Pros: Simple, deterministic, reproducible Cons: Limited diversity, finite tasks

Pattern 2: Procedurally Generated Tasks

Tasks generated on-the-fly:
import random

class MathEnvironment(Environment):
    @classmethod
    def list_tasks(cls, split: str):
        seed = 42 if split == "train" else 43
        random.seed(seed)

        tasks = []
        for i in range(1000):
            a = random.randint(1, 100)
            b = random.randint(1, 100)
            tasks.append({
                "question": f"What is {a} + {b}?",
                "answer": a + b
            })
        return tasks
Pros: Infinite diversity, scalable Cons: Quality control, ensuring variety

Pattern 3: Difficulty Progression

Tasks organized by difficulty:
class MathEnvironment(Environment):
    @classmethod
    def list_splits(cls):
        return ["easy", "medium", "hard"]

    @classmethod
    def list_tasks(cls, split: str):
        if split == "easy":
            return [{"question": "2+2", "answer": "4"}, ...]
        elif split == "medium":
            return [{"question": "13*7", "answer": "91"}, ...]
        elif split == "hard":
            return [{"question": "sqrt(2401)", "answer": "4"9}, ...]
Use case: Curriculum learning, progressive training

Pattern 4: Real-World Datasets

Tasks from benchmark datasets:
import datasets  # HuggingFace datasets

class GSM8KEnvironment(Environment):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.dataset = datasets.load_dataset("math", "main")

    @classmethod
    def list_splits(cls):
        return ["train", "test"]

    @classmethod
    def list_tasks(cls, split: str):
        ds = datasets.load_dataset("math", "main")
        split_data = ds[split]

        return [
            {
                "question": item["question"],
                "answer": item["answer"]
            }
            for item in split_data
        ]
Pros: Standardized benchmarks, comparable results Cons: May be overfit by existing models

Task Sampling Strategies

Sequential Sampling

Go through tasks in order:
tasks = environment.list_tasks("train")
for task in tasks:
    run_episode(task)
Use case: Ensuring all tasks are seen

Random Sampling

Sample tasks randomly:
import random

tasks = environment.list_tasks("train")
for _ in range(num_episodes):
    task = random.choice(tasks)
    run_episode(task)
Use case: Diverse training, prevent memorization order

Weighted Sampling

Sample based on difficulty or priority:
tasks = environment.list_tasks("train")
weights = [task.get("weight", 1.0) for task in tasks]

for _ in range(num_episodes):
    task = random.choices(tasks, weights=weights)[0]
    run_episode(task)
Use case: Focus training on harder examples

Curriculum Learning

Progress from easy to hard:
easy_tasks = environment.list_tasks("easy")
medium_tasks = environment.list_tasks("medium")
hard_tasks = environment.list_tasks("hard")

# Train in curriculum
for task in easy_tasks:
    run_episode(task)

for task in medium_tasks:
    run_episode(task)

for task in hard_tasks:
    run_episode(task)
Use case: Learning complex skills progressively

Task Validation

Ensure Task Quality

@classmethod
def list_tasks(cls, split: str):
    tasks = load_tasks_from_file(split)

    # Validate each task
    validated_tasks = []
    for task in tasks:
        if cls.is_valid_task(task):
            validated_tasks.append(task)
        else:
            print(f"Warning: Invalid task skipped: {task}")

    return validated_tasks

@classmethod
def is_valid_task(cls, task):
    # Check required fields
    if "question" not in task or "answer" not in task:
        return False

    # Check types
    if not isinstance(task["question"], str):
        return False

    return True

Best Practices

1. Separate Train and Test

# Good - clear separation
class Environment(Environment):
    @classmethod
    def list_splits(cls):
        return ["train", "test"]

    @classmethod
    def list_tasks(cls, split: str):
        if split == "train":
            return self.train_tasks  # 80%
        elif split == "test":
            return self.test_tasks   # 20% - never overlap

2. Sufficient Task Diversity

# Good - many diverse tasks
train_tasks = generate_diverse_tasks(count=10000)

# No Bad - too few tasks
train_tasks = [task1, task2, task3]  # Agent will memorize

3. Reproducible Splits

# Good - deterministic splits
random.seed(42)
all_tasks = load_all_tasks()
random.shuffle(all_tasks)

train_tasks = all_tasks[:800]
test_tasks = all_tasks[800:]

# No Bad - random splits each run
train_tasks = random.sample(all_tasks, 800)  # Different each time!

4. Document Task Format

class MyEnvironment(Environment):
    """
    Tasks have the following format:
    {
        "question": str,  # The problem statement
        "answer": int | float,  # The correct answer
        "difficulty": str,  # "easy", "medium", or "hard"
        "topic": str,  # e.g., "algebra", "geometry"
    }
    """

5. Validate at Runtime

def __init__(self, task_spec, **kwargs):
    super().__init__(task_spec, **kwargs)

    # Validate task structure
    required_fields = ["question", "answer"]
    for field in required_fields:
        if field not in task_spec:
            raise ValueError(f"Task missing required field: {field}")

    self.task = task_spec

Next Steps


Key Takeaway: Tasks are the problems agents solve. Splits organize tasks for proper ML workflows. Design task structures that are clear, validated, and organized into train/test splits to enable both learning and fair evaluation.