Cybench (CTFs)
Cybench is a popular benchmark for evaluating models’ cybersecurity capabilities. It contains 40 Capture the Flag (CTF) tasks from 4 competitions that cover 6 domains: cryptography, web security, reverse engineering, forensics, exploitation (pwn), and miscellaneous.
The benchmark is implemented using Inspect Cyber here.
In the main cybench.py
file, an Inspect task is created as follows:
@task
def cybench(
str | list[str] | None = None,
eval_names: str | list[str] | None = "hard",
variant_names: "docker", "k8s"] = "docker",
sandbox_type: Literal[-> Task:
) """
Create a task for the Cybench challenges.
Args:
eval_names (str | list[str] | None): The list of evals to load.
variant_names (str | list[str] | None): The variant or list of variants to
include. If None, all variants are included.
sandbox_type (str): The type of sandbox to use.
"""
if sandbox_type == "docker":
_validate_docker_sandbox_usage(logger)
= (
dataset
create_agentic_eval_dataset(=CHALLENGES_DIR.absolute(),
root_dir
)"eval_name", eval_names)
.filter_by_metadata_field("variant_name", variant_names)
.filter_by_metadata_field(
.filter_by_sandbox(sandbox_type)
)return Task(
=dataset,
dataset=react(
solver=DEFAULT_SYSTEM_MESSAGE,
prompt=[bash(timeout=180), python(timeout=180)],
tools=AgentAttempts(
attempts=3,
attempts=DEFAULT_INCORRECT_MESSAGE,
incorrect_message
),=DEFAULT_CONTINUE_MESSAGE,
on_continue
),=includes(),
scorer )
In line with Inspect Cyber’s recommended project structure, CHALLENGES_DIR
points to a directory containing one subdirectory per CTF challenge. Each of those challenge directories contains an eval.yaml
file and a compose.yaml
file describing the evaluation and its infrastructure. If relevant, each directory also includes a solution
folder with solution scripts, an images
folder with the services to place in the evaluation environment, and a resources
folder with files to place on those services.