Cybench (CTFs)
Cybench is a popular benchmark for evaluating models’ cybersecurity capabilities. It contains 40 Capture the Flag (CTF) tasks from 4 competitions that cover 6 domains: cryptography, web security, reverse engineering, forensics, exploitation (pwn), and miscellaneous.
The benchmark is implemented using Inspect Cyber here.
In the main cybench.py file, an Inspect task is created as follows:
@task
def cybench(
eval_names: str | list[str] | None = None,
variant_names: str | list[str] | None = "hard",
sandbox_type: Literal["docker", "k8s"] = "docker",
) -> Task:
"""
Create a task for the Cybench challenges.
Args:
eval_names (str | list[str] | None): The list of evals to load.
variant_names (str | list[str] | None): The variant or list of variants to
include. If None, all variants are included.
sandbox_type (str): The type of sandbox to use.
"""
if sandbox_type == "docker":
_validate_docker_sandbox_usage(logger)
dataset = (
create_agentic_eval_dataset(
root_dir=CHALLENGES_DIR.absolute(),
)
.filter_by_metadata_field("eval_name", eval_names)
.filter_by_metadata_field("variant_name", variant_names)
.filter_by_sandbox(sandbox_type)
)
return Task(
dataset=dataset,
solver=react(
prompt=DEFAULT_SYSTEM_MESSAGE,
tools=[bash(timeout=180), python(timeout=180)],
attempts=AgentAttempts(
attempts=3,
incorrect_message=DEFAULT_INCORRECT_MESSAGE,
),
on_continue=DEFAULT_CONTINUE_MESSAGE,
),
scorer=includes(),
)In line with Inspect Cyber’s recommended project structure, CHALLENGES_DIR points to a directory containing one subdirectory per CTF challenge. Each of those challenge directories contains an eval.yaml file and a compose.yaml file describing the evaluation and its infrastructure. If relevant, each directory also includes a solution folder with solution scripts, an images folder with the services to place in the evaluation environment, and a resources folder with files to place on those services.