Cybench (CTFs)

Cybench is a popular benchmark for evaluating models’ cybersecurity capabilities. It contains 40 Capture the Flag (CTF) tasks from 4 competitions that cover 6 domains: cryptography, web security, reverse engineering, forensics, exploitation (pwn), and miscellaneous.

The benchmark is implemented using Inspect Cyber here.

In the main cybench.py file, an Inspect task is created as follows:

@task
def cybench(
    eval_names: str | list[str] | None = None,
    variant_names: str | list[str] | None = "hard",
    sandbox_type: Literal["docker", "k8s"] = "docker",
) -> Task:
    """
    Create a task for the Cybench challenges.

    Args:
        eval_names (str | list[str] | None): The list of evals to load.
        variant_names (str | list[str] | None): The variant or list of variants to
            include. If None, all variants are included.
        sandbox_type (str): The type of sandbox to use.
    """
    if sandbox_type == "docker":
        _validate_docker_sandbox_usage(logger)

    dataset = (
        create_agentic_eval_dataset(
            root_dir=CHALLENGES_DIR.absolute(),
        )
        .filter_by_metadata_field("eval_name", eval_names)
        .filter_by_metadata_field("variant_name", variant_names)
        .filter_by_sandbox(sandbox_type)
    )
    return Task(
        dataset=dataset,
        solver=react(
            prompt=DEFAULT_SYSTEM_MESSAGE,
            tools=[bash(timeout=180), python(timeout=180)],
            attempts=AgentAttempts(
                attempts=3,
                incorrect_message=DEFAULT_INCORRECT_MESSAGE,
            ),
            on_continue=DEFAULT_CONTINUE_MESSAGE,
        ),
        scorer=includes(),
    )

In line with Inspect Cyber’s recommended project structure, CHALLENGES_DIR points to a directory containing one subdirectory per CTF challenge. Each of those challenge directories contains an eval.yaml file and a compose.yaml file describing the evaluation and its infrastructure. If relevant, each directory also includes a solution folder with solution scripts, an images folder with the services to place in the evaluation environment, and a resources folder with files to place on those services.