| title | Artemis SOC Triage Benchmark |
|---|---|
| emoji | π‘οΈ |
| colorFrom | blue |
| colorTo | indigo |
| sdk | docker |
| pinned | false |
Artemis is a premium, reproducible OpenEnv benchmark engineered for the evaluation of AI agents in high-fidelity Security Operations Center (SOC) triage scenarios. It provides a standardized framework to train, test, and validate autonomous decision-making in cybersecurity.
- Overview
- Quick Start
- Task Specifications
- Environment API
- Baseline Performance
- Architecture and Design
- Deployment
- Testing
- Contributing
In modern enterprise security, Security Operations Centers (SOCs) are overwhelmed by sheer volume. With alerts scaling from 10,000 to 1,000,000 daily, and a false positive rate of 80-90%, human analysts are pushed to their limits. Artemis addresses this critical bottleneck by providing a realistic training ground for AI agents to automate the triage process.
Artemis simulates a sophisticated SOC dashboard where agents execute real-world triage actions:
- Triage: Distinguish between noise and genuine threats.
- Remediate: Isolate malicious IPs and compromise user accounts.
- Investigate: Deep-dive into logs and temporal file access patterns.
- Escalate: Strategically involve human analysts when high-stakes ambiguity arise.
- π‘οΈ Operational Realism β Incidents modeled after authentic attack signatures.
- π² Stochastic Scenario Generation β 9 rigorous procedural variants across 3 core task tracks prevent basic memorization.
- π Deterministic Evaluation β Seed-controlled reproducible trajectories for rigorous scientific benchmarking.
- π― Task-Specific Precision Graders β Granular scoring driven by customized ground-truth rubrics rather than weak heuristics.
- π Signal-Rich Rewards β Partial credit for investigative steps and early threat isolation.
- π High Efficiency β Optimized for speed; complete global benchmarks in under 3 minutes.
- π¦ OpenEnv Compliance β Native support for the OpenEnv
reset/step/stateinterface.
Initialize the Artemis benchmark and run the baseline agent in under 2 minutes.
# Clone the benchmark repo
git clone https://github.com/MONSTER4REX/ARTEMIS.git
cd ARTEMIS
# Setup environment
pip install -r requirements.txt
# Configure your OpenEnv environment
export HF_TOKEN="your-huggingface-token"
export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4-0125-preview"
export ENV_URL="http://localhost:7860" # Or your HF Space URL
# Execute baseline inference
python inference.pyArtemis offers three core difficulty tiers designed to test different cognitive layers of an AI agent.
Objective: Filter obvious false positives from a noisy dashboard.
- Scenario: Internal SQL injection tests and known research scanners (Shodan).
- Core Skill: Contextual awareness and allowlist recognition.
- Target Metric: >95% accuracy.
Objective: Detect and mitigate a coordinated multi-source Brute Force attack.
- Scenario: Coordinated failed logins followed by a successful compromise.
- Core Skill: Temporal correlation and proactive IP isolation.
- Target Metric: >75% accuracy.
Objective: Uncover sophisticated multi-stage lateral movement.
- Scenario: Unusual login from a new geo-location β Sensitive file access β Internal API calls.
- Core Skill: Hypothesis testing, long-horizon reasoning, and risk judgment.
- Target Metric: >60% accuracy.
The Artemis API follows the OpenEnv Specification for seamless integration with existing AI agent frameworks.
1. POST /reset - Initialize Episode
{
"task": "lateral_movement_detection",
"seed": 42
}2. POST /step - Execute Action
{
"episode_id": "ep_abc123",
"action": {
"action_type": "isolate_user",
"user_id": "admin_alpha",
"reason": "Compromised user exhibiting lateral movement patterns."
}
}3. GET /state - Inspect Environment
Returns the full internal simulation state for debugging and audit purposes.
4. GET /leaderboard - Live Rankings
Returns the global curated dataset leaderboard comparing major model families against the environment tiers.
Standardized performance metrics across top-tier models for the Artemis v1.2.0 benchmark.
| Model Hierarchy | Easy (T1) | Medium (T2) | Hard (T3) | Average |
|---|---|---|---|---|
| OpenAI GPT-4o | 0.91 | 0.78 | 0.64 | 0.78 |
| Intelligent Baseline | 0.88 | 0.75 | 0.60 | 0.74 |
| Llama-3 (70B) | 0.86 | 0.71 | 0.49 | 0.69 |
| Llama-3 (8B) | 0.68 | 0.51 | 0.22 | 0.47 |
Note: Benchmarks performed with standard Artemis system prompts and default scenario seeds across all 9 deterministic variants.
Artemis is built with a decoupled 3-tier architecture for maximum scalability and portability.
graph TD
A[Agent Layer: Python Baseline] -->|REST API| B[API Layer: FastAPI]
B -->|Simulation Calls| C[Environment Layer: Artemis Engine]
C --> D[Scenario Manager]
C --> E[Reward & Grader Engine]
C --> F[Pydantic Models]
- Environment Layer: Pure Python simulation logic (stateless, deterministic).
- API Layer: Performance-tuned FastAPI server with OpenAPI documentation.
- Deployment: Dockerized for zero-config hosting on Hugging Face Spaces.
docker build -t artemis:latest .
docker run -p 7860:7860 artemis:latest- Create a Docker Space on Hugging Face.
- Push your repository to the HF remote.
- Access your live endpoint: https://rudreshrx-artemis.hf.space
Ensuring the integrity of the benchmark is paramount. Artemis includes a comprehensive suite of unit and integration tests.
# Run all core environment tests
pytest tests/test_environment.py
# Validate grading logic
pytest tests/test_grading.py
# Perform full baseline validation
./scripts/validate-submission.shimport os
from openai import OpenAI
client = OpenAI(
base_url=os.getenv("API_BASE_URL"),
api_key=os.getenv("HF_TOKEN")
)
def get_action(observation):
return client.chat.completions.create(
model=os.getenv("MODEL_NAME"),
messages=[{"role": "user", "content": observation}],
response_format={"type": "json_object"}
)# Point to any OpenAI-compatible endpoint
export API_BASE_URL="https://api-inference.huggingface.co/v1"
export MODEL_NAME="meta-llama/Llama-3-8B-Instruct"
python inference.pyWe welcome contributions to the Artemis benchmark! Please review our PRD Document for deep technical specs before submitting a PR.
- Fork the Repo.
- Implement feature/task.
- Run
pytestandblack. - Open a Pull Request.
@software{artemis_benchmark_2026,
title={Artemis: A High-Fidelity OpenEnv Benchmark for SOC Triage},
author={MONSTER4REX},
year={2026},
url={https://github.com/MONSTER4REX/ARTEMIS}
}Distributed under the Apache 2.0 License. See LICENSE for more information.
Last Updated: April 11, 2026 | Version: 1.2.0 | Status: Production Ready