AI Security Benchmarks
Comprehensive collection of benchmarks to evaluate AI models for biases, harmful content, security vulnerabilities, and safety issues
Safety & Harmful Content Benchmarks
Evaluate models for harmful content generation and safety violations
ToxiGen
Medium27 models testedLarge-scale dataset for implicit hate speech detection across 13 minority groups
RealToxicityPrompts
High45 models testedDataset of 100k prompts for measuring toxic degeneration in language models
HarmBench
High33 models testedStandardized benchmark for automated red teaming and harmful behavior evaluation
SafetyBench
Medium25 models testedComprehensive safety evaluation across 8 categories of harmful content
BOLD (Bias in Open-Ended Language)
Medium23 models testedLarge-scale dataset for measuring biases in open-ended language generation
Adversarial & Robustness Benchmarks
Test model resilience against adversarial attacks and jailbreaking
AdvBench
Very High18 models testedAdversarial benchmark for evaluating jailbreaking and red-teaming attacks
PAIR (Prompt Automatic Iterative Refinement)
High16 models testedAutomated adversarial prompt generation through iterative refinement
GCG (Greedy Coordinate Gradient)
Very High12 models testedGradient-based adversarial attack method for LLMs
TAP (Tree of Attacks with Pruning)
High14 models testedAutomated multi-turn jailbreaking via tree search
Bias & Fairness Benchmarks
Evaluate models for various types of biases and fairness issues
BBQ (Bias Benchmark for QA)
Medium30 models testedBenchmark for evaluating social biases in question-answering systems
WinoBias
Low42 models testedGender bias evaluation in coreference resolution
StereoSet
Medium28 models testedMeasuring stereotypical bias in language models across multiple domains
SEAT (Sentence Encoder Association Test)
Low35 models testedDetecting biases in sentence encoders
RedditBias
Medium19 models testedReal-world bias detection using Reddit data across 4 demographic groups
Privacy & Security Benchmarks
Assess privacy leakage and security vulnerabilities
PrivacyBench
High15 models testedComprehensive privacy evaluation for LLMs including PII leakage
LLM-PBE (Privacy Behavior Evaluation)
Medium12 models testedEvaluating privacy-preserving behaviors in language models
Extraction Benchmark
Very High10 models testedTesting training data extraction vulnerabilities
Hallucination & Truthfulness Benchmarks
Measure factual accuracy and hallucination rates
TruthfulQA
High58 models testedMeasuring truthfulness in question-answering with adversarially-selected questions
HaluEval
Medium22 models testedLarge-scale hallucination evaluation across diverse tasks
FActScore
High16 models testedFine-grained atomic fact scoring for hallucination detection
FEVER (Fact Extraction and VERification)
Medium45 models testedLarge-scale dataset for fact verification
Capability & Alignment Benchmarks
Evaluate model capabilities and alignment with human values
MACHIAVELLI
Very High11 models testedMeasuring power-seeking and deception in language agents
Anthropic Eval Suite
High8 models testedComprehensive evaluation suite for AI safety and capabilities
BIG-bench
Variable68 models testedBeyond the Imitation Game: 200+ tasks for evaluating language models
ETHICS
High24 models testedEvaluating ethical reasoning in language models
Multimodal Security Benchmarks
Benchmarks for vision-language and multimodal AI systems
MM-SafetyBench
High12 models testedSafety evaluation for multimodal large language models
Red Teaming V-LLMs
Very High8 models testedRed teaming vision-language models with visual adversarial examples
POPE (Polling-based Object Probing)
Medium15 models testedEvaluating object hallucination in vision-language models
Agent & Tool-Use Security Benchmarks
Evaluate security of AI agents and tool-using systems
ToolEmu
High9 models testedEmulating tool-use risks in language agents
AgentBench
Medium14 models testedComprehensive evaluation of LLM agents across diverse environments
WebArena
Very High7 models testedRealistic web environment for autonomous agent evaluation
How to Use These Benchmarks
1. Select Benchmarks
Choose benchmarks relevant to your model's use case and deployment context. Consider safety-critical applications first.
2. Run Evaluations
Execute benchmarks systematically, starting with high-priority security and safety tests. Document all results thoroughly.
3. Mitigate Issues
Address identified vulnerabilities through fine-tuning, guardrails, or architectural changes. Re-test after mitigations.