AI Security Benchmarks

Comprehensive collection of benchmarks to evaluate AI models for biases, harmful content, security vulnerabilities, and safety issues

31
Total Benchmarks
8
Categories
88%
Avg Coverage
500+
Models Tested
All CategoriesSafety & Harmful Content BenchmarksAdversarial & Robustness BenchmarksBias & Fairness BenchmarksPrivacy & Security BenchmarksHallucination & Truthfulness BenchmarksCapability & Alignment BenchmarksMultimodal Security BenchmarksAgent & Tool-Use Security Benchmarks

Safety & Harmful Content Benchmarks

Evaluate models for harmful content generation and safety violations

ToxiGen

Medium27 models tested

Large-scale dataset for implicit hate speech detection across 13 minority groups

94%
Coverage
ToxicityHate SpeechBias

RealToxicityPrompts

High45 models tested

Dataset of 100k prompts for measuring toxic degeneration in language models

89%
Coverage
ToxicityContent Safety

HarmBench

High33 models tested

Standardized benchmark for automated red teaming and harmful behavior evaluation

91%
Coverage
Red TeamingHarmful Behavior

SafetyBench

Medium25 models tested

Comprehensive safety evaluation across 8 categories of harmful content

87%
Coverage
SafetyContent Moderation

BOLD (Bias in Open-Ended Language)

Medium23 models tested

Large-scale dataset for measuring biases in open-ended language generation

85%
Coverage
BiasFairness

Adversarial & Robustness Benchmarks

Test model resilience against adversarial attacks and jailbreaking

AdvBench

Very High18 models tested

Adversarial benchmark for evaluating jailbreaking and red-teaming attacks

92%
Coverage
JailbreakingAdversarial

PAIR (Prompt Automatic Iterative Refinement)

High16 models tested

Automated adversarial prompt generation through iterative refinement

88%
Coverage
Automated AttacksJailbreaking

GCG (Greedy Coordinate Gradient)

Very High12 models tested

Gradient-based adversarial attack method for LLMs

86%
Coverage
Gradient AttacksWhite-box

TAP (Tree of Attacks with Pruning)

High14 models tested

Automated multi-turn jailbreaking via tree search

90%
Coverage
Multi-turnTree Search

Bias & Fairness Benchmarks

Evaluate models for various types of biases and fairness issues

BBQ (Bias Benchmark for QA)

Medium30 models tested

Benchmark for evaluating social biases in question-answering systems

93%
Coverage
Social BiasQA Systems

WinoBias

Low42 models tested

Gender bias evaluation in coreference resolution

87%
Coverage
Gender BiasCoreference

StereoSet

Medium28 models tested

Measuring stereotypical bias in language models across multiple domains

89%
Coverage
StereotypesMultiple Domains

SEAT (Sentence Encoder Association Test)

Low35 models tested

Detecting biases in sentence encoders

82%
Coverage
Embedding BiasAssociation Tests

RedditBias

Medium19 models tested

Real-world bias detection using Reddit data across 4 demographic groups

84%
Coverage
Real-world BiasDemographics

Privacy & Security Benchmarks

Assess privacy leakage and security vulnerabilities

PrivacyBench

High15 models tested

Comprehensive privacy evaluation for LLMs including PII leakage

91%
Coverage
PrivacyPII Protection

LLM-PBE (Privacy Behavior Evaluation)

Medium12 models tested

Evaluating privacy-preserving behaviors in language models

86%
Coverage
Privacy BehaviorData Protection

Extraction Benchmark

Very High10 models tested

Testing training data extraction vulnerabilities

88%
Coverage
Data ExtractionMemorization

Hallucination & Truthfulness Benchmarks

Measure factual accuracy and hallucination rates

TruthfulQA

High58 models tested

Measuring truthfulness in question-answering with adversarially-selected questions

94%
Coverage
TruthfulnessMisinformation

HaluEval

Medium22 models tested

Large-scale hallucination evaluation across diverse tasks

90%
Coverage
HallucinationMulti-task

FActScore

High16 models tested

Fine-grained atomic fact scoring for hallucination detection

87%
Coverage
FactualityAtomic Facts

FEVER (Fact Extraction and VERification)

Medium45 models tested

Large-scale dataset for fact verification

85%
Coverage
Fact CheckingVerification

Capability & Alignment Benchmarks

Evaluate model capabilities and alignment with human values

MACHIAVELLI

Very High11 models tested

Measuring power-seeking and deception in language agents

89%
Coverage
DeceptionPower-seekingEthics

Anthropic Eval Suite

High8 models tested

Comprehensive evaluation suite for AI safety and capabilities

92%
Coverage
HHHAlignment

BIG-bench

Variable68 models tested

Beyond the Imitation Game: 200+ tasks for evaluating language models

95%
Coverage
CapabilitiesEmergence

ETHICS

High24 models tested

Evaluating ethical reasoning in language models

86%
Coverage
EthicsMoral Reasoning

Multimodal Security Benchmarks

Benchmarks for vision-language and multimodal AI systems

MM-SafetyBench

High12 models tested

Safety evaluation for multimodal large language models

88%
Coverage
MultimodalSafety

Red Teaming V-LLMs

Very High8 models tested

Red teaming vision-language models with visual adversarial examples

85%
Coverage
Vision-LanguageAdversarial

POPE (Polling-based Object Probing)

Medium15 models tested

Evaluating object hallucination in vision-language models

83%
Coverage
HallucinationVision

Agent & Tool-Use Security Benchmarks

Evaluate security of AI agents and tool-using systems

ToolEmu

High9 models tested

Emulating tool-use risks in language agents

87%
Coverage
Tool UseAgent Safety

AgentBench

Medium14 models tested

Comprehensive evaluation of LLM agents across diverse environments

90%
Coverage
Agent EvaluationMulti-environment

WebArena

Very High7 models tested

Realistic web environment for autonomous agent evaluation

84%
Coverage
Web AgentsReal-world Tasks

How to Use These Benchmarks

1. Select Benchmarks

Choose benchmarks relevant to your model's use case and deployment context. Consider safety-critical applications first.

2. Run Evaluations

Execute benchmarks systematically, starting with high-priority security and safety tests. Document all results thoroughly.

3. Mitigate Issues

Address identified vulnerabilities through fine-tuning, guardrails, or architectural changes. Re-test after mitigations.

Recommended Evaluation Pipeline

Phase 1
Safety & Toxicity:ToxiGen, RealToxicityPrompts, HarmBench
Phase 2
Adversarial Robustness:AdvBench, PAIR, GCG
Phase 3
Bias & Fairness:BBQ, WinoBias, StereoSet
Phase 4
Privacy & Security:PrivacyBench, Extraction Benchmark
Phase 5
Truthfulness:TruthfulQA, HaluEval, FActScore
Phase 6
Capability Alignment:MACHIAVELLI, ETHICS, Anthropic Evals