AI Security Testing Costs Overview

LLm application on a mobile phone, header photo

AI security testing for agents and LLM-based systems is still in its early stages. Automated red teaming requires infrastructure, careful setup, and comes with real execution costs.

AI security testing identifies vulnerabilities specific to AI systems, including prompt injection, hallucinations, data leakage, and unsafe tool execution. Unlike traditional application security, which focuses on infrastructure and APIs, AI security testing evaluates how models behave under adversarial conditions.

In our R&D department, we conducted automated red teaming tests using Promptfoo to assess a production-grade AI agent. The objective was to measure AI risk exposure across multiple security frameworks, estimate execution costs per configuration, and explore how reliable automated LLM-based security assessment can be in a controlled environment.

Key takeaways

  • Security testing of AI agents differs from traditional security testing of static AI models.
  • Automated red teaming evaluates how large language models behave when embedded inside real AI systems.
  • Each adversarial test activates full AI workloads, increasing operational risk and execution cost.
  • AI security testing should be embedded into the AI development lifecycle and supported by continuous monitoring.
  • Structured adversarial inputs help assess model behavior beyond manual penetration testing techniques.

Automated AI security testing with Promptfoo: What we tested

As part of our internal R&D, we ran automated red teaming tests on Omega — our production-grade AI sales agent.

Omega handles structured, multi-step workflows and integrates with external services, which gives it a much larger attack surface than a simple chatbot.

To evaluate its security posture, we used Promptfoo, an open-source tool for LLM evaluation and red teaming. It generated adversarial prompts based on established security frameworks and automatically executed them against the agent.

This allowed us to test how Omega behaves under structured attack scenarios and uncover potential vulnerabilities — beyond what manual, ad-hoc testing would reveal.

Security frameworks and real execution cost in AI security testing

Running automated red teaming against Omega involved measurable execution costs and configuration trade-offs.

Each adversarial test triggered full agent execution — including prompt processing, tool calls, multi-step reasoning, and external integrations. Unlike testing a single model endpoint, agent evaluation activates real workflows, resulting in token usage, API calls, and infrastructure load.

Structured attack patterns improve coverage, but executing them at scale introduces financial and technical overhead.

AI security tools: Full framework configuration

Framework

Plugins

Duration & Probes

Estimated Cost (USD)*

NIST

19

1h 40m · 7,980 probes

~$400

OWASP LLM Top 10

30

2h 38m · 12,600 probes

~$630

OWASP GenAI Red Team

59

5h 10m · 24,780 probes

~$1,239

OWASP API Top 10

13

1h 9m · 5,460 probes

~$273

OWASP Agentic AI Top 10

25

2h 12m · 10,500 probes

~$525

MITRE ATLAS

23

2h 1m · 9,660 probes

~$483

EU AI Act

15

1h 19m · 6,300 probes

~$315

ISO 42001

30

2h 38m · 12,600 probes

~$630

GDPR

23

2h 1m · 9,660 probes

~$483

*Estimation based on ~5,000 tokens per probe and $10 per 1M tokens.

What this means for automated AI security testing in AI systems

The data highlights several realities about AI security and automated security testing:

  • Cost scales with coverage. More plugins and strategies mean more probes, which directly increases token consumption and total spend.
  • Comprehensive AI security testing is expensive. A full OWASP GenAI Red Team configuration exceeded $1,200 for a single run.
  • Compliance-driven AI security frameworks are not lightweight. ISO 42001, GDPR, and EU AI Act configurations still require thousands of probes and can approach $500–$600 per execution.
  • Per-model validation multiplies cost. Running AI testing tools across multiple models, environments, or iterations significantly increases total budget.
  • Automation increases coverage — but also financial exposure. Without careful scoping, automated AI security testing can become one of the most expensive stages of AI application security validation.

How to reduce automated AI security testing costs

Full-framework AI security testing offers broad coverage, but recommended configurations may not be viable for every project. In our tests, even a single comprehensive run could cost several hundred dollars. When testing multiple models or environments, costs increase rapidly.

To make automated AI security testing sustainable, we experimented with simplified framework configurations.

Reduced configuration setup for AI tests

Instead of running full strategy sets, we limited execution to:

  • 5 test cases per plugin
  • Strategies:
    basic, jailbreak:meta, jailbreak:composite

    This significantly reduced probe generation and overall token usage.

Results of simplified AI security framework configuration

Framework

Plugins

Duration & Probes

Estimated Cost (USD)*

NIST

19

21m · 1,615 probes

~$80

OWASP LLM Top 10

30

33m · 2,550 probes

~$127

EU AI Act

15

17m · 1,227 probes

~$61

ISO 42001

30

33m · 2,550 probes

~$127

GDPR

23

25m · 1,995 probes

~$100

*Estimation based on ~5,000 tokens per probe and $10 per 1M tokens.

What changes with simplified AI security testing

The difference is substantial:

  • NIST dropped from ~$400 to ~$80
  • ISO 42001 dropped from ~$630 to ~$127
  • EU AI Act dropped from ~$315 to ~$61

Execution time also decreased from hours to under 30 minutes in most cases.

This approach makes automated security testing financially accessible while still covering high-risk adversarial patterns. However, trade-offs remain:

  • Fewer probes may reduce statistical confidence.
  • Some vulnerability categories may not be fully explored.
  • Manual review effort increases.

Targeted retesting based on risk scoring

Another cost-control strategy is staged validation. Teams can run a limited number of test cases per framework and use Promptfoo’s risk scoring to identify the most vulnerable categories. Follow-up tests are then focused only on high-risk areas.

This reduces probe generation and overall cost, but it introduces trade-offs. Smaller sample sizes may not fully exhaust certain vulnerability types and can increase the likelihood of false positives. Additional manual analysis is often required to validate findings and maintain confidence in the results.

Using red team generation as learning support

When full automated evaluation is too costly, teams can generate adversarial prompts without executing the full red team run. Promptfoo provides structured attack scenarios aligned with security frameworks, which QA engineers or developers can use for manual validation.

Even without automated evaluation, this approach can uncover vulnerabilities that were not previously considered — especially in projects without dedicated security expertise. However, relying purely on automation without internal security knowledge may create blind spots. Automated AI security testing is most effective when paired with informed human analysis.

Running Promptfoo redteam tests against an AI agent (Omega example)

For implementation details, Promptfoo’s official Quickstart and configuration documentation remain the primary reference.

In our setup, we ran Promptfoo redteam automation against Omega after confirming that the standard test environment was already configured and operational.

Since we didn’t have a dedicated makefile for redteam at the time, we exported the required environment variables manually from the project root. We then initialized the redteam configuration using Promptfoo’s helper and adjusted the generated YAML file to use a custom provider (the Omega adapter).

The commands below reflect the exact flow we followed.

1. Export required environment variables

Since there is currently no dedicated makefile for red teaming, export the required environment variables manually from the project’s root directory:

export PYTHONPATH=./src
export AWS_PARAMETER_STORE_ENV_PATH=/ai-agent-rnd-dev/lambda/envs/
export APP_LANGFUSE_ENABLED=false
export AWS_SECRETS_MANAGER_SECRET_NAME=ai-agent-rnd-dev/app/db_secret
export OMEGA_MOCK_SERVICES=1

2. Initialize the redteam configuration

Run the Promptfoo configuration helper:

npx promptfoo@latest redteam init

Inside the CLI tool:

  • Select the desired attack scenarios
  • Set Target type to Custom Provider

Refer to the Quickstart documentation and recommended plugins for additional configuration options if needed.

3. Adjust the configuration file

After initialization, review the generated configuration and update promptfooconfig.yaml to use the Omega provider. Example configuration:

description: Omega - main test with model-based evaluation using Azure OpenAI
prompts:
- ""
providers:
- id: file://src/promptfoo/adapter.py
label: Omega Adapter
config:
pythonExecutable: .venv/bin/python
pythonPath: src
redteam:
plugins:
- id: hallucination
strategies:
- id: basic
numTests: 2
maxConcurrency: 4

4. Run the redteam tests

This command generates attack scenarios and executes them:

aws-vault exec ai-agent-rnd npx promptfoo@latest redteam run

5. Review the results

To generate and view the report:

aws-vault exec ai-agent-rnd npx promptfoo@latest redteam report

Limitations of AI security tools

Automated AI security tools provide structure and scale, but they have clear limitations. In tools like Promptfoo, both test generation and evaluation are model-based. This means LLMs are effectively evaluating other LLMs. As a result, outputs are probabilistic and may vary between runs.

When LLMs evaluate LLMs

Red teaming automation generates adversarial prompts and uses model-based evaluation to assess responses. This approach is efficient, but it inherits typical AI behavior: variability, sensitivity to phrasing, and occasional misclassification. Running more probes can improve reliability and strengthen AI risk assessment, but it increases token usage and cost.

Cost vs confidence trade-offs

Automated AI security testing is expensive. Even limited configurations can cost $300 or more per run. Comprehensive setups cost significantly more. If testing must be repeated in production, total cost increases again.

Reducing the number of test cases lowers cost, but it also reduces confidence. Smaller samples may not fully explore certain potential vulnerabilities or broader security risks. Limited runs can increase false positives or miss edge cases. Higher confidence requires broader coverage, repeated testing, or additional manual validation when securing AI systems.

Tool maturity and practical impact

AI security testing tools are still evolving. While structured red teaming provides useful signals, it is not always clear how closely automated findings reflect real-world exploitability. Some detected issues may have limited practical impact, while others may require deeper investigation.

Automation does not replace expertise

Automated testing can highlight risks, but interpreting results requires experienced security teams and strong human oversight. Without understanding attack patterns, system architecture, and data flows, organizations may misjudge findings or allocate resources inefficiently when managing broader AI risk.

What automated AI security testing is actually good for

Despite its limitations, automated security testing plays a meaningful role in AI application security when applied deliberately.

Early vulnerability discovery

Automated red teaming can quickly surface potential weaknesses, such as prompt injection susceptibility, hallucination risks, or unsafe tool behavior. This makes it useful during early-stage validation before production deployment.

Learning attack patterns

Generated adversarial prompts provide insight into how AI systems may be exploited. Even without running full evaluation cycles, these scenarios help teams understand common vulnerability categories and improve internal security awareness.

Supporting (not replacing) audits

Automated AI security testing can support external audits or compliance efforts by providing structured evidence of systematic testing. However, it should complement — not replace — dedicated security assessments.

Continuous validation across models

When models are upgraded or configurations change, automated testing enables repeatable validation. This is particularly relevant in multi-model environments where behavior may vary between versions or providers.

Used strategically, automated AI security testing strengthens AI security posture — especially when combined with informed human analysis.

Bonus: Live AI agent hacking demo and LLM security checklist

Automated AI security testing provides structured validation, but real-world attack scenarios require additional security review.

In a live session, Mateusz Matyska demonstrated how an AI agent can be compromised through prompt injection and misconfigured safeguards. During the session, he also introduced an AI Security Checklist he developed based on OWASP guidance for LLM-based systems. The checklist outlines key risk categories and practical aspects to verify when reviewing AI products, from prompt injection resistance to access control and data exposure.

While tools like Promptfoo help simulate adversarial scenarios at scale, structured reviews based on OWASP categories add another layer of security validation. Combining automated testing with expert-driven security review leads to a more balanced approach.

Watch the session here:

Summary: QA reviewer observations from AI security testing

From a QA perspective, several practical conclusions emerged during AI security testing.

First, Promptfoo supports a wide range of established AI security frameworks, and the setup process is relatively quick. Once baseline configuration is complete, launching automated red teaming requires limited additional overhead.

Second, cost is a major constraint. Even medium configurations can cost several hundred dollars per run, and comprehensive setups cost more. If testing needs to be repeated across multiple models or environments — or again in production — the total cost increases significantly.

Third, automated red teaming does not guarantee exhaustive coverage of possible security threats or system vulnerabilities. The tool generates a defined number of test cases per plugin and strategy, but the depth of testing depends entirely on how the run is parameterized. The framework configurations shown earlier reflect Promptfoo’s recommended minimum settings — not independently validated benchmarks. It is therefore difficult to determine how thoroughly AI applications have been tested in practice.

Both test generation and evaluation rely heavily on LLM-based mechanisms. As a result, outputs inherit typical AI limitations, including variability and occasional misclassification. Increasing the number of probes may improve confidence, but it also increases cost.

Finally, while structured adversarial prompts can uncover overlooked vulnerabilities and improve internal awareness, the practical impact of automated AI security testing is still being explored. It can surface useful signals, but it does not provide guaranteed coverage or definitive security assurance for complex artificial intelligence systems.

FAQ: AI security testing and red teaming

How is AI security testing different from traditional penetration testing?
Traditional penetration testing focuses on infrastructure and application layers. AI security testing evaluates model behavior, system instructions, and agent workflows under adversarial inputs.

Why is continuous monitoring important for AI systems?
AI systems operate within evolving business operations and workflow automation environments. Continuous monitoring helps detect security incidents and emerging threats over time.

Should AI security testing be part of the AI development lifecycle?
Yes. Security validation should be embedded into the AI lifecycle and treated as part of risk management and vulnerability management processes.

We're Netguru

At Netguru we specialize in designing, building, shipping and scaling beautiful, usable products with blazing-fast efficiency.

Let's talk business