这是indexloc提供的服务,不要输入任何密码
Skip to content

[Feature]: Implement confidence scores for LLM Firewall #1034

@homanp

Description

@homanp

Feature Description

We need to add a confidence score to the ninja-lm outputs to reduce false positives.

= 98% Jailbreak
= 98% benign

Between 92 - 98% flagged for review

Less that 92 false.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions