A Unified Lightweight Framework for Detecting Multimodal Attacks
Date
2025-12-13Author
Sayem, Md
Rakibul Hasan, Md.
Islam Anika, Morium
Metadata
Show full item recordAbstract
Modern Vision-Language Model (VLM) has shown tremendous performance in multi-
modal reasoning, captioning, retrieval and generative tasks, but it is also fatally suscep-
tible to two types of attacks: adversarial perturbation and jailbreak prompts. In order
to tackle these balancing threats, the current thesis presents two lightweight, model-free
detection systems that will improve the security and resilience of modern multimodal
systems. The former contribution is LMDF, which is a semantic-consistency-based ad-
versarial detection model to detect perturbation-based attacks by evaluating cross-modal
correspondence between image and text embeddings. Based on the conceptual underpin-
nings of contrastive learning, LMDF uses the fact that adversarial perturbations, though
imperceptible on the eye, introduce quantifiable distortions in the shared embedding
space. LMDF identifies adversarial manipulations of frozen pretrained languages encod-
ings like CLIP and BLIP-2 with high accuracy through the evaluation of cosine similarity
between language and vision encodings. Major experiments between various datasets
and attack algorithms (FGSM, PGD and Adversarial Patch) show high effectiveness with
maximum accuracy and AUC scores reaching up to 91.2 and AUC up to 0.950 with mini-
mum computational costs; only two forward passes and similarity calculations are needed.
The second addition is a multimodal jailbreak detecting framework based on confidence
which expands the concepts of Free Jailbreak Detection to vision-language environment.
This method compares temperature scaled token probability distributions produced by
decoder-based VLMs and derives five important statistical properties, namely minimum
token confidence, first-token confidence, mean token confidence, entropy and confidence
standard deviation. Jailbreak induces the typical instability of these confidence profiles
which allow effective classification with a small threshold-based detector. Empirical val-
idation shows great discriminative ability with AUC = 0.979, 90 percent accuracy and
F1-score = 0.907 at an optimal temperature setting without adjusting its model, gradi-
ent access, or evident computing cost.Combined, these two detection modules deal with
different yet more and more common multimodal attack vectors. This paper combines
semantic alignment analysis of adversarial perturbations with behavioral analysis based
on confidence to offer a consistent, practical, and efficient defense mechanism to protect
the modern VLMs. The suggested frameworks promote the objective of developing reli-
able, multimodal AIs, which can work safely in the real-world environment, high-stakes,
and adversarial settings.
Collections
- 2025 [9]
