This article describes the safety and security measures you should think about when adding AI to applications. (whether they are existing applications or new applications).
You’ve probably seen it, or been involved yourself recently: A development team gets the green light to plug AI (or a Large Language Model) into an application or product. Management and development teams are eager to move ahead quickly and show the value of AI features. Everyone’s excited and the demos look amazing.
But here’s the part that may get overlooked: You need to think about safety and security of your application and users of the application. This starts with the foundation, and extends all the way to the specific AI features.
AI isn’t a magic box. It’s software, it has weak points, and it often has more of them than people expect. When you bolt AI onto an application, you aren’t just shipping a new feature. You’re opening a set of security problems that classic scanners and security measures won’t catch. The AI value is real, but it comes with a bigger and newer attack surface.
Traditional security still matters
Before we move onto AI, we have to remember that traditional security still matters.
The most common mistake is the shiny AI object trap. Teams act like AI security replaces traditional app and infra security. It doesn’t. AI systems still runs on servers, still run in the cloud, still communicate over networks, still depend on storage, secure keys, use databases, and depend on identity & access management. If your existing stack has cracks, the AI features will inherit these too.
AI systems are still subject to traditional cybersecurity risks (and the CIA triad, confidentiality, integrity, availability, still applies). AI doesn’t retire those concerns, it often amplifies them. ENISA has also warned that AI systems share the usual IT security issues, plus they introduce new vulnerabilities at the algorithm level.
In fact, it’s very interesting and helpful to look at traditional security, and the AI variant or connector of it. Here’s a simple way to see how familiar risks morph once AI shows up:
| Risk Area | Traditional Software Risk | AI-Specific Amplification |
| Input Handling | SQL Injection / Cross-Site Scripting | Prompt Injection / Evasion Attacks |
| Data Integrity | Unauthorized Database Access | Data Poisoning / Backdoor Attacks |
| Intellectual Property | Reverse Engineering Binaries | Model Extraction / Model Stealing |
| System Availability | Volumetric DDoS | Sponge Examples (Computational Exhaustion) |
| Privacy | Database Leaks | Membership Inference Attacks |
Data is the new injection vector, poisoning and Nightshade
In normal apps, we worry about injecting malicious code. In apps with AI, the main injection vector becomes data. That’s data poisoning. Training data shapes what the model learns, so an attacker who can influence that data can nudge the model into making unsafe or insecure calls or even build in malicious behavior.
Note: when we talk about altering (or manipulating) data, it can relate to foundational model training, as well as data used for RAG (Retrieval Augmented Generation). Most teams adding AI t their application are not building foundational models.
This gets especially risky when you use user-generated data for training, which is common for recommendation systems and language-heavy products. If your system adapts to user behavior in real time, or you retrain on a schedule with fresh data, you’ve created a door someone can try to slip through.
A simple description of poisoning, often quoted in overviews of adversarial ML, is this:
Poisoning consists of contaminating the training dataset with data designed to increase errors in the output. Given that learning algorithms are shaped by their training datasets, poisoning can effectively reprogram algorithms with potentially malicious intent. Wikipedia, Adversarial Machine Learning.
A more targeted version is the backdoor attack. The model behaves normally most of the time, except when it sees a specific trigger. That trigger might be a tiny pattern in an image, or a harmless-looking word in text. During training, the attacker links the trigger to a bad outcome. Later, the trigger flips the model into the attacker’s preferred behavior.
Then there’s Nightshade. It’s described as a defensive poisoning tool aimed at helping artists. The basic idea is that artwork can be altered so that models scraping it can end up learning the wrong associations. You can respect the goal, protecting IP, and still see the enterprise risk. Nightshade is a signal that we’re heading into a data jungle. If you’re scraping unvetted data to train internal models, you might be training on something that’s been deliberately booby-trapped.
The black box paradox, explainability is a security requirement
Explainability isn’t only a user experience feature. It’s also a security control. If you can’t explain why a model made a decision, you can’t debug it properly, and you can’t tell whether a weird output is a harmless mistake or an attacker steering the model with a prompt injection.
You also need to be honest about how you’re using the system. There’s a big difference between AI as decision support, where a human reviews and approves, and AI as a decision maker, where the system executes automatically. If your AI is making decisions without explainability, you’re flying blind.
ENISA has stressed that weak explainability makes it harder to investigate incidents and reduce impact. Explainable systems are easier to document, audit, and govern. Without that, you get the pizza glue moment, when a model suggests adding non-toxic glue to pizza cheese so it doesn’t slide off. It sounds funny until you remember that it’s a real example of the model generating confident nonsense that isn’t grounded in reality.
Legitimate queries can steal your model and AI IP
In old-school IP theft, attackers break into file systems and steal binaries. With AI, an attacker can sometimes steal value just by asking questions. That’s the scary part. Model extraction is the idea that a black-box model can be probed through its API until the attacker reconstructs something close to the original model, or learns sensitive information about it.
The pattern is simple: send carefully chosen inputs, observe outputs, repeat. People often call this an oracle-style attack because the model behaves like an oracle, it answers, and the answers leak information.
Two big examples show up all the time:
- Model stealing: Reconstructing a working copy of your proprietary model. Picture a competitor duplicating your secret sauce trading model or diagnosis classifier just by querying your endpoint.
- Membership inference: Figuring out whether a specific record, like a person’s medical detail, was part of the training data. This often exploits overfitting, when the model memorizes too much.
If you spent millions on data and training, it’s hard to realize that an attacker might be able to approximate your logic for the price of API calls.
Physical reality can fool digital intelligence
Sometimes you don’t need advanced hacking to break AI. Sometimes you need black tape.
Researchers have shown that small physical changes, like a strip of tape on a speed limit sign, can fool a vision model into reading the sign incorrectly. These are physical adversarial attacks, and they matter whenever your system sees the real world through cameras or sensors.
Other examples people cite often:
- The 3D-printed toy turtle engineered so an object detector classifies it as a rifle from many angles.
- Stealth streetwear patterns designed to confuse facial recognition or license plate systems.
Nick Frosst from Google Brain has pointed out a basic weakness behind this: models assume the training distribution matches the real-world distribution. In practice, that assumption breaks constantly. And if your app touches the physical world, your threat model has to include these low-tech bridges that lead to high-impact failures.
The one-pixel vulnerability and gradient-based attacks
If your application is fully digital, evasion attacks are still a serious problem. These attacks add small changes to inputs to cause misclassification. The changes can be tiny enough that humans don’t notice them, but the model flips anyway.
Common examples include:
- One-pixel attacks: Research has shown that changing a single pixel can sometimes cause deep learning models to misclassify an image.
- FGSM (Fast Gradient Sign Method): A white box technique where the attacker uses model gradients, basically the model’s own math, to compute the fastest way to add noise that increases error.
- PGD (Projected Gradient Descent): A stronger iterative version of FGSM that searches for a better perturbation over multiple steps.
- Carlini and Wagner (C&W): An optimization-based attack known for breaking defenses that once looked promising, including defensive distillation.
When you plan defenses, you need to be clear about what the attacker knows:
- White box attacks: The attacker has full access to model parameters.
- Black box attacks: The attacker only sees inputs and outputs.
Black box attackers often use score-based or decision-based methods. Square attacks can work by querying confidence scores without direct gradient access. HopSkipJump is nasty because it can work even without scores. It only needs the final predicted class, then it uses an iterative boundary search to find the smallest change needed to cross the decision boundary and force failure.
Content anomalies are the new pizza glue
People call it hallucination. In architecture and ops, it’s more useful to treat it as a control problem, content anomalies. These models don’t store facts like a database. They generate text based on patterns. When they go off the rails, it can be a mix of weak grounding, poor alignment, or an attacker pushing them.
As one example reported in The New Stack, you might see chatbots suggest unsafe recipes, or systems answer election-related questions incorrectly at a measurable rate, like the 27% figure cited in some discussions. Whether the number is 27% in your environment isn’t the point. The point is this: if your model can confidently output wrong or harmful content, you need detection and containment.
That’s why content anomaly detection matters. You should monitor outputs for obvious errors, unsafe guidance, misinformation, and anything that could turn your AI into a legal or reputational disaster. And yes, there’s another edge here: a model that provides step-by-step instructions for illegal acts can become a tool for crime. You need guardrails, and you need monitoring to see when the guardrails fail.
Byzantine attacks, the danger of decentralization
If you’re moving toward federated learning, where edge devices like phones help train a shared central model, you’re stepping into a different kind of mess.
In a federated setup, a minority of malicious participants can poison the global model. These are Byzantine participants, devices or clients that intentionally deviate from expected training behavior. The goal can be to degrade the model overall, or bias it toward a specific narrative, product, or disinformation target.
This gets harder because data across devices isn’t IID, meaning it isn’t neatly uniform. Everyone’s data looks different. There are robust aggregation methods that try to reduce the impact of outliers, but they aren’t magic. In heterogeneous settings, there are provable limits on what robust learning can guarantee. In the worst case, one compromised participant becomes a single point of failure for the learning process.
The class imbalance blind spot, base rate fallacy
High accuracy in the lab is often a vanity metric. In real security data, malicious samples might be 0.01% to 2% of traffic or events.
That imbalance pushes models toward the majority class. If 99.9% of your data is benign, a model that labels everything as safe has 99.9% accuracy and still fails completely as a security tool. That’s the base rate fallacy showing up in model evaluation.
To handle this, some teams are moving beyond older architectures like LSTMs and using approaches that model sequences more like language. One example described in the literature is treating Android app activity sequences like text and fine-tuning a pre-trained BERT model, reporting an F1 score of 0.919 even when malware made up only 0.5% of the dataset. The takeaway is simple: accuracy won’t save you. You need metrics and training setups that match the real-world base rate and the real-world cost of misses.
AI regulation is no longer optional
The EU AI Act and NIS2 have moved AI security from best practice into legal territory. For high-risk systems, trustworthiness isn’t a marketing word, it’s an obligation.
The EU AI Act introduces expectations like:
- Conformity assessments, showing robustness and cybersecurity before release.
- Notified bodies, national institutions that can investigate violations and audit your management system.
- Post-market monitoring, continuous oversight of performance and risks in the real world.
If you don’t build with compliance in mind now, you’re creating technical debt that can turn into a business blocker later. In the worst case, you could end up with a system that can’t legally ship in the European market.
A practical action plan
So what do you do without falling into panic or paralysis? You stop treating AI like a one-time project. AI is a living system, so security has to behave like an ongoing function.
Shift left, proactive defense
- Vulnerability assessments: Test early and often. That includes classic security testing and model-focused testing that probes behavior, not just code structure.
- Red teaming: Simulate attacks. Try to jailbreak the model, try prompt injection, try adversarial examples with methods like FGSM or PGD if that fits your threat model.
- AI risk management worksheet for SMEs: If you don’t have a full ISMS, use a simplified worksheet to identify minimum steps and minimum controls before adopting AI components.
Technical controls
- Data sanitization: Vet training data hard. Unvetted data increases poisoning risk, and it stacks up over time.
- Rate limiting: ISO/IEC guidance often points to this as a key defense against model extraction. Limit queries and you make oracle-style reconstruction slower, more expensive, and sometimes impractical.
- Content redaction: Prevent PII from going into the model, and prevent it from leaking out.
- Information laundering: Alter or reduce what adversaries can learn from outputs to make model stealing less effective.
Operational readiness
- AI injects: Add AI scenarios to incident response drills. Ask concrete questions like, what happens if a financial bot starts making illegal trades, or a support bot starts leaking customer data.
- Standardized contract clauses: If you buy AI systems, bake security baselines and monitoring obligations into contracts.
- Explainability measures: Build interpretability in early so you can explain and audit decisions, especially when AI drives actions with real-world consequences.
The framework reference: your security north star
Don’t reinvent everything. Use existing toolboxes and standards, including:
- ISO 27001, the foundation for information security management systems.
- NIST AI RMF 1.0, a risk management framework focused specifically on AI.
- EU AI Act, the regulatory benchmark in the EU context.
- ISO 42001, an emerging standard for AI management systems.
The horizon: remaining risks and the model collapse threat
Even with strong controls, three risks don’t go away. They just become things you manage continuously:
- Concept drift: attackers adapt, data changes, and yesterday’s detection pattern won’t match tomorrow’s abuse.
- Model collapse: as synthetic AI content floods the web, models can end up training on AI-generated data about AI-generated data, then drifting away from reality and producing more nonsense.
- Black box opacity: some high-performing models remain hard to explain, which creates a permanent tension between performance and assurance.
Threat Model your AI features before adding them
Before you (blindly) add AI, you should threat model your AI features to look for potential safety or security issues.
Traditional threat modeling techniques such as STRIDE and PASTA are still useful in the AI era.
You can also try our threat modeling tool!
A provocative final thought on adding AI to applications
What’s really changing (by adding AI to applications) isn’t only the code. It’s the mindset. You’re moving from project-oriented security to agency-oriented security. You aren’t shipping a static feature, you’re deploying a system that evolves, learns, and can be manipulated if nobody’s watching.
So here’s the real question for the team: are you ready to own something that’s never truly finished?
If you treat AI like a finished product, you’re setting yourself up to fail. If you treat it like a living organism that needs constant monitoring, controls, audits, and occasional hard resets, you’ve at least got a chance of surviving the transition.
