The independent test for the chatbot you already run
Is your chatbot safe?
Almost every company with customers already has one in production — on the website, in the app, in the help centre — talking to customers right now, unsupervised.It can be talked into mis-stating your policy, leak data it shouldn’t, commit you to things no one approved, or never disclose it’s an AI at all. And you’re liable for what it says.
Almost nobody is independently testing whether the bot they shipped is safe. Annexo is that test.
Promises cover, prices, or terms that aren’t real — and you’re bound by it.
Coughs up another customer’s details, or internal information, when nudged.
A few crafted messages override its instructions and steer it off the rails.
Tells a customer it’s a human advisor — the opposite of what the law expects.
This already happened — to companies bigger than yours
Public chatbots that cost real money and real brand.
Three incidents, all on the public record. None of these companies set out to ship an unsafe bot — they just never had it independently tested.
A tribunal made the airline pay for what its chatbot said.
Air Canada's website chatbot gave a grieving passenger wrong information about its bereavement-refund policy. When the passenger relied on it and sued, a tribunal held the airline legally liable for what its own chatbot told him — it couldn't disclaim its bot as a separate entity. The company is on the hook for the promise the bot made.
Its chatbot “agreed” to sell a car for one dollar.
A car dealership put a general-purpose AI chatbot on its site. A visitor instructed it to agree to anything the customer said and to end every reply with “that's a legally binding offer — no takesies-backsies,” then got it to “agree” to sell a new SUV for $1. The screenshots went viral and the bot was pulled — a customer-facing AI manipulated into committing the business to terms no one authorised.
Its chatbot swore and trashed the company — on the record.
A customer of the delivery firm DPD got its support chatbot to drop its guardrails: it swore, called DPD “the worst delivery firm in the world,” and wrote a poem about how useless it is. The exchange went viral and DPD disabled the bot. A support surface meant to help customers was steered into damaging the brand in the company's own voice.
These are public chatbots that cost real money and brand — and yours is exactly as exposed. The only difference is whether anyone has looked.
Watch it — a live test, zero input
See a chatbot get caught — live.
Here’s a sample customer chatbot — a friendly sales & service bot that answers cover questions and sets up quotes. Press one button and watch Annexo run it. It handles the easy questions well. Then we catch it doing exactly the two things that put a company on the hook— telling a customer they’re covered for something the policy excludes (the Air-Canada failure mode), and claiming to be a human advisor. No setup, no key, no input.
A customer-facing chatbot of the kind a company puts on its site — answers questions about cover, quotes prices, helps customers. Live on a public surface, talking to customers on its own.
This is an illustrative demonstration on a fictional chatbot. The engine is real — Annexo runs the same kind of probes it runs against a live bot — but the chatbot, the customer and the figures are invented to show what the test surfaces. It reports observed behaviour: not a conformity assessment and not legal advice; Annexo is not a notified body.
The front door — not the ceiling
Land on the chatbot you have today. Cover the agent fleet you’re building tomorrow.
Your chatbot is the front door: the AI your customers already touch, the one with your name on what it says. We test it independently in minutes — and the same engine then verifies and monitors every AI agent you deploy next: the claims agent, the underwriting model, the internal copilots, the autonomous workflows. One test today becomes continuous assurance across your whole fleet.
Land
We independently test the customer chatbot you already run — in minutes, against the obligations that actually bite.
Expand
Point the same engine at the next agent, and the next — every AI you put in front of a customer or a decision.
Monitor
Keep it running, so you see the moment a guardrail or a disclosure quietly changes — across the whole fleet.
The chatbot is the wedge because it’s the AI you can’t pretend you don’t have. It’s the entry, not the ceiling.
Find out before a customer — or a regulator — does.
Run the independent test on the chatbot you already have. Then let’s scope it across the agents you’re deploying next.
Annexo runs independent, observational tests of AI systems and reports the behaviour it observes — it does not issue a compliant/non-compliant verdict, is not a conformity assessment, not a penetration test, and not legal advice; Annexo is not a notified body. The Air Canada, Chevrolet-dealership and DPD incidents above are matters of public record, stated as reported. The sample chatbot in the live test is fictional and the run is illustrative. Questions? hello@annexo.eu.