Zenoo
Regulatory intelligence

Ongoing customer monitoring: AI vs. rules-based. What regulators actually want to see

Stuart Watkins10 min read
Share
Ongoing customer monitoring: AI vs. rules-based. What regulators actually want to see

By Stuart Watkins, CEO, Zenoo

In 2025, FinCEN fined a US regional bank $85 million. Not for failing to monitor its customers. For monitoring them badly. The bank had a transaction monitoring system. It had rules. It generated alerts. But the rules had not been recalibrated in four years. The alert-to-SAR conversion rate was 0.3%. And when examiners asked the compliance team to demonstrate how their monitoring adapted to emerging typologies, no one could answer. The system was running. It just was not working.

This is the gap that regulators are now targeting. AMLA, the EU's new Anti-Money Laundering Authority, became operational in 2025. Its mandate is not to introduce new obligations from scratch. It is to ensure that existing obligations, particularly ongoing customer monitoring, are being met with genuine effectiveness, not just procedural compliance. FATF's updated guidance on risk-based supervision makes the same point: having a monitoring programme is necessary but not sufficient. The programme must demonstrably detect, assess, and respond to changes in customer risk.

We see this every week. Firms that passed regulatory inspections three years ago are now receiving findings on the same monitoring programmes. The rules have not changed. The expectations have.

What AMLA and FATF actually expect from ongoing monitoring

AMLA's supervisory mandate covers direct supervision of high-risk obliged entities and coordination of national supervisors across the EU. For ongoing customer monitoring specifically, three expectations are becoming non-negotiable.

First, monitoring must be risk-proportionate. FATF Recommendation 10 and its interpretive note require that the nature and frequency of monitoring reflect the customer's risk profile. A high-risk customer with complex cross-border transactions should not be monitored with the same rule set and review cycle as a low-risk domestic retail customer. This sounds obvious. In practice, most institutions apply the same rules across their entire customer base, with risk differentiation limited to the frequency of periodic reviews.

Second, monitoring must be responsive to change. The EU's Sixth Anti-Money Laundering Directive and AMLA's technical standards require that monitoring systems react to material changes: new sanctions designations, changes in corporate ownership, shifts in transaction behaviour, adverse media. Batch processing that runs overnight is no longer considered adequate for sanctions-related triggers. The expectation is near real-time for high-risk events.

Third, monitoring must be demonstrably effective. This is where most enforcement actions originate. Regulators are no longer satisfied with evidence that a monitoring system exists. They want evidence that it works: what is the alert-to-SAR conversion rate, how are thresholds calibrated, when were rules last updated, what is the false positive rate, and how does the institution measure whether its monitoring is catching what it should catch.

Static rules catch what you already know about. That is the problem.

Rules-based monitoring works on a simple principle: define a condition, generate an alert when the condition is met. If a customer makes a cash deposit over a defined threshold, flag it. If a customer transacts with a counterparty in a high-risk jurisdiction, flag it. If transaction volume exceeds a multiple of the customer's historical average, flag it.

This approach has three strengths. It is transparent: you can explain exactly why an alert was generated. It is auditable: regulators can review the rule set and understand the logic. And it is predictable: the same inputs always produce the same outputs.

It also has three fundamental weaknesses that regulators are increasingly unwilling to overlook.

The first is rigidity. Rules detect what they are designed to detect. If a money laundering typology involves structuring deposits just below reporting thresholds, a threshold-based rule will catch it, but only if the threshold is set correctly and the structuring pattern is simple enough. Sophisticated actors adapt. They study the rules, and they design their behaviour to stay just outside the detection parameters. Rules-based systems cannot adapt to patterns they were not designed to recognise.

The second is alert volume. The industry average false positive rate for transaction monitoring alerts sits at approximately 95%. For every 100 alerts a rules-based system generates, roughly 95 require analyst time and turn out to be benign. At $2,600 per manual case review, the operational cost is staggering. More importantly, genuine risks get buried in noise. When analysts are processing thousands of false positives per month, their attention to genuine anomalies degrades. This is not a technology problem. It is a human factors problem created by technology that is not precise enough.

The third weakness is static calibration. Rules are set at a point in time, based on the typologies and risk environment that existed when they were written. The risk environment changes. New typologies emerge. Customer behaviour shifts. Jurisdictional risk profiles evolve. A rules-based system that was well calibrated in 2023 may be materially miscalibrated by 2026, and many institutions lack the resources or processes to recalibrate at the pace the environment demands.

"We had 47 transaction monitoring rules when I joined. I asked the team when they were last reviewed. No one knew. We traced some back to 2019. The rules were generating 2,200 alerts per month. Our SAR filing rate from those alerts was 1.8%. We were spending the equivalent of three full-time analysts processing noise."
Head of Financial Crime, UK payment institution

AI-driven anomaly detection: what it adds, and what regulators demand from it

Machine learning and AI-driven monitoring take a fundamentally different approach. Instead of defining rules in advance, these systems learn patterns from historical data and flag deviations from those patterns. A customer whose transaction behaviour shifts materially from their established profile will generate an alert, even if the specific pattern does not match any predefined rule.

This approach addresses the three weaknesses of rules-based systems. It can detect novel patterns that rules were not designed to catch. It can be more precise, reducing false positive rates by learning what "normal" looks like for each customer individually rather than applying generic thresholds. And it adapts as behaviour changes, because the models retrain on new data.

The numbers bear this out. Institutions that have deployed ML-based monitoring alongside their rules engines consistently report false positive reductions of 40 to 60% on the alerts where both systems overlap. Some report higher. The EY 2024 global survey found that 43% of financial institutions now use machine learning in their detection mechanisms, up from roughly 25% two years earlier. The direction of travel is clear.

But regulators have legitimate concerns about AI in monitoring, and any institution deploying it needs to address three specific demands.

Explainability. When an AI model flags a transaction or a customer, the institution must be able to explain why. "The model scored this 0.87" is not an explanation. "The model identified a 340% increase in cross-border transfers to a jurisdiction where the customer has no declared business activity, combined with a change in counterparty concentration from 12 regular counterparties to 3 new ones over a 60-day period" is an explanation. Regulators expect human-readable rationale for every alert, and the institution must demonstrate that analysts understand and can interrogate the model's reasoning.

Validation and governance. ML models require ongoing validation. This means regular back-testing against known outcomes (did the model flag cases that subsequently became SARs?), sensitivity analysis (how does alert volume change when model parameters shift?), and bias testing (is the model disproportionately flagging customers from certain demographics or jurisdictions for reasons unrelated to risk?). AMLA's technical standards are expected to address model governance explicitly. Institutions should not wait for those standards to build their validation frameworks.

Audit trail integrity. Every model decision must be logged with the data that informed it, the model version that produced it, and the outcome. If a model is retrained and its behaviour changes, the institution must be able to explain what changed and why. This is more demanding than the audit trail for a rules-based system, because the logic is not static. The audit trail must capture not just what the model decided, but how the model was operating at the time of the decision.

This is exactly the monitoring gap that Zenoo was built to close. If your current system is generating thousands of alerts with a sub-2% conversion rate, book a demo and we will show you what risk-proportionate monitoring looks like with your own data.

OCM best practices that actually survive regulatory scrutiny

The institutions we work with that perform best under regulatory examination share four common practices. None of them rely exclusively on rules or exclusively on AI. They use both, deliberately.

Risk-scoring models that update continuously. Customer risk scores should not be static assignments that change only at periodic review. Every monitoring event, whether it originates from transaction monitoring, sanctions screening, adverse media, or corporate registry changes, should feed into a risk recalculation. When the recalculated score crosses a materiality threshold, the customer is queued for review. This means risk scores reflect the current reality, not the reality at the last scheduled review date. Institutions that do this well see a 20 to 30% increase in the proportion of reviews that result in a genuine risk decision, because reviews are triggered by actual changes rather than arbitrary schedules.

Transaction pattern analysis at the customer level. Generic rules apply the same thresholds to every customer. Effective monitoring establishes a behavioural baseline for each customer and flags deviations from that baseline. A cash deposit of £50,000 from a commercial property firm is routine. The same deposit from a sole-trader consultancy is anomalous. Customer-level baselines eliminate the largest source of false positives: alerts generated because a transaction exceeded a generic threshold that was never appropriate for that customer's profile.

Behavioural clustering for peer comparison. Even with individual baselines, some anomalies are only visible in context. Behavioural clustering groups customers with similar profiles and transaction patterns, then identifies customers whose behaviour is diverging from their peer group. A customer who was onboarded as a small import/export business but whose transaction patterns now resemble a money service business will stand out in a peer comparison, even if their absolute transaction volumes have not crossed any rule-based threshold.

Documented threshold governance. Every monitoring threshold, whether rule-based or model-derived, should have a documented rationale, an owner, a review date, and a record of when it was last calibrated. This is the evidence regulators ask for first. When an examiner asks why a particular threshold is set at a particular level, the answer cannot be "it has always been that way." The answer must reference a risk assessment, a typology analysis, or a calibration exercise.

Enforcement actions tell you exactly where OCM programmes fail

Regulatory enforcement actions are, in effect, published case studies of what not to do. The patterns in recent OCM-related actions are remarkably consistent.

FinCEN's 2024 and 2025 enforcement actions against mid-tier banks repeatedly cite the same failures: transaction monitoring rules that were not calibrated to the institution's specific risk profile, alert backlogs that grew without remediation, and a lack of documentation showing how monitoring thresholds were set and reviewed. In one case, an institution had a 14-month backlog of unreviewed alerts. In another, the monitoring system had not been updated to reflect new correspondent banking relationships that materially changed the institution's risk exposure.

OCC consent orders in the same period tell a similar story. A common finding is that institutions could not demonstrate the effectiveness of their monitoring. They could show that monitoring was running. They could produce alert volumes. But they could not show what the monitoring was catching, what it was missing, or how they knew the difference. The absence of effectiveness testing, where the institution proactively tests whether its monitoring would detect known typologies, is cited in the majority of recent OCC orders related to BSA/AML programmes.

In the EU, the pattern is consistent but with additional emphasis on risk-based differentiation. Supervisory findings from national competent authorities, now coordinated through AMLA, increasingly focus on whether monitoring intensity matches customer risk. Institutions that apply the same monitoring parameters across all risk tiers are receiving findings, even when their overall monitoring coverage is high. Coverage without risk proportionality is not compliance.

"After our last examination, the single biggest remediation item was not technology. It was documentation. We could not demonstrate why our thresholds were set where they were, when they were last reviewed, or how we measured whether they were effective. We had the technology. We did not have the governance around it."
MLRO, European digital bank

Building a hybrid programme: rules plus intelligence for defensibility

The most defensible OCM programmes are hybrid. They use rules for known, well-defined scenarios where transparency is paramount, and AI for pattern detection, anomaly identification, and dynamic risk scoring where adaptability matters. The key is knowing which approach applies where, and documenting the rationale for both.

Rules should cover regulatory bright lines: sanctions screening matches, threshold-based reporting obligations, PEP identification triggers. These are scenarios where the detection logic must be fully transparent, the regulatory expectation is binary (flag or do not flag), and the cost of a miss is existential. Rules are appropriate here because the patterns are known, the logic is auditable, and there is no ambiguity about what constitutes a match.

AI should cover everything that requires pattern recognition across complex, multi-dimensional data: behavioural anomaly detection, peer group deviation, transaction network analysis, and dynamic risk recalculation. These are scenarios where the patterns are not known in advance, where individual rules cannot capture the complexity, and where the volume of data exceeds what rule-based systems can meaningfully process.

The hybrid model works because each approach compensates for the other's weaknesses. Rules ensure that known, high-priority scenarios are always detected with full transparency. AI ensures that novel, complex, or evolving patterns are detected despite not fitting any predefined rule. Together, they create a monitoring programme that is both defensible (regulators can see the logic for rules-based detections) and effective (AI catches what rules miss).

At Zenoo, we built our monitoring capabilities around this hybrid principle. Rules and AI run in parallel, with a unified case management layer that presents analysts with the combined output, the detection source (rule, model, or both), and the supporting evidence. The analyst sees what triggered the alert, why, and what data informed the decision, regardless of whether the trigger was a rule or a model.

Benchmarking your OCM: the numbers that matter

If you cannot measure your monitoring programme's performance, you cannot defend it to a regulator. Here are the metrics that matter, with benchmarks from what we see across the industry.

Alert-to-SAR conversion rate. The industry average for rules-based systems sits between 1 to 3%. Institutions with well-calibrated hybrid programmes typically achieve 8 to 15%. If your conversion rate is below 2%, your rules are generating too much noise. If it is above 20%, your rules may be too narrow and you may be missing activity that should be generating alerts.

False positive rate. The widely cited industry figure is 95%. Institutions deploying ML-based anomaly detection alongside rules consistently report reductions to 50 to 70% false positive rates on overlapping alert types. The target is not zero false positives. The target is a false positive rate that your team can process without building a backlog and without analyst fatigue degrading the quality of genuine reviews.

Alert backlog age. Any alert older than 30 days without a disposition is a regulatory finding waiting to happen. Best practice is disposition within 5 business days for standard alerts and 24 hours for high-priority alerts. If your backlog is growing, you either have too many alerts (calibration problem) or too few analysts (resourcing problem). Either way, the regulator will notice.

Monitoring coverage. What percentage of your customer base and transaction volume is subject to active monitoring? The answer should be 100% for sanctions screening and threshold-based rules. For behavioural monitoring, coverage may vary by risk tier, but you should be able to articulate what is covered, what is not, and why. Gaps in coverage must be documented and risk-assessed.

Threshold review frequency. At minimum, thresholds should be reviewed annually. Best practice is semi-annual review for high-risk scenarios and annual for standard scenarios, with ad hoc reviews triggered by material changes in the risk environment (new typologies, new products, new jurisdictions). The review must be documented, including what was assessed, what was changed, and the rationale.

Remediation completion rate. When monitoring identifies a risk, how quickly is the risk addressed? Measure the time from alert generation to case closure, and track the percentage of cases that are remediated within your defined SLA. Industry benchmarks vary, but institutions with mature programmes close 80 to 90% of standard cases within 15 business days and 95% of high-priority cases within 5 business days.

The compliance team's role is changing, not disappearing

None of this means that compliance teams are being replaced by algorithms. The shift is from manual alert processing to model oversight, threshold governance, and complex case investigation. Analysts who currently spend 60 to 70% of their time on routine false positive disposition will spend that time on cases that genuinely require human judgement, on validating model performance, and on the governance framework that makes the whole programme defensible.

This is a better use of skilled professionals. It is also a more sustainable operating model. The industry survey data on compliance professional stress (68% report high stress) and attrition (42% are considering leaving the profession) reflects a workforce that is burning out on repetitive work that technology should be handling. Redirecting human effort towards the work that actually requires human expertise is not just an efficiency gain. It is a retention strategy.

AMLA is operational. FATF is tightening its effectiveness assessments. Regulators in every major jurisdiction are moving from checking that monitoring exists to evaluating whether monitoring works. Institutions that are still running the same rules-based monitoring programme they built five years ago are not just operationally inefficient. They are regulatorily exposed.

The question is no longer whether to move beyond pure rules-based monitoring. It is how quickly you can build a hybrid programme that is both effective and defensible. If your alert-to-SAR conversion rate is below 3%, your thresholds have not been reviewed in over a year, or your team is drowning in false positives, the regulatory risk is real and growing.

We built Zenoo to solve exactly this. If you want to see what risk-proportionate, hybrid monitoring looks like in practice, book a demo. 30 minutes. Your data. No slides.

Stuart Watkins is CEO of Zenoo, the compliance orchestration platform that connects screening, monitoring, and case management through a single intelligence layer.

Share
SW
Stuart Watkins

About the author

Stuart Watkins

CEO & Founder

Stuart founded Zenoo in 2017 after spending 15 years in financial services technology. He leads the company's mission to make compliance faster, smarter, and less painful for regulated businesses worldwide.

More from FinCrimeOps

22 hours per alert is too long. Cut it to 12 minutes.

One platform. 10 AI agents. 240+ check types. Live in weeks, not months.

30 minutes. Your data. No slides.