30 April 2026 · 11 min read
PII Tokenisation: How Zero-Knowledge AI Actually Works in Production
Australian firms cannot send client PII to overseas AI models in plaintext, and yet the AI needs context to be useful. The privacy router pattern resolves the tension. Here's how it works in production, what it costs, and what it cannot protect against.
Roman Silantev — Founder, AI Lab Australia
The contradiction at the heart of business AI
Every Australian professional services firm using AI is operating against a contradiction. To be useful, the AI needs to read the firm's actual data — the client's name in the email, the participant's NDIS number on the invoice, the patient's medication chart, the borrower's TFN on the loan application. To be safe under the Privacy Act 2026 ADM transparency provisions, the firm has to be able to explain what the AI did with that data. To be safe under common sense, the firm shouldn't be sending the data to a third-party model provider in plaintext, where it can be retained, logged, fine-tuned on, or breached.
The industry's answer to this contradiction has typically been one of three things. Some firms accept the risk and send the data anyway, banking on the model provider's data-handling promises. Some firms refuse to use AI at all, accepting the operational cost. Some firms attempt to redact PII before each call — manually, or with regex tools — accepting the loss of context. None of these is satisfactory for a firm that wants both usefulness and defensibility.
What tokenisation does
Tokenisation is the structural answer. Before any client data reaches an AI model, it passes through a privacy router that detects personal information and replaces it with reversible tokens. The model receives TKN_a8f2c1 instead of John Smith, TKN_b3d7e9 instead of 0412 345 678, TKN_c4f1a2 instead of TFN 123 456 789. The model still has the structure of the input — it knows that this is a name, a phone number, a tax file number — and it can reason about what to do with that structure. What it does not have is the actual personal data.
The reverse-mapping table that converts tokens back to plaintext is held only inside the firm's own infrastructure, encrypted with the firm's per-organisation key. When the AI's output comes back through the router, the tokens are detokenised to plaintext, but only inside the user's authenticated browser session. At no point does the model provider see the actual personal data; the data is structurally meaningless to them.
In architectural terms, this is zero-knowledge AI. The platform operates on a zero-knowledge basis with respect to its model providers — they process tokens, and the only place the tokens are meaningful is inside the firm's own dedicated infrastructure.
The seventeen categories
SydClaw's privacy router detects seventeen categories of personal information: full names, given names, surnames, email addresses, phone numbers, postal addresses, dates of birth, ABNs, ACNs, TFNs, Medicare numbers, NDIS participant numbers, bank account numbers, BSB numbers, driver licence numbers, passport numbers, and IP addresses. The detection runs as a chain of detectors — high-confidence regex for structured identifiers (TFNs, NDIS numbers, ABNs all have well-defined formats), context-aware named-entity recognition for less structured ones (names, addresses), and a final pass that resolves overlapping detections to the highest-confidence interpretation.
The categories were not chosen academically. Each one is a category of data that, if leaked, would constitute a notifiable data breach under the OAIC's threshold tests. The list is calibrated to the regulatory reality, not to a generic privacy framework.
Why reversible, not destructive
The tokens are reversible by design. An alternative architecture would simply destroy the personal information at the privacy boundary — replace it with [REDACTED] and never look back. That approach is more secure but it is also useless for any workflow where the output needs to reach a human in plaintext. Drafting an email to John Smith is operationally meaningless if the AI's output reads "Dear [REDACTED]".
The reversibility is bounded carefully. The reverse-mapping table is held only inside the firm's own Supabase project, with column-level encryption using a separate key from the bulk database key. RESTRICTIVE row-level security policies prevent UPDATE and DELETE on the mapping table, even by service-role authentication. The detokenisation happens inside the user's authenticated browser session, never on a server that an external party could compromise. And the mapping table is segregated by client matter, so a token resolved in the context of Client A cannot be resolved in the context of Client B.
Where it doesn't help
Tokenisation is a layer; it is not the whole defence. There are categories of leakage it does not protect against, and being honest about those is part of what makes the architecture defensible.
It does not protect against the AI inadvertently reproducing personal information that was tokenised. If the AI receives "the client TKN_a8f2c1 lives at TKN_e7d2b9" and writes back "the client lives at the address mentioned in TKN_e7d2b9", the structural information is still there, even if the literal value is not. We mitigate this with output-rail screening that flags any AI response containing token references and routes it for review, but the mitigation is probabilistic, not absolute.
It does not protect against side-channel inference. A skilled adversary with access to the model provider's logs could, in theory, infer information from the patterns of token usage — same token appearing in many requests, statistical correlations between tokens and observed outcomes. The risk is small but not zero. The Anthropic, OpenAI, and Microsoft data-handling agreements all prohibit this kind of analysis, and we monitor model-provider behaviour for any indication it is happening, but the technical defence stops at tokenisation.
It does not protect against an authenticated user inside the firm exfiltrating the mapping table. That risk is a different category and is addressed through different controls — RBAC, MFA, audit logging, access reviews. The privacy router protects against external leakage, not against internal misuse.
What it costs
Tokenisation costs latency and accuracy. Each request adds approximately 80 to 250 milliseconds for the privacy-router pass — detection, tokenisation, lookup against the existing mapping table, persist any new tokens. The cost is dominated by the named-entity recognition for unstructured fields; the structured detectors are fast.
The accuracy cost is harder to quantify. Replacing John Smith with TKN_a8f2c1 reduces the model's ability to draw on contextual knowledge about the name (which is rarely useful) but also reduces its ability to detect when the named entity is the same across multiple inputs (which is sometimes useful). For most professional services workflows — email triage, invoice processing, BAS preparation, claim drafting — the impact is negligible because the workflows are about structure and process, not about who the person is. For workflows that genuinely require named-entity reasoning (relationship mapping, sentiment analysis tied to specific individuals), tokenisation degrades the AI's quality, and we make that explicit when scoping those workflows with a client.
What it enables
Tokenisation is what makes Australian-residency PII handling under the Privacy Act 2026 actually work for an AI workforce platform. Without it, a firm using SydClaw would be sending client personal data to overseas model providers and trusting their data-handling practices. With it, the firm can demonstrate to a regulator that personal data structurally never leaves the firm's own infrastructure in identifiable form.
The Privacy Act 2026 ADM transparency requirement asks: explain how this automated decision was made. SydClaw's audit trail captures the prompt template, the tokens that were passed in, the model's response, and the reviewer who approved the result. The reviewer can, in their authenticated session, see the detokenised version. The regulator, if they request the audit log, can see the tokenised version. The chain of custody is unambiguous, and the explanation is reproducible.
For NDIS providers operating under the Quality and Safeguards Commission, tokenisation is what makes participant-scoped AI safe. A worker can draft a service delivery record from a voice memo in 30 seconds; the AI sees TKN_p4d9c2 instead of the participant's name; the SDR is detokenised back to plaintext only when the worker reviews and signs. The participant's identity does not transit any external service in plaintext.
What we'd build differently
If we were building the privacy router from scratch today, we would invest more heavily in detection accuracy for low-incidence categories. The high-incidence ones (names, emails, phone numbers, TFNs) have detection rates above 99%; the lower-incidence ones (driver licence numbers in unusual formats, foreign passport numbers, edge-case addresses) sit closer to 95%. The 95% is acceptable for the categories where the consequence of a miss is contained — a name that slips through is often not actually personally identifying in context. It is less acceptable for categories where any miss is significant.
We would also expose the mapping table operations more clearly to the firm's audit log. The current design captures every tokenisation event but the export is more verbose than it needs to be; a firm asking "what personal data has been tokenised for this matter" should see a one-page summary, not a 200-row CSV. That is on the roadmap and not yet shipped.
The architecture is correct. The implementation has the kinds of refinements that come from running against real client workflows — the kind that don't surface until production. We document them in the security review pack we share with prospects under NDA, and we update them as we learn.
About the author
Roman Silantev — Founder, AI Lab Australia. Roman is the founder of AI Lab Australia Pty Ltd, the company that builds and operates SydClaw. He has spent the last decade building enterprise software for Australian professional services firms, and writes regularly on AI compliance and Privacy Act obligations.