# Technical Uncertainty Log

## Purpose
This document records the technical uncertainties encountered during GoComply R&D activities.
For each uncertainty, we document: what was unknown, why a competent professional could not
have determined the outcome in advance, and what experimentation was conducted.

---

## Uncertainty 1: RAG Retrieval Quality for Regulatory Text

**Date Identified:** August 2024
**Related Activity:** Core Activity 1 (RAG Pipeline)

**What was unknown:**
Whether BM25-based full-text search (SQLite FTS5) could retrieve sufficiently relevant
regulatory chunks from a corpus of 1,813 chunks spanning 228 sources, given natural
language compliance queries.

**Why a competent professional could not determine this in advance:**
- BM25 is optimised for general text retrieval, not legal/regulatory language which
  features dense cross-references, defined terms, and nested clause structures
- No benchmark existed for regulatory text retrieval in the Australian context
- The interaction between chunk granularity and retrieval precision for compliance
  analysis was unexplored in literature
- Regulatory text contains domain-specific terminology (e.g., "responsible entity",
  "notifiable breach", "accountable person") whose overlap creates disambiguation challenges

**Experimentation conducted:**
1. Tested 3 chunk size strategies: 512, 1024, 2048 tokens
2. Evaluated 2 query formulation approaches: direct extraction vs. LLM-reformulated
3. Measured precision@5 and recall@10 against manually annotated test set
4. Iterated on FTS5 tokeniser configuration for regulatory terminology

**Result:**
Clause-level chunking (200-400 tokens) with LLM-reformulated queries achieved best
results (precision@5 = 0.72, recall@10 = 0.68). This was not predictable — larger chunks
were initially expected to perform better due to more context.

---

## Uncertainty 2: LLM Hallucination in Compliance-Critical Outputs

**Date Identified:** September 2024
**Related Activity:** Core Activity 1 (RAG Pipeline)

**What was unknown:**
Whether a large language model (Claude Sonnet) would produce factually accurate compliance
findings with verifiable regulatory citations, or whether hallucination rates would be
unacceptably high for a compliance product where incorrect findings could cause real harm.

**Why a competent professional could not determine this in advance:**
- LLM hallucination rates are highly task-dependent and cannot be predicted without
  domain-specific evaluation
- Compliance assessment requires not just finding issues but citing the exact regulatory
  clause — a higher bar than general summarisation
- The interaction between retrieved context quality and hallucination rate was unknown
- No prior work measured LLM hallucination rates for Australian regulatory compliance tasks

**Experimentation conducted:**
1. Baseline: LLM without retrieval context — hallucination rate ~45%
2. With FTS5 retrieval context (top 5 chunks) — hallucination rate dropped to ~18%
3. Added verification layer (LLM checks its own citations against source text) — rate dropped to ~11%
4. Added clause-reference validation (regex match against known clause identifiers) — rate ~8%

**Result:**
Multi-stage verification pipeline reduces hallucination to ~8%, which is approaching
but has not yet met the target of <5% for production compliance use. Further research needed.

---

## Uncertainty 3: Rule Engine Scalability and Conflict Resolution

**Date Identified:** October 2024
**Related Activity:** Core Activity 2 (Rule Engine)

**What was unknown:**
Whether a keyword-pattern rule engine could scale beyond ~200 rules without the interaction
effects between rules producing unacceptable false positive rates.

**Why a competent professional could not determine this in advance:**
- Regulatory frameworks are interconnected — CPS 230 references CPS 232, CPG 230, SPS 220,
  and APS 110, creating potential for rule overlap
- The false positive rate as a function of rule count was not predictable from first principles
- No prior system had attempted this breadth of regulatory rule coverage
- The non-linear interaction between rules (conflicting severity, overlapping keywords)
  could not be determined without empirical testing

**Experimentation conducted:**
1. Scaled from 21 rules → 200 rules: false positive rate increased from 12% to 35%
2. Developed automated conflict detection algorithm — identified 15% cross-rule conflicts
3. Implemented rule priority system with regulation-domain weighting
4. Scaled to 1,975 rules with conflict resolution — false positive rate: 22%
5. Ongoing: severity calibration experiments to reduce to <15%

**Result:**
Automated conflict detection was essential for scaling. Without it, the rule engine would
have been unusable beyond ~500 rules. This finding was not predictable.

---

## Uncertainty 4: Cross-Regulatory Compliance Assessment

**Date Identified:** November 2024
**Related Activity:** Core Activities 1 and 2

**What was unknown:**
Whether a single scanning pass could meaningfully assess compliance across fundamentally
different regulatory paradigms (prescriptive APRA standards vs. principles-based ASIC
guidance vs. activity-based AUSTRAC requirements).

**Why a competent professional could not determine this in advance:**
- Each regulatory body uses different compliance assessment methodologies
- A document that is compliant under CPS 234 (information security) may still violate
  the Privacy Act's different framing of the same obligation
- No prior system had attempted unified cross-regulatory assessment
- The semantic overlap between regulations from different bodies was unmeasured

**Experimentation conducted:**
1. Single-regulation scanning: tested accuracy per regulatory domain independently
2. Cross-regulation scanning: same documents scanned against all applicable regulations
3. Measured conflict rate: findings from different regulations contradicting each other
4. Developed regulation-relationship mapping to resolve cross-regulatory conflicts

**Result:**
Cross-regulatory scanning is viable but requires explicit relationship modelling between
regulatory frameworks. Flat, independent scanning produced ~30% contradictory findings.
After relationship modelling, contradictions reduced to ~8%.

---

## Uncertainty 5: AI Governance Back-Test Validity

**Date Identified:** March 2026
**Related Activity:** Core Activity 3 (Sentinel)

**What was unknown:**
Whether historical enforcement case data (publicly available court proceedings, AUSTRAC
statements of claim, APRA investigation reports) contains sufficient detail to reconstruct
a meaningful pre-enforcement compliance profile for back-testing purposes.

**Why a competent professional could not determine this in advance:**
- Enforcement documents are not designed for back-testing — they present findings, not
  the full compliance landscape
- The completeness of public enforcement data varies significantly between regulators
  and cases
- No methodology existed for converting enforcement findings into testable compliance
  profiles
- Whether the scanner's detection of known failures constitutes a valid back-test
  (vs. information leakage from the enforcement data itself) required careful experimental design

**Experimentation conducted:**
1. CBA AUSTRAC 2018: extracted 9 key findings from Statement of Claim, mapped to
   regulatory requirements, reconstructed minimum compliance profile
2. Tested scanner against CBA profile — detected 7/9 failures (78%)
3. Westpac AUSTRAC 2020: repeated with 8 findings — detected 6/8 (75%)
4. Controlled for information leakage: tested with findings removed from scanner rules
5. Validated generalisability across AML/CTF (high detection) vs. governance (lower)

**Result:**
Methodology is viable with caveats. Detection rates of 75-78% demonstrate the scanner
can identify known compliance failures, but the methodology is stronger for prescriptive
regulations (AML/CTF) than principles-based frameworks.