Evaluating a KMS | senye.me

The open question

The word "compounding" gets thrown around in knowledge-management writing, mine included, as if it were observable. It isn't. A KMS can accumulate notes at a constant rate without any of them ever being retrieved. A KMS can answer a query well once and never again. A KMS can reach a steady state where the agent appears helpful but isn't actually making the user's work meaningfully different. None of those states are distinguishable from the inside.

The question: how do you know your KMS is actually helping, and not just giving you the warm feeling of a system that hums?

Current thinking

The three levels of what you could measure

System-health signals: is the machine running? Ingestion rate, compilation success rate, query latency, error rate. These are observable and boring. They tell you the system is alive, not that it is useful.

Retrieval quality: when a query is made, does the answer use the right sources? Is the right context surfaced? Here the measurement is more interesting but still inside-out: you're evaluating the system against a curated set of expected retrievals, which requires someone to build and maintain that set. Most teams never do.

Outcome signals: did the user make a better decision because of what the system surfaced? Did the onboarding cycle shorten? Did a decision get made faster and with fewer meetings? These are the signals that matter. They are also the hardest to attribute, because a decision happens at the intersection of many inputs and the KMS is only one of them.

The field has settled into measuring (1) because it's automatic, gesturing at (2) when funded, and almost never touching (3).

Proxy metrics I'm currently running

At the personal tier, the proxy I use is revisit rate: how often does the agent surface a note I haven't touched in a week, and does the surfacing turn into a reference? If old notes never come back into active use, the system is a write-only store and the compounding claim is empty.

At the team tier, the proxy is question deflection: how many questions that used to reach a specific person (the CTO, the head of ops) now get answered from the system without reaching them? This is measurable through Slack mention patterns if you're willing to parse them, and it's a reasonable proxy for whether the system has absorbed institutional knowledge into a queryable form.

At the app-scoped tier, the proxy is first-query-first-answer rate: when a user asks a new question, does the system's first answer become the answer they act on, or do they keep querying and refining? A system with a rising first-query-first-answer rate is a system that has learned.

None of these are the same as "outcome signals." They are fast-feedback heuristics that let you notice decay quickly.

The paradox of a good KMS

A KMS that is working well is invisible. The user doesn't remember to be grateful for context that simply appeared; they only remember the friction when it didn't. This makes user-reported satisfaction an almost useless signal for a mature system: the users who benefit most have forgotten the system is there. Evaluation has to come from behavioral traces, not from asking.

The decay asymmetry

Compounding is the claim. Decay is the failure mode nobody talks about. A KMS that is not actively maintained declines: the notes become stale, the extracted facts drift from the current state of the business, the compilation accumulates contradictions no one resolves. A good evaluation framework has to detect decay earlier than it detects growth, because decay in a knowledge system is less visible and more costly.

The current detection signal I run for decay is contradiction rate: how many flagged contradictions exist at any given time, and are they being resolved faster than they accumulate? A rising unresolved-contradiction count is the clearest leading indicator of a KMS that is slipping.

What I haven't figured out

The outcome attribution problem has no clean answer. A decision gets made with inputs from people, meetings, memory, and the KMS, and you cannot separate the contribution of each in a live operating environment without instrumenting the entire decision process, which neither a team nor a user will tolerate.

The benchmark problem: there is no equivalent of an eval set for KMS quality. The field has retrieval benchmarks (BEIR, MTEB, the ones the embedding-model papers use), but those are about document-level retrieval, not about whether a synthesized answer is useful for the decision at hand. Building a domain-specific eval set is expensive and becomes outdated fast.

What would settle it

A framework that combines lightweight behavioral proxies (revisit rate, question deflection, contradiction rate) with a quarterly outcome audit (five specific decisions made in the period, reconstructed to understand what role the KMS played), plus a standardized decay detector. This is buildable; it's not built yet. And until it is, the word "compounding" should be used with a caveat: it's a claim about a shape, not a measurement.