An assessment measures something. The question is whether it measures what the program claims to develop. Most assessments in graduate and professional education measure recall: can the learner retrieve and reproduce information or concepts from the curriculum? Recall assessments are easy to design, easy to administer, and easy to grade. They are also largely useless as evidence of competency, because the capability that matters in professional practice — the ability to diagnose unfamiliar situations, apply relevant frameworks under uncertainty, and make defensible decisions with incomplete information — is not reliably predicted by recall performance.
The gap between what assessments measure and what programs intend to develop is not new. But it is persistent, and it persists for structural reasons that most curriculum designers do not address. Understanding those reasons is a prerequisite for designing assessments that do what assessments are supposed to do: produce evidence that a learner can actually do what the program claims to teach.
There is a sharper way to see the problem. An assessment is not a test of the learner. It is a test of the program's claim about the learner. When a program asserts that its graduates can diagnose organizational dysfunction, the assessment is the evidence offered in support of that assertion. If the assessment measures recall and the claim is about judgment, the evidence does not support the claim — and no one notices, because the grade distribution looks healthy and the students report satisfaction. The failure is silent. It surfaces years later, in the field, when a graduate with strong transcripts cannot read a situation that does not announce its structure.
Why Most Assessments Default to Recall
Recall assessments have several properties that make them the path of least resistance for curriculum designers and faculty.
They are easy to construct. A recall question requires identifying an important fact, concept, or framework from the curriculum and asking the learner to reproduce it. This is a low-cognitive-load design task. Designing an assessment that requires application or transfer requires significantly more design effort: the designer must construct a situation that is genuinely unfamiliar (not recognizable from curriculum examples), contains the relevant features that would cue use of the target capability, and has a tractable evaluation rubric that does not simply reward surface-level pattern matching.
They are easy to grade with inter-rater reliability. A recall item has a correct answer. A competency-based assessment — one that asks learners to diagnose a novel situation — requires raters who can evaluate the quality of reasoning, not just the accuracy of reproduction. This requires rater calibration, rubric development, and acceptance of some judgment variability. Institutions with high student-to-faculty ratios find this administratively costly.
They produce interpretable grades. A score of 85 on a recall test communicates clearly that the student correctly answered 85% of recall items. A score of 85 on a competency-based performance assessment is meaningful only if the rubric clearly defines what a score of 85 means in terms of actual capability. Designing rubrics with that specificity takes time most curriculum designers do not invest.
There is a fourth property, less discussed and more corrosive: recall assessments are defensible against appeal. A student who contests a recall grade can be shown the answer key. A student who contests a judgment-based grade is contesting the evaluator's reasoning, which feels — to administrators and to the student — like an argument rather than a verdict. Faculty under workload pressure gravitate toward the instrument that cannot be argued with. The institutional preference for recall is not only a preference for ease of design; it is a preference for grades that do not have to be defended.
The result is a persistent structural bias toward recall assessment, even in programs that explicitly state their goal as competency development. The assessments reveal competency at recall, which is correlated with but not equivalent to professional competency.
The Competency Evidence Hierarchy
Not all assessment types are equivalent in the evidence they provide about learner capability. A useful way to think about assessment design is through a hierarchy of evidence quality — what I call the Competency Evidence Hierarchy. The hierarchy has five levels: recall, recognition, application, transfer, and generation. Each level provides stronger evidence of genuine competency than the one below it.
Recall is reproduction of previously encountered information. A learner who can define a term, name the steps of a framework, or reproduce a diagram has demonstrated recall. This is the weakest form of competency evidence because it does not require that the learner can use what they recalled.
Recognition is identifying the correct option among alternatives. Multiple-choice assessments typically operate at this level. Recognition is slightly stronger evidence than recall because it tests whether the learner can distinguish correct from incorrect without providing retrieval cues, but it still does not require application. A learner can recognize the correct application of a framework in an example without being able to produce that application themselves.
Application is using a concept, framework, or skill to produce an output in a situation that resembles what the curriculum covered. Case study analysis, structured exercises, and problem sets with defined correct outputs operate at this level. Application is stronger evidence than recognition because it requires production, not just identification. It remains limited because it tests performance in familiar-enough situations — situations similar enough to curriculum examples that the learner can pattern-match rather than genuinely reason.
Transfer is applying a capability to a genuinely unfamiliar situation: one where the relevant concepts and frameworks are not cued by surface similarity to curriculum examples. Transfer assessments require learners to recognize what kind of problem they are facing before they can apply anything to it. This is the first level at which assessment begins to provide meaningful evidence of professional competency, because professional practice requires transfer almost exclusively — practitioners rarely encounter situations that are recognizable as textbook examples.
Generation is producing a novel artifact — a diagnostic framework, a design, a recommendation with supporting reasoning — that demonstrates not just ability to apply but ability to construct new analytical approaches from existing knowledge. Generation assessments are the strongest evidence of deep competency, and the hardest to design, administer, and evaluate. They are also the most honest test of whether a graduate program has produced the kind of thinking it claims to develop.
The hierarchy is not a ladder every learner must climb in sequence within a single assessment. It is a measurement instrument for the program designer. Before designing any assessment, the designer should be able to state which level the assessment targets and why that level is the right evidence for the claim the program is making. An assessment that cannot be located on the hierarchy is an assessment whose evidential value is unknown — which is the situation most programs are in without realizing it.
There is also a diagnostic use for the hierarchy that has nothing to do with grading. When a graduate fails in practice, the failure can usually be located on the hierarchy. The graduate who froze in front of an undiagnosed situation was assessed at application but deployed into transfer. The gap between the level the program certified and the level the work demands is the precise shape of the competency debt the program shipped.
What Diosh Uses in Graduate Education
In graduate education at PCU and in professional development program design, assessment design choices track the Competency Evidence Hierarchy deliberately. The default is not to start at recall and add upper levels if possible — it is to identify the highest level of evidence the program should produce and design backward from there.
For a course in organizational systems thinking, the target competency is transfer-level: learners should be able to take an unfamiliar organizational situation they have not previously analyzed and apply systems thinking frameworks to produce a coherent diagnostic picture. The assessment design implication is that the primary summative assessment must use a case they have never seen — not a case discussed in class, not a case that resembles class examples closely enough to allow pattern matching. The case must be genuinely unfamiliar, which means the assessment cannot be designed until after the curriculum is delivered, or must use cases from domains outside the learners' professional backgrounds.
Formative assessments during the course operate at the application level: structured case analyses using frameworks covered in the current unit, where the evaluation focuses on whether the learner is applying the framework correctly and where they are making errors that need correction before the summative assessment. These serve a different function — they are learning instruments, not primarily measurement instruments — and their design appropriately emphasizes feedback over evaluation.
The sequencing matters as much as the levels. A program that runs transfer-level assessment before learners have accumulated enough application-level practice is not measuring transfer; it is measuring who already had judgment before the course began. The formative application work is the scaffolding that makes the summative transfer assessment a fair test of what the course taught rather than a test of prior advantage. When this scaffolding is absent, transfer assessments reliably reward the students who arrived with field experience and penalize the ones the program was built to develop.
The assessment that produces the most honest information about learner competency is one that is uncomfortable to administer: open-ended, unfamiliar, evaluated by rubric rather than answer key, and acknowledged to have uncertainty in the scoring. That discomfort reflects the reality of professional practice, where the situations are always somewhat unfamiliar, the criteria for a good answer are rarely perfectly clear, and the evaluator must exercise judgment.
The Classroom Moment That Reveals the Level
The hierarchy is abstract until it is visible in a room. It becomes visible at a specific moment: when a learner who has performed well on application-level work is handed a genuinely unfamiliar case and asked to begin.
The diagnostic tell is the first move. A learner operating at application reaches immediately for a framework — they name the tool before they have read the situation, because in the curriculum the tool was always given and the task was to apply it. A learner operating at transfer does something slower and less comfortable: they describe what they observe, mark what is missing, and resist naming a framework until the situation has told them which kind of problem it is. The pause before the framework is the behavioral signature of judgment. Its absence is the behavioral signature of pattern matching dressed as analysis.
This is why a competency assessment must be designed to make that pause observable. A summative case that supplies the framing — "using systems thinking, analyze the following" — has already done the diagnostic work the learner was supposed to do, and has quietly converted a transfer task back into an application task. The instruction itself leaked the answer to the question the assessment was meant to ask. Designing the assessment so that the learner must decide what kind of problem they are facing, with no cue from the prompt, is the difference between measuring transfer and measuring obedient application. The wording of the prompt is not a presentation detail. It is the level control.
Rubric Design for Competency-Based Assessment
A competency-based assessment is only as good as its rubric. A poorly designed rubric evaluates surface features — length, structure, use of vocabulary from the curriculum — rather than the quality of reasoning, which is what the assessment is supposed to measure.
Rubric design for higher-level competency assessments requires that each criterion describes a cognitive process, not a content presence. "The response demonstrates accurate understanding of feedback loops" evaluates content presence. "The response correctly identifies which feedback loops are causal drivers of the pattern described in the case, and explains why" evaluates reasoning. The second criterion is harder to score because it requires the rater to evaluate the quality of an argument, not the presence of a term. It is also the criterion that produces information about what the learner can actually do.
The practical test for a rubric criterion is whether a fluent learner could satisfy it without genuine competency. If a criterion can be passed by a student who has the vocabulary but not the judgment, the criterion is measuring vocabulary. "Uses systems terminology accurately" passes that test in the wrong direction — a recall-strong student satisfies it easily. "Distinguishes a reinforcing loop from a coincidental correlation in the case data, and states the evidence" cannot be satisfied by vocabulary alone. Writing criteria that a fluent-but-shallow learner fails is the entire craft of competency rubric design, and it is the step most rubric authors skip because it requires imagining the articulate wrong answer, not just the correct one.
Rater calibration is essential for rubrics evaluating reasoning rather than recall. Without calibration, two raters applying the same rubric to the same response will frequently assign different scores — not because the rubric is flawed, but because evaluating argument quality requires judgment, and judgment requires a shared standard. Calibration sessions in which raters score the same set of anchor responses and discuss their reasoning until they reach agreement produce the shared standard that makes the rubric functional. The disagreements surfaced in calibration are not noise to be eliminated; they are the most useful artifact the process produces, because each disagreement marks a place where the rubric language was ambiguous and the criterion needs to be rewritten before it can be trusted.
The Institutional Default and Why It Persists
Most institutions default to recall-level assessment because the incentive structure rewards it. Faculty are evaluated on student satisfaction scores, which are higher when assessments feel fair and manageable — which recall assessments do. Administrative systems are designed for grades that are easy to calculate and report. Accreditation bodies have historically been more interested in curriculum coverage than in assessment rigor. The result is that the assessment practices that produce the most honest evidence of competency are the practices that create the most friction in the institutional system.
Changing this requires both individual curriculum choices — designing better assessments regardless of the institutional default — and institutional choices about what to measure and reward. Programs that invest in assessment redesign without corresponding investment in faculty development for rubric design and rater calibration will see limited improvement, because better assessment instruments are only useful when the people administering them know how to evaluate them.
Where This Approach Breaks Down
Competency-based assessment is not free, and treating it as universally superior is its own kind of error. There are real conditions under which recall and recognition assessment are the correct choice, and a designer who reaches for transfer-level instruments everywhere will waste effort and damage fairness.
Recall is the right instrument when the capability genuinely is reproduction — safety procedures, regulatory thresholds, foundational definitions that must be automatic before higher reasoning is possible. You do not want a clinician deriving a contraindication from first principles under time pressure; you want it recalled. The hierarchy ranks evidence of judgment, not the value of the knowledge. Some knowledge is supposed to be recalled, and assessing it any other way is a category mistake.
The approach also breaks down at scale without resourcing. A transfer-level summative assessment for a cohort of two hundred, graded by two calibrated raters each, is administratively heavy and slow to return. A program that mandates competency assessment without funding the rater time will get the form of competency assessment — the open-ended prompt — without the substance, because rushed raters under volume pressure fall back on surface features the rubric was designed to exclude. A degraded competency assessment can be worse than an honest recall test, because it carries the prestige of rigor while delivering the reliability of neither. The honest move, when resourcing is absent, is to assess fewer competencies at transfer level and assess them properly, rather than gesturing at transfer across the whole curriculum.
What You Can Audit This Week
The full redesign is institutional and slow. The audit is individual and fast. Take any one assessment you currently administer and answer three questions about it.
First, locate it on the hierarchy. Read your own prompt and ask what level of evidence a strong response actually requires — not what you intended, but what the wording permits. If a student could earn full marks by reproducing or recognizing curriculum content, the assessment is measuring recall or recognition regardless of the learning outcome printed on the syllabus.
Second, check whether the prompt leaks the diagnosis. If the instruction names the framework to apply, you have converted a transfer task into an application task. Remove the framing and require the learner to decide what kind of problem they are facing. That single edit raises the evidential level of the assessment more than any amount of added difficulty.
Third, read one criterion of your rubric and ask whether a fluent student with no judgment could satisfy it. If yes, rewrite it to describe the cognitive move you actually care about — the distinction drawn, the hypothesis tested, the trade-off named — rather than the content displayed. You do not have to fix the whole instrument this week. Fixing one criterion is enough to feel the difference between scoring presence and scoring reasoning.
The honest version of what most assessments reveal is that learners have acquired the curriculum's vocabulary and can reproduce its content. That is useful to know, but it is not evidence of professional competency. Designing assessments that produce the higher-level evidence is more difficult and more honest, and it is the only way to know whether a program is developing what it claims to develop.
Continue in this series
This piece is part of Teaching Systems Thinking to Graduate Students Who Want a Framework, my systematic guide to teaching systems thinking. Related reading:
- Graduate Assessments Reward Documentation, Not Judgment — Here Is How to Fix It
- Curriculum Design for Professional Development Programs That Actually Change Behavior
- The Gap Between Teaching Content and Teaching Judgment
- Teaching Systems Thinking to Graduate Students Who Want a Framework
More on how I teach this — learning resources and frameworks.






