Assessment Design for Real Competency

An assessment measures something. The question is whether it measures what the program claims to develop. Most assessments in graduate and professional education measure recall: can the learner retrieve and reproduce information or concepts from the curriculum? Recall assessments are easy to design, easy to administer, and easy to grade. They are also largely useless as evidence of competency, because the capability that matters in professional practice — the ability to diagnose unfamiliar situations, apply relevant frameworks under uncertainty, and make defensible decisions with incomplete information — is not reliably predicted by recall performance.

The gap between what assessments measure and what programs intend to develop is not new. But it is persistent, and it persists for structural reasons that most curriculum designers do not address. Understanding those reasons is a prerequisite for designing assessments that do what assessments are supposed to do: produce evidence that a learner can actually do what the program claims to teach.

Why Most Assessments Default to Recall

Recall assessments have several properties that make them the path of least resistance for curriculum designers and faculty.

They are easy to construct. A recall question requires identifying an important fact, concept, or framework from the curriculum and asking the learner to reproduce it. This is a low-cognitive-load design task. Designing an assessment that requires application or transfer requires significantly more design effort: the designer must construct a situation that is genuinely unfamiliar (not recognizable from curriculum examples), contains the relevant features that would cue use of the target capability, and has a tractable evaluation rubric that does not simply reward surface-level pattern matching.

They are easy to grade with inter-rater reliability. A recall item has a correct answer. A competency-based assessment — one that asks learners to diagnose a novel situation — requires raters who can evaluate the quality of reasoning, not just the accuracy of reproduction. This requires rater calibration, rubric development, and acceptance of some judgment variability. Institutions with high student-to-faculty ratios find this administratively costly.

They produce interpretable grades. A score of 85 on a recall test communicates clearly that the student correctly answered 85% of recall items. A score of 85 on a competency-based performance assessment is meaningful only if the rubric clearly defines what a score of 85 means in terms of actual capability. Designing rubrics with that specificity takes time most curriculum designers do not invest.

The result is a persistent structural bias toward recall assessment, even in programs that explicitly state their goal as competency development. The assessments reveal competency at recall, which is correlated with but not equivalent to professional competency.

The Competency Evidence Hierarchy

Not all assessment types are equivalent in the evidence they provide about learner capability. A useful way to think about assessment design is through a hierarchy of evidence quality — what I call the Competency Evidence Hierarchy. The hierarchy has five levels: recall, recognition, application, transfer, and generation. Each level provides stronger evidence of genuine competency than the one below it.

Recall is reproduction of previously encountered information. A learner who can define a term, name the steps of a framework, or reproduce a diagram has demonstrated recall. This is the weakest form of competency evidence because it does not require that the learner can use what they recalled.

Recognition is identifying the correct option among alternatives. Multiple-choice assessments typically operate at this level. Recognition is slightly stronger evidence than recall because it tests whether the learner can distinguish correct from incorrect without providing retrieval cues, but it still does not require application. A learner can recognize the correct application of a framework in an example without being able to produce that application themselves.

Application is using a concept, framework, or skill to produce an output in a situation that resembles what the curriculum covered. Case study analysis, structured exercises, and problem sets with defined correct outputs operate at this level. Application is stronger evidence than recognition because it requires production, not just identification. It remains limited because it tests performance in familiar-enough situations — situations similar enough to curriculum examples that the learner can pattern-match rather than genuinely reason.

Transfer is applying a capability to a genuinely unfamiliar situation: one where the relevant concepts and frameworks are not cued by surface similarity to curriculum examples. Transfer assessments require learners to recognize what kind of problem they are facing before they can apply anything to it. This is the first level at which assessment begins to provide meaningful evidence of professional competency, because professional practice requires transfer almost exclusively — practitioners rarely encounter situations that are recognizable as textbook examples.

Generation is producing a novel artifact — a diagnostic framework, a design, a recommendation with supporting reasoning — that demonstrates not just ability to apply but ability to construct new analytical approaches from existing knowledge. Generation assessments are the strongest evidence of deep competency, and the hardest to design, administer, and evaluate. They are also the most honest test of whether a graduate program has produced the kind of thinking it claims to develop.

What Diosh Uses in Graduate Education

In graduate education at PCU and in professional development program design, assessment design choices track the Competency Evidence Hierarchy deliberately. The default is not to start at recall and add upper levels if possible — it is to identify the highest level of evidence the program should produce and design backward from there.

For a course in organizational systems thinking, the target competency is transfer-level: learners should be able to take an unfamiliar organizational situation they have not previously analyzed and apply systems thinking frameworks to produce a coherent diagnostic picture. The assessment design implication is that the primary summative assessment must use a case they have never seen — not a case discussed in class, not a case that resembles class examples closely enough to allow pattern matching. The case must be genuinely unfamiliar, which means the assessment cannot be designed until after the curriculum is delivered, or must use cases from domains outside the learners' professional backgrounds.

Formative assessments during the course operate at the application level: structured case analyses using frameworks covered in the current unit, where the evaluation focuses on whether the learner is applying the framework correctly and where they are making errors that need correction before the summative assessment. These serve a different function — they are learning instruments, not primarily measurement instruments — and their design appropriately emphasizes feedback over evaluation.

The assessment that produces the most honest information about learner competency is one that is uncomfortable to administer: open-ended, unfamiliar, evaluated by rubric rather than answer key, and acknowledged to have uncertainty in the scoring. That discomfort reflects the reality of professional practice, where the situations are always somewhat unfamiliar, the criteria for a good answer are rarely perfectly clear, and the evaluator must exercise judgment.

Rubric Design for Competency-Based Assessment

A competency-based assessment is only as good as its rubric. A poorly designed rubric evaluates surface features — length, structure, use of vocabulary from the curriculum — rather than the quality of reasoning, which is what the assessment is supposed to measure.

Rubric design for higher-level competency assessments requires that each criterion describes a cognitive process, not a content presence. "The response demonstrates accurate understanding of feedback loops" evaluates content presence. "The response correctly identifies which feedback loops are causal drivers of the pattern described in the case, and explains why" evaluates reasoning. The second criterion is harder to score because it requires the rater to evaluate the quality of an argument, not the presence of a term. It is also the criterion that produces information about what the learner can actually do.

Rater calibration is essential for rubrics evaluating reasoning rather than recall. Without calibration, two raters applying the same rubric to the same response will frequently assign different scores — not because the rubric is flawed, but because evaluating argument quality requires judgment, and judgment requires a shared standard. Calibration sessions in which raters score the same set of anchor responses and discuss their reasoning until they reach agreement produce the shared standard that makes the rubric functional.

The Institutional Default and Why It Persists

Most institutions default to recall-level assessment because the incentive structure rewards it. Faculty are evaluated on student satisfaction scores, which are higher when assessments feel fair and manageable — which recall assessments do. Administrative systems are designed for grades that are easy to calculate and report. Accreditation bodies have historically been more interested in curriculum coverage than in assessment rigor. The result is that the assessment practices that produce the most honest evidence of competency are the practices that create the most friction in the institutional system.

Changing this requires both individual curriculum choices — designing better assessments regardless of the institutional default — and institutional choices about what to measure and reward. Programs that invest in assessment redesign without corresponding investment in faculty development for rubric design and rater calibration will see limited improvement, because better assessment instruments are only useful when the people administering them know how to evaluate them.

The honest version of what most assessments reveal is that learners have acquired the curriculum's vocabulary and can reproduce its content. That is useful to know, but it is not evidence of professional competency. Designing assessments that produce the higher-level evidence is more difficult and more honest, and it is the only way to know whether a program is developing what it claims to develop.

Assessment Design That Measures Real Competency

Why Most Assessments Default to Recall

The Competency Evidence Hierarchy

What Diosh Uses in Graduate Education

Rubric Design for Competency-Based Assessment

The Institutional Default and Why It Persists

Integrating AI Tools into Graduate Education Without Replacing Thinking

Peer Learning Structures That Work in Professional Education

Running Workshops That Produce Behavior Change, Not Just Awareness