Why Validity and Reliability Matter in L&D—and What You Can Do About It

By Matt Richter

My good friend and philosopher, Tineke Melkebeek from Ghent, Belgium, has been urging me to run a workshop on statistics for L&D professionals. However, I’m much too lazy, so I wrote this article instead.

INTRODUCTION

L&D professionals are constantly faced with choices: which leadership assessment to use, which team diagnostic to roll out, or which psychometric profile to trust? Often, these tools come with glossy reports and vendor endorsements, but beneath the surface, many lack the scientific rigor necessary to support meaningful decisions. That's why validity and reliability matter. They are not academic luxuries; they are essential filters that protect organizations from wasting resources, misleading learners, and making poor talent decisions.

UNDERSTANDING VALIDITY AND RELIABILITY

Validity is about truth: does the tool measure what it claims to measure?

Reliability is about consistency: does it measure it consistently over time and across raters or situations?

These terms are not interchangeable. A tool can be highly reliable but not valid (consistently wrong) or moderately valid but inconsistently applied. Both matter for L&D because they determine whether an instrument actually supports development—or provides the illusion of insight.

TYPES OF VALIDITY

Content Validity – Does the tool cover the whole domain it claims to assess? For example, a leadership assessment should address decision-making, influence, adaptability, and ethics, not just communication style.

Construct Validity – Does the tool accurately reflect the theoretical concept (e.g., emotional intelligence, resilience) it's based on? This includes:

  1. Convergent validity (correlates with similar constructs)

  2. Discriminant validity (does not correlate with unrelated constructs)

Criterion Validity – Does the assessment correlate with or predict real-world outcomes?

  1. Predictive validity (e.g., scores forecast future performance)

  2. Concurrent validity (e.g., scores correlate with current metrics)

Face Validity (Not scientifically rigorous) – Does the tool look like it measures what it says it does? This is about user perception but does not actually measure quality.

TYPES OF RELIABILITY

  1. Test-Retest Reliability – Does the assessment yield similar results over time for the same person?

  2. Inter-Rater Reliability – Do different evaluators produce similar ratings?

  3. Internal Consistency – Do items within the test measure the same underlying construct? Often tested using Cronbach's alpha*. You don’t need to go that deep, however. Ask a friend.

DISTINGUISHING INSTRUMENT VALIDITY FROM THEORETICAL VALIDITY

Here's where many well-intentioned professionals go wrong: they assume that if a tool seems well-designed or user-friendly, it must be valid. But you must distinguish between:

  1. Validity of the Instrument – Is the assessment psychometrically sound? Does it have empirical backing for its use case?

  2. Validity of the Theory or Model – Is the underlying framework grounded in credible research?

For example, MBTI may be reliably administered with standardized scoring (instrument reliability), but it's based on Jungian archetypes with no empirical foundation (theoretical invalidity). Similarly, a tool built on the concept of "learning styles" may have internally consistent items, but the entire theory of learning styles lacks scientific support.

Using a valid instrument based on a flawed model is like constructing an accurate thermometer to measure phlogiston (an 18th-century hypothesis about combustion proven wrong) — it may function correctly internally, but it's measuring the wrong thing. A more contemporary example is how my profiling tool for "WHICH STAR TREK CAPTAIN ARE YOU MOST LIKE" has high levels of assessment validity and reliability but completely lacks any theoretical framework to support it conceptually.

HOW TO EVALUATE THE VALIDITY OF A THEORY OR MODEL IN L&D

Let's dig more into just that— the validity of a theory or model. It's not enough to know whether an assessment works— you also need to ask whether the theory behind it holds up. A reliable and valid theoretical framework is the foundation of any meaningful tool or intervention. So, how do you know if the theory is sound?

Start with conceptual clarity. Are the core ideas clearly defined and distinct? For example, does "resilience" mean persistence, optimism, or emotional regulation—or all three? Vague or overlapping terms are red flags.

Next, look for an empirical foundation. Has the theory been tested through peer-reviewed research? Reliable theories are supported by consistent evidence across contexts, not just internal white papers or anecdotal success stories.

Third, ask whether the theory has predictive power. Does it help explain or forecast relevant outcomes—like performance, engagement, or adaptability? If it can't do that, it may be descriptive but not helpful.

Also, consider falsifiability: can the theory be tested and potentially proven wrong? If not, it's more of a belief system than a scientific model.

Finally, assess practical relevance. Does the theory help leaders or learners solve real problems—or does it just sound good in a workshop?

In short, don't just ask if a model is popular. Ask if it's precise, tested, useful, and open to challenge. A good L&D professional doesn't just apply models—they interrogate them. Exploring the viability of an assessment is a similar questioning process for determining the viability of a theory. It shares the same thought process but operates within a different scope.

WHY L&D PROFESSIONALS SHOULD CARE

  1. You're Making People Decisions – Whether for coaching, promotion, or team building, using invalid tools leads to misleading conclusions about real people.

  2. You Influence Organizational Culture – Flawed tools can create false narratives about personality, learning, leadership, and potential.

  3. You Risk Reputational and Legal Consequences – Using invalid assessments for hiring or performance management can expose your organization to ethical and legal risk.

  4. You're Responsible for Bad Investments – If you're investing in development tools, they must generate meaningful insights. Pseudoscientific instruments waste resources and crowd out better methods.

HOW NON-STATISTICIANS CAN ASSESS VALIDITY AND RELIABILITY

You don't need to be a psychometrician to ask thoughtful questions. Here are practical steps:

Ask Vendors for Evidence

  1. Request a technical manual or validity studies.

  2. Ask: What kind of reliability testing has been done? What is Cronbach's alpha for key scales? What predictive validity do you have? For what outcomes?

  3. Then, get a stats friend to help interpret the answers.

Look for Peer-Reviewed Citations

  1. Can the vendor point to independent studies in credible academic journals?

  2. Is the tool included in meta-analyses or comparative reviews?

Understand What Counts as Evidence

  1. Testimonials are not validation.

  2. User satisfaction is not validity.

  3. Proprietary "white papers" are not peer-reviewed evidence.

  4. To borrow from Alex Edmans, a statement is not necessarily a fact. Facts are not necessarily data. Data are not necessarily evidence, and evidence is not necessarily proof.

Examine the Theoretical Foundation

  1. Is the model based on contemporary psychological science (e.g., Big Five personality traits, human development, motivation theory)?

  2. Does it rely on outdated or debunked frameworks (e.g., Jungian types, learning styles, brain hemisphere dominance)?

Watch for Red Flags

  1. Research what others are saying. This is merely a sniff test, and it definitely isn't reliable—but if there are numerous criticisms out there, you can probably stop before digging too deep into the details.

  2. Tools that claim everyone is a "type" or assigns fixed labels.

  3. Assessments that are always positive or flattering.

  4. Claims of "universal applicability" across all roles and cultures.

  5. Tools that are used for both development and selection without clear evidence… or, any evidence for that matter!

WHY CORRELATION ISN'T CAUSATION: WHAT L&D NEEDS TO UNDERSTAND

There is one additional component of statistical terminology that we need to explore. It’s tempting to think that when two things move together, one must cause the other. For instance, if employees with high engagement survey scores also have high leadership potential ratings, we might assume that engagement causesleadership effectiveness. However, this is a classic mistake: correlation does not imply causation.

Correlation refers to a statistical relationship between two variables, meaning they increase or decrease together. Causation indicates that one variable directly impacts another. This distinction is crucial in L&D. Why does it matter? Interventions that stem from false assumptions about causality waste time, resources, and credibility.

Imagine a company observes that participants who complete a popular strengths-based training often get promoted within 12 months. A vendor might claim, “Our program causes faster promotions.” However, those who self-select into training may already be more ambitious, connected, or visible. Their promotion isn’t caused by the training—it merely co-occurswith it. Without experimental controls, we can’t know.

This distinction links directly to validity and theoretical evaluation. A tool might correlate with positive outcomes (criterion validity), but unless we understand the mechanism—how and why it works—we risk mistaking coincidence for insight. A theory lacking causal logic cannot be applied practically, regardless of how compelling the data snapshots appear.

For L&D professionals, the takeaway is clear: Don’t settle for surface-level data. Ask, “What’s the mechanism behind this effect?” Insist on models that explain causality, not just pattern recognition. Otherwise, we risk building strategies on sand rather than on science.

CONCLUSION

L&D professionals operate at the intersection of people, performance, and decisions that carry real consequences. That's why we can't afford to lean solely on good intentions or familiar practices. Validity and reliability aren't academic technicalities—they're working definitions of whether our tools are meaningful or misleading. If a tool doesn't measure what it claims to measure—or can't do so consistently—it has no place in development conversations.

But that's just the beginning. We also need to examine the models behind the tools we use. Is the theory sound? Is it backed by empirical evidence? A well-designed tool based on a flawed model isn't helpful—it's merely a polished mechanism for reinforcing misguided assumptions. Even when we observe a correlation between a model and a desired outcome, we must still inquire: What's the mechanism? Can we reasonably assert that one causes the other, or are we simply connecting dots that happen to be situated close together?

This is where L&D work starts to matter more. We don't need to be psychometricians, but we do need to ask better questions. Does this tool work? Is it built on something credible? Are we chasing patterns or driving impact?

Evidence-informed practice isn't just better science—it's better leadership. And it's how we protect the credibility of our profession.


* Cronbach’s Alpha (α) is a measure of internal consistency—how closely related items in a scale are. It is expressed as a single value ranging from 0 to 1. A value above .80 is considered good, while a value between 0.65 and 0.80 is deemed mediocre. A score below .65 is considered poor. Regarding limitations, it can be inflated by including more items, it assumes some form of unidimensionality, and it doesn’t indicate validity.

** The Big Five (Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism) was not developed to explain behavior. In other words, it was discovered, not invented. Specifically, psychologists used lexical analysis, reviewing thousands of adjectives used to describe human behavior. Then, they applied factor analysis, a statistical method that groups items that tend to co-occur. Across studies, five broad dimensions consistently emerged. This makes the Big Five a data-driven model, not a theory. It tells us how personality traits cluster, not why they exist, how they develop, or what mechanisms drive them. So, what is it?  It’s a taxonomy— a classification system for describing personality in a consistent, replicable way. It is atheoretical in origin: it doesn’t assume anything about causes or processes. It offers empirical utility: it predicts a wide range of behaviors (e.g., job performance, well-being, learning). The bottom line is that the Big Five is a framework, not an explanatory system.


References

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing (2nd ed.). Washington, DC: American Educational Research Association.

Anastasi, A., & Urbina, S. (1997). Psychological testing (7th ed.). Upper Saddle River, NJ: Prentice Hall.

Boyle, G. J., & Saklofske, D. H. (2004). Measuring personality and intelligence for selection: What can and cannot be done in practice. International Journal of Selection and Assessment, 12(1‐2), 92–98. https://doi.org/10.1111/j.0965-075X.2004.00267.x

Bacharach, S. B. (1989). Organizational theories: Some criteria for evaluation. Academy of Management Review, 14(4), 496–515. https://doi.org/10.5465/amr.1989.4308374

Campbell, D. T., Stanley, J. C., & Gage, N. L. (1963). Experimental and quasi-experimental designs for research.Houghton, Mifflin and Company.

Edmans, A. (2024). May Contain Lies: How Stories, Statistics, and Studies Exploit Our Biases—And What We Can Do about It. University of California Press.

John, O. P., Naumann, L. P., & Soto, C. J. (2008). Paradigm shift to the integrative Big Five trait taxonomy: History, measurement, and conceptual issues. In O. P. John, R. W. Robins, & L. A. Pervin (Eds.), Handbook of personality: Theory and research (3rd ed., pp. 114–158). The Guilford Press.

Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50(9), 741–749. https://doi.org/10.1037/0003-066X.50.9.741

Pittenger, D. J. (2005). Cautionary comments regarding the Myers-Briggs Type Indicator. Consulting Psychology Journal: Practice and Research, 57(3), 210–221. https://doi.org/10.1037/1065-9293.57.3.210

Weick, K. E. (1995). What theory is not, theorizing is. Administrative Science Quarterly, 40(3), 385–390. https://doi.org/10.2307/2393789

Van de Ven, A. H. (2007). Engaged scholarship: A guide for organizational and social research. Oxford University Press.