Leadership Development: The Problem of VALIDITY AND RELIABILITY

By Matt Richter

Writing about the context of leadership is fun. One gets to discuss history, and examples of both great and bad leadership. One can pose the issues that affect leadership and share theoretical frameworks for what one can actually do. The same is true when we get to write about the purpose, goals, and values that drive effective leadership. But writing about the statistical characteristics of leadership theories and assessments is inherently only interesting to about three people in the world. One of them is dead. One of them is me. Which leaves one reader out there for this post—you.

If I have misidentified you, I nonetheless urge you to take a few minutes and force feed this down your gullet. The issues of reliability and validity are really a major crux when it comes to leadership development. Why? Because leadership development implies a prospective leader is learning something through a formal learning process. That implies intent, and it implies some kind of curriculum. For that curriculum to work, it needs to be based on theories, models, tools, approaches, and other content that have validity (we’ll define that momentarily). And, annoyingly, there are many types of validity, and all are easy to confuse, and all are important. A good curriculum also implies that the content and the teaching approach work over and over. In other words, what we teach to one cohort will yield similar results with the next cohorts.

Let’s delve into the definitions of validity and reliability and explore their relative significance to leadership development. Before I do, let me stipulate that when I use the word model, I am referring to the statistical model used to construct an assessment or a statistical structure for an experiment. When I refer to the framework that an assessment or process is based on, I will use the word theory. Yes… I know this isn’t the most precise or accurate usages for these words, but in a paper referencing both statistics and theoretical frameworks, I will defer to expediency rather than precision to avoid confusion.

In general, validity refers to how well the outcomes from a study, a process, an assessment, or a methodology represent the same outcomes in real life. It means that the theory you may teach in your leadership development program actually works as it says it does. Or, that the assessment you are using paints an accurate picture of the person it is assessing. Or the process you use to teach yields the professed outcome. As mentioned, there are many types of validity. And one of the core problems most of us non-research-oriented folks have when we hear something is valid is not realizing that fact. Vendors of assessments often will share huge binders of statistics. What they share may be comprehensive, but may also just focus on one or two types of validity. For example, with assessments, one often sees strong face validity.

So, what are some of the different validities that matter to this discussion?

FACE VALIDITY is the easy one. Does the test, model, or method APPEAR to measure what it claims. The most common way we test face validity is to ask users whether the results make sense to them. Does the assessment report or model resonate? You can think of face validity as the SMELL TEST. There are lots of leadership theories out there. And if the author writes about it well, it will likely pass the smell test. Take Servant Leadership. Or, Bill George’s Authentic Leadership. Or a management theory like Situational Leadership. All of these pass the smell test. Some of these have associated assessments. Assessments associated with Situational Leadership have decades of data and strong factorial analyses (construct analysis). From a face validity perspective, the assessments are good because they make sense when one reads the report. But unfortunately, we humans are not the most reliable judges. So face validity just scratches the surface.

These leadership theories can make sense, or be explored, in other statistical ways, as well, though.

Another form of validity a lot of leadership theories use is CONTENT VALIDITY. Content validity also refers to an assessment, inventory, or instrument and it evaluates how well that tool aligns to and covers ALL of the relevant parts of the construct. The construct is the theory represented by the statistical model. So, I have a theory like emotional intelligence. In most versions of EQ, there are five constructs, or components. My assessment must evaluate all five. AND, my assessment must evaluate all five distinctly, consistently, and as described. This type of validity is deeper than face validity and provides clear information about the quality of the assessment and begins to provide some information about the quality of the underlining theory, but not that much. It implies that the assessment we are using clearly aligns to the definitions in the theory and that each assessment question distinctly aligns to one of the components without bleeding over. But note… it implies— it doesn’t prove or demonstrate the linkage explicitly. Of course, the biggest issue with content validity in the context of leadership development is we are still talking about an assessment as our base of understanding and not the foundational theoretical framework. Which means we are still unclear whether the actual theory itself has any value and rigor to it.

Now many popular leadership theories have associated assessments. And most of those have excellent samples and reviews of those assessments when it comes to face validity and content validity. Books of data support these.

It starts to get hairier when we look at CRITERION VALIDITY. Still focused on measures, criterion validity looks at the extent in which the measures in the model (again, usually determined through some form of an assessment) meet a “gold standard.” This is cool stuff!!!! But, boy is it hard to do with something as ultimately nebulous and vague as leadership, or things like emotional intelligence. Why? Because what is the “gold standard” when it comes to leadership? And even more specifically, to leadership development? Criterion validity works extremely well when it comes to medical diagnoses, job performance ratings, school grades, etc.  But for leadership, we have to refer back to our definitional problem blog post. What the heck are we even talking about to determine or reference a gold standard? Now, in some models, it is possible for the authors and researchers to define that standard. But now we often get into structural and construct problems. Are the constructs in the model distinct, at the same hierarchical level, and lacking bleed-over into other constructs? For example one of the big critiques of emotional intelligence is that the five components are often inherently covered by other models and lack independence. They are not distinct amongst the five and may even be supported or undermined by variables described more clearly in other models and theories. Of the five factors, personality theory describes some of them and not others, making the factors not of equal weight or distinct ideas. So, when it comes to leadership, defining a rubric as the standard of measurement becomes quite difficult to achieve. Is financial success a leadership outcome? Maybe. But it is also an outcome of sales performance. What about setting a strategic direction? Is that practice originating from one person? Multiple people? Multiple functions? Who is it actually defining the direction? Is then, setting a strategic direction a leadership outcome distinctively, or an outcome of many different behaviors and activities?

But where things really begin to break down is with CONSTRUCT VALIDITY. A construct is one of the components in your theory. From a statistical perspective we need to be careful not to conflate the model that you use to structure your assessment (Content validity validates this) and the theoretical framework (the unpinning science that your assessment evaluates and explains). Do the constructs, or the ideas in your assessment actually live up to rigorous testing? I am not yet exploring whether your framework actually works. And, here is where most theories really fall down. They focus on their associated assessment and claim the theory itself is therefore good to go. But construct validity doesn’t really give us that. We have to dig deeper. More on this part later.

Good construct validity testing takes time. At its base level, it involves measuring the correlations (simply a mutual relationship—there is a connection) it has with other measures that have already been determined. Then it argues that your correlations are similar in some way, and therefore your model is predictive. There are lots of statistics tests researchers will use to assess construct validity. With assessments, factor analyses are a core approach—where the questions on an assessment go into a unique category (called a factor) with other similar questions without crossing over into other categories. Meaning, you take something that already has been shown to work in both research and practice. You experiment with your model and see if there are correlations to it. Hopefully strong correlations. If there are, you are good to go!

The trick is that I can make any assessment I write good with enough training in psychometrics and survey writing experience. Personally, I think the “Which Star Trek Captain Are You Most Like” is an excellent leadership assessment tool—he says sardonically. In other words, I can create an assessment that will have four factors and each question will overwhelmingly align to one of them based on how I word it and structure it. I can also ask my Star Trek assessment takers to rate how much they personally agree with the results. A good factor analysis will determine that my beautifully written assessment indeed yields four distinct categories that indicate which captain one is most like. I can also establish clear parameters for each captain so that my assessment questions clearly align to one and not the others. Unfortunately, my model will unlikely have a “gold standard.” And, of course, we have one more problem…

The strongest type of validation is when I look to determine whether my theory actually is valid in of itself. This gets complex! The bottom line is I need to compare the output of my experiment (whether my model actually works) to independent fields or other experimental data sets that align with the simulated scenario used in my assay. Is Star Trek—and Kirk, Picard, Sisko, and Janeway— an appropriate leadership metaphor that can prescriptively dictate how a potential leader may behave or act in practice? Determining this is soooooo reliant on good experimental design and dependent on how well I define my dependent and independent variables. Is Kirk’s high risk taking reflective only in him and not the other three? Is the way he takes risks completely independent from the other three? Are there other leadership factors at play that are properly explained and ultimately measured in this model? In other words, is it comprehensive enough? Still yet, if I design my experiment poorly, I may think I have good variable definition, but in fact, have a massive clusterfudge (a technical term)! I will avoid the statistics here… but the complexity gets huge!

A brief detour…

I am going to get a bit confusing here. We need to take a step back and differentiate full and complete theories of leadership and simplified theories. Simplifies theories, to get really confusing, are often also called models. So, for the next few minutes, I am going to use the word model to mean a simplified theory— not as I have been as a statistical structure in an assessment. So, to reiterate, I will use the word model outside of the realm of statistics for the next couple of paragraphs. Models are often used as metaphors to explain how something works— especially complex theories. So, a picture, a map, or a chart that explains your overall theory. It is a simplified narrative to explain that concept. All models are limited in scope… by definition. They are used to explain phenomena and can be good as metaphors or entries to more complex ideas. But leadership models often are too simple and ignore both context and the inherent systems they reside in (more on this in our next post). Scott Page, from the University of Michigan, has some wonderful explanations about the power and limitations of theoretical models. He properly highlights the fact that often, in order to be comprehensive with a given explanation, we really need more than one model to explain something. Often many models. Which makes life messy! Which is also why we are left without much in the context of leadership development. Model validation is so rarely done and when it is, the experimentation design is specious… at best. Deeper theoretical frameworks are researched even less within the context of leadership. At least in ways that meet hardcore statistical and experimental standards.

So, why the detour? Because many authors and consultants construct a narrative, or a metaphor to explain their view of leadership. It is often oversimplified and doesn’t even reach the level of a proper theory. It gets converted into an assessment. And it gets sold. Any research done is usually built then on a house of cards. This is utterly problematic. The theory, or simplified model, used has little or no validity. And the assessment associated with it can seem deceptively ok.

In summary, it is possible to have tons of research. But those bodies of research are all too often focused on the wrong questions, bad design, or on assessments that DO HAVE good design, but fail to align to the overarching theory.

Sure, it is easy to be critical. But given the amount of money spent on leadership development, we need some criteria for evaluating what it is we are teaching. Investing in invalid methods, tools, assessments, or theories without knowing if they actual work and do what they say is foolish!

Now we need to discuss RELIABILITY. Reliability means very simply, how well what you are doing CONSISTENTLY works. With assessments, does your assessment give you the same results over and over, as expected? With a model… does your structure work consistently, in multiple contexts over time? Is there replication when we evaluate? In other words, does servant leadership work over and over in a myriad of different contexts? Will what worked yesterday work today, and still tomorrow? A quick survey of leadership theories out there over the past 50 years will yield vastly different theories, all of which have worked in certain situations at least once or twice. But most fail the multiple contexts test (I say most in case there is one out there, but I haven’t seen it). And still more damning, we have never to my knowledge done real longitudinal studies to determine whether any of these theories have truly work all the time even in the same organizations. Again, take Servant Leadership. Was Steve Jobs a servant leader? Arguably, he was a bit unlikable and not very supportive of his followers. Yet, based on many standards (revenue, innovation, marketing, etc), he was a highly effective leader. But, he was definitely not a servant leader. Does this mean servant leadership is a bad theory? Not at all, but it does beg the question whether it is distinct enough or takes into account other factors? Or, whether it would have worked in the context of a Steve Jobs and Apple in the early 2000s. It also begs the question whether in other well-defined contexts it works, but couldn’t due to the Apple variables. Reliability is tough because for a theory or model to be reliable, it needs to work all the time, over time (in statistics reliability is actually presented as a percentage that is large enough to make us happy). And if we need to put constraints around the circumstances to enable the theory to do so, is it a general enough theory to be useful? Not sure. But it makes picking Servant Leadership, or any theory difficult—or at least more complex to consider— when it comes to predictability. How do we know that given the factors at play in our future today’s choice of Theory A will work then? We can’t.

Whew! You see… it’s so much easier to ignore this silly science stuff and just go to market!

Remember, this is a post about the validity and reliability issues with leadership development… so… how many theories of leadership have actually been shown to work:

  1. Over time. (reliability)

  2. Consistently within the same context. (reliability)

  3. Consistently in different contexts. (reliability)

  4. According to criterion referenced gold standards. (validity)

  5. Distinctly due to the theory and not other factors. (validity)

See… it’s tough.

Validity and reliability issues are significant when it comes to leadership development. Leadership development traditionally requires us to predict what knowledge, skills, abilities, and attitudes will work in a specific context tomorrow. To do that, one needs to be able to match the assumed needs of the organization and then match those needs to some known curriculum that works. The best way to know if something works it to hold it against research criteria. That is most easily done against validity and reliability measures. It ain’t foolproof, but it is better than anything else we have. Unfortunately, most, if not all, theories fail to meet those standards… so, then, what are we to teach?

Oh, and as my LDA partner-in-crime, Clark Quinn likes to remind me, these statistical issues are not limited to leadership development alone. Take any of the frameworks and assessments that categorize individuals, including personality and disposition. They ALL have the same concerns and inherent flaws, too.

My next post will tackle the fourth and final issue with leadership development… poorly aligned goals, values, and expectations. Then, once I have finished painting a complete picture that seems helpless… hopeless… despondent… and full of despair…, I will dive into what we can actually do. I will paint a picture that is of a better place. Rosy and aromatic! I will share how we can indeed begin to create leadership development that is contextual and applied. More to follow!


REFERENCES

Conte, J.M. (2005). A review and critique of emotional intelligence measures. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.516.126&rep=rep1&type=pdf

Fambrough, M., & Hart, R.K. (2008). Emotions in Leadership Development: A Critique of Emotional Intelligence. Advances in Developing Human Resources. https://www.researchgate.net/profile/Mary_Fambrough/publication/249631547_Emotions_in_Leadership_Development_A_Critique_of_Emotional_Intelligence/links/0046353bef93c7b874000000/Emotions-in-Leadership-Development-A-Critique-of-Emotional-Intelligence.pdf

Isaacson, W. (2011). Steve Jobs. New York: Simon & Schuster.

Northouse, P.G. (2019). Leadership: Theory and Practice (8th Edition). London, Sage Publications, Inc.

Page, S.E. (2018). The Model Thinker: What you need to know to make data work for you. New York: Hachette Book Group.

Pfeffer, J. (2015). Leadership BS: Fixing workplaces and careers once truth at a time. Harper Business.

Pyrczak, F. (2018). Making sense of statistics: A conceptual overview. New York: Routlege.

Reinhart, A. (2015). Statistics done wrong: The woefully complete guide. San Francisco: No Starch Press.

Wheelan, C. (2014). Naked Statistics: Stripping the dread from the data. New York: W.W. Norton & Company.

Zeidner, M., Matthews, G., & Roberts, R. D. (2004). Emotional intelligence in the workplace: A critical review. Applied Psychology: An International Review, 53, 371-399.