The Limitations of Course Evaluations: Identifying Helpful, Accurate & Wholistic Measures

As learning organizations venture further into the use learning analytics and data-driven decision-making, I find it increasingly important to consider the danger of simply collecting and analyzing the data that are available or easiest to collect. I will use the example of course evaluations in schools to illustrate my point, largely based on the insights from An Evaluation of Course Evaluation Evaluations by Stark and Feishtat (which I learned about and located because of this article in the Chronicle of Higher Education). Amid their critique of evaluations, they share the following story.

Three statisticians go hunting. They spot a
deer. The first statistician shoots; the shot passes a yard to the left of the deer. The
second shoots; the shot passes a yard to the right of the deer. The third one yells, “We
got it!” (Stark, P, and Freishtat, R., p. 4)

As indicated by this story, using averages in data may lead to flawed conclusions. At some point, there is need to put faces and stories to the data, which calls for more forms of data collection. The problem is that not all data are equally easy to collect. So, we often settle for pre-developed templates, what our analytics software can most easily collect and display, or what we (individually or collectively) can most easily understand. We may establish key performance indicators and identify measures based on what data is available or easiest to collect, analyze and understand. In doing so, we make flawed conclusions about how we are doing as an institution. Our numbers look good, so we are making progress. Or, our numbers are down so we must do what we can to raise them.

Note the potential flaw with that last statement. If our numbers are down, we must do something to raise them. When we hear something like this, we have signs of a subtle but important shift in an organization. There may be hundreds of ways to increase the numbers so that we seem to be making progress. Yet, not these options are equally valuable. Consider a course evaluation where an instructor’s overall course evaluations go down one semester. The only obvious change that the instructor can identify from the last term (where rating were much higher) was that she added the requirement of a weekly learning journal. So, she got rid of the learning journal assignment the next term and the evaluation averages went back up. Problem solved. Look more closely and find that student performance had actually increased during that term with the lower average evaluation. So, the ratings are now higher but students are not performing as well on the assessments. The teacher sticks with that strategy, knowing that rank and promotion is partly dependent on course evaluation averages.

Most course evaluations are based upon self-reporting, because that is easy to do. In the scenario from the last paragraph, note that discovering this potential problem would only happen if we collected actual student performance data along with their evaluations. Yet, I am not aware of organizations that do that. It is a more complex task to carry out. So, we settle for the easy route, despite the fact that it may lead us down the wrong path.

Please know that I am not arguing against the benefit of quantitative data in learning organization. These data sets can indeed open our eyes to important patterns, trends, and relationships. They are quite valuable. Instead, I’m suggesting that we want to put careful thought and planning into what data we collect and how we collect them, that we do the hard work of identifying measures that will give us the most complete and accurate picture. We want the complete (or as complete as possible) story. We want to see human faces in the data. This will help us use the data to make decisions that will truly support our organizational mission, vision, values and goals.

Self-reporting data in course evaluations has any number of limitations, as pointed out by Stark, P, and Freishtat. The ratings do not mean the same to all students. What one student considers “excellent” may only be “very good” to another student. What one student considers “very challenging” may be “not very challenging” to another. Given this reality, what do the averages tell us?

As Stark and Freishtat explain,

To a great extent, this is what we do with student evaluations of teaching effectiveness.
We do not measure teaching effectiveness. We measure what students say, and pretend
it’s the same thing. We dress up the responses by taking averages to one or two decimal
places, and call it a day (p. 6).

In the end, I must confess that I was favorable to Stark and Freishtat’s work because it affirms my own values and convictions. They conclude that a better way of evaluating teacher effectiveness is one that includes observations, narrative feedback, the inclusion of artifacts as evidence of teacher effectiveness, along with insights gleaned from course evaluations (p. 11). This sort of triangulation tells a story. It puts a face on the data. It provides context and something from which a teacher can more readily learn. The problem is that this takes more time and effort. Yet, if we truly want to create key performance indicators for our learning organizations, and we genuinely want to know how we are doing with regard to those indicators, then it requires this type of work. And from another perspective, what example do learning organizations set for students if the people in that organization set up an entire system of measurement based upon cutting corners and doing what is easy and available?