March 26, 2020

Creating Fair Student Evaluations of Teaching

This content was previously published by Campus Labs, now part of Anthology. Product and/or solution names may have changed.

In the recent Campus Intelligence blog post Evaluating Teaching Appropriately, we recommended that educators should choose valid and reliable measures when evaluating teaching. However, in an article published in the February 2020 issue of Assessment and Evaluation in Higher Education, Justin Esarey and Natalie Valdes, of Wake Forest University, argue that even valid and reliable student evaluations of teaching (SET) can still be unfair. Their point is that SET are not by themselves a very accurate indicator of teaching quality. So using SET as the only criterion in evaluations of teaching would be inaccurate and, in the context of personnel decisions, unfair.

We couldn’t agree more. However, we would go further. Not only is it unfair, but it is also unprincipled. Using a single score—be it SET, SAT, GRE, or any other measure—as “the sole indicators to characterize an individual’s functioning [or] competence” (p. 71) violates Standard 3.18 of the Standards for Educational and Psychological Testing (AERA et al., 2014). Simply put, using one measure to make an important decision about a person’s career is the cardinal sin of psychological and educational measurement. Consequently, IDEA has been recommending for decades that multiple indicators (e.g., SET, teacher observations, and review of course documents) should be used to evaluate teaching effectiveness.

Consequently, we agree with the sensible recommendations that Esarey and Valdes make:

  • Remove the influence of non-instructional factors (i.e., class size, student motivation, work habits, background preparation) upon SET scores by using regression adjustment, which Esarey and Valdes point out is something Campus Labs’ IDEA Student Ratings of Instruction (SRI) system has been doing for years.
  • Avoid over-reliance on SET scores and include multiple sources of evidence, including but not necessarily limited to SET, interviews with students, teaching observations, and peer review of instructional materials. Averaging across these can allow idiosyncratic variation to cancel out, reducing imprecision in estimates of true teaching performance.

In addition to taking these actions, educators should be aware of other factors that can pose threats to fair and valid interpretations of SET scores:

  1. SET content
    There are many aspects of teaching effectiveness that students are unqualified to judge and that, therefore, should not be included in SET items. Some examples are the instructor’s knowledge of the subject matter and commitment to teaching, the quality of the course design and tests, and the appropriateness of goals, objectives, and course content.
  2. SET context
    The classroom context in which student ratings are collected can influence ratings. For example, giving students treats the day of the course evaluation or being present when students complete the ratings could create bias. Institutions should consider standardizing the administration procedure of SET. Also, security should surround the submission of ratings so that students are assured of confidentiality and/or anonymity.
  3. Student characteristics
    Some students have not had the exposure to content, instruction, and knowledge that affords them the best opportunity to learn. Others are unmotivated to take the course, perhaps because it is required rather than an elective. Still others have relatively poor work habits. Those student qualities can negatively affect SET and yet are beyond instructors’ control, which is why IDEA controls for them in its adjusted scores.
  4. Course and class characteristics
    Average SET scores differ by field of study. Consequently, IDEA provides comparative scores for the instructor’s self-identified academic discipline. This enables instructors to compare their standard scores with those who teach in the same content area. Classes also differ by size, which is why IDEA also controls for enrollment in its adjusted scores.

Ultimately, educators can minimize unfairness in SET in the ways they design and administer them. First, apply principles of universal design, making SET usable for all students regardless of gender, age, language background, culture, and disability. Second, employ standardized administration, scoring, and security procedures so that scores will not be unduly influenced by extraneous factors. Third, provide evidence of reliability and validity, which is something IDEA has been doing since its inception. Finally, and most importantly, never rely on SET scores as the sole source of evidence to characterize an individual’s teaching competence. Instead, use multiple information sources, including instructor self-ratings, course documents, student products of achievement, and peer observations.


  1. American Educational Research Association, American Psychological Association, National Council on Assessment in Education, & Joint Committee on Standards for Educational, & Psychological Testing (U.S.). (2014). Standards for educational and psychological testing.
  2. Washington, DC: American Educational Research Association. Esarey, J., & Valdes, N. (2020). Unbiased, reliable, and valid student evaluations can still be unfair. Assessment & Evaluation in Higher Education, DOI:

Headshot of Dan Li, Ph.D.

Dan Li, Ph.D.

Data Scientist

Dan Li is a Data Scientist at Anthology (formerly Campus Labs). From 2011 to 2019 she was a researcher and data analyst at The IDEA Center, where her work focused on student ratings of instruction in higher education. She holds a B.A. from Huazhong University of Science and Technology, an M.A. from Marquette University, and a Ph.D. in Media, Technology, and Society from Northwestern University. Her previous research examined the social effects of online technologies, digital inequality, and parental mediation of television viewing.

Headshot of Steve Benton, Ph.D.

Steve Benton, Ph.D.

Data Scientist

Steve Benton, Ph.D., is a data scientist in the Campus Labs data science team. Previously, he was Senior Research Officer at The IDEA Center where, from 2008 to 2019, he led a research team that designed and conducted reliability and validity studies for IDEA products. He is also Emeritus Professor and Chair of Special Education, Counseling, and Student Affairs at Kansas State University where he served from 1983 to 2008. His areas of expertise include student ratings of instruction, teaching and learning, and faculty development and evaluation. Steve received his Ph.D. in Psychological and Cultural Studies from the University of Nebraska-Lincoln, from whom he received the Alumni Award of Excellence in 1997. He is a Fellow in the American Psychological Association and the American Educational Research Association.