03/30/2013 | W. JAMES POPHAM
“What’s needed is a way that human judgments, based on thoughtful evidence weighing, can play a prominent role in teacher evaluation.”
The two federal initiatives that led to today’s avalanche of new state-level teacher evaluations are 2009’s Race to the Top program (RTT) and 2011’s ESEA Flexibility Program. RTT provided substantial grants to cash-strapped states — and, later, to school districts — if they install serious educational reform programs including more rigorous teacher evaluations. A pivotal feature of these teacher evaluation programs is the use of multiple measures in which student growth must be a substantial determiner of a teacher’s quality. The lure of RTT largesse caused many educational policymakers to seriously alter their state’s teacher evaluations, sometimes even legislatively, so that their states would be more likely to receive an RTT grant.
The ESEA Flexibility Program proved equally enticing to educational authorities in many states, for if a state were granted a federal waiver under this program, then the state could successfully evade the increasingly negative sanctions flowing from the No Child Left Behind Act — the most recent incarnation of 1965’s Elementary and Secondary Education Act (ESEA). Because those sanctions were becoming especially onerous as that law’s unrealistically demanding time lines neared their conclusions, many states pursued ESEA waivers with zeal. As was the case with RTT, to successfully receive such a waiver, states were directed to revise their teacher evaluation procedures so that multiple measures would be the factors used to evaluate teachers — and student growth was to be important among those factors.
Clearly, educational policymakers in many states have succumbed to the prospect of RTT dollars or ESEA leniency. In education, as in most settings, both carrots and sticks can produce results. As noted earlier, these federally spawned teacher-evaluation systems must rely on multiple sources of evidence. And that’s why attempts to create people-proof teacher evaluation systems are so wrong-headed.
We can understand why federal officials would advocate the use of multiple measures to evaluate workers who are engaged in a complex and nuanced task such as teaching. When determining a teacher’s instructional ability, for example, it is altogether reasonable to make use of diverse evidence sources such as students’ test performances, classroom observations of teachers in action, administrator ratings of a teacher’s effectiveness, and students’ ratings of the teacher. Each of these sources of evidence can make distinctive and useful contributions to determining a teacher’s ability. Certain of those evidence sources, however, ought to be given greater evaluative weight than others. And that’s where human judgment comes in.
Let’s look at evidence of student growth as an example. We usually determine student growth by using students’ test performances collected via some sort of pre-instruction versus post-instruction design. To illustrate, we might analyze students’ scores on statewide accountability tests by comparing the scores of a teacher’s students on this year’s end-of-school state tests with those same students’ scores on last year’s end-of-school state tests. In addition, however, a teacher might also collect pretest and post test evidence by using district-developed tests or relying on teacher-made classroom assessments. The evaluative significance that we assign to the resulting evidence should be determined not only by the quality of the tests themselves, but also by the similarity of the conditions associated with their administration. Similarly, care should be taken to verify that, during instruction itself, there was an absence of “item teaching,” that is, test preparation involving teachers’ coaching students how to answer items actually found in the upcoming post test.
In short, many factors can influence the significance of the weight we should ascribe to any evaluative evidence. And this applies not just to the category of evidence — such as student ratings versus classroom observations, but also to the significance of particular instances of that category’s evidence. To illustrate, if the classroom observations of a specific teacher have been few in number and collected by untrained observers using poorly conceived observation forms, the resulting observation evidence should be accorded less significance than if the teacher had been observed more frequently by well-trained observers using carefully refined observation forms.
But who should make these judgments about the evaluative import regarding the multiple sources of evidence available for a particular teacher? Should it be a state department of education staff whose familiarity with a given teacher’s instructional setting is nonexistent? Or should it be individuals closer to the teacher actually being evaluated?
At the moment, teacher evaluation systems devised by officials in many states seem intent on reducing local educators to the role of mere evidence-collectors. Decisions about how heavily to weigh certain sorts of evidence have already been made at the state level. Not only are local district administrators told how much weight should be accorded to such measures of student growth as students’ scores on state accountability tests, but the weights are also spelled out for students’ scores on classroom or other assessments. State assigned weights are also specified for other evidence sources such as classroom observations or students’ attitudinal shifts. In short, these teacher evaluation systems are clearly devised to minimize local educators’ evaluative judgments.
But there are many instances in which certain evidence indicative of a particular teacher’s quality should be given more — or less — evaluative weight. For example, suppose that in a small elementary school, a particularly weak third grade teacher had effectively “turned off” a cohort of students so that those children became remarkably negative not only toward school but, more importantly, toward learning itself. Suppose, further, that the next year’s fourth grade teacher who received these disaffected students was able to collect data indicating a complete turnaround in those students’ attitudes — so that, at the close of the fourth grade, those same children were once more enthusiastic about school. Shouldn’t the evidence regarding this attitudinal shift be given heavier than usual evaluative weight for this teacher?
What’s needed, therefore, is a way that human judgments, based on thoughtful evidence weighing, can play a prominent role in teacher evaluation. Will human beings ever arrive at incorrect judgments about certain teachers’ abilities? Of course they will! No teacher evaluation system will ever be foolproof. However, teacher evaluation systems in which properly trained teacher evaluators render their best judgments based on multiple evidence sources will almost always produce fewer mistakes than will any people-proof evaluations.
The new evaluation system devised by Washington State is called the Teacher and Principal Evaluation System. Although many elements of this system are predetermined, the evaluation of Washington teachers hinges heavily on the way principals weigh the evaluative data associated with a given teacher. For this system — and similar systems — to succeed, it becomes imperative for the state’s principals to become well versed in the reasons that certain evidence should be given greater or lesser evaluative weight. Typically, principals have not been trained in how to evaluate the persuasiveness of different sorts of evaluative evidence. But teacher evaluators need such training and, if resources permit, certification that they are capable of weighing evaluative evidence sensibly.
People proof evaluations of teachers, whenever multiple measures are involved, won’t work well. But neither will evaluative approaches featuring human judgment — if the judges haven’t been taught how to carry out such procedures appropriately.