| Policy
Implications of Long-Term Teacher Effects on Student Achievements |
Karen L.
Bembry, Heather R. Jordan, Elvia Gomez |
Longitudinal Teacher Effects Results
Bias. An analysis of bias was conducted for each cohort on four selected factors: Teacher effectiveness level by teacher ethnicity, student achievement quartiles by teacher level, student ethnicity by student quartiles, and teacher level by teacher ethnicity. All analyses were conducted for each year within each cohort. Results are presented in Table 2. For these analyses, teacher effectiveness indices were divided into thirds.
A one-way analysis of variance was used to determine whether effectiveness levels were biased based when examined by teacher ethnicity. Sample sizes for each ethnic group varied widely within each year and each cohort. However, the one-way analysis of variance for testing the equality of means is a robust procedure despite disparate sample sizes. The analysis for nearly every case indicated a statistically significant difference among the means of the ethnic groups. Further investigation, including pairwise comparisons of group means using Duncans Multiple-Range Test and an effect size analysis (Cohen effect size measure, see Stevens (1992)), were conducted to determine whether or not the differences were large enough to be of practical significance. More than 75% of the statistically significant differences between groups, as indicated by Duncans Multiple-Range Test, had related effect sizes ranging from small to medium, thus indicating that the differences were of little practical significance relative to the variance of the indices. As for the remaining cases of statistically significant differences, the sample sizes for the groups involved were either extremely large, comprising more than 50% of the entire population, or extremely small, less that 4% of the entire population. Overall there was no practical significance involved in any of the cases.
The remaining three factors were tested for bias using the chi-square test for contingency tables. Again the analyses for nearly every cohort indicated a statistically significant difference between the observed and expected frequencies within each contingency table. However, it should be noted that every cohort had a sample size greater than 2800, which indicates that very small differences could have been determined statistically significant due to the large sample sizes. To determine if the observed differences were of practical significance Cramérs V was calculated for each case as a measure of association, which is uninfluenced by sample size. Using Cramérs V and the residuals for each case, it was determined that for V £ 0.18 the difference was of no practical significance, for V between 0.19 and 0.24 there was a partial bias, and for V ³ 0.25 there was a bias.
The analysis of student quartiles by teacher thirds indicated that for the reading cohorts no bias was present. For the math cohorts, the analyses indicated that as students entered the 7th and 8th grades there was a bias; high ranking students were placed with high ranking teachers at a higher frequency than was expected. There was a partial bias throughout most of the reading cohorts when looking at student quartiles by student ethnicity. The math cohorts, on the other hand, were fairly unbiased throughout. Lastly, teacher thirds by teacher ethnicity were unbiased in all cohorts.

B = bias ( 0.25 £ Cramérs V)
pb = partial bias ( 0.19 £ Cramérs V £ 0.24)
U* = unbiased statistically
u = unbiased, no practical significance (Cramérs V £ 0.18 or small to medium effect sizes)
ANCOVA. All cohorts were first analyzed with an analysis of covariance using the pretest score as the covariate, the level of the teacher for each year as a blocking variable to form three or four analysis groups and the three or four years of test scores as the dependent variable. The analysis of covariance showed overwhelming main effects. Results of the analyses are summarized in Table 3. The data in Table 3 show clearly that the main adjusted effects are quite large (the F statistics for the 1997 main effect are included). However, there were three problems with the analysis. First, in most analyses there was a statistically significant, although practically insignificant, interaction among the levels of the three or four analysis groups. For each cohort, the largest of the interaction effects is included in Table 3. Upon inspection, it becomes obvious that the level of interaction was not really a significant problem since the main effects were so large relative to the interactions, but interactions were still statistically significant. Second, the tests of the regression slopes between analysis groups showed significant differences in slopes between the groups in most of the analyses. Here, more of a problem was encountered, since the adjustment in analysis of covariance is dependent on approximately equal slopes. (However, over the range of the variables, again the main effects make this issue less compelling from a practical standpoint.) Finally, there were large differences in the size of some subgroups, partial reasons for which can be explained by the bias found in the contingency analysis of ability of students by teacher effectiveness. Subgroup sizes ranged from 4 to 111, with most groups having at least 20 cases. (As noted, complete tables with subgroup means and sizes are contained in Mendro, Jordan, Gomez, Bembry, and Anderson (1998)). This brought into question the stability of the estimates derived from the analysis of covariance.
These problems suggested that a different analysis be used for determining subgroup effect sizes. The investigators determined that a hierarchical linear model would be more appropriate for the analysis of group effect sizes. First, by analyzing group effects directly for each group, the analysis of effect sizes is altered, but the presence of an interaction effect is now irrelevant, regardless of the relative size of the interactions. Next, HLM estimates subgroup regression slopes directly for each group, eliminating the equal slope concern. Most important, though, HLM adjusts for the differing sample sizes through shrinkage.
Each cohort was analyzed using an HLM procedure with pretest and 1997 group means modeled at the first level and slope and intercepts modeled with a random effects model at the second level. Summary statistics from the HLM analyses are presented in Table 4. Inspection of Table 4 shows that the intercept estimates are all sufficiently reliable. From this analysis, empirical Bayes residuals were used to model group effects and effect sizes were converted to estimated NCE values. The estimated effect sizes for the 4-year groups are presented in Table 5 and for the 5-year groups in Table 6.




Analysis of the effect sizes can be shown in terms of actual probabilities of effect sizes being in specific ranges of the distribution based on conditional teacher levels. For this discussion Table 7 presents data for the 5-year cohorts and shows the probability of falling in the top 20% of groups given a level 1 teacher and the probability of falling in the bottom 20% of groups with a level 3 teacher. The probability of a given level in a group is .81. As the data in Table 7 show clearly, this probability is reduced nearly in half for the conditions noted. More plainly, with at least one level 1 teacher, a student has approximately half the chance of an effect size in the top 20%. With at least one level 3 teacher, a student has approximately half the chance of an effect size falling in the bottom 20%.
Table 8 presents similar data for the 4-year cohorts. Here the overall reduction in probability is a little over one half. By chance 12 subgroups or .48 of the subgroups will have a level 1 teacher in them or .48 will have a level 5 teacher in them. Now the average probability is about .2 or about 40 % of the original .48.
The tables also show the longitudinal effects of poor teachers, a result found by Sanders and Rivers (1996) and Jordan, Mendro, and Weerasinghe (1997). In both tables 7 and 8, if a level 1 teacher is encountered in the first year and the highest level teachers the remaining years (i.e., 1333 or 155), the effect size of the group with the level 1 teacher never exceeds that of the group with all high level teachers (i.e., 3333 and 555). In only 1 case out of 18 does this effect size equal that of the highest group. There is a false sense of confidence among principals which this relates to. They assume that if a child is put with a poor teacher that putting them with a good teacher the next year will make up the difference. In light of these data, this is a false hope. Two years of good teachers and they have not made it back to the top yet. Three years of good teachers and they have only a 1 in 8 chance of equaling the top.
More telling from the standpoint of convincing the general public (and principals) is the collection of raw group means for a sample of subgroups. For these analyses, the investigators have selected groups with approximately equal pretest means from a subgroup with relatively low teachers and relatively high teachers. (The bias of high ability students with high ability teachers makes this task harder than it looks.) Then we have computed the raw NCE group means for the first and last years for each group and converted them to percentiles. Our experience is that these raw data present a more powerful picture to the general public than any effect size analysis. Table 9 has a collection of these subgroups, two from each cohort.
Table 10 presents the average NCE means for 1997 by student quartile and teacher level for the R5-5, M5-8, R4-4, and M4-8 cohorts. These data are presented to show the difference in NCE means by teacher effectiveness level when the bias in student ability by level is partially controlled. Clearly, the data indicate a considerable difference in most groups between the teachers at different levels.

