From Economic Policy Institute:

Education policymakers and analysts express great concern about the performance of U.S. students on international tests. Education reformers frequently invoke the relatively poor performance of U.S. students to justify school policy changes.

In December 2012, the International Association for the Evaluation of Educational Achievement (IEA) released national average results from the 2011 administration of the Trends in International Mathematics and Science Study (TIMSS). U.S. Secretary of Education Arne Duncan promptly issued a press release calling the results “unacceptable,” saying that they “underscore the urgency of accelerating achievement in secondary school and the need to close large and persistent achievement gaps,” and calling particular attention to the fact that the 8th-grade scores in mathematics for U.S. students failed to improve since the previous administration of the TIMSS.

Two years earlier, the Organization for Economic Cooperation and Development (OECD) released results from another international test, the 2009 administration of the Program for International Student Assessment (PISA). Secretary Duncan’s statement was similar. The results, he said, “show that American students are poorly prepared to compete in today’s knowledge economy. … Americans need to wake up to this educational reality—instead of napping at the wheel while emerging competitors prepare their students for economic leadership.” In particular, Duncan stressed results for disadvantaged U.S. students: “As disturbing as these national trends are for America, enormous achievement gaps among black and Hispanic students portend even more trouble for the U.S. in the years ahead.”

However, conclusions like these, which are often drawn from international test comparisons, are oversimplified, frequently exaggerated, and misleading. They ignore the complexity of test results and may lead policymakers to pursue inappropriate and even harmful reforms.

Both TIMSS and PISA eventually released not only the average national scores on their tests but also a rich international database from which analysts can disaggregate test scores by students’ social and economic characteristics, their school composition, and other informative criteria. Such analysis can lead to very different and more nuanced conclusions than those suggested from average national scores alone. For some reason, however, although TIMSS released its average national results in December, it scheduled release of the international database for five weeks later. This puzzling strategy ensured that policymakers and commentators would draw quick and perhaps misleading interpretations from the results. This is especially the case because analysis of the international database takes time, and headlines from the initial release are likely to be sealed in conventional wisdom by the time scholars have had the opportunity to complete a careful study.

While we await the release of the TIMSS international database, this report describes a detailed analysis we have conducted of the 2009 PISA database. It offers a different picture of the 2009 PISA results than the one suggested by Secretary Duncan’s reaction to the average national scores of the United States and other nations.

Because of the complexity and size of the PISA international database, this report’s analysis is restricted to the comparative test performance of adolescents in the United States, in three top-scoring countries, and in three other post-industrial countries similar to the United States. These countries are illustrative of those with which the United States is usually compared. We compare the performance of adolescents in these seven countries who have similar social class characteristics. We compare performance in the most recent test for which data are available, as well as trends in performance over the last nearly two decades.

In general, we find that test data are too complex and oversimplified to permit meaningful policy conclusions regarding U.S. educational performance without deeper study of test results and methodology. However, a clear set of findings stands out and is supported by all data we have available:

Because social class inequality is greater in the United States than in any of the countries with which we can reasonably be compared, the relative performance of U.S. adolescents is better than it appears when countries’ national average performance is conventionally compared.

  • Because in every country, students at the bottom of the social class distribution perform worse than students higher in that distribution, U.S. average performance appears to be relatively low partly because we have so many more test takers from the bottom of the social class distribution.
  • A sampling error in the U.S. administration of the most recent international (PISA) test resulted in students from the most disadvantaged schools being over-represented in the overall U.S. test-taker sample. This error further depressed the reported average U.S. test score.
  • If U.S. adolescents had a social class distribution that was similar to the distribution in countries to which the United States is frequently compared, average reading scores in the United States would be higher than average reading scores in the similar post-industrial countries we examined (France, Germany, and the United Kingdom), and average math scores in the United States would be about the same as average math scores in similar post-industrial countries.
  • A re-estimated U.S. average PISA score that adjusted for a student population in the United States that is more disadvantaged than populations in otherwise similar post-industrial countries, and for the over-sampling of students from the most-disadvantaged schools in a recent U.S. international assessment sample, finds that the U.S. average score in both reading and mathematics would be higher than official reports indicate (in the case of mathematics, substantially higher).
  • This re-estimate would also improve the U.S. place in the international ranking of all OECD countries, bringing the U.S. average score to fourth in reading and 10th in math. Conventional ranking reports based on PISA, which make no adjustments for social class composition or for sampling errors, and which rank countries irrespective of whether score differences are large enough to be meaningful, report that the U.S. average score is 14th in reading and 25th in math.
  • Disadvantaged and lower-middle-class U.S. students perform better (and in most cases, substantially better) than comparable students in similar post-industrial countries in reading. In math, disadvantaged and lower-middle-class U.S. students perform about the same as comparable students in similar post-industrial countries.
  • At all points in the social class distribution, U.S. students perform worse, and in many cases substantially worse, than students in a group of top-scoring countries (Canada, Finland, and Korea). Although controlling for social class distribution would narrow the difference in average scores between these countries and the United States, it would not eliminate it.
  • U.S. students from disadvantaged social class backgrounds perform better relative to their social class peers in the three similar post-industrial countries than advantaged U.S. students perform relative to their social class peers. But U.S. students from advantaged social class backgrounds perform better relative to their social class peers in the top-scoring countries of Finland and Canada than disadvantaged U.S. students perform relative to their social class peers.
  • On average, and for almost every social class group, U.S. students do relatively better in reading than in math, compared to students in both the top-scoring and the similar post-industrial countries.

Because not only educational effectiveness but also countries’ social class composition changes over time, comparisons of test score trends over time by social class group provide more useful information to policymakers than comparisons of total average test scores at one point in time or even of changes in total average test scores over time.

  • The performance of the lowest social class U.S. students has been improving over time, while the performance of such students in both top-scoring and similar post-industrial countries has been falling.
  • Over time, in some middle and advantaged social class groups where U.S. performance has not improved, comparable social class groups in some top-scoring and similar post-industrial countries have had declines in performance.

Performance levels and trends in Germany are an exception to the trends just described. Average math scores in Germany would still be higher than average U.S. math scores, even after standardizing for a similar social class distribution. Although the performance of disadvantaged students in the two countries is about the same, lower-middle-class students in Germany perform substantially better than comparable social class U.S. students. Over time, scores of German adolescents from all social class groups have been improving, and at a faster rate than U.S. improvement, even for social class groups and subjects where U.S. performance has also been improving. But the causes of German improvement (concentrated among immigrants and perhaps also attributable to East and West German integration) may be idiosyncratic, and without lessons for other countries or predictive of the future. Whether German rates of improvement can be sustained to the point where that country’s scores by social class group uniformly exceed those of the United States remains to be seen. As of 2009, this was not the case.

Great policy attention in recent years has been focused on the high average performance of adolescents in Finland. This attention may be justified, because both math and reading scores in Finland are higher for every social class group than in the United States. However, Finland’s scores have been falling for the most disadvantaged students while U.S. scores have been improving for similar social class students. This should lead to greater caution in applying presumed lessons from Finland. At first glance, it may seem that the decline in scores of disadvantaged students in Finland results in part from a recent influx of lower-class immigrants. However, average scores for all social class groups have been falling in Finland, and the gap in scores between Finland and the United States has narrowed in each social class group. Further, during the same period in which scores for the lowest social class group have declined, the share of all Finnish students in this group has also declined, which should have made the national challenge of educating the lowest social class students more manageable, so immigration is unlikely to provide much of the explanation for declining performance.

Although this report’s primary focus is on reading and mathematics performance on PISA, it also examines mathematics test score performance in earlier administrations of the TIMSS. Where relevant, we also discuss what can already be learned from the limited information now available from the 2011 TIMSS. To help with the interpretation of these PISA and TIMSS data, we also explore reading and mathematics performance on two forms of the U.S. domestic National Assessment of Educational Progress (NAEP).

Relevant complexities are too often ignored when policymakers draw conclusions from international comparisons. Different international tests yield different rankings among countries and over time. PISA, TIMSS, and NAEP all purport to reflect the achievement of adolescents in mathematics (and PISA and NAEP in reading), yet results on different tests can vary greatly—in the most extreme cases, countries’ scores can go up on one test and down on another that purport to assess the same students in the same subject matter—and scholars have not investigated what causes such discrepancies. These differences can be caused by the content of the tests themselves (for example, differences in the specific skills that test makers consider to represent adolescent “mathematics”) or by flaws in sampling and test administration. Because these differences are revealed in the most cursory examination of test results, policymakers should exercise greater caution in drawing policy conclusions from international score comparisons.

To arrive at our conclusions, we made a number of explicit and transparent methodological decisions that reflect our best judgment. Three are of importance: our definition of social class groups, our selection of comparison countries, and our determination of when differences in test scores are meaningful.

There is no clear way to divide test takers from different countries into social class groups that reflect comparable social background characteristics relevant to academic performance. For this report, we chose differences in the number of books in adolescents’ homes to distinguish them by social class group; we consider that children in different countries have similar social class backgrounds if their homes have similar numbers of books. We think that this indicator of household literacy is plausibly relevant to student academic performance, and it has been used frequently for this purpose by social scientists. We show in a technical appendix that supplementing it with other plausible measures (mother’s educational level, and an index of “economic, social, and cultural status” created by PISA’s statisticians) does not provide better estimates. Also influencing our decision is that the number of books in the home is a social class measure common to both PISA and TIMSS, so its use permits us to explore longer trend lines and more international comparisons. As noted, however, data on these background characteristics were not released along with the national average scores on the 2011 TIMSS, and so our information on the performance of students from different social class groups on TIMSS must end with the previous, 2007, test administration.

In this report, we focus particularly on comparisons of U.S. performance in math and reading in PISA with performance in three “top-scoring countries” (Canada, Finland, and Korea) whose average scores are generally higher than U.S. scores, and with performance in three “similar post-industrial countries” (France, Germany, and the United Kingdom) whose scores are generally similar to those of the United States. We employed no sophisticated statistical methodology to identify these six comparison countries. Assembling and disaggregating data for this report was time consuming, and we were not able to consider additional countries. We think our choices include countries to which the United States is commonly compared, and we are reasonably confident that adding other countries would not appreciably change our conclusions. If other scholars wish to develop data for other countries, we would gladly offer them methodological advice.

Technical reports on test scores typically distinguish differences that are “significant” from those that are not. But this distinction is not always useful for policy purposes and is frequently misunderstood by policymakers. To a technical expert, a score difference can be miniscule but still “significant” if it can be reproduced 95 percent of the time when a comparison is repeated. But miniscule score differences should be of little interest to policymakers. In general, social scientists consider an intervention to be worthwhile if it improves a median subject’s performance enough to be superior to the performance of about 57 percent or more of all subjects prior to the intervention. Such an intervention should be considered “significant” for policy purposes, but, to avoid confusion, we avoid the term “significant” altogether. Instead, for PISA, we consider countries’ (or social class groups’) average scores to be “about the same” if they are less than 8 test scale points different (even if this small difference would be repeated in 95 of 100 test administrations), to be “better” or “worse” if they are at least 8 but less than 18 scale points different, and “substantially better” or “substantially worse” if they differ by 18 scale points or more. Eighteen scale points in most cases is approximately equivalent to the difference social scientists generally consider to be the minimum result of a worthwhile intervention (an effect size of about 0.2 standard deviations). The TIMSS scale is slightly different from the PISA scale; for TIMSS, the cut points used in this report are 7 and 17 rather than 8 and 18.

With regard to these and other methodological decisions we have made, scholars and policymakers may choose different approaches. We are only certain of this: To make judgments only on the basis of statistically significant differences in national average scores, on only one test, at only one point in time, without regard to social class context or curricular or population sampling methodologies, is the worst possible choice. But, unfortunately, this is how most policymakers and analysts approach the field.

The most recent test for which an international database is presently available is PISA, administered in 2009. As noted, the database for TIMSS 2011 is scheduled for release later this month (January 2013). In December 2013, PISA will announce results and make data available from its 2012 test administration. Scholars will then be able to dig into TIMSS 2011 and PISA 2012 databases and place the publicly promoted average national results in proper context. The analyses that follow in this report should caution policymakers to await understanding of this context before drawing conclusions about lessons from TIMSS or PISA assessments. We plan to conduct our own analyses of these data when they become available, and publish supplements to this report as soon as it is practical to do so, given the care that should be taken with these complex databases.

To read more, click here.