Using Data From a Level Testing System To Change A School District
G. Gage Kingsbury and Ronald L. Houser
Portland (OR) Public Schools
In 1977, the Portland (OR) Public School system made two significant changes to its student assessment system. First, it adopted the one-parameter logistic item response model (IRT: Lord and Novick, 1968; Lord 1980), better known as the Rasch model (Rasch, 1960; Wright, 1977). Second, it began using paper-and-pencil tests which were targeted at students' expected achievement levels, better known as functional level tests (NWEA, 1997). These two adoptions have stayed with the district ever since, and are still being used today.
The Rasch model provides the school district with a model for measurement and item bank development. This model has been used to develop large item banks in Reading, Mathematics, Language Usage, and Science. These banks span the range of student performance from third grade through eighth grade, and are currently being expanded into the primary and high school grades. A cautious use of the Rasch model, with multiple sets of overlapping tests and frequent tests of dimensionality, has allowed the expansion from a single small set of test questions in each content area to the current banks which contain thousands of items in each area. The result is a consistent scale that measures both student performance and student growth.
Functional level testing simply involves creating and administering a set of test forms which differ in difficulty, but overlap to some extent. A given student takes a test form with appropriate difficulty, based on his or her past performance. Since a student takes test questions that are near to his or her achievement level, students tend to be challenged but not frustrated by the test. This is in sharp contrast to a wide-range test, which must include questions for low and high performing students in the same test. Through its emphasis on targeting student performance (and a retest cycle for students who are badly targeted), the functional level test provides student information almost as accurate as a computerized adaptive test, and substantially more accurate than a wide-range test of the same length.
Since Portland has been using functional level testing under the Rasch model, an abundance of data has been collected to indicated that the model performs adequately, and that the scales continue to be constant from year to year to year. While we continue to expand the item banks, and perform periodic dimensionality and drift studies, all evidence indicates that the measurement scales are providing the district with consistent, accurate data concerning student performance and growth. This information is in use on the classroom level, to guide instruction for individual students and to help teachers with instructional design.
That being said, we get to the more interesting issue of using the information that we are gathering to make suggestions to improve educational practice beyond the classroom level. Since Portland has a consistent record of student performance spanning two decades, one would hope that we can start making some comments about districtwide student performance. In addition, we should be able to use existing information about student growth to make some informed estimates of the impact of different types of educational "reform."
In this discussion, then, we will examine a very small portion of the data that is available from the testing system in Portland. First, we will make some comments concerning student performance that might be important to the district and of interest to other districts. Second, we will use this data to help predict the impact of two types of educational interventions: the first related to a standards-based education model and the second related to a growth-based model. We will use the existing data to predict the impact of these interventions on student performance in the district, and comment on some of the results with respect to the long range result of the interventions.
Figure 1: Mean mathematics RIT scores across sixteen years

Sixteen Years of Educational Growth
Figure 1 shows a historic record of sixteen years of Mathematics achievement by students in Portland in grades four through eight, as measured by the Portland Achievement Level Tests. Each bar indicates the mean growth of students in one calendar year, measured from Spring to Spring, where the bottom of the bar indicates the mean status in the previous Spring and the top of the bar indicates mean status during the Spring of the grade of interest. Each cluster of bars indicates the pattern of achievement in a particular grade across the sixteen years represented. Each bar is based on a group of over 2500 students, and most are based on 3000 to 4000 students.
One point to note is that the mean scores displayed in each bar are based on students who have scores during both springs represented in the bar, and who didn’t move across school boundaries during the year. This means that the group used to calculate the mean is slightly more stable than the district as a whole. This is done to make the data more meaningful to the teachers in the district, since it includes only students that were in the same school for the complete instructional year.
An additional point to note is that data is reported only since 1981, even though the level testing system was put in place in 1977. As with any testing system, it was reasonable to allow teachers and administrators to become adjusted to the level testing system for several years to assure stability. Therefore, 1981 was chosen as the baseline year for all historical comparisons. We won't discuss the figure in as much detail as possible, but several features raise very interesting questions for the school district and beyond.
One feature that is very apparent from the figure is that in every grade, student performance levels have increased fairly consistently. Students enter and exit a given grade performing at a much higher level than their predecessors. In fact students entering a given grade last year performed at a higher level than students entering one grade higher in 1981. This is a surprising result in an era in which we talk about the decline of public education, but it is not an artifact. Students in Portland today answer much more difficult questions correctly than they did sixteen years ago, and that is evident not only from our measurement model, but also from actually looking at the questions themselves. (We don't trust our models that much.)
Before we get too excited with success, though, a second feature is worth noting from the figure. Although student status in any particular grade (the position of the bar) has improved over the years , the growth within any grade (the length of the bar) has remained relatively constant over the years. There is no clear indication that students are learning more in any of the grades in which we test. This seems to indicate that students are coming into the fourth grade at a higher level of achievement, and that they are maintaining this higher level throughout their fourth through eighth grade years. Therefore, the increased performance of students in the district is caused by factors acting in the primary grades or before school. Currently, these factors are undiscovered, and may be related to school, or changes in the student population, or the fluoridation of the water supply. This interesting pattern of growth, which also is observed in Reading, opens up all sorts of new questions for research, and has impact at all levels of the school district.
In the discussion above, it was mentioned that the growth within each grade has remained relatively constant over the years. The one exception to this statement is the sixth grade. In the sixth grade, two troubling trends are observed. First, in recent years growth in the sixth grade has been less than in any other grade. Second, in recent years growth in the sixth grade has been noticeably less than in prior years. It is unclear why students grow less in the sixth grade than they do in the surrounding grades but there are several intriguing possibilities.
The most obvious difference between the sixth grade and the other grades is that the sixth grade is the year of transition from elementary school to middle school since Portland has a K-5, 6-8, and 9-12 structure. In addition, the school district changed over from a K-8, 9-12 structure during the mid 1980s. Therefore, larger gains were noted for sixth grade in the older structure, and the new structure results in smaller gains at the sixth grade level than at any other grade level.
It appears that the sixth grade dip in growth may be tied to the middle school structure, but the nature of this relationship is unclear. The transition of students from fifth to sixth grade may be more traumatic for students than we thought, and may be impairing growth in sixth grade. Alternatively, the change in structure in the district may have resulted in sixth grade teachers having to bear the brunt of the changeover, possibly unprepared. If this second hypothesis were the case, one would expect the dip to disappear with time.
Whatever the cause of the sixth grade dip in growth, it is of considerable concern to the school district. A project is currently underway to examine the data from other school districts who use level tests drawn from the same item banks, to determine whether a similar dip exists in other districts, and if so, to determine the characteristics of districts which seem to be related to the dip in growth.
As should be clear from the discussion above, a school district or other organization that couples a high quality measurement program with a constant measurement scale used over a long period of time will achieve an embarrassment of riches in its database. Portland has the data to cause it to question the way that it educates children, and the ability to collect the data that may help it to do a better job in the future. Unfortunately, the district is currently somewhere in between. We are in the process of searching for answers, but our data raises questions at least as fast as it helps us solve them. The thing that distinguishes our efforts from those of districts that don’t use a consistent measurement scale is that we know which questions to ask.
Two Data-Based Predictions
In the name of progress, education feels the need to reinvent itself every few years. In a cyclical fashion, new reforms are considered with little research, proselytized with exaggerated claims, adopted with high hopes, and discarded as newer reforms are identified. In the past, is has been difficult, if not impossible to stop this cycle, because little information was available to tell us whether the innovation was working before it passed away.
With the use of item response theory and high-quality measurement systems, we should be able to bring more evidence to the table before we pass judgement on a reform. If we can make data-based decisions concerning the impact of a reform movement, we may be able to stop the cycle of reform, and begin a process of improvement. In addition, we may be able to make predictions about the impact of a particular educational intervention, from knowledge of past data.
As an example of this type of prediction, we can start with two interventions that have been suggested to ''reform" educational practice. To these suggestions, we will add the existing information that we have collected concerning student performance, and some strong assumptions. Mixing these ingredients, we can obtain predictions for student performance after each intervention is used. By examining the differences in these predictions, we may be able to make recommendations concerning which of the two interventions we wish to undertake.
For purposes of this paper, we will limit our predictions to eighth grade mathematics. There is no reason that we couldn't extend the range of this prediction to include other grades and subject areas.
Intervention 1: Standards-Based Education. A current reform movement in education is one that proposes that standards-based school reform will enable our children to learn everything they need to know. This reform assumes that by establishing challenging, well-defined standards, students and teachers will eventually find ways to get the students to meet the standards. This trend has emerged from the development of newer curriculum standards in several content areas over the past decade, and from the research that indicates that students have more success working toward clearly defined targets than they do working toward ill-defined targets. From these roots, the drive toward standards-based education has changed to suggest that all students can reach any criterion level that we wish to set, given enough time and effort
The primary emphasis of standards-based education becomes making sure that each student reaches some predefined standard of performance, usually at some particular time such as the end of a particular grade. If a student is in danger of not reaching the standard at the prescribed time, some additional program of instruction is used with the student, to help him or her over the hurdle. The point of reference becomes the criterion level, and the degree of success for a teacher can be measured by the increase in success of students meeting the criterion.
Intervention 2: Growth-Based Education. An alternative to a standards-based approach is to focus attention on individual growth. Rather than using a performance level as a criterion level, the growth-based approach sets a growth standard for all students to achieve. This approach assumes that the educational growth of each student is equally important, but does not specifically try to change individual differences in student performance. The point of reference in this approach is a desired amount of growth for each student, and the degree of success for a teacher can be measured by the increase in the percentage of students growing by at least the desired amount during the course of instruction.
Existing information. From Figure 1, several patterns can be noticed concerning performance and growth of students in mathematics during their eighth grade year. Over the sixteen year period observed, the lowest eighth grade mathematics mean observed was 228.5 in 1980-81, and the highest was 236.0 in 1994-95. The lowest standard deviation observed was 15.94 in 1981-82, while the highest was 18.81 in 1994-95. Since there appears to be an upward trend in the data, it seems appropriate to use the most recent year's data as our base from which to extrapolate. In the 1995-96 school year, the eighth grade mean in the spring was 234.7, with a standard deviation of 18.80.
Observing growth during the eighth grade in mathematics, we can see that the lowest growth observed was 3.5 points in 1985-1986, and the highest was 7.9 points in 1994-95. Since there doesn't seem to be a strong trend in growth in the data, it seem appropriate to assume that the growth that we see in a particular year will be close to the mean growth observed, which is 7.0 points.
An important point to consider here is that the change in student achievement for a year for the average student is substantially less than the standard deviation of student achievement within the eighth grade Spring test. This means that we have to temper our enthusiasm for new programs to a reasonable level. If students grow ten percent more in a grade, this would be a wonderful achievement, but it means a change of approximately .7 RIT points, or about .04 standard deviations in student performance.
Figure 2: Distribution of RIT scores in eighth grade mathematics

In order to make our prediction somewhat more realistic, it seems reasonable to use the actual distribution of students in the most recent year as our model. Figure 2 shows the distribution of 3,677 student scores in mathematics in the test given during the Spring of 1996. As can be seen in this figure, the distribution of scores is non-normal, tending to be peaked, but too heavy in the tails to be normal.
Since this distribution is somewhat lumpy, we smoothed the distribution slightly, to allow somewhat more comparison to other locations. (If the purpose of this paper were just prediction, it would be more appropriate to use the unsmoothed data.) Results of this smoothing are shown in Figure 3, with frequencies converted to percentages.
Figure 3: Smoothed distribution of RIT scores in eighth grade mathematics

Assumptions. Given our existing information, we can make some assumptions that will allow us to derive our tentative predictions:
1) No additional funding is available for use in our reform effort, so total teacher time remains the same.
2) The criterion level used in the standards-based approach is set arbitrarily at a RIT score of 231. This level is just slightly below the mean, and has been suggested as a criterion level in our state.
3) In the standards-based model, teachers are rewarded by the number of students in their classrooms that achieve the criterion level, and nothing else.
4) In the growth-based approach, teachers are rewarded by the number of students in their classrooms that achieve the appropriate level of growth. In this case that level of additional growth is set to 10% additional growth, or .7 RIT points.
5) Each reform effort lasts for a single year and is only applied to eighth grade mathematics.
6) Both reform efforts have the same mean impact on growth, allowing students to grow ten percent more than normal during the eighth grade year, and therefore we should be very happy with our reforms.
Interim results. Since the teachers in our two reform conditions are under different reward systems, we can expect them to treat their classes differently, trying to maximize the reward and move the school district toward its desired goals as quickly as possible .
Further, since the number of teacher hours has not increased with the reform, we can expect the amount of time and the type of instruction that each student receives will vary to help the teacher maximize the reward and speed the district toward it's goals.
We anticipate that each teacher will realize that the attention that a particular student receives in the classroom should depend on their performance relative to the success criterion.
In the growth-based intervention, this means that teachers can ignore students that have grown the specified amount already during the year. Once a student has achieved the desired growth, there is no additional gain that will improve the teachers rewards or the school districts goals.
In the standards-based intervention, this means that teachers can ignore students once they reach the specified criterion level of achievement. In addition, they may ignore any student who is so far below the criterion level that he or she has virtually no chance of reaching the criterion level by the end of the year. Since students grow somewhat less than .5 standard deviations per year, a teacher that chooses to ignore all students who begin the year more than a standard deviation below the criterion level will free up a lot of instructional time to spend with students who actually have a chance of passing the standard. By ignoring students above the standard and students far below the standard, the teacher maximizes the number of students passing the standard, and therefore their own rewards and the districts goals.
While teachers might like to maximize their rewards and the district's movement toward its goals, it is unlikely that they will do so at the cost of completely ignoring students in their class. Therefore, as it becomes less useful to teach a particular student, we will expect that the teacher becomes less likely to spend time with the student, and we will further expect that student growth is directly related to the time spent with the teacher.
To represent this drop off in attention for the standards-based intervention, we used a drop off of 25 percent in attention for each standard deviation the student's performance was below the criterion level, and a drop off of 50 percent in attention for each standard deviation above the criterion level. We further assumed that any student with 50 percent less teacher attention than expected would start to lose achievement, rather then growing.
Prediction results. Since we designed our example so that students should grow ten percent more than normal, it is hardly surprising that the mean achievement for both of our interventions was 235.4, compared to the average of 234.7 observed with no intervention. The standards-based approach resulted in 13.4% more students meeting or exceeding the criterion level of 231. The growth-based approach resulted in 2.2% more students meeting or exceeding the level of 231.
At first look, this is extremely positive evidence of the effectiveness of the standards-based approach, but the first look may be misleading in this case. Figure 4 shows the expected distribution of student scores following each intervention (and the original distribution as well). Two trends are immediately noticeable from this figure.
First, the distribution of scores in the growth-based approach is extremely similar to the distribution in the no-intervention case [see Figure 4]. This is reasonable since the teacher tries to influence each student to grow a little bit more than before. This distributes teacher effort evenly until students have gained .7 more RIT points than expected.
Figure 4: Expected distribution of RIT scores for two different interventions

Second, the distribution of scores in the standards-based approach is substantially different from the no-intervention case. The percentage of students immediately above the criterion level is substantially higher than with no-intervention, while the percentage of students immediately below the criterion level is substantially lower.
Of even more interest, though, is the impact of the standards-based intervention on the performance of students some distance from the criterion level. At the lower end of the distribution and the upper end of the distribution, student performance is actually depressed below that observed in the no-intervention case. This is reasonable, since there is no benefit to the teacher or the district's goals to be gained in teaching these students. Therefore, an initial result of standards-based education may be that poor performing students may perform even poorer, while high performing student may not perform as high as with no intervention at all.
Therefore, while students in each intervention grew 10 percent more during their eighth grade year, it is apparent that the different interventions had very different impact on the performance of students. The growth-based intervention had a very small positive influence on growth of all students, while the standards-based intervention had a larger positive impact on growth of students near the criterion level, and a negative impact on growth of students in both tails of the distribution.
Finally, it is likely that the difference in the impact on these two types of intervention would be even greater if the programs were in effect across all grades, rather than in just the eighth grade. It is possible that the standards-based approach might split the distribution of students into a group that was consistently ignored from year to year (the lowest performers) and a group that was tightly packed into the vicinity of the criterion level for any particular grade. The more years that a low performing student spent under this system, the less likely they would be to catch up, as they fell farther and farther behind the main group of students each year. This is obviously not the intent of the standards-based educational approach, but in an era of non-expanding resources, it could be a likely outcome.
Discussion
The information that we obtain from level tests using item response theory provides us with a set of tools for identifying, evaluating, and predicting educational change. In particular, strong measurement coupled with consistent measurement across several years allows us to see the slowly changing landscape of educational performance. In most settings, we try to react too quickly to changes in test scores, even though those changes may be caused by the quality of the Fall TV lineup rather than by changes that have been made in instruction. By reacting too quickly, we reduce our chances of observing actual change that will allow us to nudge our school districts in the right direction. Since student growth happens on a small scale relative to individual differences within any particular grade, we need to be extremely careful in using our data as a sledge hammer rather then a tack hammer.
In using data to give us a base to make predictions about new programs, we move even farther out onto our limb. The particular assumptions that we made in our prediction example are almost certainly faulty. However, they do suggest a procedure that school districts could use to pre-evaluate suggested reforms. Then, if a particular educational intervention is chosen for use, districts might have a more realistic expectation concerning changes in student performance and growth. Finally, the pre-evaluation might suggest some difficulties with a particular intervention that could make a district think twice before adoption of a procedure. In our example, the tendency for low performing students to fall farther behind in a standards-based approach is extremely unsettling, and suggests that the reward function for this approach needs to be evaluated very closely before we implement the approach.
The accurate measurement information that comes from the use of an item response model in conjunction with a functional level testing program gives us the luxury of collecting sound data, and the longevity of the testing program in Portland provides us with the continuous data that make extrapolation and prediction a strong possibility. Regardless of the accuracy of our assumptions, we are taking the first steps toward an educational system that changes due to thoughtful and incremental action, and not due to the persuasiveness of the current fashion. While this is certainly not all the result of the Rasch model (I'm sure other models might have worked as well.), the need for strong measurement scales is even more obvious when we realize that we are looking for small effects in most of our educational change. Hopefully, the use of strong measurement procedures will allow us to make most of our small changes positive.
References
Lord, F. M. (1980). Applications of item response theory to practical testinq problems. Hillsdale, NJ: Erlbaum.
Lord, F. M. & Novick M. R. (1968). Statistical theory of mental test scores. Reading, MA: Addison-Wesley.
NWEA (1997). Technical manual for the NWEA level tests in Reading, Mathematics, and Language. Portland, OR: Northwest Evaluation Association.
Rasch, G. O. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: Danish Institute for Educational Research.
Wright, B. D. (1977). Solving measurement problems with the Rasch model, Journal of Educational Measurement, 14, 97-116.