Using IRT Techniques in the

Mesa Schools Elementary Testing Program

Joe O’Reilly

Mesa [AZ] Public Schools

 

 

INTRODUCTION

 

Today I would like to share with you how we use IRT techniques to analyze our district elementary reading and math tests. Our tests are multiple choice tests and are given in the spring. They are designed to measure mastery of the MPS curriculum at the end of the year.

 

I should preface my remarks by saying that we are rather new at using these techniques We turned to them in response to the needs of our curriculum staff and teachers. My goals for today are to describe how we are using IRT techniques and to learn as much as I can from the other panelists and the audience. I should also say that I have framed this talk as a test director, not an expert in IRT. I am going to emphasize what we have done and why we have done it and leave it up to some of our other panelists to describe the more technical details.

 

Let me start by giving you some background.

 

In the past we did the traditional analysis -- percent correct, percent of students mastering objectives, results disaggregated by sub-populations, and classical item anaalyses -- KR20’s, etc.

 

This met many needs. You could look at the percent correct and ask if that overall level of performance was adequate or not and you could see if an item was wildly off, at least with a given group of kids.

This was fine for many years. Then the tests started becoming more important. Schools had to set three year academic goals and monitor progress toward their achievement. And these results had to be shared with school site councils and, indirectly, with the whole community through the state school report card.

 

This led to some concerns about the tests. Teachers and principals had three major concerns --

 

 

 

To address these needs we turned to Item Response Theory [IRT] techniques. We used these methods to do three things. First, we performed more detailed item analyses. In addition to looking at things like distractor response patterns, we also looked at response patterns by gender, ethnicity and performance level We essentially compared the predicted performance on each item to the actual performance for each sub-population.

 

Next we completed a School Performance Analysis. This allows a school to compare their performance on each item and objective to their overall performance. This is a powerful tool for self-reflection -- based on their overall performance more students should have gotten this item correct -- why didn’t they?

 

Finally, we developed Mesa Scale Scores. We have linked the tests so that we now have a common scale for all district reading and math tests. That means we are able to give schools gain or growth scores for schools, teachers and students.

 

 

IRT Item Analysis

 

Let me start with how we do item analysis using IRT techniques.

 

Figure 1 shows some of the output for our item analysis. As you can see at the top of the chart we have some traditional information -- the mean item score, the Point Biserial Correlation and the percent choosing each distractor.

 

Figure 1. Sample Item Analysis Output

 [not available in digital format, please see paper copy of the report]

 

You can also see something different. In the top table, the program breaks students into three roughly equal groups based on performance on the test. This allows us to see how students of differing performance levels responded to the distractors. For example, we found that across tests, low performing students just chose A on difficult items.

 

In the next tables you can see that the program predicts what percent of students would get an item right based on that sub-populations overall performance on the test and the difficulty of the item. Doing this analysis, by the way, increased the perceived fairness and credibility of the tests.

 

Figure 2 is one of the items on the test. Anglos, Asian-Americans and African-Americans got this item correct at about the predicted percentage. But the Hispanics and Native-Americans did much worse on this item that predicted.

 

Figure 2. Sample First Grade Test Item

 [not available in digital format, please see paper copy of the report]

 

Why? Well, we threw out the possibility that it was because the name of their ethnic group did not begin with "A". We talked to teachers and the best guess was that they were not familiar enough with a bowling ball (similar to what happens to any Arizona kid when they are asked about sows or cellars on the ITBS). One teacher suggested buying bowling balls for all second grades, but we didn’t think our insurance company would like that.

 

So, what we did was label the pictures and that did the trick -- all groups performed about where we expected.

 

When people realized that we did this type of analysis they felt that the tests were much fairer and valid. That is very important when you want them to use the test results to improve instruction.

 

School Performance Analysis

 

The second way we use IRT is to provide schools with a School Performance Analysis. We basically asked and answered this question -- Is Your Grade’s Performance on Individual Items on a Specific Test Consistent with
Your Grade’s Overall Performance On That Test?
We did this for each item on each test, once for reading and once for math, for each grade, first through sixth, for each school. That is about 35,000 analyses -- thank god for computers!

 

This is different than what we have traditionally done with our NRT results. There we using a regression model to compare actual versus predicted performance. That analysis is very much dependent upon how other schools performed on the test. This one really is not, it is a within school comparison. This was seen as a big improvement by a number of our principals.

 

How did we do it? First, we calculated a difficulty rating for each item using all 5,000+ students that we tested. Next, we calculated a performance level for each test for each school. So, for example, for each school we calculated a performance level for fifth grade reading, and a separate performance level for fifth grade math, using just the performance of that school’s fifth graders in spring of 1995 on each of those tests. Finally, we take these numbers -- the difficulty of each item and the school’s overall performance on the test to predict what percent of students would get each item correct. If you are within +/- 8% of the predicted then a school is performing about where you they would be expected on that item. If they are within +/- 8-10% then they are slightly above or below, +/- 10-15% is above or below and more than +/- 15% is well above or below.

In other words, we are looking at how consistent students did on each item. When there was an inconsistency between the performance on an item and the overall performance, then we highlighted that item

 

Lets look at a real example. Here is one school’s fifth grade results. On most objectives there is nothing listed. This means that on those objectives and items the school performed as we would have predicted given the school’s overall performance and the difficulty of each item. Other categories have items with plusses and minuses. This indicates how far they are above and below predicted.

 

Figure 3: Sample Performance Analysis Results

 [not available in digital format, please see paper copy of the report]

 

A few things immediately pop out. Objectives 8 & 12 each have one item on which students over perform and one on which they under perform. This could indicate a need for a slight change of emphasis...or, some other explanation that is obvious to the school’s teachers but not to you and me. That is where your role as the educational detective comes in.

Objectives 4 & 14 pop out even more. Let’s look at these items on the test.

 

Figure 4: Representative Test Items

 [not available in digital format, please see paper copy of the report]

 

You can see that #22 is addition and # 23 & 24 are multiplication. Why is that? Perhaps it was covered early in the year and not practiced. Perhaps one class really blew it and the others did OK. Once again, that is something for a school’s grade level team to explore.

 

Similarly, on items 70 & 71 there was under performance. These items deal with graphs and coordinates. Was this covered after the test was given? Was it presented in a different way than the way the questions were worded? Answering these questions is the responsibility of the school faculty.

 

Mesa Scale Scores

 

A third analysis we did was create what we are calling a Mesa Scale Score. We have developed this because of teacher and principal requests. Several have asked for a way of measuring growth or improvement over time. Others, and especially teachers, want to minimize the time taken away from instruction for testing.

 

If we could put all the tests on the same scale we can compare scores from one year to the next. We could also then eliminate mass testing in the fall, you would only have to test the students not tested the prior fall if you wanted gain scores.

 

That is what we are trying to accomplish with this analysis.

 

As a secondary benefit we will be able to describe differences in different forms of the test, either because we are creating alternative test forms or because we have improved the test by changing some items.

 

Before we get into these scores in detail, let me state some caveats right up front. First, we are not sure yet how to best describe growth on the scale -- as the number of points, as the difference from how much the district grew, as a standard deviation, or maybe just a phrasee -- above, at or below average. Also, since we only have two years of experience with these scores we don’t know if there is something odd that only would show up one year. We would like to wait one more year to see if we get a similar pattern. And we report the results to schools in both a graphic and tabular forms.

 

Our big question now is how to translate this gain into something useful. As I said earlier, we are not sure yet how to best describe growth on the scale -- as the number of points, as the difference from how much the district grew, as a standard deviation, or maybe just a phrasee -- above, at or below average.

 

Conclusion

 

So that is how we are using IRT techniques to help schools improve instruction.

The benefits of using this approach have been perceived fairness, and therefore usefulness, of analyses. Scale scores also provide a measure of growth and allow for comparable alternative forms.

 

The challenges or concerns that we have are:

 

And how have our principals reacted? We have 49 of them, so their reactions run the gamut. A few feel that we have done too much and a few want more done. But most feel that we are at the right level of analysis. We have improved over what we used to do, and have addressed principals’ concerns. And we are probably just under the breaking point of giving them too much.

Overall, using these techniques have been a very positive step for us.