Assessing K12 Assessments
by Katharine Beals
May 31, 2011 -
educationnews.org
Katharine Beals, PhD analyzes the problems with K-12 assessments and recommends ways
teachers and administrators can utilize assessments more effectively.
Introduction
As the 2010-2011 school year enters its final marking period, as states wrap up
their No Child Left Behind tests, and as colleges and selective high schools
send out their admissions decisions, 'tis the season of K12 assessments. They
come in all shapes and sizes and measure all kinds of things, including the
knowledge of numbers from 1 to 1000:
For all the assessing that assessments do, how often are they themselves assessed?
Where do we even begin? Let's begin at the top.
In assessment above, a student, prompted to write down a number that is 200 more
than a given a number, has written down a number that instead is 400 more. What
sort of mistake is this? What, specifically, does it indicate about the
student's ability to do the task that is apparently being assessed here: adding
ones, tens and hundreds to a given three-digit number? What do the rest of the
student's answers indicate about his/her ability?
With all this in mind, should this assessment be graded, and, if so, how? How many
points should the student lose? What other consequences or follow-up measures
should ensue as a result of the student's mistake?
The purposes of K12 assessments:
One way to address these last questions is to step back and consider K12 assessments
in general. Whether the assessment tool is a test, a homework assignment, or an
in-class activity, what purposes does it serve?
Most immediately, assessments provide feedback: feedback to the student and his or
her parents about how he or she is doing, as well as feedback to the teacher.
Ideally, the teacher gains insight not just into how the student is doing, but
also into how effectively he or she is teaching this student in particular and
the class as a whole. Ideally, assessment motivates self-reflection in student
and teacher alike.
Related to this, assessments offer incentives: to the student to work hard and adjust
his/her study habits as needed; to the teacher to adjust, as needed, his or her
teaching strategies or to provide remediation to particular
students.
Beyond the classroom and the family, assessments (especially course grades, test
scores, and teacher recommendations) help admissions committees decide whom to
admit into gifted programs, selective high schools, or particular colleges. They
help potential employers decide whom to hire. When assessments are standardized
and administered across multiple classrooms and schools, as in No Child Left
Behind tests, they also provide information to principals about how different
classrooms are doing, and information to the school, the government, and the
public about how this school compares with others.
With this in mind, let's assess the assessments. How well do different K12
assessments serve these purposes? When do they fail at what they're supposed to
accomplish?
Potential problems:
An assessment may go wrong in any number of ways. It may set too high a bar or too
low a ceiling; it may assess things that aren't being taught or that aren't
relevant to the given assessment area; it may be distorted by irrelevant
factors; it may target things that aren't readily assessable; or it may be
overly subjective.
Assessments that set too high a bar, such that most students do poorly, provide little in
the way of useful feedback, except to the teacher: namely, about how his or her
expectations square with what he or she has actually succeeded in teaching. Also
potentially problematic are assessments with the opposite outcome, with most
students either getting all, or nearly all, of the answers right. This may
indicate highly successful teaching. It may also signal, however, that
expectations weren't sufficiently high and that the assessment's ceiling was too
low. The latter is arguably a problem with many of the statewide tests that have
sprung up under No Child Left Behind, especially since these tests are sometimes
used, not just to evaluate schools and teachers, but to decide who gets admitted
to gifted programs and selective high schools. If an assessment's ceiling is too
low, then however accurately it measures the skills of the least capable
students, it won't capture the full range of abilities within the class as a
whole, especially those at the other end of the spectrum.
Indeed, assessments with low ceilings may even underestimate the relative capacities of
the some of the more capable students. A student who finds the assessment too
easy may become disengaged from the test items and careless in his or her
responses. Low-ceiling, grade-level NCLB-inspired tests do not include harder,
above-grade level questions where bright but sloppy testers could make up for
points lost elsewhere.
Assessments may also go wrong by including things that aren't being taught. Many
assessments, for example, include measurements of handwriting and neatness, but
many teachers no longer teach penmanship. Some assessments go even further,
including skills that not only aren't taught, but aren't even teachable - at least
by typically-trained K12 teachers. Common examples are creativity (frequently
factored into grades for projects and other open-ended assignments), social
confidence and interpersonal skills (implicitly factored into grades for class
participation and presentations), and the ability to cooperate with classmates
(often factored into grades for group assignments). Consider, for example, the
following all-purpose oral presentation rubric, variations of which make
repeated appearances around the Internet and inside grade school
classrooms:
Here we see ratings for "Speaks Clearly" and "Posture and Eye Contact," with the
lowest points going to the student who mumbles or mispronounces, or who slouches
and doesn't look at others. How many classroom teachers spend time teaching and
encouraging, or even know
how to teach and encourage, things like
good posture, eye contact, and clear speech?
Still, might there be a reason for K12 schools to assess skills that, perhaps
justifiably, they don't consider it their duty to teach? After all, some of
these are real-world skills that employers and admissions committees may care
about. But K12 assessments aren't the only sources of information that these
parties have access to. There are also interviews, application essays, outside
recommendations, and work portfolios. Indeed, these are arguably much better
tools for assessing relevant interpersonal skills and creative potential than
K12 assessments are.
Yet another way in which K12 assessments can go wrong is when they base scores in
part on factors unrelated to what's purportedly being assessed. For example,
even if skills like penmanship and neatness
are being taught, is it appropriate to
include them in assessments intended for other skills? In an assessment of
persuasive essay writing, should points be taken off for penmanship problems? In
an assessment of math skills, should points be taken off for answers that are
wrong only because the student had trouble understanding the directions or the
language in a word problem? In an assessment of place value understanding,
should points be taken off for an incomplete or inarticulate verbal explanation
or a missing diagram when elsewhere within the assessment it's clear that the
student understands what he or she is doing, as is arguably the case in the
excerpt below?
That's not to say that, where problems with penmanship or reading comprehension or
verbal expression surface, there shouldn't be consequences. But beyond feedback
to the student and parents about the nature of these weakness, and remediation
that adequately addresses them, are any other measures
necessary?
Even assessments that target only what they purport to assess may be distorted by
other factors - for example variations among students' motivations and
concentration skills. Especially vulnerable are timed tests: some students are
poor testers, failing to focus and read directions carefully, and/or failing to
double-check answers for stupid mistakes. While it may be reasonable to assess
test-taking skills in their own right, particularly in the context of teaching
these skills, any assessment that isn't supposed to include test-taking ability
as one of its criteria may be distorted by this very factor.
But perhaps most insidious are assessments that intentionally include criteria that
are ill-defined and/or highly subjective - especially when these are also
unteachable and/or untaught. One common example is motivation and effort. While
all assessments are potentially distorted by these, many deliberately include
them - particularly those assessments we sometimes call "formative": assessments
that target the student's in-class learning and work processes as opposed to his
or her final products and test results.
Related to effort and motivation, and also difficult to measure objectively, are some of
the criteria for exceeding a school district's standard for the given grade
level. On many report cards (particularly those that use the 1-4 grading scale)
"exceeding the standard" is the requirement for receiving the highest grade
(see, e.g.,
Standards Based Report Cards). Now, in theory, a student could accomplish this simply by doing a good job on
assignments that are above his or her grade level. However, more and more
schools, perhaps influenced in part by those low-ceiling NCLB tests, resist
giving students above-grade-level assignments. In practice, therefore, exceeding
the standard means not only perfect performance on grade-level assignments
(which, as we've discussed, the more under-challenged students may be too
disengaged to produce), but a particular sort of disposition towards grade-level
activities. The requisite disposition often includes a certain independence and
initiative-doing class work "independently and without teacher prompting" (see,
e.g.,
www.greensburgsalem.org),
and/or doing more than is asked for ("Writes independently with purpose beyond
the given time frames"; c.f., www.tenafly.k12.nj.us)-that the more disengaged
students, again, may fail to display. The requisite disposition may also include
a certain level of apparent cognition: "demonstrates a thorough understanding"
(as opposed to "demonstrates an understanding") of "the knowledge and skills for
this grade level" (
www.eastchester.k12.ny.us/schools/ah/principal/principal.htm);
or "consistently shows evidence of higher level thinking" (see,
e.g., www.tenafly.k12.nj.us); or "demonstrates broader, deeper, more complex understanding of the standard beyond the expected level of mastery" (see, e.g.,
http://www.smfc.k12.ca.us/msreportcards);
or evinces "depth of understanding and flexible application of grade-level
concepts" (see., e.g.,
www.sbsdk12.org). Such traits that are not easily quantified, and, once again, may be least
evident in some of the brightest students, who often are among the least
uninspired by their classroom's grade-level offerings.
Fuzzier yet, and similarly susceptible to the student's engagement level, is
"creativity" - a common requirement for top marks on projects.
A frequently cited reason for including such soft skills as effort, initiative,
motivation and creativity is to make assessments less narrow and rigid, or more
"authentic," as the lingo goes. The problem is that soft skills tend not to be
quantifiable - or even teachable - and the resultant subjectivity can lead to
subconscious bias towards certain students, for example the more cooperative,
sociable, or otherwise likable.
Another new trend in assessment has arisen in part to address these problems of
subjectivity and ill-defined criteria: namely, the rubric. Rubrics - like the
Presentation Rubric we discussed above - attempt to break down an assignment's
components into measurable factors, mapping specific criteria to different point
levels. Let's consider another example, a rubric for grading
posters:
While rubrics may succeed in factoring out certain measurable skills, some criteria
remain ill-defined or subjective. In the above example, "enhance" makes repeated
appearances, at one point joining forces with "creatively" ("creatively enhances
information"). "Engaging" also appears. The measurement scales, furthermore, are
often crude and leave out other possibilities - for example in the Quality of
Information scale above, we have 3 points for "product description is clear,
complete and concise"; 2 points for "product description is mostly clear, could
be a bit more concise"; and 1 point for "product description is unclear,
incomplete, and not concise." What if the description is extremely clear and
complete, but not concise?
And what if the product description is extremely clear, complete, and concise, and
the writing is exceptionally good,
and there are no grammar and spelling
errors, but the poster's layout and graphics only meet the criteria for 2
points? Rubric-based grading, where two-dimensional grids tend to impose a
uniform point scale on every assessment area, is typically too rigid to allow
extra points for exceptionally good work in those areas where the student,
arguably, has more than made up for deficiencies in others. Rubric-based
grading, in other words, tends to place an artificial point ceiling on any
student who struggles in specific areas but excels well beyond his or her
classmates in others.
A second trend in assessment also attempts to make grading more accurate: the
portfolio assessment. Ideally, the teacher examines a portfolio of all the work
that each student has produced during the given marking period. Realistically,
the teacher may not be able to assess the entire oeuvre of each student in the
class. In practice, therefore, portfolio assessment means selecting what is
supposed to be a representative sample of each student's work - a selection
process that risks being subjective and favoring certain types of
assignments - and, therefore, certain types of students.
Making assessments work
For assessments to serve the specific purposes we have considered above, they should
have appropriate bars and ceilings; they should attempt to measure only what is
actually being taught; and they should minimize factors that aren't objectively
measurable. If neatness is important, we should be sure to teach penmanship. If
graphics and layout are important, we should teach graphics and layout. As for
creativity, if we don't know how to teach it or measure it, we should question
whether it's a reasonable requirement. How useful is it for a student or his
parents to learn that he was insufficiently creative? Again, while outside
evaluators - admissions committees or employers - may care about creativity, they
often have their own preferred ways of assessing it. Perhaps creativity in class
work and homework could count as extra credit when it's truly outstanding,
rather than being a conventionalized, generalized, purportedly quantifiable
expectation that all students must strive to meet.
While all tests are distorted by variations in test-taking skills, teachers can check
whether students are staying on task and consider ways to minimize distractions.
While all assessments are distorted by variations in motivation and effort,
these, too, are somewhat detectable and treatable. When someone does poorly,
especially unexpectedly so, we should question whether deficient effort and
motivation played a role. We should diplomatically explore with such students
the reasons for their apparently low motivation and whether there are
appropriate alternative, perhaps more challenging, assignments that might
motivate them more.
Teachers should examine mistakes carefully to detect which ones truly indicate deficits
in the area being assessed, and which ones instead are stupid mistakes, or
mistakes caused by other deficits, for example in reading comprehension or
verbal expression. If we remain uncertain, we should solicit input from the
student herself. We might have her redo particular questions - perhaps after first
clarifying the directions - without providing any hints other than the inevitable
cue that she made some sort of mistake. If we find ourselves saying, "Even
though he got the wrong answer, I knew he knew how to do this," we should
question whether it makes sense to take off points. Again, this does not rule
out other consequences like asking the student to double check and redo, or
exploring whether he would benefit from remediation, say, in reading or
language.
Throughout, the assessment should be carefully tailored to its purpose(s). Is it intended to
provide feedback to students, to teachers, to state and federal governments, to
the general public, to admissions committees, and/or to potential employers?
Does the feedback make clear what has and what hasn't been measured? Given what
was in fact measured, is it reasonable to factor the assessment into students'
grades?
Assessments with low ceilings are most appropriate for measuring whether teachers and
schools have successfully taught the material that they're supposed teach,
rather than for measuring the range of aptitudes among students. Thus, while
such tests may serve to make teachers and schools more accountable, they are
less appropriate as tools for determining whether specific students should be
admitted to gifted programs or selective high schools. Better suited to these
latter purposes are standardized, normed, high-ceiling alternatives like the
SAT-9.
Assessments that rely significantly on test-taking skills might provide better feedback if
it's clear to all concerned that these skills are part of what's being measured
(an assumption we generally make about standardized tests, but not necessarily
about in-class assessments). Such assessments might also be preceded and
followed up by assignments and activities that help weaker testers
improve.
Assessments, especially those that affect students' grades, should be flexible enough to
measure accurately, and credit sufficiently, those students with unusual ranges
of abilities. Rubrics, if used, should allow extra credit points in certain key
areas - particularly those most directly related to the academic subject in
question - so that students who excel in these areas but not in others aren't
downgraded relative to their more cognitively typical classmates. In an essay
rubric, for example, areas most core to essay writing (like organization and
ideas, as opposed to spelling and vocabulary) should be areas where extra credit
is possible.
Of course, even with these cautionary measures there's no such thing as a perfect
assessment - one fully immune from extraneous influences. But in continually
assessing K12 assessments, paying close attention to how different students
perform on them and what feedback these students provide us with afterwards, we
can more carefully align them with the various goals that they are most
potentially capable of effectively serving.
As for those who play a subsequent role in furthering these goals - whether they are
teachers, students, parents, governments, admissions committees, or potential
employers - they should always bear in mind the various factors that keep K12
assessments from being as reliable as they might at first appear to
be.
Katharine Beals, PhD is the author of "Raising a Left-Brain Child in a Right-Brain World: Strategies for Helping Bright, Quirky, Socially Awkward Children to Thrive at Home and at School." She teaches at the University of Pennsylvania Graduate School of Education and at
the Drexel University School of Education, specializing in the education of
children on the autistic spectrum. She blogs about education at Kitchen Table
Math and on her own blog, Out in Left Field.