primary functions that tests serve in language programs, two functions that .. a program (as found in Brown ), the danger is that the groupings of. No part of this book may be reproduced or utilized in any form or. Sri Sarada Devi Entrance Encyclopedia of Hindu. TEFL courses in person and tutored those taking distance your lesson plan so that they can talk to you Putting Your Le Emergency Medicine.
|Language:||English, Spanish, Arabic|
|Genre:||Science & Research|
|Distribution:||Free* [*Registration needed]|
Testing in Language Programs by James Dean Brown, , available at Book Depository with free delivery worldwide. Performance-Based Assessment In language courses and programs around the world, test designers are now tackling this new and more student-centered. Testing in language programs (Chapter 8) By James Dean Brown Dr. Golshan 's class (loved it).
Liza Pleva Production editor: Jane Townsend. Melissa Leyva Director of manufacturing: Patrice Fraccio Senior manufacturing downloader: Edith Pullman Cover design: Tracy Munz Cataldo Te.. Wendy Wolf Text composition: Carlisle Communications, Ltd. Don Martinetti Text credits: See p. Douglas Language assessment: Douglas Brown. Includes bibliogt4: ISBN 1. Language and languages-Ability testing. Language and languages-,-Examinations. B76 '. Access our Companion Websites, our online catalog, and our local offices around the world.
Visit us at longman. Are the test procedures practical? Is the test reliable? Does the,;procedure demonstrate content validity? Is the procedure face valid and "biased for best"? Are the test tasks as authentic as possible? Does the test offer beneficial washback to the learner? Test Types, 43 I I "! Design each item to nleasure a specific objective, 56 2. State both stem and options as simply and directly as pOSSible, 57 3. Make certain that the intended answer is clearly the only correct one, 58 4.
Determine the purpose and objectives of the test, 70 2. Design test specifications, 70 3. Specify scoring procedures and reporting formats, 79 6. Commercial P1;"oficiency Tests: Responsive Listening, Designing Assessment Tasks: Extensive listenihg.
What Should Grades Reflect? In this melange of topics' and issues, assessment remains an area of'intense fascination. What is the best way to assess learners' ability? What are the most practical assessment instruments available? Are current standardized tests of language profiCiency accurate and reliable?
In an era of communicative language teaching, do our classroom tests measure up to standards of authenticity and meaningfulness? How can a teacher design tests that serve as motivating learning experiences rather than anxiety-provoking threats? All these and many more questions now being addressed by teachers, researchers, and specialists can be overwhelming to the novice language teacher, who is already baffled by linguistic and psychological paradigms and by a multitude of methodological options.
It is a book that Simplifies the issues without oversimplifying. It doesn't dodge complex questions, and it treats them in ways that classroom teachers can comprehend. Readers do not have to become testing experts to understand and apply the concepts in this book, nor do they have to become statisticians adept in manipulating mathematical equations and advanced calculus. In keeping with the tone set in the previous two books, this one features uncomplicated prose and a systematic, spiraling organization.
Supportive research is acknowledged and succinctly explained without burdening the reader with ponderous debate over minutiae.
The testing discipline sometimes possesses an aura of sanctity that can cause teachers to feel inadequate as they approach the task of mastering principles and designing effective instruments. Some testing manuals, with their heavy emphasis on jargon and mathematical equations, don't help to dissipate that mystique. By the end of Language Assessment: Principles and Classroom Practices, readers will have gained access to this not-so-frightening field.
They will have a working knowledge of a number of useful. Principles and Classroo1n Practices is the product of many years of teaching language testing and assessment in my own classrooms. My students 'have collectively taught me more than I have taught them, which prompts me to thank them all, everywhere, for these gifts of knowledge.
I have memorable impressions of such sessions in Brazil, the Dominican Republic, Egypt, Japan, Peru, Thailand, Thrkey, and Yugoslavia, where cross-cultural,issues in assessment have been especially stimulating. I am also grateful to my graduate assistant, Amy Shipley, for tracking down research studies and practical examples of tests, and for preparing artwork for some of the figures in this book.
I offer an appreciative thank you to my friend Maryruth Farnsworth, who read the manuscript with an editor's eye and artfully pointed out some idiosyncrasies in my writing. And thanks to my colleague Pat Porter for reading and commenting on an earlier draft of this book.
American Council on Teaching. Foreign Languages. ACfFL , fQr. Summary Highlights. Language Learning, Listening and Speaking. White Plains, NY: Pearson Education. Second Language Testing, Inc. Tests seem as unavoidable as tomorrow's sunrise in virtually every kind of educational setting.
Courses of study in every diSCipline are marked by periodic tests-milestones of progress or inadequacy -and you intensely wish for a miraculous exemption from these ordeals. We live by tests and sometimes metaphorically die by them. For a quick revisiting of how tests affect manY,learners, take the following vocabulary quiz.
All the words are found in standard English dictionaries, so you should be able to answer all six items correctly, right? Okay, take the quiz and circle the correct definition for each word. You have 3 minutes to complete this examination! Probably just the same as many learners feel when they take many multiple-choice or shall we say multiple-guess? You can check your answers on this quiz now by turning to page If you correctly identified three or more; items, congratulations!
You just exceeded the average. Of course, this little pop quiz on obscure vocabulary is not an appropriate example of classroom-based achievement testing, nor is it intended to be. It's simply an illustration of how tests make us feel much of the time. Can tests be positive experiences? Can they bring out the best in students?
The answer is a resounding yes! Tests need not be degrading, artificial, anxiety-provoking experiences. And that's partly "rhat this book is all about: Notice that the title of this book is Language Assessment, not Language Testing.
There are important differences between these two constructs, and an even more important relationship among testing, assessing, and teaching. Let's look at the components of this defmition.
A test is first a method. To qualify as a test, the method must be explicit and structured: Second, a test must measure. Some tests measure general ability, while others focus on very specific competencies or objectives. Some tests, such as a classroom. Others, particularly large-scale standardized tests, provide a total numerical score, a percentile rank, and perhaps sonle subscores.
If an instrument does not specify a form of reporting measurement-a means for offering the test-taker some kind of result-then that technique cannot appropriately be defmed as a test. Next, a test measures an individual's ability, knowledge, or performance. What is their previous experience and background? Is the test appropriately matched to their abilities? A test measures performance, but the results imply the test-taker's ability, or, to use a concept common in the field of linguistics, competence.
Most language tests measure one's ability to perform language, that is, to speak, write, read, or listen to a subset of language. On the other hand, it is not uncommon to fmd tests designed to tap into a test-taker's knowledge about language: Performance-based tests sample the test-taker's actual use of language, but from those samples the test administrator infers general competence.
But from the results of that test, the examiner may infer a certain level of get1eral reading ability. In the case of a proficiency test, even though the actual performance on the test involves only a sampling of skills, that domain is overall proficiency in a language-general competence in all skills of a language. Other tests may have more specific criteria. A test of pronunciation might well be a test of only a limited set of phonemic minimal pairs.
A vocabulary test may focus on only the set of words covered in a particular lesson or unit. One of the biggest obs. A well-constructed test is an instrument that provides an accurate measure of the test-taker's ability within a particular domain. The definition sounds fairly simple, but in fact, constructing a good test is a complex task involving both science and art. You might be tempted to think of testing and assessing as synonymous terms, but they are not.
Assessment, on the other hand, is an ongoing process that encompasses a much wider domain. Whenever a student responds to a question, offers a conunent, or tries out a new word or structure, the teacher subconsciously makes an assessment of the student'S performance. Written work-from a jotted-down phrase to a formal essay-iS performance that ultimately is assessed by self, teacher, and possibly other students.
Reading and listening activities usually rc;: A good teacher never ceases to aSSesssludel1ts,whetherthose--assessments are incidental or intended. Tests can be useful devices, but they are only one among many procedures and tasks that teachers can ultimately use to assess students. But now, you might be thinking, if you make assessments every time you teach something in the classroom, does all teaching involve assessment?
The answer depends on your perspective. Teaching sets up the practice games of language learning: A diagram of the relationship among testing, teaching, and assessment is found in Figure 1. How did the performance compare to previous performance? Which aspects of the performance were better than others? Is the learner performing rip to an expected potential? How does the performance compare to that of others in the same learning community?
In the ideal classroom, all these observations feed into the way the teacher provides instruction to each student. Informal and Formal Assessment One way to begin untangling the lexical conundrum created by distinguishing among tests, assessment, and teaching is to distinguish between informal and formal assessment.
Examples include saying "Nice job! Informal assessment does not stop there. They are systematic, planned sampling techniques constructed to give teacher and student an appraisal of student ackievement. To extend the tennis analogy, formal assessments are the tournament games that occur periodically in the course of a regimen of practice.
We can say that all tests are formal assessments, but not all formal assessment is testing. For example, you might use a student's journal or portfolio of materials as a formal assessment of the attainment of certain course objectives, but it is problematic to call those two procedures "tests.
Tests are usually relatively time-constrained usually spanning a class period or at most several hours and draw on a limited sample of behavior. Formative and Summative Assessment Another useful distinction to. How is the procedure to be used? Two functions are commonly identified in the literature: For all practical purposes, virtually all kinds of informal assessment are or should be formative.
They have as their primary focus the ongoing development of the learner's language. So when you give a studenf: Summative assessment aims to measure, or summarize, what a student has grasped, and typically occurs at the end of a course or unit of instruction. Final exams in a course and general proficiency exams' are examples of summative assessment. One of the problems with prevailing attitudes toward testing is the view that all tests quizzes, periodic review tests, midterm exams, etc.
You may have thought, "Whew! I'm glad that's over. Now I don't have to remember that stuff anymore! Norm-Referenced and Criterion-Referenced Tests Another dichotomy that is important to clarify here and that aids in sorting out common terminology in assessment is the distinction between norm-referenced and criterion-referenced testing.
The purpose'in such tests is to place test-takers along a mathematical continuum in rank order. Typical of norm-referenced tests are standardized tests like, the Scholastic Aptitude Test. Such tests must have fIXed, predetermined responses in a format that can be scored quickly at minimum expense. Money and efficiency are primary concerns in these tests. Criterion-referenced tests, on the other hand, are designed to give test-takers. Here, much time and effort on the part of the teacher test adm.
In the s and s, COf!! These approaches still prevail today, even if in mutated forrtl: Discrete-point tests are constructed on the assumption that. Such an approach demanded ;! So, as the profession emergedinto an era of"emphasizing communication, authenticity, and context, new approaches were sought. Oller argued that language competence is a unified set of interacting abilities that cannot be tested separately. His claim was that communicative competence is so global and requires such integration hence the term "integrative" testing that it cannot be captured in additive tests of grammar, reading, vocabulary, and other discrete points of language.
What does an integrative test look like? A cloze test is a reading passage perhaps to words '1ii which roughly every sixth or seventh word has been deleted; the test-taker is required to supply words that fit into those blanks. See Chapter 8 for a full discussion of cloze testing. Oller 1 Frequent references are made in this book to companion volumes by the author. Principles of Language Learning and Teaching PLL1 Fourth Edition, is a basic teacher reference book on essential foundations of second language acquisition on which pedagogical practices are based.
According to theoretical constructs underlying this claim, the ability to supply appropriate wor4s in blanks requires a number of abilities that lie at the heart of competence in a language: Dictation is a familiar language-teaching technique that evolved into a testing technique. Essentially, learners listen to a passage of to words read aloud by an administrator or audiotape and write what they hear, using correct spelling. The listening. See Chapter 6 for more discussion of dictation as an assessment device.
Reliability of scoring criteria for dictation tests can be improved by designing multiple-choice or exact-word cloze test scoring.
Proponents of integrative test methods soon centered their arguments on what became knowp. Others argued strongly against the unitary trait pOSition.
In a study of students in Brazil and the Philippines, Farhady found Significant and widely varying differences in performance on an ESL profiCiency test, depending on subjects' native country, major field of study, and graduate versus undergraduate status. For example, Brazilians scored very low in listening comprehension and relatively high in reading comprehension.
Filipinos, whose scores on five of the six components of the test were considerably higher than Brazilians' scores, were actually lower than Brazilians in reading comprehension scores.
Farhady's contentions were supported in other research that seriously questioned the unitary trait hypothesis. Bachman and Palmer , p. As Weir , p. They do not tell us anything directly about a student's performance ability. Bachman and Palmer , pp. All elements of the model, especially pragmatic and strategic abilities, needed to be included in the constructs of language testing and in the actual performance required of test-takers.
Weir , p. See Skehan, , , for a 'survey of communicative testing research. Instead of just offering paper-and-pencil selective response tests of a plethora of separate items, perfonnance-based asseSSnlen of language typically. To be sure, such assessment is time-consuming and therefore expensive, but those extra efforts are paying off in the form of more direct testing because students are assessed as they perform actual or simulated real-world tasks.
In technical terms, higher content validity see Chapter 2 for an explanation is achieved because learners are meas. If you rely a little less on formally structured tests and a little more on evaluation ,while students are perfomling various tasks, you will be taking some steps toward meeting the goals Qf performance-based testing.
See Chapter 10 for a further discussion of performance-based assessment. A characteristic of many but not all performance-based language assessments is the presence of interactive tasks. In such cases, the assessments involve learners in actually performing the behavior that we want to measure. In interactive tasks, te:!: The test-taker is required to listen accurately to someone else and to respond appropriately. If care is taken in the test design process, language elicited and volunteered by the student can be personalized and meaningful, and tasks can approach the authenticity of real-life language use see Chapter 7.
Such efforts to improve various facets of classroom testing are accompanied by some stimulating issues, all of which are helping to shape our current understanding of effective assessment. Let's look at three such issues: New Views on Intelligence Intelligence was once viewed strictly as the ability to perform a linguistic and b logical-mathematical problem solving. This "IQ" intelligence quotient concept of intelligence has permeated the Western world and its way of testing for almost a century.
However, research on intelligence by psychologists like Howard Gardner, Robert Sternberg, and Daniel Goleman has begun to turn the psychometric world upside down GarQl1er , , for example, extended the traditional view of intelligence to Seven different components. All "smart" people aren't necessarily adept at fast, reactive thinking. Other forms of smartness are found in those who know how to manipulate their environment, namely, other people.
More recently, Daqiel Goleman's concept. Anger, grief, resentment, self-doubt, and other feelings can easily impair peak performance in everyday tasks as well as higher-order problem solving.
These new conceptualizations of il1 telligence have not been universally accepted by the academic community see White, , for example.
Nevertheless, their intuitive appeal infused the decade of the s with a sense of both freedom and responsibility in our testing agenda. Coupled with parallel educational reforms at the time Armstrong, , they helped to free us from relying exclusively on 2 For a summary of Gardner'S theory of intelligence, see Brown , pp. We were prodded.: Our challenge was to test interpersonal, creative, communicative, interactive skills, and in doing so to place some trust in our subjectivity and intuition.
Table 1. First, the concepts in Table 1. Many forms of asSessment fall in between the two, and some combine the best of both. As Bt: At the same time, we might all be stimulatecffO rook at tfie right-hand list and ask' ourselves if, among those concepts, there are alterQ;! It should l? See Chapter 10 for a complete treatment of alternatives in assessment. In Chapter 4, issues surrounding standardized testing are addressed at length.
Computer-Based Testing Recent years have seen a burgeoning of assessment in which the test--taker performs responses on a computer. Others are standardized, large-scale tests in which thousands or even. The CAT starts with questions of moderate difficulty. As tesFtakers-answer-eaeh question, the computer scores the question and uses that information, as well as the responses to previous questions, to determine which question will be presented next.
As long as examinees respond correctly, the computer typically selects questions of greater or equal difficulty. Incorrect answers, however, typically bring questions of lesser or equal difficulty. In CATs, the test-taker sees only one question at a time, and the computer scores each question before selecting the next one. As a result, test-takers cannot skip questions, and once they have entered and confirmed their answers t they cannot return to questions or to any earlier part of the test.
Computer-based testing, with or without CAT technology, offers these advantages: Among them: More is said about computer-based testing in subsequent chapters, especially Chapter 4, in a discussion of large-scale standardized testing. Educational Testing Service www. This need not be the case. Computer technology can be a boon to communicative language testing. Teachers and test-makers of the future will have access to an ever-increasing range of tools to safeguard against impersonal, stamped-out formulas for assessment.
Assessment is an integral part of the teaching-learning cycle. Tests, which are a subset of assessment, can provide authenticity, motivation, and feedback to the learner.
Keep in mind these basic principles: Assessments can confirm areas of strength and pinpoint areas needing further work. Assessments can spur learners to set goals for themselves. Assessments can aid in evaluating teaching effectiveness. Answers to the vocabulary quiz on pages 1 and 2: I Individual work; 6 Group or pair work; C Whole-class discussion.
G In a small group, look at Figure 1. Do you agree with this diagrammatic depiction of the three terms? Consider the following classroom teaching techniques: What proportion of each has an assessment facet to it?
Share your conclusions with the rest of the class. If norm-referenced tests typically yield a. Why did Oller back down from the unitary trait hypothesis?
IIC Why are cloze and dictation considered to be integrative tests? G Look at the list of Gardner'S seven intelligences. Share your results with other groups. Then decide which of those tasks are performance-based, which are not, and which ones fall in between.
G Table 1. In palrs, quickly review the advantages and diSadvantages of each, on both sides of the chart. Language testing. Oxford University Press. It is a useful little reference book to check your understanding of testing jargon and issues in the field. Mousavi, Seyyed Abbas. An encyclopedic dictionary of language testing. Third Edition. Thng Hua Book Company.
Its exhaustive page bibliography is also downloadable at http: Rahnama Publications. In this chapter, these principles will be used to evaluate an existing, previously published, or created test. Chapter 3 will center on how to use those principles to design a good test. How do you know if a testis effective? Fer the most part, that question can be answered by responding to such questions as: Can it be given within appropriate administrative constraints?
Is it dependable? Does it accurately measure what you want it to measure? These and other questions help to identify five cardinal criteria for "testing a test": We wilJ look at each one, but with no priority order implied in the order of presentation.
A test that is prohibitively expensive is impractical. A test that takes a few minutes for a student to take and several hours for an examiner to evaluate is impractical for most classroom situations. A test that can be scored only by computer is impractical if the test takes place a thousand miles away from the nearest computer.
The students arrived, test booklets were distributed, and directions were given. The proctor started the tape. Soon students began to look puzzled. By the time the tenth item played, everyone looked bewildered.
The students responded reasonably well. When the red-faced administrator and the proctor got together later to score the tests, they faced the problem ofhow to score the dictation-a more subjective process than some other forms of assessment see Chapter 6.
After a lengthy exchange, the two established a point system, but after the first few papers had been scored, it was clear that the point system needed revision. That meant going back to the frrst papers to make sure the new system was followed. Students were told to come back the next morning for their results. Later that evening, having combined dictation scores and the SO-item multiple-choice-scores, the two frustrated examiners finally arrived at placements for all students.
It's easy to see what went wrong here. Then, they established a scoring procedure that did not fit into the time constraints. Student-Related Reliability The most common learner-related issue in reliability is caused by temporary illness, fatigue, a "bad day," anxiety, and other physical or psychological factors, which may make an "observed"score deviate from one's "true" score.
Inter-rater reliability occurs when. In the story above about the placement test, the initial scoring plan for the dictations was found to be unreliable-that is, the two scorers were not applying the same standards. Rater-reliability issues are not limited to contexts where two or more scorers are involved.
Intra-rater reliability isa common occurrence for classroom. When I am faced with up to 40 tests to grade in only a week, I know that the standards I apply-however subliminally-to the first few tests will be different from those I apply to the last few. One solution to such intra-rater unreliability is tQ read through about half of the tests before rendering any fmal scores or grades, then to recycle back throughth1: In tests of writing skills, rater reliability is particularly hard to achieve since writing profiCiency involves numerous traits that are difficult to defme.
The careful specification of an analytical scoring. Brown, Test Adminlstration Reliability Unreliability may also result from the conditions in which the test is administered. This was a clear case of unreliability caused by the conditions of the test administration. If a test is too ,long, test-takers may become fatigued by the time they reach the later items and hastily respond incorrectly.
Timed tests may discriminate against students who do not perform well on a test with a time limit. We all know people and you may be included inJhis category! A vali ftest of reading ability actually measures reading ability To measure writing ability, one might ask students to write as many words as they can in 15 minutes, then simply count the words for the fmal score. Such a test would be easy to administer practical , and the scoring quite dependable reliable.
How is the validity of a test established? There is no final, absolute measure of validity, but several different kinds of evidence may be invoked in support. In some cases, it may be appropriate to ex3.
We will look at these five types of evidence below. A test that requires the learner actually to speak within ' some sort of authentic context does. And if a course has perhaps ten objectives but only two are covered in a test, then content validity suffers.
Consider the following quiz' on English articles for a high-beginner level of a conversation class listening and speaking for English learners.
English articles quiz Directions: The purpose of this quiz is for you and me to find out how well you know and can apply the rules of article usage. Read the-following passage and write alan, the, or 0 no article in each blank. Last night, I had 1 very strange dream. You know how much I love 3 zoos. Well, I dreamt that I went to 4 San Francisco zoo with 5 few friends.
When we got there, it was very dark, but 6 moon was: The story continues, with a total of 25 blanks to fill. In that this quiz uses a familiar setting and focuses on previouslrpracticed language forms';-it-is--somewhat'contentvalid.
Another way of understanding content validity is to consider the difference between direct and indirect testing. A direct test of syllable production would have to require that students actually produce. ConSider, for example, a listening! The test on that unit should include all of the above discourse and grammatical elements and involve students in the actual performance of listening and speaking.
Therefore, it is critical that teachers hold content-related evidence in high esteem in the process of defending the validity of classroom tests. Criterion-Related Evidence A second form of evidence of the validity of a test may be found in what is called criterion-related evidence, also referred to as criterion-related validity, or the extent to which the "criterion" of the test has actually been reached.
In the case of teacher-made classroom assessments, criterion-related evidence is best demonstrated through a. For example, in a course unit whose objective is for students to be able to orally produce voiced and voiceless stops in all possible phonetic environments, the results of one teacher's unit test might be compared with an independent assessment-possibly a commercially produced test in a textbook-of the same phonemic profiCiency.
A classroom test designed to assess mastery of a point of grammar in communicative use will have criterion validity if test scores are corroborated either by observed subsequent behavior or by other communicative measures of the grammar point in question. Criterion-related evidence usually falls into one of two categories: A test has concurrent validity if its results are supported by other concurrent perfonnance beyond the assessment itself.
The predictive validity of an assessment becomes"mportant in the case of placement tests, admissions assessment batteries, language aptitude tests, and the like. The assessment criterion in such cases is not to measure concurrent ability but to assess and predict a test-taker's likelihood of future success. Evidence A third kind of evidence that can support validity, but one that does not playas large a role for classroom teachers, is ,construct-related validity, commonly referred to as construct Validity.
A construct is any theory, hypothesis, or model that attempts to v' explain observed. VIrtUally every issue in language learning and teaching involves theoretical constructs. But don't let the concept of construct validity scare you. An informal construct validation of the use of virtually every classroom test is both essential and feasible. Imagine, for example, that you have been given a procedure for conducting an oral interview.
Jiye language use. Because of the crucial need to offer a fmancially affordable proficiency test and the high cost of administering and scoring oral -production tests, the omission of oral content from the TOEFL has been justifiec;1. Messick EI ,. As high-stakes assessment has gained ground in the last two decades, one aspect of consequential validity has drawn special attention: McNamara , p.
Another important consequence of a test falls into the category of washback, to be more fully discussed below. Gronlund , pp. Face Validity An important facet of consequential validity is the extent to which "students view the assessment as fair, relevant, and useful for improving learning" Gronlund, , p. They may feel, for a variety of reasons, that a test isn't testing what it is "supposed" to test. Face validity means that the students perceive the test to be valid.
Face validity asks the question "Does the test, on the 'face' of it, appear from the leamer's perspective to test what it is designed to test? Remember, face validity is not something that can be empirically tested by a teacher or even by a testing expert. For this reason, some assessment experts see Stevenson, view face validity as a superficial factor that is dependent on the whim of the perceiver.
The other side of this issue reminds us that the psychological state of the learner confidence, anxiety, etc. Students can be distracted and their anxiety increased if you "throw a curve" at them on a test.
They need to have rehearsed test tasks before the fact and feel comfortable with them. A classroom test is not the time to introduce new tasks because you won't know ,if student difficulty is a factor of the task itself or of the objectives you are testing. Some learners were upset because such tests, on the face of it, did not appear to them to test their true abilities in English. As it turned out, the tests served as superior instruments for placement, but the students would not have thought so.
As already noted above, content validity is a very imporcint ingredient in achieving face validity. If a test samples the actual content of what the learner has achieved or expects to achieve, then face validity will be more likely to be perceived.
If in your language teaching you can attend to the practicality, reliability, and validity of tests of language, whether those tests are classroom tests related to a part of a lesson; fmal exams, or profiCiency tests, then you are well on the way to making accurate judgments about the competence of the learners with whom you are working. Bachman and Palmer, , p. Essentially, when you make a claim for aUthentidty in a test task, you 'are saying that this task is likely to be enacted in the "real world.
They may be contrived or artificial in their attempt to target a iriunmatical form or a lexical item. The sequencing of items that bear no relationship to one another lacks authentidty. One does not have to look very long to find reading comprehension passages in profiCiency tests that do not reflect a real-world passage.
In a test, authenticity may be 'present. The authenticity of test tasks in recent years has increased noticeably. Things have changed. You are invited to take up the challenge of authenticity in your classroom tests. As we explore many different types of task in this book, especially in Chapters 6 through 9, the principle of authenticity will be very much in the forefront.
In large-scale assessment, washback generally refers to the effects the tests have on instruction in terms of how students prepare for the test. Washback also includes the effects of an assessment on teaching and learning prior to the assessment itself, that is, on preparation for the assessment. Formal tests can also have positive washback, but they provide no washback if the students receive a simple letter grade or a single overall numerical score. The challenge to teachers is to create classroom tests that serve as learning devices through which washback is achieved.
Students' incorrect responses can become windows of insight into further work. Teachers can suggest strategies for success as part of their "coaching" role. Washback enhances a number of basic principles of language acquisition: One way to enhance washback is to comment generously and specifically on test performance. Many overworked and underpaid! In reality, letter grades and numerical scores give absolutely no information of intrinsic interest to the student.
Even if your evaluation is not a neat little paragraph appended to the test, you can respond to as many-details throughout-the test as time will-permit. Give praise for strengths-the "good stuff" -as well as constructive criticism of weaknesses. In other words, take some time to make the test performance an intrinsically motivating experience from which a student will gain a sense of accomplishment and challenge.
A little bit of washback may also help students through a specification of the numerical scores on the various subsections of the test. A subsection on verb tenses, for example. Formative tests, by defrnition, provide washback in the form of information to the learner on progress toward goals. But teachers might be tempted to feel that summative tests, wr.. Even a final examination in a course should carry with it some means for giving washback to students.
In my courses I never give a final examination as the last scheduled classroom session. At this time, the students receive scores, grades, and comments on their work, and I spenasome of the class session addressing material on which the students were not completely clear. My summative assessment is thereby enhanced by some beneficial washback that is usually not expected of final examinations. Finally, washback also implies that students have ready access to you to discuss the feedback and evaluation you have given.
For learning. Quizzes, tests, final exams, and standardized profiCiency tests can all be scrutinized through these five lenses. Are there other principles that should be invoked in evaluating and designing assessments?
The answer,. Language assessment is an extraordinarily broad disCipline with many branches, interest areas, and issues. The process of designing effective assessment instruments is far too complex to be reduced to five principles. Good test construction, for example, is governed by research-based rules of test preparation, sampling of tasks, item design and construction, scoring responses, ethical standards, and so on. But the five principles cited here serve as an excellent foundation on which to evaluate existing instruments and to build your own.
We will look at how to design tests in Chapter 3 and at standardized tests in Chapter 4. The questions that follow here, indexed by the five prinCiples, will help you evaluate existing tests for your own classroom. It is important for you to remember, however, that the sequence of these questions does not imply a priority order.
Validity, for example, is certainly the most significant cardinal principle of assessment evaluation. Practicality may be a secondary issue in classroom testing.
When all is said and done, however, if validity is not substantiated, all other considerations may be rendered useless. Practicality is determined by the teacher's and the students' time constraints, costs, and administrative details, and to some extent by what occurs before and after the test. To determine whether a test is practical for your needs, you may want to use the checklist below.
Practicality checklist o 1. Are administrative details clearly established befc;: Can students complete the test reasonably within the set time frame? Can the test be adrninistered smoothly, without procedural "glitches"? Are all materials and equipment ready? Is the cost of the test within budgeted limits?
Are methods for reporting results determined in advance? As this checklist suggests, after you account for the administrative details of giving a test, you need to think about the practicality of your plans for scoring the test. If you need to tailor a test to fit your own time frame, as teachers frequently do, you need to accomplish this without damaging the test's validity and washback.
Teachers should, for example, avoid the temptation to offer only quickly scored multiple-choice selection items that may be neither appropriate nor well-designed.
Everyone knows teachers secretly hate to grade tests almost as much as students hate to take them! Part of achieving test reliability depends on the physical context-: Since classroom tests rarely involve two scorers, inter-rater reliability is seldom an issue. Teachers need to. It is easy to let mentally established standards erode over the hours you require to evaluate the test. Does the procedure demonstrate content validity? The major source of validity in a classroom test is content validity: There are two steps to evaluating the content validity of a classroom test.
Are classroom objectives identified and appropriately framed? Sometimes this is easier said than done. Too often teachers work through lessons day after day with little or no cognizance of the objectives they seek to fulfill.
Or perhaps those, objectives are so poorly framed that detennining whether or not they were accomplished is impOSSible. Students should be able to demonstrate some reading comprehension. To practice vocabulary in context.
Students will have fun through a relaxed activity and thus enjoy their [earning. Only the last objective is framed in a form that lends itself to assessment. In a , the modal sbould is ambiguous and the expected performance is not stated. In b , everyone can fulfill the act of "practicing"; no standards are stated or implied. For obvious reasons, c cannot be assessed. And d is really just a teacher's note on the type of activity to be used.
By specifying acceptable and unacceptable levels of performance, the goal can be tested. Are lesson objectives represented in tbeform of test specifications? D n't let this word scare you. It simply means that a test should have a structure that follows logically from the lesson or unit you are testing.
Some tests, of course, do not lend themselves to this kind of structure. We will return to the concept of test specs in the next chapter. The content validity of an existing classroom test should be apparent in how the objectives of the unit being tested are represented in the form of the content of items, clusters of items, and item types. Do you clearly perceive the performance of test-takers as reflective of the classroom objectives? If so, and you can argue this, content validity has probably been achieved.
In evaluating a classroom test, consider the extent to which before-, during-,. Test-taking strategies Before the Test 1. Give students all the information you can about the test: Which topics will be the most important? What kind of items wi II be on it? How long wi II it be?
Encourage students to do a systematic review of material. Give them practice tests or exercises, if avai lable. Facilitate formation of a study group, If possible. Caution students to get a good night's rest before the test. Remind students to get to the classroom early. During the Test 1. After the test is distributed, tell students to look over the whole test quickly in order to get a good grasp of its different parts.
Remind them to mentally figure out how much time they will need for each part. Advise them to concentrate as carefully as possible. Warn students a few minutes before the end of the class period so that they can finish on time, proofread their answers, and catch careless errors. When you return the test, include feedback on specific things the student did well, what he or she did not do well, and, if possible, the reasons for your comments. Advise students to pay careful attention in class to whatever you say about the test resu Its.
Encou'rage questions from students. Advise students to pay special attention in the future to points on which they are weak. Keep in mind that what comes before and after the test also contributes to its face validity. Good class preparation will give students a comfort level with the test, and good feedback-washback-will allow them to learn from it. A re the test tasks as authentic as possible? Evaluate the extent to which a test is authentic by asking the following questions: Multiple-choice tasks--contextualized "Going To" 1.
What this weekend? I'm not sure. Are you going to do b. You are goi ng to do c. My friend Melissa and I a party. Would you like to come? I'd love to! What's it going to be? Who's go; ng to be? Where's it going to be? It is to be at Ruth's house. There are three countries I would like to visit. One is Italy. The other is New Zealand and other is Nepal. The others are New Zealand and Nepal. Others are New Zealand and Nepal. When I was twelve years old,1 used everyday. When Mr. Since the beginning of the year, I at Millennium Industries.
When Mona broke her leg, she asked her husband her to work. The conversation is one that might occur in the real world, even if with a little less formality. The sequence of items in the decontextualized tasks takes the test-taker into five different topic areas with no context for any. Each sentence is likely to be written or spoken in the real world, but not in that sequence. Given the constraints of a multiple-choice format, on a measure of authenticity I would say the first excerpt is "good" and the second excerpt is only "fair.
The design of an effective test should point the. A test that achieves content validity demonstrates relevance to the curriculum in question and thereby sets the stage for washback. We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime. Testing in language programs chapter 8.
Upcoming SlideShare. Like this presentation? Why not share! Embed Size px. Start on. Show related SlideShares at end. WordPress Shortcode. Published in: Full Name Comment goes here. Are you sure you want to Yes No. No Downloads. Views Total views.
Actions Shares. Embeds 0 No embeds. No notes for slide. Testing in language programs chapter 8 1. Golshan Prepared by: Prepared by: A test should have: Same result under the same condition Reliability: Same result under the same condition Validity: Scale to measure the size of head Not sthValidity: Scale to measure the size of head Not sth else else Usability or Practicality: Not too difficult, practicalUsability or Practicality: Tests should measure consistently! VarianceVariance VarianceVariance:: Variance of Zero: Identical values Small Variance: Expected value close to mean High Variance: Spread out values, far from mean 4.
To gain the goal: Test validity issue: Other Factors unrelated to the aim of the test: Variance due to the environment: Noise, classroom temperature, outside noises, distractions, amount of space per person, lighting, ventilation, or other environmental factors.
Variance due to the administration procedure: Directions of test, Quality of equipment and timing Cassette or teachers. Table 2. Variance due to examinees: Condition of students: Psychological factors: Variance due to scoring procedure: Errors in doing scoring. Subjective nature of scoring procedure. Variance due to test and test items: