Requirements of a “good” test
Practicality
A test ought to be practical — within the means of financial limitations, time constraints, ease of administration, and scoring and interpretation. A test of language proficiency that takes a student 10 hours to complete is impractical. A test that requires individual one-to-one proctoring is impractical for a group of 500 people and only a handful of examiners. A test that takes a few minutes for a student to take and several hours for the examiner to correct is impractical for a large number of testees and one examiner if results are expected within a short time. A test that can be scored only by computer is impractical if the test takes place a thousand miles away from the nearest computer. The value and quality of a test are dependent upon such nitty-gritty, practical considerations.
Teachers need to be able to make clear and useful interpretations of test data in order to understand their students better. A test that is too complex or too sophisticated may not be of practical use to the teacher.
Reliability
A reliable test is a test that is consistent and dependable. Sources of unreliability may lie in the test itself or in the scoring of the test, known respectively as test reliability and rater (or scorer) reliability. If you give the same test to the same subject or matched subjects on two different occasions, the test itself should yield similar results; it should have test reliability. I once witnessed the administration of a test of aural comprehension in which a tape recorder played items for comprehension, but because of street noise outside the testing room, some students in the room were prevented from hearing the tape accurately. That was a clear case of unreliability. Sometimes a test yields unreliable results because of factors beyond the control of the test writer, such as illness, a "bad day," or no sleep the night before.
Scorer reliability is the consistency of scoring by two or more scorers. If very subjective techniques are employed in the scoring of a test, one would not expect to find high scorer reliability. A test of authenticity of pronunciation in which the scorer is to assign a number between one and five might be unreliable if the scoring directions are not clear. If scoring directions are clear and specific as to the exact details the judge should attend to, then such scoring can become reasonably consistent and dependable. In tests of writing skills scorer reliability is not easy to achieve since writing proficiency involves numerous traits that are difficult to define.
Validity
By far the most complex criterion of a good test is validity, the degree to which the test actually measures what it is intended to measure. A valid test of reading ability is one that actually measures reading ability and not previous knowledge in a subject, or some other variable of questionable relevance. To measure writing ability, one might conceivably ask students to write as many words as they can in 15 minutes, then simply count the words for the final, score. Such a test would be practical and reliable; the test would be easy to administer, and the scoring quite dependable. But it would hardly constitute a valid test of writing ability unless some consideration were given to the communication and organization of ideas, among other factors.
How does one establish the validity of a test? Validity can only be established by observation and theoretical justification. There is no final, absolute, and objective measure of validity. We have to ask questions that give us convincing evidence that a test accurately and sufficiently measures the testee for the particular purpose, or objective, or criterion, of the test.
In tests of language, validity is supported most convincingly by subsequent personal observation of teachers and peers. The validity of a high score on the final exam of a foreign language course will be substantiated by "actual" proficiency in the language (if the claim is that a high score is indicative of high proficiency). A classroom test designed to assess mastery of a point of grammar in communicative use will have validity if test scores correlate either with observed subsequent behavior or with other communicative measures of the grammar point in question.
How can teachers be somewhat assured that a test, whether it is a standardized test or one which has been constructed for classroom use, is indeed valid? The technical procedures for validating tests are complex and require specialized knowledge. But two major types of validation are important for classroom teachers: content validity and construct validity.
Content Validity
If a test actually samples the class of situations, that is, the universe of subject matter about which conclusions are to be drawn, it is said to have content validity. The test actually involves the testee in a sample of the behaviour that is being measured. You can usually determine content validity, observationally, if you can clearly define the achievement that you are measuring. If you are trying to assess a person's ability to speak a second language in a conversational setting, a test that asks the learner to answer paper-and-pencil multiple-choice questions requiring grammatical judgments does not achieve content validity. A test that requires the learner actually to speak within some sort of authentic context does.
A concept that is very closely related to content validity is face validity, which asks the question: does the test, on the "face" of it, appear to test what it is designed to test? Face validity is very important from the learner's perspective. To achieve "peak" performance on a test, a learner needs to be convinced that the test is indeed testing what it claims to test. Face validity is almost always perceived in terms of content: if the test samples the actual content of what the learner has achieved or expects to achieve, then face validity will be perceived.
Construct Validity
A second category of validity that teachers must be aware of in considering language tests is construct validity. One way to look at construct validity is to ask the question: does this test actually tap into the theoretical construct as it has been defined? "Proficiency" is a construct. "Communicative competence" is a construct. "Self-esteem" is a construct. Virtually every theoretical category is a theoretical construct. A teacher, then, needs to be satisfied that a particular test is an adequate definition of a construct.