The Invalidity of the Wechsler Adult Intelligence Scale-III

© Peter Zohrab 2007

(This was written as a Massey University Psychology essay with a maximum word-length -- including title page -- of 3,000 words, with the title: Critical Review of the Wechsler Adult Intelligence Scale - Third Edition. The relevance of the Wechsler family of tests to the political theme of equality is that

  1. Psychologists are very powerful in Family Law and elsewhere, and that

  2. intelligence is an important political issue, and the credibility of the way that Psychologists talk about it is therefore also important.)


Purpose and nature

Practical evaluation

Technical evaluation of psychometric properties

Research Relevant to Usefulness of Measure

Evaluation and Discussion


1. General

The Wechsler Adult Intelligence Test - Third Edition (WAIS - III) is part of a family of Wechsler tests. "The WAIS-III is the great-grandchild of the original 1939 Wechsler-Bellevue Form I." (Kaufman and Lichtenberger 1999, p. 3) It has been the subject of extensive research, so this short review will merely present an overview, with some focus on the issue of its validity.

The WAIS - III project directors were David Tulsky and Jianjun Zhu, the publisher is Harcourt Brace & Company, and the date of publication of the test and of its normative data was 1997. Administration of the Verbal IQ, Performance IQ, and Full Scale IQ is intended to take 60-90 minutes. The test costs either US$ 914 or US$ 967, depending on the packaging (box or case) required (according to the WAIS-III WMS-III Technical Technical Manual and from the website http://harcourtassessment.com/).



2. Purpose and nature

The purpose of WAIS-III is to measure an adult's intellectual ability using a multiple aptitude battery. The test is for adults between the ages of 16 and 89 years. It is designed for use with individuals, and the battery is composed of seven performance subtests and seven verbal subtests. The overall result of a WAIS-III test is called a Full Scale IQ, but the verbal and performance components also yield their own scores -- respectively the Verbal IQ and the Performance IQ. The verbal and performance components each also have subcomponents, and these subcomponents yield scores called indices.

According to Kaufman and Lichtenberger (1999), these subtests were largely based on other researchers' work, especially the Stanford-Binet and the Army Performance Scale Examination.



3. Practical evaluation

Face validity is one of the test's strong-points, because it is apparently inclusive of a lot of intelligence-related skills. The design of the main materials (Stimulus Booklet, Block Design blocks, Picture Arrangement pictures, Object Assembly objects, and Administration and Scoring Manual) is excellent. The content is literate, clearly laid-out, and easy to use. Everything is attractive and apparently durable, except that the cover of the stimulus booklet I saw was showing its age, being slightly turned-up at the bottom. The materials seems appropriate to the age of the users.

It is lengthy and somewhat complicated to administer. It cannot be administered by computer -- this has to be done one-on-one, human-to-human. The directions are not always completely clear, but administrators would normally be taught how to administer the test, so this should not be a real issue in most cases. According to the webpage http://harcourtassessment.com/hai/ProductLongDesc.aspx?Catalog=TPC-USCatalog&ISBN=015-8980-727&Category=Adolescents , there is a training video available for purchase.

There is computer-assisted scoring available, but scoring procedures -- while not simple -- are not really difficult. However, there is always the risk of human error because of the number of manual entries of raw scores and scaled scores that the administrator has to make. Scoring templates are used for some subtests. Index scores are generated for Verbal Comprehension, Working Memory, Perceptual Organization, and Processing Speed, as well as scores for Verbal IQ, Performance IQ and Full-Scale IQ.

According to an email received on 25 April 2007 from Harcourt Assessment Customer Service, WAIS-III requires a high level of expertise in test interpretation, and can be purchased by individuals with:

  • Licensure or certification to practice in a field related to the purchase, or

  • A doctorate degree in psychology, education, or closely related field with formal training in the ethical administration, scoring, and interpretation of clinical assessments related to the intended use of the assessment.



4. Technical evaluation of psychometric properties

(a) Norms:

Shum, O'Gorman, and Myors (2006, p. 130) state that one of the strengths of the WAIS-III is the size and representativeness of the standardisation sample used in test development. According to the Technical Manual, the WAIS-III and WMS-III (Wechsler Memory Scale -- Third Edition) normative information was based on United States standardisation samples of 2,450 individuals representative of the population of adults aged 16-89 years. A stratified, census-based sampling plan ensured that the standardisation samples included representative proportions of adults according to each selected demographic variable. The variables used for stratification were age, sex, race/ethnicity, education level, and geographic region.

According to the Technical Manual, one set of norms was produced that was representative of US Census proportions as regards all variables except age. It was based on the performance of a reference group that consisted of the participants in the standardisation sample who were between the ages of 20 and 34. The Manual recommends that this set of norms be used when clinical questions dictate comparisons of an individual's performance to that of a reference group. Another set of norms was produced that was based on age-corrected subtest scores. The Manual recommends that this set of norms be used when clinical questions dictate comparisons of an individual's performance to that of his or her age peers.



(b) Reliability

The WAIS-III only exists in one version, so there is no issue with alternate forms. According to the Technical Manual, interscorer agreement is very high, averaging in the high .90s. According to the Technical Manual, the stability of WAIS-III scores was assessed in a study and found to be adequate across time for all age-groups.

According to the Technical Manual, the reliability of each WAIS-III subtest (except Digit Symbol-Coding and Symbol Search) was estimated using a split-half procedure from the item scores from a single administration, with the correlation corrected using the Spearman-Brown formula. Since Digit Symbol-Coding and Symbol Search subtests are speeded subtests, the split-half coefficient was not considered to be a good estimate of their reliability. For that reason, test-retest stability coefficients were used as the reliability estimates for these two subtests, with the correlation being corrected for the variability of the standardisation sample.

The sample included 394 participants, with roughly 30 participants from each of the 13 age-groups. The reliability coefficients of the WAIS-III IQ scales and indexes were calculated with the formula recommended by Guilford (1954) and Nunnally (1978). The average reliability coefficients across age-groups of the subtests (except Picture Arrangement, Symbol Search and Object Assembly), which were calculated with Fisher's z transformation, range from .82 to .93. The Symbol Search subtest had a coefficient of .77, Picture Arrangement had .74, and Object Assembly had .70. The Object Assembly subtest is not included in the computation of IQ and Index scores, in part because of its low reliability for older adults.



(c) Validity:

The Technical Manual (p. 75) asserts that, in order to ensure content validity, comprehensive literature reviews were undertaken, consultants were consulted, surveys were carried out, and focus groups and an advisory panel were set up. The Manual also provides considerable detail about the testing that was done of the WAIS-III's concurrent criterion-related validity.

A later section will examine the issue of construct validity in more detail. Here it suffices to state that the Technical Manual provides a lot of data on intercorrelation studies within the components of the WAIS-II itself, on factor analysis and on the ability of the WAIS-III to discriminate between the normal population and groups with various neurological disorders, alcohol-related disorders, schizophrenia, psychoeducational and developmental disorders, and deafness or hearing-impairment.



5. Research Relevant to Usefulness of Measure

There has been a vast amount of research done on WAIS-III and its predecessors, so it is beyond the scope of this review to do more than just to sample it -- giving a hopefully varied but unsystematic taster of the available body of research. Watkins, C. E. Jnr., Campbell, V. L., Nieberding, R. and Hallmark (1995) conclude that the Wechsler scales are amongst the assessment procedures most frequently recommended by American clinicians for clinical students to learn about and that most clinicians still use most often what they call the "most tried and true" assessment standards, including the Wechsler scales. The WAIS-R (the immediate predecessor of WAIS-III) was the clear frontrunner in terms of frequency of use of intelligence tests. Camara, Nathan and Puente (2000) made a similar finding.

In this connection, it is worth noting that the WAIS-III Technical Manual (on page 75) states that "... because of the similarities between the WAIS-III and the WAIS-R ..., the accumulated research on the WAIS-R ... should be considered in any evaluation of the validity of the (WAIS-III)."

In Australia, Sharpley and Pain (1988) report that the Wechsler tests of intelligence were also the most valued and recommended, and in New Zealand Knight and Godfrey (1984) reports that the WAIS was the test that the most hospital psychologists believed clinical psychology graduates should have had experience in administering and interpreting.

There has been a lot of factor analysis of the validity of various aspects of the WAIS-III subsequent to its publication, as was anticipated in the Technical Manual itself. It is interesting to note that such studies sometimes appear to contradict each other -- for example, Taub (2001) concluded that his evidence did not support the Verbal IQ/Performance IQ dichotomy, whereas Jones, van Schaik, and Witts (2006) conclude as follows:

...we suggest that index scores should be used with caution in individuals with low IQ (74 or less). The use of two scores (for verbal and performance domains) is justified based on the two-factor solution obtained in the current study.

Bennett (1981) investigates the effect of encouragement of examinees by administrators on measured IQ and found a significant positive correlation for Full-Scale IQ, with those who had received encouragement scoring higher than those who had not. This effect was also found for Performance IQ, but the effect for Verbal IQ was not significant. Bennett cites previous research which had also found that reinforcement of various kinds had a significant effect on academic performance and test scores. He also investigated the interaction of encouragement with examinee personality-type (Locus of Control), but he found no significant effect in this case.

With regard to the issue of encouragement, Bennett states (p. 78) that it is inevitable that some differences will arise among examiners. "These differences do not matter if the encouragement has no effect, but if that is the case, there is little point in using it." He goes on to state (p.80):

Although the differences obtained in the present research were within the standard error of measurement of the WAIS, it must be remembered that as a result of factors mentioned above, and the fact that examiner differences were kept to a minimum, the effect found in the present study was probably a minimal one.

Heaton, Taylor and Manly (2003) investigates certain aspects of both the WAIS-III and the WMS-III, which were standardised jointly. The authors are concerned to optimise these two tests for clinical -- especially neurodiagnostic -- purposes. The use of the tests that they have in mind is for comparing the scores achieved by particular individuals with what they would have achieved if they did not have any neuropsychiatric disorder, so that the scores can be used to establish the presence or absence, nature and extent of any such disorder in that individual.

This would involve comparing test results with norms (unless the individuals concerned happened to have been recently tested prior to the suspected onset of any relevant morbidity). Confounding variables would, of course, need to be taken into account and it would be preferable therefore to have separate norms for every relevant category that an individual might fall into. The authors state that there is evidence that sex, education-level and ethnicity are relevant in this regard. However, WAIS-III only has separate norms for particular age-groups.

The authors address this problem and claim to have solved it. They investigate the effect of these variables on WAIS-III and WMS-III test scores and also the effect on score-interpretation of not taking these factors into account. They then provide new standardised scores that correct for these demographic influences, and demonstrate how these result in more accurate score-interpretations.



6. Evaluation and Discussion

The Wechsler family of tests are long-established and well-known, and have both a large amount of face-validity and professional credibility because of this. The subtests of the WAIS-III are varied and attractive, which reduces the tedium (for the examinee) which might be associated with sitting a long test, although there is evidence (Axelrod and Ryan 2000) that some examinee groups can average as long as 110 minutes to complete the full test.

One of the main strengths of the WAIS-III is the size and representativeness of the standardisation sample used in test development. However, as Kaufman and Lichtenberger (1999, p. 3) state, "The development of Wechsler's tests was not based on theory ... but instead on practical and clinical perspectives." This theoretical vacuum reflects on its construct validity.

As the technical manual states (p. 75):

The validity of a test is regarded as the most fundamental and important aspect of test development... validity is the overall evaluation of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of interpretations of test scores.

The main weakness of the WAIS-III relates to the theoretical rationales which underpin its claims to validity. The Technical Manual states that Wechsler maintained throughout his career the definition of intelligence as the "capacity of the individual to act purposefully, to think rationally, and to deal effectively with his environment." From the point of view of construct validity, however, it is implausible to claim that WAIS-III measures the constructs intended by its design, if the constructs are based on the above definition, which is extremely broad. How prominently does "purposefulness" figure in the WAIS-III? Not at all, as far as I am aware. And the term "environment" is so broad that it would be implausible to suggest that sitting any test at a desk under supervision was at all relevant to assessing how an individual dealt with his environment as a whole (however that might be defined). I have seen no evidence that the subtests (derived, as they mostly were, from tests developed by other researchers) were developed to test a construct based on that definition of intelligence -- or anything like it.

Moreover, it would also be implausible to claim that users (administrators, user organisations and examinees) of the WAIS-III would generally have as broad a definition as that in mind when they purchased and/or used it in good faith to produce scores of "intelligence". It is beyond the scope of this review to investigate whether there have been or might in future be legal arguments raised in connection with the above issues.

Coolican (2005, p. 288) warns:

Note that psychologists have not discovered that intelligence has a normal distribution in the population. The tests were purposely created to fit a normal distribution, basically for research purposes and practical convenience in test comparisons.

This artificiality and pragmatism are not limited to the distribution of intelligence scores. Psychologists often apply their theories to important social purposes, and one of these purposes is to assess the "amount" of the pre-existing popular concept of "intelligence" that particular people possess. This popular concept itself is vague and understood in different ways by different ordinary people, but tests such as the WAIS-III are marketed back to ordinary people as being tests of "intelligence" (as is shown by the appearance of the word "intelligence" in the name of the test), with the implication that this is the same concept that lay people have in mind when they use that word. It might have been better to use a term such as "rational cognitive ability".

There are no substantial ethical issues involved with the WAIS-III that are not common to all psychometric tests. The one possible exception is the ethical need to resist pressures from political groups to interpret as ethical issues what are properly considered political issues related to the education or employment of particular ethnic or other groups. These need to be decided through the proper democratic poiltical processes.



