Understanding Terminology of Scientific Research
Achievement tests are standardized assessments that measure the acquisition of knowledge and skills in academic subject areas (i.e., math, spelling, and reading), usually against age and stage of learning – see bench marks.
In standardised testing, age equivalent scores are those which show the typical age of the norm group that obtained a similar score, such as the median (middle) score or within a specific percentile range.
Age Equivalent Score
In a norm-referenced assessment, individual student's scores are reported relative to those of the norming population. This can be done in a variety of ways, but one way is to report the average age of people who received the same score as the individual child. Thus, an individual child's score is described as being the same as students that are younger, the same age, or older than that student (e.g. a 9 year old student my receive the same score that an average 13 year old student does, suggesting that this student is quite advanced).
A group or series of tests or subtests administered to assess overall achievement, potential or functional abilities and skills.
See normal distribution curve.
Levels of (usually) academic performance used as checkpoints to monitor progress toward performance goals and/or academic standards as initially identified through global standardised testing such as NAPLAN (Australia’s National Assessment Program Literacy and Numeracy).
Ceiling effect is a compression of top scores on a test. That is, if an assessment item only scores to a certain level and the student is capable of performing at a higher level, her/his real ability will not be recorded.
Clackmannanshire (Clacks) study
A landmark study on the effects of reading instruction based on synthetic phonics, which led to the formal adoption of synthetics phonics as the recommended approach to the teaching of initial reading in the UK See The Effects Of Synthetic Phonics Teaching On Reading And Spelling Attainment: A seven year longitudinal study by Rhona Johnston and Joyce Watson. See: http://www.gov.scot/Publications/2005/02/20682/52383
EXECUTIVE SUMMARY of the Clackmannanshire Study.
We report here a study of the effectiveness of a synthetic phonics programme in teaching reading and spelling. Around 300 children in Primary 1 were divided into three groups. One group learnt by the synthetic phonics method, one by the standard analytic phonics method, and one by an analytic phonics programme that included systematic phonemic awareness teaching without reference to print. At the end of the programme, the synthetic phonics taught group were reading and spelling 7 months ahead of chronological age. They read words around 7 months ahead of the other two groups, and were 8 to 9 months ahead in spelling. The other two groups then carried out the synthetic phonics programme, completing it by the end of Primary 1.
The practice of combining two or more subtest scores to create an average or composite score. For example, a reading performance score may be an average of vocabulary and reading comprehension subtest scores.
A chart used to translate test scores into different measures of performance or age equivalents (e.g., grade equivalents and percentile ranks).
Measurement is compared to an acceptable standard. The individual’s performance is compared to an objective or performance standard, not to the performance of other students. Tests determine if skills have been mastered; do not compare a child’s performance to that of other children.
Raw scores are often converted to derived scores based on a normalised distribution of the raw scores, with a given mean and standard deviation. This procedure requires collection of data on a representative sample of the relevant population, usually defined in terms of age or grade level. Examples of derived scores are the z-score (mean = 0, SD=1), a deviation IQ score (mean=100, SD=15), a T-score (mean = 50 and SD= 10), and a stanine score (mean = 5 and SD= 2). A z-score of 1 is therefore equivalent to an IQ score of 115, a T-score of 60, and a stanine score of 7. A common procedure for many achievement tests is to convert raw scores to standardized scores with a mean of 100 and a SD of 15, so that a student’s performance across different tests can be compared, as well as performance relative to the norm group on which the test was standardized. Derived scores are not calibrated on an equal-interval scale, in the same way as scale scores are, so they have some limitations in terms of the statistical procedures that can be applied.
A test used to diagnose, analyse or identify specific areas of weakness and strength of abilities and skills; to determine the nature of weaknesses or deficiencies; diagnostic achievement tests are used to measure skills, to inform instruction.
The average change in test scores that is intended to occur over a specific time for individuals at specific age or grade levels.
In standardized testing, the floor is defined as the lowest score that a test can statistically reliably measure.
The process of gathering information using standardized, published tests or instruments in conjunction with specific administration and interpretation procedures, and used to make general instructional decisions.
Formative assessments are designed to evaluate students on a frequent basis so that adjustments can be made in instruction to help them reach target achievement goals. See Summative Assessment.
A frequency distribution is a method of graphically displaying information, such as test scores.
Grade Equivalent Scores
In a norm-referenced assessment, individual students' scores are reported relative to those of the norming population. This can be done in a variety of ways, but one way is to report the average grade of students who received the same score as the individual child. Thus, an individual child's score is described as being the same as students that are in higher, the same, or lower grades than that student (e.g. a student in 2nd grade my earn the same score that an average fourth grade student does, suggesting that this student is quite advanced).
Test scores that equate a score to a particular grade level. Example: if a child scores at the average of all fifth graders tested, the child would receive a grade equivalent score of 5.0.
One's score is compared to one's own previous score on a test covering the same material in order to show that learning has occurred.
Intelligence Quotient (IQ)
IQ is a quantitative representation of cognitive ability based on standardised testing of a sample of cognitive skills.
The first measures of intelligence were devised by French psychologist Alfred Binet in 1904 for the purpose of identifying children who needed special help in coping with the school curriculum. The tests were initially used for the identification of ‘retarded’ children for placement in special schools. Working together with his collaborator Theodore Simon, revised versions of what became known as the Binet-Simon scale were published 1908 and 1911. Scores on the Binet-Simon tests were expressed in the form of a mental age, based on the average score according to chronological age.
The term "IQ," was coined by the German psychologist William Stern in 1912 as a proposed method of scoring intelligence tests such as those developed by Binet and Simon. The IQ, or ‘Intelligence Quotient’ is calculated as Mental Age divided by Chronological Age, multiplied by 100. The first use of the IQ to report scores on an intelligence test was in the case of the 1916 Stanford-Binet Intelligence Scale, developed by Lewis M. Terman of Stanford University. This test provided the model for most intelligence tests still in use today, which are generally referred to as IQ tests.
Tests that assess intellectual capacities, based on a sampling of items tapping various areas of cognitive functioning. Intelligence tests typically have a number of subtests, and provide both an overall score and individual subtest scores from various domains such as verbal reasoning, abstract reasoning, short-term memory, and processing speed. Scores on intelligence tests are usually expressed in the form of an IQ, which is a standardised score based on a normal distribution of scores, with a mean of 100 and a standard deviation of 15 (or 16 in the case of the Stanford Binet test). The pattern of scores among the various subtests provides a profile of an individual’s strengths and weaknesses, and an uneven distribution of scores is often associated with atypical patterns of learning. Intelligence tests are valid only for the culturally specific populations for which they were designed and normed, and scores in the extreme ranges need to be interpreted with caution. The most widely used individual tests of intelligence are the Wechsler tests (the Wechsler Adult Intelligence Scale (WAIS) and the Wechsler Intelligence Scale for Children (WISC), and the Stanford Binet test. Group tests of intelligence are also available, and are used mainly for large scale screening or selection purposes. The practice of identifying a specific learning difficulty or learning disability in terms of a discrepancy between IQ score and achievement is no longer recommended. (See National Joint Committee on Learning Disabilities (NJCLD). (2011). Learning disabilities: Implications for policy regarding research and practice: A Report by the National Joint Committee on Learning Disabilities. Retrieved from http://www.ldonline.org/about/partners/njcld.
The cut-off score on a criterion-referenced or mastery test at which or above people are considered to have mastered the material. However, mastery may be an arbitrary judgment.
A test that determines whether an individual has mastered a unit of instruction or skill; a test that provides information about what an individual knows, not how his or her performance compares to the norm group.
Borrowed from a line in the Bible's Book of Matthew -- the rich get richer and the poor get poorer. In reading, this describes the difference between good readers and poor readers -- while good readers gain new skills very rapidly, and quickly move from "learning to read" to "reading to learn," poor readers become increasingly frustrated with the act of reading, and try to avoid reading when possible. The gap is relatively narrow when the children are young, but rapidly widens as children grow older.
A mean score is the arithmetical average; the sum of individual scores divided by the total number of scores.
The median is a measure of central tendency where half the scores are above and half below, the middle score in a distribution or set of ranked scores; the point (score) that divides a group into two equal parts; the 50th percentile. Half the scores are below the median, and half are above it.
Naming speed, or Rapid Automatised Naming (RAN), is one predictor of future dyslexia.
A RAN Test consists of naming an array of objects, colours, letters or symbols as quickly as possible. The rapid naming of just letters and numbers is called Rapid Alphanumeric Naming. Research by Lervag and Hulme (Psychological Science 2009) found, in a longitudinal study, that speed in naming pictures of objects and colour patches also predicts future reading skill.
For a study on Rapid Naming and Phonological Processing as Predictors of Reading and Spelling and many research references, see:http://www.thefreelibrary.com/Rapid+naming+and+phonological+processing+as+predictors+of+reading+and...-a0197363398
National percentile rank
A national percentile rank indicates the relative standing of one child when compared with all the others in the same grade or norming group; percentile ranks range from a low score of 1 to a high score of 99. The national percentile represents the percentage of students in the national norm group that scored below a given student’s score. For example, a student whose national percentile score is 70, scored higher than 70% of the students in the norm group. NAPLAN, the Australian National assessment tests give national percentile among its results.
In sociology, a culturally relative guideline for social behaviour. In testing, a statistical measure of central tendency.
Normal distribution curve
A distribution of scores used to scale a test. Normal distribution, or Bell Curve, is a bell-shaped curve with most scores in the middle and a small number of scores at the low and high ends. The normal distribution can be completely described by the mean (which is zero) and standard deviation (which is the amount scores deviate from the mean). By knowing the mean and standard deviation of a set of data, then one can know every access point in the data set.
In general, the normal distribution rule states that 68% of the data will fall within one standard deviation from the mean, 95% will fall within two standard deviations of the mean, and 99.7% will fall between three standard deviations of the mean.
A data set that is statistically symmetrical around an average, represented graphically by a bell curve. In a perfectly normal distribution, the mean, median and mode are all equal.
Measurement is compared to a norm or average. IQ tests are norm-referenced tests.
Norm-referenced assessment compares an individual child's score against the scores of other children who have previously taken the same assessment. With a norm-referenced assessment, the child's raw score can be (is usually) converted into a comparative (standard) score such as a percentile rank or a stanine for comparison.
Standardized tests designed to compare the scores of children to scores achieved by children the same age who have taken the same test. They are designed to discriminate among groups of students, and allow comparisons across years, grade levels, schools, and other variables.
Percentiles or percentile ranks
Percentage of scores that fall above or below a point on a score distribution; for example, a score at the 75th percentile indicates that 75% of students obtained that score or lower.
Psychometrics is the field of study concerned with the theory and technique of educational and psychological measurement such as measuring the mental characteristics of IQ, which includes the knowledge, abilities, attitudes, and personality traits. The field is primarily concerned with the construction and validation of measurement instruments, such as questionnaires, tests, and personality assessments.
A raw score is the number of questions answered correctly on a test or subtest. For example, if a test has 59 items and the student gets 23 items correct, the raw score would be 23. Raw scores on standardised tests are usually converted to percentile ranks, standard scores, and grade equivalent and age equivalent scores.
Readability refers to a formulated intended level of difficulty in a written passage. This depends on factors such as length of words, length of sentences, grammatical complexity and word frequency.
The reliability of a test reflects the consistency with which a test measures the area being tested; describing the extent to which a test is dependable, stable, and consistent when administered to the
same individuals on different occasions. See: https://www.uni.edu/chfasoa/reliabilityandvalidity.htm
The common use of the term is to mean any gathering of data, information and facts for the advancement of knowledge. The two main types of education research are Qualitative research which is descriptive in nature, and Quantitative research involves analysis of numerical data.
Scaled scores represent approximately equal units on a continuous scale; facilitate conversions to other types of scores; and can be used to examine change in performance over time.
Standard deviation (SD)
The standard deviation is a statistical measure describing the variability in a distribution of scores. The more the scores cluster around the mean, the smaller the standard deviation. In a normal distribution, 68% of the scores fall within one standard deviation above and one standard deviation below the mean.
Scores on norm-referenced tests are based on the bell curve, or the equal distribution of scores from the average of the distribution. Standard scores are especially useful because they allow for comparison between students and comparisons of one student over time.
Standardisation is the process of producing a consistent set of procedures for designing, administering, and scoring an assessment. The purpose of standardization is to ensure that all individuals are assessed under the same conditions and are not influenced by different conditions.
Tests that are uniformly developed, administered, and scored.
Stanines are a standard score between 1 to 9, with a mean of 5 and a standard deviation of 2. The first stanine is the lowest scoring group and the 9th stanine is the highest scoring group.
The difference in test scores that is attributable to demographic variables (e.g., gender, ethnicity, language background, socio-economic status, and age), or other variables.
T scores are a standard score with a mean of 50 and a standard deviation of 10. A T-score of 60 represents a score that is 1 standard deviation above the mean.
Confidence in the extent to which a test measures the skills it claims to measure and therefore degree of confidence in the extent to which inferences and actions made on the basis of test scores are appropriate and accurate.
Z-scores are a standard score with a mean of 0 (zero) and a standard deviation of 1.