The results of the PISA 2022 assessment are reported in a numerical scale consisting of PISA score points. This section summarises the test-development and scaling procedures used to ensure that PISA score points are comparable across countries.
PISA 2022 Results (Volume III)
Annex A5. The construction of the reporting scale and data adjudication for creative thinking
The construction of the reporting scale
Assessment framework and test development
The first step in defining a reporting scale in PISA is developing a framework for the assessed domain. This framework provides a definition of what it means to be proficient in the domain; delimits and organises the domain according to different dimensions; and suggests the kind of test items and tasks that can be used to measure what students can do in the domain within the constraints of the PISA design. The PISA 2022 Creative Thinking framework was developed by a group of international experts and agreed upon by the participating countries. More information on the PISA 2022 Creative Thinking framework can be found in Annex A1 of this volume, or in the PISA 2022 assessment and analytical framework (OECD, 2023[1]).
The second step is the development of the test questions (i.e. items) to assess students’ proficiency. A consortium of testing organisations under contract to the OECD on behalf of participating governments develops new items for the PISA innovative domain. The expert group that developed the framework reviews these proposed items to confirm that they meet the requirements and specifications of the framework.
The third step is a qualitative review of the testing instruments by all participating countries and economies to ensure the items’ overall quality and appropriateness in their own national/jurisdictional context. These ratings are considered when selecting the final pool of items for the assessment. Selected items are then translated and adapted to create national/jurisdictional versions of the testing instruments. These versions are verified by the PISA consortium.
The verified national/jurisdictional versions of the items are then presented to a sample of 15-year-old students in all participating countries and economies as part of a field trial. This is to ensure that they meet stringent quantitative standards of technical quality and international comparability. In particular, the field trial serves to verify the psychometric equivalence of items across countries and economies (see also Annex A6).
After the field trial, material is considered for rejection, revision or retention in the pool of potential items. The international expert group for each domain then formulates recommendations as to which items should be included in the main assessments. The final set of selected items is also subject to review by all countries and economies. This selection was balanced across the various dimensions specified in the framework and spans various levels of difficulty so that the entire pool of items measures performance across the component skills and the range of item contexts and student abilities.
Test assembly for the PISA 2022 Main Study
32 items were retained in the final pool of items for the creative thinking test. These items were organised into test units that varied in terms of the facets targeted (i.e. generate diverse ideas, generate creative ideas, and evaluate and improve ideas), the domain context (i.e. written expression, visual expression, social problem solving, or scientific problem solving) and the duration of the unit (guidelines of between 5 and 15 minutes). Some units included a single item and some units included multiple items, although dependencies between items within units was minimised.
Constructed-response tasks accounted for 92% of the items administered as part of the creative thinking test. Tasks typically called for a written response, ranging from a few words (e.g. cartoon caption or scientific hypothesis) to a short text (e.g. creative ending to a story or explanation of a design idea). Some constructed-response items instead called for a visual design response (e.g. designing a poster combining a set of given shapes and stamps) that was supported by a simple drawing editor tool. The test also included two items that were part of an interactive simulation-based task (although these were subsequently dropped from the scaling – see below section on data adjudication) and two hybrid, multiple-choice items where students had the possibility to select a previously suggested idea from the same unit or to generate a new idea.
The creative thinking units were organised into five mutually exclusive 30-minute blocks or clusters. The clusters were rotated according to an integrated design (see Chapter 3 of PISA 2022 Technical Report (OECD, 2023[2])). About 28% of the sample of PISA students were administered the creative thinking assessment – these students who took the creative thinking assessment spent one hour on creative thinking test items with the remaining hour of testing time assigned to one of the other core domains (mathematics, reading or scientific literacy).
Proficiency scales for PISA domains
Proficiency scores in creative thinking are based on student responses to items that represent the assessment framework for each domain (see previous section). While different students saw different questions, the test design, which ensured a significant overlap of items across different test forms, made it possible to construct proficiency scales that are common to all students. In general, the PISA frameworks assume that a single continuous scale can be used to report overall proficiency in a domain, but this assumption is further verified during scaling (see following section).
PISA proficiency scales are constructed using item-response-theory models in which the likelihood that the test-taker responds correctly to any question is a function of the question’s characteristics and of the test-taker’s position on the scale. In other words, the test-taker’s proficiency is associated with a particular point on the scale that indicates the likelihood that he or she responds correctly to any question. Higher values on the scale indicate greater proficiency, which is equivalent to a greater likelihood of responding correctly to any question.
In the item-response-theory models used in PISA, the test item characteristics are summarised by two parameters that represent task difficulty and task discrimination. The first parameter, task difficulty, is the point on the scale where there is at least a 50% probability of a correct response by students who score at or above that point; higher values correspond to more difficult items. For the purpose of describing proficiency levels that represent mastery, PISA often reports the difficulty of a task as the point on the scale where there is at least a 62% probability of a correct response by students who score at or above that point.
The second parameter, task discrimination, represents the rate at which the proportion of correct responses increases as a function of student proficiency. For an idealised highly-discriminate item, close to 0% of students respond correctly if their proficiency is below the item difficulty and close to 100% of students respond correctly as soon as their proficiency is above the item difficulty. In contrast, for weakly-discriminate items, the probability of a correct response still increases as a function of student proficiency, but only gradually.
A single continuous scale can therefore show both the difficulty of questions and the proficiency of test-takers (see Figure III.A5.1). By showing the difficulty of each question on this scale, it is possible to identify the level of proficiency in the domain that the question demands. By showing the proficiency of test-takers on the same scale, it is possible to describe each test-taker’s level of skill or literacy by the type of tasks that they can perform correctly most of the time.
Estimates of student proficiency are based on the kinds of tasks that students are expected to perform successfully. This means that students are likely to be able to successfully answer questions located at or below the level of difficulty associated with their own position on the scale. Conversely, they are unlikely to be able to successfully answer questions above the level of difficulty associated with their position on the scale.
The higher a student’s proficiency level is located above a given test question, the more likely they can answer the question successfully. The discrimination parameter for this particular test question indicates how quickly the likelihood of a correct response increases. The further the student’s proficiency is located below a given question, the less likely they are able to answer the question successfully. In this case, the discrimination parameter indicates how fast this likelihood decreases as the distance between the student’s proficiency and the question’s difficulty increases.
Data adjudication and approach to scaling the creative thinking data for reporting
In June 2023, the Core A Contractor (responsible for the overall management of contractors and implementation of the PISA Surveys – see Annex D) presented the Technical Advisory Group (TAG) with the PISA 2022 creative thinking data and preliminary psychometric analyses for data adjudication. Following the initial feedback of the TAG on the scalability of the data given the relatively low inter-item correlations and the creation of plausible values, the PISA Secretariat conducted further analyses of the creative thinking data, including modifying some of the scoring rules with the goal of increasing the validity of inferences drawn from the creative thinking data, and improving the scalability and comparability across countries.
Following a thorough review of the data, the following changes were implemented:
Four items were dropped from the scaling. The four items identified for exclusion were drawn from two units (one from visual expression, and one from scientific problem solving) and were all in the same test cluster. These four items showed poor discrimination and high omit rates, likely due to their position within the cluster.
The scoring rules for 14 items were modified. All “generate creative ideas” and “evaluate and improve” items were reviewed following the main survey in terms of the distribution of double-digit codes across countries. The scoring process for these items required coders to use a second digit to indicate the primary theme of each response (i.e. responses corresponding to Conventional Theme 1 were coded either 11 or 21, depending on whether the response was awarded full credit [21] or partial credit [11]). Responses coded with values of 1-3 as their second digit (i.e. 11, 12, 13 or 21, 22, 23) thus represented ideas that corresponded with the initial conventional themes designated in the coding guide. The double-digit codes intended to serve as a mechanism through which to review the distribution of codes across countries following the data collection and, if needed, adjust the themes designated as conventional following the field trial and main survey. The number of conventional themes were modified for 14 of the 18 items corresponding to “generate creative ideas” and “evaluate and improve ideas” based on the results of the main survey to improve the validity of the scoring rules for these items and to align the scoring with the framework (i.e. originality as statistical infrequency, with respect to the responses of other students who completed the same task).
Responses submitted in fewer than 15 seconds were invalidated (i.e. converted to missing responses). For most items in the creative thinking test, students must generate a written or visual output in response to a written or visual stimulus (i.e. task prompt with instructions and material for inspiration). The construct of creative thinking also aims to measure the cognitive processes associated with idea generation, evaluation and improvement, which are considered to be slow and thoughtful processes rather than reflective of opportunistic or rapid processes. For most items in the test, responses submitted within 15 seconds of viewing the item cannot be considered reflective of creative thinking processes. A review of the timing data for the items also showed a clear bimodal distribution of response submission, with one peak prior to 15 seconds and another peak a significant time afterward. This modification was applied to all items, with the exception of three: in two cases, students were able to select a response to a previous question akin to a multiple-choice mechanism; and in the other item, students were asked to generate a very short written output. In these three cases, it was judged that students could submit a response that reflected creative thinking processes within 15 seconds, thus no minimum response time was imposed.
In October 2023, the PISA Secretariat, the Core A Contractor and the TAG reconvened for the data adjudication of the creative thinking data following the further analyses conducted by the PISA Secretariat, and to finalise the reporting approach. The TAG recommended to report the creative thinking data according to a non-linear transformation of the “theta” scale, using the test-characteristic curve for a hypothetical test using the final pool of 32 creative thinking items and based on international item parameters. The advantages of this approach include:
Reporting student performance according to a bounded scale (between 0-60, reflecting the maximum sum-score of all items) that is the same for all countries. This solution maintains the possibility of reporting performance on a scale, but signals a clear difference to the PISA scales used for the other domains; the broader “grain” size of the creative thinking scale signals its relative lower reliability compared to the other PISA scales (a 1-point change in the creative thinking scale reflects about 10% of a standard deviation).
Scores can be easily interpreted in terms of the number of items correct on this specific test (rather than a more general reflection of students’ creative thinking ability applied to other performance tests), drawing attention to the actual test content and the framework that guided its development and facilitating the interpretation of the relatively high frequency of low scores in this test (i.e. students scored 0 on the test, rather than not having any creative thinking skill).
Test scores differ more where the test has more information about students, i.e. in those regions of the creative thinking scale where a greater number of item-difficulty estimates are located, and therefore where more information is available to differentiate student performance on the scale.
The international database still includes 10 “plausible scores” per student.
References
[1] OECD (2023), PISA 2022 Creative Thinking framework, OECD Publishing, Paris, https://doi.org/10.1787/471ae22e-en.
[2] OECD (2023), Test Development for the Core Domains, OECD Publishing, Paris, https://www.oecd.org/pisa/data/pisa2022technicalreport/PISA-2022-Technical-Report-Ch-03-PISA-2022-Development-Core-Domains.pdf.