ESEE come, not so ESEE go… What went wrong with the Learner Gains Research Project and what can we learn from it?

A couple of weeks ago the Ministry of Advanced Education and Skills Development (MAESD) released a memo and accompanying reports announcing they will not use the Essential Skills for Employment and Education (ESEE) test to measure the skill gains of students who participate in LBS programs. ESEE was piloted in LBS programs in what was called the Learner Gains Research Project (LGRP) in 2016 after adaptations were made to the original test in 2015.

It took over a year for MAESD to release the evaluation  and analysis reports. Both were completed in December 2016. Not shared with the field are the findings from a survey completed by program staff who participated in the LGRP.

Here is the ministry’s statement:

Based on the findings of the reports, the ministry will not further implement the ESEE tool for measuring learner gains at this time.

They did not provide a rationale for their decision, commentary or interpretation of the reports.

While I’m guessing that many in the field, and a number of those who participated in the LGRP, are simply relieved to see this announcement and wish to move on, the field and particularly the participants in the LGRP deserve the respect of a rationale for this decision and a brief analysis of the overall project.

The right decision was made based on the problems that I and others documented during testing. Now I realize there were even more issues beyond the failings of the test, which I wrote about here, here and here.

The report contains numerous procedural and analytical flaws, and the overall design of the project had problems from the outset. Ultimately though, the decision to use the ESEE as a learner gains tool, the role and oversight of the consultants and the research design are the ministry’s responsibility. Simply moving on without understanding the full extent of the problems in this project is not a great option. If we don’t gain some insight, not only do we remain vulnerable to a repeat of the same mistakes, but the ministry can’t be held accountable for its decisions, actions and inactions.

For those who read the reports, at first it seems that the ministry’s decision doesn’t match the report’s conclusion that there were “significant” skill gains among the small number of test-takers who took a post-test. However, closer reading reveals quite a few flaws in the analysis.

Posting the reports and leaving it up to the field to wrestle with the issues is quite extraordinary. It could also be perceived as an attempt to deflect criticism and avoid responsibility. We need to place the results in a larger context and fully understand what happened and what didn’t happen in the broader LGRP. This rather lengthy post is an attempt to provide context and critically analyse both the LGRP and the ESEE analysis report. My aim is not to place blame but to develop our knowledge of assessment, particularly high-stakes assessments, by pointing out the weaknesses in the LGRP. Perhaps we will all be better prepared the next time a learner gains tool is discussed. Here are the issues as I see them.

  1. The ministry provided funding for the adaptation and piloting of ESEE directly to test developers who also became the evaluators of their own product, a product with an “e-commerce option” and marketing plans.

The ministry did not rely on any external and impartial expertise during the two years that it provided funding to the ESEE team. Instead, they relied only on the analysis and evaluation of ESEE test developers and consultants, who are not only personally invested in the success of the project, but also clearly state in their report that they have plans to market the assessment to employers. Perhaps the consultants also had preliminary discussions about licencing agreements with the ministry. They became the sole source of information, analysis and evaluation of the ESEE within the LGRP.

  1. Test item development, along with internal and external validity was never fully evaluated by an outside literacy testing expert.

The ministry has never shared its rationale supporting its decision to pursue the use of ESEE as a learner gains tool. Is there a clear rationale? How was it determined that the tool could actually capture skill gain after short-term participation in LBS across diverse learner groups and streams and across a vast range of skill development, covering the equivalence of K-12 literacy and numeracy development?

Test development, particularly test development using both the international literacy testing framework and item-response theory, is extremely complex, difficult and time-consuming. ESEE draws on both. Attempts to transpose the expertise and methods of this approach, initially developed and still maintained by Educational Testing Services (ETS), results in products that don’t perform exactly the same way as products from ETS (i.e. the PIAAC, Education and Skills Online and PDQ).

Internal (or construct) validity analysis would reveal that the ESEE contains very difficult reading passages, and its’s approach to numeracy is different from the model of test item development used for reading and document use. External validity analysis would reveal that the ESEE levels don’t align with the OALCF (participants discovered this) and the scores don’t align with scores from international testing (test-item development is different). And, as I have argued, the model of test item development is not appropriate for detecting short-term skill gains in programs like LBS.

  1. LGRP designers did not ensure there was adequate analysis of the sample of participants who completed the test.

The analysis report doesn’t contain adequate analysis of the sample that could be used to make any generalizable statements about the LBS population. Without this analysis, very little can be said about the representativeness of the sample (Does the sample adequately capture a range of learners who attend programs?), and nothing can be said about generalizability (Does the sample relate to all LBS learners?). The sample is not cross-referenced with EOIS-CaMS data.

To make up for this problem in the research design, demographic information was collected from each of the 2,800 participants (2,782 completed at least one of three assessments) and we have some information about the gender and age of participants, along with their sectors, streams and regions.

Missing is any indication of the learners’ skill levels. Questions to provide an indicator of ability were asked (i.e. first language and education attainment) but were not included in the analysis. The main education attainment question posed in ESEE is confusing, which likely impacted its usefulness. Learners were prompted to indicate the “institution of highest level of education completed.” But the accompanying drop-down menu contained phrases describing grade levels, such as Grade 0-8 and not types of institutions. Without knowing the education attainment of the learners and their first language, both of which indicate ability to complete the test, not even general inferences about the usefulness of the tool in the LBS system can be made.

It’s very possible that the pool of test-takers who took the test was a selective sample and did not reflect the range of skill levels of all LBS participants. Once program staff saw how difficult the test was they likely recruited only those students they thought could successfully complete the test.

  1. There were no controls over the testing session, and this may be impossible to achieve.

Another issue that wasn’t adequately considered in the design is some sort of oversight and control over test delivery. Learners participating in distance learning programs could take the test in their homes, working at their own pace, opening and closing sessions and possibly asking for support.

In response to this problem of test session controls, the consultants removed over 1,000 test results (out of the pool of 6,563 completed reading, document use and numeracy assessments) that had very fast completion times. They rationalized that this may have indicated that the test-taker “did not invest enough time (effort) in their test to provide an accurate measure of their skill.” They also documented this decision and considered it in their analyses. But what they didn’t do is explore why this may have happened. Staff I spoke to said students likely moved through the test quickly, clicking on random responses, for several inter-related reasons: they were fed-up with the process, the test was too long, the test was not meaningful and they simply wanted to finish and receive an honorarium.

Taking too long to complete the test also became an issue. However, the issue was not fully explored in the report. The consultants state: “Some clients who took longer to answer questions did not use their time effectively.”

Attempting to control what happens in a testing session, particularly a pre- and post-test, may be very challenging in LBS, a diverse field with numerous delivery modes and few to no organizational hierarchies (like we would see in the regular school system, for example) that are used to enforce test protocols across a system.

  1. Post-test scores that showed a decrease were increased to match the pre-test score.

Consultants adjusted any post-test scores that were below a test-takers’ pre-test scores, stating that they “converted negative skill gains to zero.” They do not explore the impacts of this adjustment in their analyses. If the negative scores were used in their gains calculations how would they alter the overall results?

Score loss happens in testing for many reasons, such as fatigue, disinterest, and the test environment. In addition, the use of a tool that has not been validated for the purpose of measuring short-term skill gain has a role. None of these issues are explored. The consultants simply state: “It is unlikely that clients lose skills in a time frame of nine months or less, in light of their participation in training.”

Consultants also note that test-takers spent 11%-20% less time on their post-tests compared to their pre-tests. They conclude that this reflected a lack of effort and don’t consider that test-takers became knowledgeable about the test. It’s very plausible that test-takers simply became better at taking the test if items were repeated or were similar.

The statement below, describing an increase in the number of post-tests completed is confusing and needs more explanation:

Several program participants continued to pre- and post-test resulting in an increase of 787 clients completing one or more assessments. This increased the post-test sample size by approximately 10% (p. 6).

Does this mean that test-takers completed their post-test immediately after their pre-test with no program time in between? It’s difficult to understand and is important to do so, since whatever happened nearly doubled the number of post-tests (or test-takers?) completed.

Since less than 22% of test-takers completed a post-test (consultants also report that 19% of “clients” completed a post-test, but may have confused the number of completed tests and number of test-takers), every decision about the numbers analysed and how that analysis was done becomes important.

  1. Particularly high scores reported by one sub-group of learners were not further investigated and may have skewed the average skill gains results.

One sub-group of learners had particularly high gains between pre- and post-testing in the Type B tests (the more difficult ones). The higher scores needed some investigation before being reported. Was this a carefully selected and highly skilled group of learners? Did they receive some additional supports? Did they receive some very precise instruction on how to complete the test? We don’t know.

When analysing skill gains, the consultants don’t share their analysis and calculations of statistical significance (a standard practice), but only share a final classification (0-5 increase is not significant, 6-20 is significant, over 20 is very significant).

In addition, all correlational analysis needs more explanation.

  1. Unsupported inferences and conclusions were made about the sample and its relation to the overall LBS population.

Since the sample was never adequately defined and described none of the following statements about the ability of all students in LBS carries any weight (my emphasis in bold below).

  • “A small percentage of clients demonstrate very weak essential skills” based on the numbers who took the Type A assessment
  • “Most clients have the skills needed to handle the more difficult Type B”
  • Only “2% or fewer” were exempted using the locator function, a quick screen tool. “This infers that 98% of learners likely have the skills needed to complete the online assessment.”

Perhaps these are wording issues, and consultants meant to describe the sample of test-takers and not all clients or learners. Nonetheless, these are statements that have no relevance when describing all LBS learners.

  1. Average scores remain beneath a cut-off score (that also indicated successful completion to the test-taker) established by the test developers without a rationale.

Finally, we have to consider the possible impact of the results report statement, which I wrote about here:

 Number scores normally range between 200 and 300. It varies, but most jobs require reading skills at 250 or higher.

It may have been a demotivator for students, and could have served as a motivator for staff to ensure students scored above 250. No one wants their students to be demoralized and deflated by a test result. The statement along with other factors such as a very lengthy test, reading level difficulty, the test’s lack of meaning and significance to the learner, and the timing of the project during summer months likely combined to contribute to the overall poor rates of post-test completion.

There were also drop-offs between tests during the pre-testing phase. After completing the reading test, the first test and the one most often completed, less than half (44%) of test-takers completed the document use test. More however (62%) did complete the numeracy pre-test, which also operates differently than the other two.

The average score on all reading tests was 192 for Type A and 240 for Type B, meaning those who took the Type A reading test would not even hit the minimal range based on the results report,  and those who took Type B are still below the level (erroneously) set as a minimum to do most jobs.

Consultants do state that future versions of the test should “revise the score description statement on Results Report.”

  1. The ministry did not follow a process of informed consent and did not inform participants of their rights, particularly their right to withdraw from the study without repercussions.

A final problem to recognize is the lack of informed consent in the LGRP. The research designers did not inform participants of their rights, a mandatory protocol in government funded institutions such as hospitals, colleges, universities and schools. Considering that this project involved high-stakes testing, it was an oversight to not include an informed consent process. Participants were left in a vulnerable position. They were not informed that they had the right to withdraw from the study without repercussion. In addition, they received funding for their participation, a motivator to stay in the study and actively participate. They were also not informed what would happen to their funding if they did withdraw. Compounding the issues, was their inability to directly connect with a ministry official (or an impartial designate not associated with ESEE) to discuss their concerns about participating in the LGRP.

Any one of these nine issues would be a reason to have serious concerns about the rigour of the process for developing a high-stakes assessment. (It is high-stakes because the effort to measure skill gains is as much or more about connecting test results with funding decisions as it is about skill gains.)

Developing rigorous standardized testing is time-consuming, costly and very challenging to get right. Using a pre- and post-test protocol in LBS is likely impossible, since there is no stable and predictable program experience in the middle and the ministry no longer collects data related to time spent in programs to help make some sort of comparison between program involvement and skill gains.

Forging ahead with the learner gains initiative without the necessary expertise, supports and funding has likely led to the problems that plagued the LGRP. It’s the people in LBS programs who end up paying the price.

Staff who participated in the project may have had to compromise their professionalism as they wrestled with the contradiction of encouraging learners to take a test that they began to question, and at the same time, conform with the study’s requirements that they agreed to, and for which they received an incentive. Consultants write that staff were already cautious about entering into the project before it began:

As in previous projects, service providers were cooperative, even though they may have had reservations about the Ministry’s intentions, the assessment tool, the process or the outcomes (p. 4).

They may have even more reservations about ministry efforts now.

It is the adult learners who became frustrated, angry and upset who paid a greater price. Their trust in programs, a place where they should feel supported and not have their vulnerabilities exposed, was compromised. Perhaps it was destroyed in some instances when learners simply left the program after taking a test.

In addition, consultants may also pay a price. Arguably, their analysis report and accompanying evaluation, which was not completed by an impartial evaluator, should not have been released with so many flaws. Once the ministry recognized the issues, an evaluation from an objective assessment expert, examining both the LGRP and the ESEE analysis, should have been conducted.

In writing this, I also recognize that those in the ministry, working directly on the LGRP, must be under some extreme pressure to pursue a learner gains tool that is aligned with international literacy testing, despite running into so many obstacles. It may be time to seriously question this pursuit.

What’s next? We’ll see I guess. In the meantime, it’s important to know that different choices can be made—ones that provide rigorous and useful data to the ministry and also support responsive and relevant literacy development, if the ministry wants to make those choices.



2 thoughts on “ESEE come, not so ESEE go… What went wrong with the Learner Gains Research Project and what can we learn from it?

  1. Hi Christine,

    Lately I have been widening the focus of my thinking from what does it take to successfully integrate digital technology in adult education to what does it take to be a successful and dynamic learning system. In the digital technology realm research points to the key role played by the organizational leadership. The leadership’s technology vision and capacity is critical to the success of the teachers, staff and learners. Your post is very timely for me, as both personally and professionally I have been examining what could make the Ontario adult learning system stronger. Almost exclusively my attention has be directed towards the possible approaches and supports to build the capacity of adult ed. teachers. The teachers are commonly thought to be the root of the problem, perceived to be “deficient” and in need upskilling. In a discussion with a peer researcher, just yesterday, it became obvious to me that the best designed tools, would struggle or fail unless the leadership within each of the delivery organizations had the capacity (competence and support) to lead, motivate and activate the tools. Your post makes it very obvious that all efforts at create an adult learning system that itself can grow and learn will not be possible until MAESD’s corporate culture and its leaders commit to developing their capacity to lead a true learning system. Without transparency and knowledge sharing from those with the data and information, learning and growth cannot happen. We are all as you point out bound to repeat mistakes. Your post points squarely at one of the most significant barriers to the improvement and expansion of the current system. This example is very stimulating and I hope I can incorporate the insights you have provided into my ongoing work.

    With sincere thanks,

    Alan Cherwinski


    1. You gave me a lot to think about too in relation to the idea of the culture within government. After the auditor general released his report on the Phoenix debacle there was some discussion of government culture and its contribution to the problem. Not to say that the LGRP is similar in any way, but there are elements of the media reporting on the issue that are thought-provoking. (Here is one example Lessons from Phoenix: Does public service culture need to be fixed? Can people in MAESD raise issues and problems? Is advancement truly premised on the notion that you simply get things done as efficiently as possible without problems? Is project completion the only thing that matters? Then, when things do go wrong is avoiding blame at all costs the way people operate? I’ve never worked in government and can’t weigh in. All we see are indications from the outside that bureaucratic culture certainly can create it’s own problems that too often leave front-line folks cleaning up and propping up their messy systems.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s