Could ChatGPT get an engineering degree?

EPFL research investigating the potential impact on education of AI assistants has found that systems like GPT-4 can answer up to 85% of university assessment questions correctly.

Tanya Petersen 29.11.2024

ChatGPT exploded onto the public scene in late 2022 attracting more than 100-million users in just its first month. Since then, there have been rising examples of how AI may transform society in the coming years, from employment and communication to education.

In higher education, AI assistants are being increasingly used by students. While these tools provide opportunities for improved teaching and education, they also pose significant challenges for assessment and learning outcomes yet, until now, there has been no comprehensive study of their potential impact on the assessment methods that educational institutions use.

As outlined in their new paper published in the Proceedings of the National Academy of Sciences (PNAS), researchers from EPFL’s School of Computer and Communication Sciences have conducted a large-scale study across 50 EPFL courses to measure the current performance of Large Language Models on higher education course assessments. The selected courses were sampled from 9 Bachelor’s, Master’s, and online programs, covering a broad spectrum of STEM disciplines, including computer science, mathematics, biology, chemistry, physics, and material sciences.

"We were lucky that a large consortium of EPFL professors, teachers, and teaching assistants helped us collect the largest data set to date of course materials, assessments, and exams to get a diverse array of materials across our degree programs,” explained Assistant Professor Antoine Bosselut, Head of the Natural Language Processing Laboratory (NLP) and member of the EPFL AI Center. “This data was curated into a format that we thought would most resemble the ways that students would actually give this information to models and then we generated responses from the models and saw how well they answered.”

Focused on GPT-3.5 and GPT-4, the researchers used eight prompting strategies to produce responses and found that GPT-4 answers an average of 65.8% of questions correctly, and can even produce the correct answer across at least one prompting strategy for 85.1% of questions.

“We were surprised at the results, nobody expected that the AI assistants would achieve such a high percentage of correct answers across so many courses. Importantly, the 65% of questions answered correctly was achieved using the most basic, no knowledge prompt strategy, so anyone, without understanding anything technically could achieve this. With some subject knowledge, which is typical, it was possible to achieve an 85% success rate and that was really the shock,” said Anna Sotnikova, a scientist with the NPL and co-author on the paper.

«We were surprised at the results, nobody expected that the AI assistants would achieve such a high percentage of correct answers across so many courses.» Anna Sotnikova, scientist at the Natural Language Processing Laboratory

AI's impact on student learning and skill development

The researchers theoretically grounded the problems associated with these AI systems being used by students around vulnerability – on the one hand, assessment vulnerability or whether traditionally used assessments can be ‘gamed’ by these systems and on the other hand, educational vulnerability, that is, could these systems be used to circumvent the typical cognitive paths that students take to learn the academic skills they need.

In this context, the researchers believe that the results of the study bring up clear questions around how to ensure that students are able to learn the core concepts needed in order to grasp more complex topics later on.

“The fear is that if these models are as capable as what we indicate, students that use them might shortcut the process through which they would learn new concepts. This could build weaker foundations for certain skills earlier on, making it harder to learn more complex concepts later. Maybe this needs to be a debate about what we should be teaching in the first place in order to come up with the best synergies of the technologies we have and what students will do in decades to come,” said Bosselut.

Another key point of the development of AI assistants is that they are not going to get worse, they will only get better. In this research, which was completed a year ago, one single model was used for all subjects and, for example, it had particular trouble with mathematics questions. Now there are specific models for math. The conclusion, the researchers say, is that if the study was rerun today, the numbers would be even higher.

Emphasizing complex assessments and adapting education

“In the short term we should push for harder assessments - not in the sense of question difficulty, but in the sense of the complexity of the assessment itself where multiple skills have to be taken from different concepts that are learned throughout the course during the semester and that are brought together in a holistic assessment,” Bosselut suggested. “The models aren’t yet really designed to plan and work in this kind of way and in the end, we actually think this project-based learning is better for students anyway.”

“AI challenges higher education institutions in many ways, for example: which new skills are required for future graduates, which are becoming obsolete, how can we provide feedback at scale and how do we measure knowledge? These kinds of questions come up in almost every management meeting at EPFL and what matters most is that our teams initiate projects which provide evidence-based answers to as many of them as we can,” said Pierre Dillenbourg, Vice President for Academic Affairs at EPFL.

«AI challenges higher education institutions in many ways.» Pierre Dillenbourg, Vice President for Academic Affairs at EPFL

Longer-term it’s clear that education systems will need to adapt and the researchers want to bring this ongoing project closer towards educators, aligning studies and then recommendations with what they will find useful.

“This is only the beginning and I think a good analogy to LLMs right now are calculators when they were introduced, there were a similar set of concerns that children would no longer learn mathematics. Now, in the earlier phases of education calculators are usually not allowed but by high school and above, they are expected, taking care of lower level work while students are learning more advanced skills that rely upon them,” added Beatriz Borges, a PhD student in the NLP and co-author of the paper.

“I think that we will see a similar, gradual adaptation and shift to an understanding of both what these systems can do for us and what we can't rely on them to do. Ultimately, we include practical suggestions to better support students, teachers, administrators, and everyone else during this transition, while also while also helping to reduce some of the risks and vulnerabilities outlined in the paper,” she concluded.

The EPFL AI Center brings together 80 laboratories and professors and has 1000 affiliates, leading the way towards trustworthy, accessible, and inclusive AI.