AI Grading: Faster Feedback for Teachers
- A recent study indicates that while artificial intelligence can accelerate the grading process for educators, it might do so at the expense of precision.
- Xiaoming Zhai, associate professor and director of the AI4STEM Education Center at the University of Georgia, noted the time constraints teachers face.
- The study presented the LLM Mixtral with middle school students' written responses, including a question asking them to model particle behavior when heat energy is transferred.
AI grading promises faster feedback, but at what cost? A new study reveals that while artificial intelligence can speed up the grading process for teachers, it may compromise assessment accuracy.the research, wich News Directory 3 covered, indicates that Large Language Models (LLMs) often take grading shortcuts, like focusing on keywords, which leads to mistakes in evaluating student responses. Though, the study suggests that when LLMs use detailed, human-designed rubrics, their grading accuracy substantially improves.Human-made rubrics boost the accuracy rate from 33.5% to over 50%.This could change how educators provide feedback. Discover what’s next for AI in education and if it can truly help teachers.
AI Grading Systems Offer Speed, But Accuracy Is a Concern
Updated May 27, 2025
A recent study indicates that while artificial intelligence can accelerate the grading process for educators, it might do so at the expense of precision. The research highlights the challenges of assessing complex student work using AI, especially in subjects emphasizing argumentation, investigation, and data analysis.
Xiaoming Zhai, associate professor and director of the AI4STEM Education Center at the University of Georgia, noted the time constraints teachers face. He said that grading complex tasks takes time, which means students may not get timely feedback.The study compared Large Language Models (LLMs) to human graders.
The study presented the LLM Mixtral with middle school students’ written responses, including a question asking them to model particle behavior when heat energy is transferred. The LLM then created rubrics to evaluate the students’ work and assign scores.
Researchers discovered that while LLMs grade quickly, they often rely on shortcuts, such as identifying keywords, which reduces accuracy. Supplying LLMs with detailed rubrics that mirror human analytical thought could improve their performance, the study suggests. These rubrics should specify what the grader should look for in a student’s response.
“The train has left the station, but it has just left the station,” Zhai said, emphasizing the need for further development in AI grading.
Traditionally, LLMs are trained using both student answers and human scores. Though, this study uniquely instructed LLMs to develop their own rubrics. While these AI-generated rubrics showed some similarities to human-created ones, LLMs often lacked the reasoning capabilities of humans, relying instead on shortcuts like “over-inferring.”
Zhai explained that LLMs might incorrectly assume a student’s understanding based solely on the presence of certain keywords, without evaluating the student’s underlying logic. For example, mentioning a temperature increase might lead the LLM to assume the student understands particle movement, even if their writing doesn’t demonstrate that understanding.
The researchers caution against completely replacing human graders,despite the speed advantages of LLMs. Human-made rubrics, which reflect instructor expectations, substantially improve AI accuracy. Without them, llms have an accuracy rate of only 33.5%,which increases to just over 50% with human rubrics.
Improved accuracy could make educators more receptive to using AI to streamline grading, freeing up time for other tasks.
“Manny teachers told me, ‘I had to spend my weekend giving feedback, but by using automatic scoring, I do not have to do that. Now, I have more time to focus on more meaningful work rather of some labor-intensive work,’” Zhai said.
What’s next
Future research will likely focus on refining AI algorithms and integrating detailed, human-like rubrics to enhance the accuracy of AI grading systems, possibly leading to wider adoption in educational settings.
