I tried learning from AI tutors. The test better be graded on a curve.

This is the first of a four-part series testing out new AI-powered homework helpers. 

On the school supply list for the 2025-2026 school year: New laptop, pouch for the school’s phone prohibition, and (hopefully) ample AI literacy. 

Whether students like it or not, AI is becoming ingrained in education. High schools, colleges, even elementary schools are incorporating it into their curricula, while AI’s heaviest hitters are making huge bets on education, hoping to foster a deeply entwined relationship between young learners and artificial intelligence. OpenAI, Google, and Claude have unveiled new learning and study versions of their models, pitched as AI tutors for the masses. Google for Education, the company’s Education Tech arm, has made a sharp pivot to AI, including passing out free Google AI Pro plans to college students around the world — Microsoft and OpenAI have done the same. AI developers have penned deals with major educational forces that will see their tech and its principles further integrated into school settings. 

So, I, a tech reporter who has been following this AI transition, decided to test out the latest cohort of tutor bots and see how they fared against a historic opponent — standardized testing. 

Some caveats: I haven’t been in a high school or college prep class in well over a decade, and while I have been to college a couple of times now, not one degree involved any math classes. “You’re a tech reporter!” you may be saying, “Obviously, you know more than the Average Joe about science or coding or other numbers-based subject areas!” I’m a words girl, paying cold, hard cash to go to journalism school in 2018. So, as it turns out, I could stand to learn a lot from these AI tutors… That is, if they are actually good at their job. 

How I approached my AI study buddies

I pulled questions directly from the New York Regents Exam and New York State Common Core Standards, the College Board’s Advanced Placement (AP) college preparatory exams from 2024, and social science curricula from the Southern Poverty Law Center (SPLC)’s free Learning for Justice program. 

Rather than sticking with the standard math or computer skill prompts that many AI companies use to promote their chatbots, I included multiple humanities questions — the so-called “soft” sciences. Subjects like reading comprehension, art history, and socio-cultural studies, compared to the more common STEM examples, have proven to be a battleground area for both AI proponents and critics. Also, to put it bluntly, I just know more about those things. 

I conceived one essay prompt using core concepts from Learning for Justice — a unit analyzing The Color of Law by Richard Rothstein, focused on institutionalized segregation — to demonstrate how AI tutors may respond to the presidential administration’s attack on “Woke AI.” Spoiler: Depending on your school district, a chatbot may teach you more “woke” history than your human educators. 


Credit: Ian Moore / Mashable Composite

To make it fair, I started every conversation with a basic prompt asking for homework or study help. I chose not to provide detailed information about my student persona’s grade level, age, course, or state of residence unless the chatbot asked. I also tried to follow the line of thinking of the chatbot as much as possible without interruptions — just as a student would for a human tutor or teacher — until it no longer felt helpful and I needed to steer back the conversation.

This, I hoped, would mimic the “average” student’s goal when using an AI tutor: To simply get their work done. 

Before we dive in: A note on building and testing AI tutors

Understanding the average student’s behavior is key to deciding if an AI tutor actually does its job, said Hamsa Bastani, associate professor at the University of Pennsylvania’s Wharton School and a researcher in this field. “There [are] very highly-motivated students, and then there [are] your typical students,” Bastani explained. Previous studies have shown gains, even if just minimal, among highly motivated students who properly use such tech, “because their goal is to learn rather than to get an A or solve this problem and move on.” But that usually reflects only the top 5 percent of the student pool.

This is part of a recurring observation coined the “five percent problem,” which has pervaded education tech design for years. In studies of tools designed to help students improve learning scores, including those by forerunner Khan Academy, only about 5 percent of tested students reported using the tools “as recommended” and thus received the intended learning benefits. The other 95 percent showed few gains. That 5 percent is also frequently composed of higher income, higher-performing individuals, reiterated Bastani, meaning even the best tools are unlikely to serve the majority of learners. 

Bastani co-authored a highly cited study on the potential harm AI chatbots pose to learning outcomes. Her team found similar results to pre-generative AI studies. “The really good students, they can use it, and sometimes they even improve. But for the majority of students, their goal is to complete the assignment, so they really don’t benefit.” Bastani’s team built their AI learning tool using GPT-4, loaded with 59 questions and learning prompts designed by a hired teacher who showed how she would help students through common mistakes. They found that even for AI-assisted students who reported much more effective studying experiences than those doing self-study, few performed better than traditional learners on exams without AI help.

Information by itself isn’t enough.
– Dylan Arena, McGraw Hill

Across the board, Bastani says she has yet to come across an “actually good” generative AI chatbot built for learning. Of the studies that have been done, most are negative or negligible as far as learning improvement.

The science just doesn’t seem to be there yet. In most cases, to turn an existing model into an AI tutor is to simply feed it an extra long prompt in the back-end ensuring it doesn’t spit out an answer right away or that it mimics the cadence of an educator, I learned from Bastani. This is essentially what her team did in its tests. “The safeguards [AI companies] have implemented [on not just revealing answers] are not good. They’re so flimsy you can get around them with little to no effort,” added Bastani. “But I think a large tech company, like OpenAI, can probably do better than that.”

Dylan Arena, chief data science and AI officer for the century-old education company McGraw Hill, gave me this metaphor: AI companies are like turn of the century entrepreneurs who have invented a 21st-century motor. Now they’re trying to find ways to retrofit that motor for our everyday lives, like a hemi engine with a sewing machine stuck to it. 

Arena, whose background is in learning science and who has been leading the AI initiatives at McGraw Hill, told me that companies are failing to really prepare users for this new era of tech, which is changing our access to information. “But information by itself isn’t enough. You need that information to be structured in a certain way, grounded in a certain way, anchored in a scope and sequence. It needs to be tied to pedagogical supports.” 

“They’ve done very little work validating these tools,” said Bastani. Few leading AI companies have published robust studies on the use of learning chatbots in school settings, she noted, citing just one report out of Anthropic that tracked university student use cases. In 2022, Google convened a group of AI experts, scientists, and learning experts, resulting in the creation of LearnLM  — they later tested the model with a group of educators simulating student interactions and providing feedback, as it launched with Gemini 2.5. 

“Your process might not be that different from the kind of ‘state of the art’ that we have now, for what it’s worth,” Bastani said. Let’s see if my results vary.

ChatGPT:  A grade point maximizer  


Credit: Ian Moore / Mashable Composite: OpenAI

I’m starting with the big man in the room: ChatGPT’s Study Mode, which I ran on GPT-5 using a standard, free account. Users can turn on Study Mode by clicking the “plus” sign at the bottom of the chatbox. The company announced the new feature in July, saying it was designed to “guide students towards using AI in ways that encourage true, deeper learning.”

The first prompt I threw at this go-to bot was a screenshot of a polynomial long division problem that I pulled from the Algebra II section of the New York State Regents Exam. ChatGPT clocked the polynomial long division immediately, asking if I had done this type of problem before or if I needed a walk-through. I replied, “I’m not very good at math.” [If a chatbot asked my grade, I said I was a rising Junior, or finishing 10th grade, approximately.] 

What followed was a step-by-step explanation, albeit with a lot of hand-holding. If I knew the next step and answered correctly, my tutor continued in good form. If I got something wrong, or asked a question, it would quickly give me the answer and move me on. No chance for me to try again, or an offer to do a practice problem so I could nail the concept. It sometimes gave me the answer, then asked me to repeat the steps by myself, with the answer right there in front of me. Of course, I couldn’t show my work either. Pen and paper don’t exist here. 

And then our chat ended. I couldn’t continue because, by dropping in that initial screenshot, I had reached my free daily limit.   

My student self was already over it.

Next, I pulled up a question on ecology from the 2024 AP Biology exam. ChatGPT asked me what subject my biology test was on (a variety) and what style of test I was taking (free response). Even though I said I had a practice exam to work through, the AI tutor guided me through what I can only describe as a user feedback session, in which the bot explained what could be on the exam and encouraged me to do a “quick warm-up.” It asked, “When you see ‘ecology’ on an AP Bio FRQ, what are two big ideas you expect might come up? (For example, ‘food chains’ or ‘population growth curves.’)”

ChatGPT already had a study plan, before I could offer input.
Credit: Screenshot by Mashable / OpenAI

It threw at me its own broad, subject-based short answer questions. I hadn’t given it my own test yet. By the time we got to a point in the practice testing where it was natural for me to finally share my own practice questions, my student self was already over it.

On to my preferred subjects. I again asked ChatGPT to help me practice for the Regents Exam English Language Arts section, this time multiple choice and free response questions on author Ted Chiang’s short story, “The Great Silence.” Interestingly, ChatGPT seemed to know exactly what I was talking about, pulling up common question formats and subjects for a Regents test. “I can walk you through how to analyze it, find the answer, and explain the reasoning so you’ll feel confident doing it on your own,” it said. Later on, the chatbot said it was using the exact Regents benchmarks and formula to help me get the “best” response. Could this be a win for those studying for standardized tests?

During the session, it quickly went back to its old ways. ChatGPT immediately gave me what it thought were the central points, themes, and author’s argument for Chiang’s work. Once it had that sorted on my behalf, it wanted to dive into multiple choice questions and then offer up some of its own free response questions — again. Alright, that’s fine, but what about the questions I came with? At the end, it told me exactly how to get full credit. But is that really true on an ELA exam? I don’t think so. 

There were times I didn’t know what ChatGPT was asking me to do, or when it would choose to “grade” my answers versus breaking it down for me.
Credit: Screenshot by Mashable / OpenAI

Could the chatbot help me study for the ultra-subjective AP Art History exam? I gave it a shot, pulling questions on Faith Ringgold’s piece Tar Beach #2 from the 2024 AP Art History test. I chose this on purpose, because the College Board publishes examples of full-point answers — and that’s what I would give the chatbots. 

Once again, ChatGPT tried to start me with its own made-up questions. “Here’s how I’d like us to work,” it said. Deciding I didn’t want to go through the same round-about studying method of the previous examples, I steered it away: I wanted to practice with a real free response. After giving it the sample AP test’s four-part short answer response, ChatGPT told me it was a “strong draft” and 4 or 5 on the test’s 5-point rubric. 

But I was feeding it, according to actual graders, a “perfect” answer. So why, then, did it tell me I needed to maximize my writing for full points? I could use better art vocabulary, it said, and add more about how Ringgold combines text and image. It also corrected grammar, while the others didn’t. This might sound enticing to users focused on cleaning up their writing, but grammar isn’t a scoring metric for the AP test, it’s more interested in the way you think (a thing chatbots, decidedly, can’t do). After several versions, it started getting pedantic, rewriting my own responses in its own voice to give it better flow. Sure, bud. 


Credit: Ian Moore / Mashable Composite: OpenAI

Finally, I hit ChatGPT with a topic I pulled from Learning for Justice’s curriculum: How early 20th-century law led to housing discrimination and segregation of Black communities. If I learned one thing, it’s that ChatGPT is champing at the bit to help you craft an essay. I could feel the chatbot salivating over generating fleshed-out outlines and concocting works’ cited pages for an essay I hadn’t even written yet. With little prompting, it was giving me topic sentences and offering to show me where I could insert references to experts and articles. I gave it essentially little to no information on my grade or knowledge level, or what topics I had actually learned in class. It had no issue with the subject matter — calling it a “really strong and important prompt” — and its sources actually checked out, even pulling from the question’s central material, The Color of Law, which I hadn’t mentioned. 

But overall, I had to interject often. Can we just look at the questions and notes I’ve already taken and the responses I’ve completed, Mx. Chatbot Tutor? Focus on me, please.

“Of course,” ChatGPT responded. “That’s even better practice.”

Summing it up

ChatGPT Study Mode Pros: Succinct interactions and a minimalist user experience that make it easier to process what you are learning. Better at practice tests, quick overviews, and built for learners seeking clarification on rubrics and grading standards. 

Cons: Cheater, cheater, pumpkin eater. Would frequently give the answers, unprompted, and failed to let users fix mistakes before moving them on to the next step. Frustrating experience using this for free response-style questions, and the chatbot is obsessed with getting users to practice and perfect what they just “learned.” 

Curious about Gemini’s results? You may be surprised.  

​Mashable

Leave a Reply

Your email address will not be published. Required fields are marked *