ChatGPT3 and programming education

Blog

ChatGPT3 and programming education

Last summer I was in a panel about programming education, where my statement was: “The tyranny of autograders should be stopped” and as they say “be careful what you wish for” because with ChatGPT3 I might get my wish!

And I wanted to write a blog post accompanying the panel forever, and now that people are asking me how I think ChatGPT will impact programming education I will do the two in one long post as I am travling to Groningen today, a three-hour train trip!

Autograders

First, what are autograders? Autograders are tools used in programming education used to automatically grade programming exercises. Of course automatic grading is used in many other fields, for example multiple choice questions in exams or practice sessions, apps that check calculated answers or open text tools that check for the presence of certain words.

Examples of autograders are Peach as I used as a student at TU Eindhoven, TU Delft’s WebLab, CodeGrade as used at VU, or Stepik which I myself used at TU Delft and at Leiden, which is a bit different since it connects to JetBrains IDEs.

Autograders notably differ from other automatic grading tools in three interesting ways.

The autograder process

Let’s first example the process of the auto grader:

Auto graders use a two-step process.

  1. They firstly run a compiler or interpreter* to be able to run the code. The output of the compiler is not controlled by the teacher, and is output designed for (if designed for anyone!) professional programmers. For example, if a closing bracket is missing somewhere, a student might get output that reads: “unexpected end of file”.
    Only if the code can be compiled, the next step can be performed.

  2. In most cases, validation is done by running tests, checking whether the code is giving the right output with a certain input. For example, if the assignment is for the student to write code that calculated the square of a given number, the test cases might be (2, 4 and 3, 9 and 0, 0 and -2; 4). In most courses I have seen or heard about, some of the test cases are hidden from the students, and a failure to output the right answer on a hidden test case will give cryptic output like: 3 out of 4 tests passed, 1 test failed.

    This output, contrary to the cryptic output given by compilers is by design, and is caused by the Frantic Fear of Fraud which holds hostage most of (programming) education. The argument here is that But If WE Give THeM The TEsT Cases They Will Just WrITE a BiG IF!! For the test above, I could just write something like:

    if n == 2: print 4
    elif n == 3 print 9 etc.

    More on this frantic fear of fraud later.

    If the output is textual (and it often is, in our example of the squares, we might expect an output not of 4, 9, 0 and 4 but of “The square of 2 of 4.”) the output must match to the letter. Missed a dot? Error. Capitalised a word that should not have been? Error.

Notable differences

So let’s summarize the three differences with what I have seen of autograding more free form output like text are:

  1. Output is given that is not controlled at all by the teacher, designed by of for professional programmers
  2. Information that could be used to correct wrong answers is, be design, withheld
  3. Extreme preciseness in the output is expected of students

Their Tyranny

In all honesty, I never really saw the issue with autograders, which, as most often the case when people don’t see the issue with something, was caused by privilege. When I first encountered one, it was in an introductory programming in university in Pascal. By then I had 8 years of programming experience, including 4 years of Pascal which I had used, among other things, to code my high school final project which performed simple machine learning on datasets. So when I was asked to calculated how many candies m each person p would get, this was no big deal for me.

I started to get somewhat sceptic when I used autograders for years in introductory programming, when I saw students struggle with really small things (like indenting one line of code) in an otherwise correct solution. The a toxic combination of syntax errors first, followed by hidden test cases drove them to understandable madness and derailed their plans entirely.

But I truly saw the issue when two of my highschoolers followed a university course, where they were doing the exercises (online) while under my supervision. Their exclamations of “aaargghhhh the tests were running for 10 minutes only to fail on one missing space.” Made it clear to me they were not at all learning from this feedback (as intended by the instructor) to be precise, they were learning that programming is unreasonably hard and designed by unreasonable people. While not untrue, this was not the lesson I wanted them to learn.

What autograders teach

The problems with autograders are many and run really deep into the belief systems of programming education, but since this post is already long, I will try to pick the most important issue and that is that introductory programming courses (and with them: autograders) teach two different things**:

  • The syntax of a certain set of programming concepts in one programming language
  • “Problem solving skills” so that students can apply these concepts in the right combination to achieve a certain result

In itself, this is already problem. As everyone with knowledge of learning can tell you, teaching two different things at the same time is doomed to fail. Some students can formulate very lucidly in words how to solve the given problem but don’t know how to put their ideas in code. Other students have no clue how to even start to attack the problem but can happily make programs that do other things. To autograders and to some teachers and Tas these are indistinguishable even though they need radically different help.

It is not just my opinion that these courses are doomed to fail. Introductory programming courses have notably high dropout rates, and lately it seems these dropout rates are getting higher. The cause of that is, I think, that in the nineties and zeroes, people like me took them: kids with ample programming experience in highschool. Now that more students without prior knowledge are taking programming courses, their bad educational design becomes even more clear.

But it gets worse! Not only do introductory courses aim to (and fail at) teaching two different things, they deny it with what must be the biggest gaslighting in the history of education. “This course IS Not AbOuT syNtAx” decorates all introductory slides, textbooks and lecture notes, because “It is ABoUt PrOBlEm SOlVinG!!!” Only for students to be confronted with “unexpected EOL” and “missing semicolon on line 4” in a two-line program.

What autograders do not teach

Before I go into the alternatives for autograders, let’s also look at what autograders do not teach. The more I teach programming, the more I explicitly teach what I call praxis of programming: teaching students to use a debugger when they encounter an unexpected answer, choosing good positions for a breakpoint, placing print statements on suspicious lines and selecting values to print. Autograders not only do not teach praxis, but they actively prohibit their use, especially through the use of hidden test cases.

Many autograders are web-based, and expect students to code in a local IDE, and copy-paste or upload the code “once ready”. But since students can ‘t see when they are ready because of hidden tests, they often code right in the browser (this was one of the somewhat surprising things I noticed when my highschoolers where using CodeGrade in class). And in the browser there of course is no debuggers to use and no breakpoints to set.

And hidden test cases prevent reproduction of a mistake in order to fix it, which is, as all progammers now, the most frustrating of situations to be in: knowing an error exists but not knowing how to reproduce it so you can see the path the faulty code is taking. When a test fails and thus their solution “is wrong”, the solution available to student is mostly “do more thinking”, while the wrongness could be one misplaced indent or a wrong interpretation of the needed algorithm entirely, or anything in between.

The amount of prior knowledge that I as a teacher used to help my students was immense (try a negative number, try an empty string, try the same number twice).

And of course, people will now say: but those are things I want students to learn! However, the thing is, after such a torturous run, students to do not want to look back and reflect on their learning. They simply cross the finish line and never look back, and learn, hardly anything. In terms of cognitive science: their cognitive load while problem solving get so high, they do not have any load left to store lessons to long-term memory.

Must be stopped

So, I do hope I am getting my wish that autograders will be stopped soon. A question like Write C++ code to invert this string will soon be one mouse click away and that is, in principle, a good thing for education. It will require us to rethink the matra that in order to get better at coding, you need to do more coding.

There are so many other things we can use if we let go of that idea, and if we let go of the idea that students will cheat at every given opportunity. When I just starting teaching, and I asked why students were cheating in my class, very wise and experienced teacher once told me: “You will not like this answer, but whenever more than a few students cheat, this indicates there is a problem in the course”. He was right, I did not like this answer, but it is very true. There will always be a few students that are cheating (sometimes even for good reasons) but when it happens en masse, students do not see another opportunity, because despite what we might think, most of them really do want to learn. If a lot of cheating happens the exercises are too hard and/or the students do not see their value. If you are stuck on a problem where you have not made any process for 3 hours, of course you cheat! It is the sensible thing do to.

What could we use instead? Here are some ideas:

Good old fashioned multiple choice questions

I have never understood everyone’s hate of multiple choice questions, they are a fine tool to use, and if you align your detractors with misconceptions, you can learn a lot from wrong answers!

Parson’s problems

Provide students with the correct lines of code, in random order and have them drag and drop the lines in the right order. Student could still, but for reasons above they will be a lot less likely to do so, since they can see they will not be stuck for hours. A lot of research explains how to design problems well, and it also shows that Parson’s Problems correlate well with performance on open programming problems.

Answer a question about program x

I think the coolest idea where we can use AI to generate meaningful questions with automatically verifiable answers. Input a program x and such a tool generates questions like:

  • What are the three variables in this program and their types?
  • How often is line 5 executed if I input 19?
  • On what line did the programmer make a mistake here?

The great benefits of the above options is that they all have an emphasis on training students to critically read code, which is what we will need a lot of when GPT3 will start spewing its legit looking nonsense code.

Create a program with certain concepts

If you want to also throw some production in the mix, you can consider more free form exercises.

One day I will do another long blog post about the tensing between problem solving and constraining students with exercises that have a clear right answer, and ‘free programming’ in which students are allowed to use code for their own pleasure and their own problems, but asking students to create something of value to them with the learned concepts is a great way for them to show what they have learned.

Create any program with a loop and a condition and explain how you used these concepts is a fine way to grade, and it can even be done in a (partly) automated way by verifying the needed concepts are present.

I asked my high schoolers (7th grade) to build something of their own choosing using 3 variables and 5 conditionals and there was not cheating at all, since they were all so engaged in their own ideas for interactive stories or restaurant apps.

——–

* I will use compiler in the remainder of the post to ease reading but I can assure you I understand the difference, don’t @ me
** Plus: often but not always, instilling the need in students to be extremely precise

Back To Top