Reflections on GitHub Copilot as a Partner in Undergraduate Research

I wasn’t following the AI hype before I joined college1, and while I had heard of GitHub Copilot2, I didn’t think much of it. I had been programming for several years by then, and couldn’t imagine how AI could help me write code. I was not the only one skeptical; in particular, most educators still saw AI as another tool for plagiarism and were highly conservative regarding its use.

Luckily for me3, my first programming and algorithms course at IISc did not shy away from such developments. My professor, Viraj Kumar, made Copilot an integral part of the course, focusing on two main aspects:

  1. Writing good specifications that express our intent unambiguously, possibly using simple examples, to Copilot.

  2. Reading and critiquing code produced by Copilot, to make sure it correctly reflects our intent.

These are, in fact, basic software engineering skills that are indispensable when working even with other programmers — and that is essentially how we should think of Copilot. In this blog post, I will specifically focus on how Copilot helped me get started with undergraduate research, as a collaborator and in some ways, a mentor. Of course, this was possible because of GitHub’s Student Developer Pack4.

Using Copilot to improve specifications

My first undergraduate research project came out of some discussions with Professor Viraj about how Copilot often suggests code that does not pass even on the provided test cases. What was more interesting was that in many cases where the top suggestion was wrong, one of the other suggestions would be correct5. This is not surprising for a language model that cares about strings more than their semantics. I built a simple VSCode extension that would look at all of these suggestions and filter out the ones that did not pass the given test cases. Note that our inputs were of the following form, where we asked Copilot to write the body of the function.

def function_name(arg1: int, arg2: str) -> bool:
    """
    Purpose of the function.

    >>> function_name(1, "test") # doctest 1
    True

    >>> function_name(2, "test") # doctest 2
    False
    """

At this point, we wondered if better specifications could have helped Copilot produce correct code more confidently. It seemed that the lack of confidence in the suggestions — by which I mean that the suggestions were functionally different from each other — was due to the ambiguity in the specification. Even more importantly, it seemed that these very differences in the suggestions could be used to reveal the ambiguity in the specification. My programming and algorithms course had focussed a lot on writing good purpose statements for functions, and both Professor Viraj and I were excited to see if we could use such findings to build a tool that would not only alert programmers about ambiguity in their specifications but also help students learn how to write better specifications in the first place. On the other hand, this can also encourage students to think more deeply about given requirements and ask the right clarifying questions, which is a skill that is often overlooked in programming courses.

This took shape as a tool called GuardRails, which we eventually presented at the COMPUTE 2023 conference. The idea was rather simple6: we would take all of Copilot’s suggestions, filter out the ones that did not pass the given test cases, and then run the remaining suggestions through several inputs7 to see if we could find an input where they behaved differently. If we did, this input, and potentially a class of similar inputs, would be a suspected source of ambiguity in the specification that we would then report to the user.

       

A quick demo of GuardRails. You can check out our paper on arXiv. Our tool presently does not work because of changes in the Copilot VSCode extension, and an update is underway. If you would like to contribute, please reach out to me, and I would be happy to discuss.

From a logistical perspective, this was a time when LLMs were still new, and free APIs were not as common as they are now. We wanted to quickly experiment with our ideas without having to first seek funding to self-host an LLM or get access to a paid API. Copilot was not meant to be used this way, but making our tool work on top of it was indeed a nice hacky way to quickly implement our ideas and test them out. Today, of course, LLMs are more accessible, and Copilot itself provides many ways to smoothly integrate additional functionality, like through extensions and even agents.

Validating translated code at scale and accidentally learning Java

While working on GuardRails, Copilot’s role in my research was like that of a monkey in an animal behavior study. We analyzed its behavior, saw what it did right and wrong, tried to understand why it did what it did, and then used that understanding to better guide its behavior. This was quite different from the colleague-like role it played in my next project, related to validating code translated from one programming language to another. I have talked about this work, called GlueTest, briefly in my previous blog post.

I had the opportunity to co-lead this work with the supervision of Professor Darko from UIUC and Professor Saikat from Cornell, and we were essentially trying to build a testing framework to validate whether code produced after translating an existing code base (in our case, in Java) to another language (in our case, Python) is indeed a correct translation. As I also discussed in my previous blog post, this has a few interesting challenges. It is natural to translate the test suite in the source language to the target language and use this to validate the translated code — but it’s usually not that simple! Test code can be voluminous and occasionally complicated — and how do we know if the tests themselves were correctly translated? More than that, while translation efforts are usually incremental8, tests often exercise too many parts of the code base at once, making it hard to translate small parts of the code base at a time and validate them. If larger parts of the code base are translated, it becomes harder to localize errors when a test fails, making debugging more time-consuming.

Long story short — we proposed a framework to directly run tests in the source language on the translated code in the target language — with a glue layer that translates data on the boundary between the two languages. Where does Copilot come into this picture? We manually translated two Java libraries9 to Python, and well, this was my first time writing Java code. Yet I managed to become one of the main contributors to the project, thanks to Copilot. I would translate the code class by class, and Copilot would give me a reasonably good template to start with. Usually, I would modify the translation for one of the methods in the class to fit my needs, and then Copilot would very quickly adapt to that style and structure. Of course, Copilot would make mistakes, especially in things with subtle differences between the two languages, but my work was limited to finding solutions to these tricky10 problems instead of doing the grunt work of writing obvious code. And in all of this, I found myself slowly picking up quite a significant amount of Java11. I am not saying that I am now a Java programmer, but I could see during my other projects and also some of my classes that my understanding of Java and its nuances has been coming in handy.

       

An illustration of our GlueTest approach. What is happening here is that we create a Python representation of every Java object, and when a Java test calls a method in the Java main code, we redirect that call to the corresponding Python representation, which implements the same method in Python. There is a bit of engineering that goes into this, for example, to convert Java types to Python and vice versa, but this is the main idea. A freely available preprint of our work is available here.

Conclusion

The bottom line here is that Copilot makes me, on average, 10 times faster in producing code. It does seem very magical, and at times, it might even feel like I am cheating my way through. But at the end of the day, it is another tool in addition to the plethora of linters, snippets, auto-completions, and other tools that our code editors have provided us for a long time. And like any of these tools, it is not a replacement for good programming practices. As I discussed above, it is also not a replacement for actually learning to write code! I don’t think it is anymore a question of whether or not we should use AI for programming — and so we should instead think more deeply about how we can use it effectively to complement our existing skills and tooling, what its limitations are, how far can we get around these limitations with additional infrastructure, and how we can use it to make coding more accessible not just to novices but even to experienced programmers who just want to have a little bit of fun!

  1. Even when I did become interested, I gave up on trying to keep up with the latest news only within a couple of months because it was simply too much too fast. Now, I only read up when something catches my attention or when I need to know something specific. 

  2. Which had been released less than 4 months before, in June 2022, for the public. 

  3. At least retrospectively. 

  4. As a side note, I must acknowledge that the student developer pack is a genius move by GitHub to get students hooked to Copilot and other tools. I cannot imagine working without Copilot now, and have decided to remain a student for as long as possible to keep using it for free (I am talking about grad school 😉). 

  5. We could press Ctrl+Enter in VSCode to see the list of up to 10 top suggestions. 

  6. And had been explored in somewhat other contexts before, like in this paper

  7. To be precise, we performed differential fuzzing on every pair of these suggestions. 

  8. All-or-nothing translations have been attempted, but as Terekhov and Verhoef mention in their article, this has been disastrous in practice — from abandoning entire projects to even bankruptcies. 

  9. Apache Commons CLI and Apache Commons CSV

  10. And arguably more interesting. 

  11. This deserves another blog post, but I believe I could truly appreciate the beauty of object-oriented programming only after working with Java. This is not only about how nice the code looks but also about how it enforces a certain way of thinking about code and writing it.