Some problems in software engineering do not naturally manifest until the software in question is sufficiently big — which is the case for a lot of real software. By big, I don’t mean just a few thousand lines of code, but rather hundreds of thousands or even millions of lines. This is software that supports large parts of modern life1, and it is often critical to maintain and evolve it over many years, most likely with many contributors that will come and go. Usability of tools and techniques in such settings carries quite a unique notion, where pragmatism is preferred over perfection. In this post, I will sketch some of the unique challenges that arise while solving one particular software engineering problem at scale — finding and fixing security vulnerabilities.
Security vulnerabilities are essentially bugs, with the specialty that they can lead to unexpected behavior that can compromise the integrity of the system. This integrity may be defined in terms of confidentiality of user data, availability of service to clients, or prevention of unauthorized transactions. Since these are not usually bugs in the functionality of the software, they are not always preventable simply by adhering to specifications and cannot be easily detected by standard testing due to the atypical and sophisticated ways in which they lead to exploits. With big software, this problem is exacerbated for several reasons. First of all, a large codebase is usually written by many people over a long period of time, because of which no single person can have a complete mental model of the entire codebase. Secondly, the complexity of the data and control flows in a large codebase, including its interactions with external libraries and systems, makes it very difficult to analyze the code manually. These together dismiss the possibility of manual code reviews or ad hoc analysis by individual developers. Moreover, it is not easy to patch parts of the code arbitrarily without a deep understanding of the behavior of the system, because a seemingly innocuous change can have unintended consequences elsewhere in the code — making development teams reluctant to accept such changes.
Static analysis has long been a choice for finding vulnerabilities in codebases. Static analysis literally means “analysis without execution,” and it involves looking at just the source code to find patterns that may indicate bugs. These can be simple analyses like type checking or linting, which is often provided as part of many compilers and development environments. But these can also be more complicated analyses related to the data and control flows in the program, like tracking the flow of untrusted data to sensitive operations. The advantage with these tools is they usually do not require any special setup and can be run on any codebase without any specific configuration. In the context of security vulnerabilities, static analysis tools can take advantage of the fact that software is written by humans — and humans are usually quite predictable in how they write code, since they usually pick it from people around them, like their peers and professors at university, other developers in their team, open-source projects, and online forums. This leads to common patterns in code, including common weaknesses2 that have been known to lead to vulnerabilities, which can be searched for using static analysis. Modern static analysis tools like CodeQL provide expressive query languages to search for such patterns and can process large codebases efficiently.
However, the advantages of static analysis do not come without costs, specifically when dealing with large codebases. Many classes of static analysis problems are either undecidable or computationally infeasible to solve exactly, especially for large volumes of code. For this reason, static analysis by design is run in an approximate manner3, making use of heuristics to bypass computationally expensive steps. This can lead to false alarms where static analysis flags a non-issue as a potential vulnerability (or does not flag a real vulnerability, although this is less common). This might be acceptable for small projects where a developer can manually triage the results, and the traces indicated by static analysis can be easily followed manually. But for large projects, this can lead to an overwhelming number of false alarms that will be too expensive to triage manually — effectively rendering the tool useless. In some cases it might be possible to modify the static analysis queries to reduce false alarms, but this puts an additional burden on a software team that may not have the expertise or the resources to maintain such queries. In software engineering practice, it is often much better to sacrifice some real vulnerabilities in favor of a smaller number of high-confidence results that can be handled easily. We will come back to this point soon.
What we need is an intelligent system that can understand the code at a deeper level and weed out false alarms, assisting the tedious manual triage process. A very tempting option is to use Large Language Models (LLMs) for this purpose, since they have shown remarkable ability to understand and generate code for various programming tasks. LLMs by themselves cannot be asked to look for vulnerabilities in a large codebase because of limited context and diminishing accuracy with large unfocused prompts. But what they can do very effectively is to start from static analysis results and triage them by looking at the relevant parts of the code, just like a human developer would. Unlike static analysis, LLMs are probabilistic and can also hallucinate — but if they can statistically reduce the number of false alarms, they can be very useful in the field.
LLMs can reason well when shown relevant code and asked specific questions. But how do they access the relevant code in the first place? This too is an engineering challenge, because it necessitates building infrastructure that can digest large codebases and provide semantically aware tools to traverse the code and retrieve relevant snippets. Even with such a tool at hand, a delicate balance must be struck between providing too little context, which may lead to missing important information for reasoning about the vulnerability, and providing too much context, which may lead to the model getting distracted and losing focus. Of course, for a small codebase, one could throw the entire project at the model at once — but for larger codebases, it is important to decide what slices of code are even relevant to the vulnerability in question.
Remember that handling such static analysis results is actually quite a low-priority task in most software teams, because they are mostly unrelated to normal functionality and do not usually have an exploit to demonstrate their severity. Yet, finding such issues before an exploit is even attempted is the whole point of static analysis, and so fixing them is important for long-term security of the software. Thus, a practical toolchain for assisting this whole process should not only triage results but also suggest high-quality and focused patches that can be reviewed and applied with minimal friction. Statically looking at code cannot always reveal intended behavior, and so patches must be best-effort and good starting points. LLMs once again provide a promising avenue for this problem, because they can easily generalize from a few instructions and examples to handle entire classes of patches, while a symbolic approach would require separately handling each of the many code styles found in the wild. And once again, a solid infrastructure is needed to ensure that LLMs are directed to the right parts of the code and produce patches that follow both the style of the codebase and also the guidelines related to the vulnerability in question. GitHub’s Copilot AutoFix can do this to some extent, but it is known to get overwhelmed by large codebases and produce incoherent and incomplete patches that are all over the place. This is again a problem of scale.
All of this was a prelude to the main point of this post, which is that scale can introduce unique challenges that are not present in toy settings. This can seem counterintuitive because this makes several problems in software engineering harder to experiment with and nearly impossible to solve in the traditional sense, even when perfect solutions exist for small cases. Software engineering is a field science, and the main takeaway that I hope to convey is that a tool designed without confronting these realities may not work as intended when taken to the field4 — when the question is that of a million lines of code. In a future post5, I will follow up on this by sketching some ideas on how to think about building components for a toolchain that solves the problem I outlined above and how to evaluate them effectively.
-
Or software that makes large parts of money in the modern economy. ↩
-
See Common Weakness Enumeration for a catalog of such weaknesses. ↩
-
In practice, usually by over-approximating the set of possible program behaviors so that true vulnerabilities are not missed. I think one of the reasons for this is that over-approximation is often easier to compute effectively in a generic, i.e., project-agnostic, fashion compared to under-approximation. Under-approximation can render the analysis uninformative for many repositories, possibly causing more harm than good. ↩
-
This is probably also my main takeaway from my internship at Microsoft Research where I deal with some of these problems directly and at the scale I described. ↩
-
I have become increasingly good at making such promises on this blog. ↩