GitHub, a code-hosting website built atop the Git version-control system, hosts hundreds of millions of repositories of code uploaded by more than 65 million developers. The Semantic Code team at GitHub builds and operates a suite of technologies that power symbolic code navigation on github.com. Symbolic code navigation lets developers click on a named identifier in source code to navigate to the definition of that entity, as well as the reverse: given an identifier, they can list all the uses of that identifier within the project.
This system is backed by a cloud object-storage service, having migrated from a multi-terabyte sharded relational database, and serves more than 40,000 requests per minute, across both read and write operations. The static analysis stage itself is built on an open source parsing toolkit called Tree-sitter, implements some well-known computer science research, and integrates with the github.com infrastructure in order to extract name-binding information from source code.
The system supports nine popular programming languages across six million repositories. Scaling even the most trivial of program analyses to this level entailed significant engineering effort, which is recounted here in the hope that it will serve as a useful guide for those scaling static analysis to large and rapidly changing codebases.
Navigating code is a fundamental part of reading, writing, and understanding programs. Unix tools such as
grep(1) allow developers to search for patterns of text, but programmers' needs are larger in scope: What the are most interested in is how the pieces of a program stitch together—given a function, where is it invoked, and where is it defined? Quick and quality answers to these queries allow a programmer to build up a mental model of a program's structure; that, in turn, allows effective modification or troubleshooting. Tools such as grep that are restricted to text matching and have no knowledge of program structure often provide too little or too much information.
Fluent code navigation is also an invaluable tool for researching bugs. The stack trace in an error-reporting system starts a journey of trying to understand the state of the program that caused that error; navigating code symbolically eases the burden of understanding code in context. As such, most integrated development environments (IDEs) have extensive support for code navigation and other such static analyses that ease the user's burden.
The Semantic Code team wanted to bring this IDE-style symbolic code navigation to the Web on github.com. The team was inspired by single-purpose sites such as source.dot.net, Mozilla, and Chromium Code Search that provide comprehensive in-browser code-navigation. The question was how to do that at scale: GitHub serves more than 65 million developers contributing to over 200 million repositories across 370 programming languages. In the last 12 months alone, there were 2.2 billion contributions on github.com. That's a lot of code and a lot of changes.
The Semantic Code team's approach to implementing code navigation centers around the following core ideas.
It's not sufficient to require that developers clone and spin up their own IDEs (or wait for an in-browser IDE such as GitHub Codespaces to load); developers are expected to be able to read and browse code quickly without having to download that code and its associated tooling. For this feature to scale and serve all of GitHub, it has to be available everywhere and in every project. The goal is for developers to focus on their programs and the problems they are trying to solve, not on configuring GitHub to work properly with their projects or convincing another project owner to get the settings right.
Though this means the language grammars may accept a superset of a given language, this philosophy yields the ability to scale and deliver results much faster. The infrastructure can run a single code stack, there's no management of multiple containers and associated resource costs, there's no cold start time for bringing up tooling, and there's no attempt to detect a project's structure, configuration, or target language version.
This also means that major aspects of analysis can be implemented in one place, providing abstractions such as machine-readable grammars and a tree query language. Only certain things unique to a programming language must be developed before that language can be enabled in the rest of the technology stack.
The static analysis that the GitHub code-navigation feature is built upon is called a tag analysis. A tag analysis looks at the definitions and the usages of functions, variables, and data types, collating them into a format suitable for viewing the definition of a given entity, as well as querying the codebase in full to determine all the places where that entity is used.
This analysis was introduced by a program named
ctags, developed by Ken Arnold and released in 1979 as part of BSD Unix 3.0. As its name indicates,
ctags supported only the C programming language; despite this, it was an immediate success, eventually incorporated into the Single Unix Specification. Further descendants of
ctags added support for more programming languages, and the vast majority of text editors and IDEs now provide
ctags integration. Though a tag analysis is trivial, relevant to the state of the art in static program analysis, implementing such an analysis at GitHub scale and within GitHub's distributed architecture was not.
Tree-sitter. The first step in any sort of static analysis is to parse textual source code into a machine-readable data structure. This is done by producing concrete syntax trees with Tree-sitter, a parser generator tool and incremental parsing library.
Tree-sitter enables specifying a grammar for a programming language and then generating a parser, which is a dependency-free C program. Given a parser and some source code, the system yields a syntax tree of that source code. The act of tagging that code entails walking the tree and performing a filter map operation. Using an s-expression-like DSL (domain-specific language), Tree-sitter provides tree queries that allow specifying which nodes to match in the syntax tree (for example, an identifier that represents the name of a function). The tooling in the Tree-sitter ecosystem allows fast iteration for grammar development and tree matching, making it possible to support new language syntax quickly or identify new constructs for code navigation.
Tree-sitter pulls from several areas of academic research, but the most prominent is generalized LR (GLR) parsing. An extension of the well-known LR (left-to-right derivation) parsing algorithm, the GLR algorithm can handle ambiguous or nondeterministic grammars, which are common in modern programming languages. Real-world languages use many different parsing algorithms: Python uses the LL(1) algorithm, Ruby uses LALR, and GCC (GNU Compiler Collection) projects use custom recursive descent parsers. The GLR algorithm, introduced in 1974 by Bernard Lang and implemented in 1984 by Masaru Tomita, is flexible enough to parse all these languages; relying on a less sophisticated algorithm such as LR(1) would be too restrictive, given that such parsers are limited to looking ahead only one token in the input (see sidebar for more information).
Tagging the trees. The next challenge is to do some basic analysis of the parse trees. To start, the team decided to provide naive name-binding information based just on identifiers for known syntactic constructs. This was originally designed as a prototype, but it worked so well it was shipped while work was beginning on further refinements.
The basic idea is to:
It turns out that for many languages, this name-binding information alone provides a compelling product experience. It's not perfect, but it's such an improvement over not having any information at all, that the team used this prototype to flesh out the production-scale system, while simultaneously researching how to refine the precision of the results.
To walk the parse tree, it suffices to use Tree-sitter's support for queries. By specifying the structure, using an s-expression syntax, of the types of syntax nodes that contain useful identifier information, a walk of the parse tree can yield only the nodes specified. As shown in Figure 1, given the use case of identifying method definitions in Ruby, a small snippet of Ruby code is parsed that defines a method add.
In more complex situations, the Tree-sitter tree queries can maintain custom state during the query's operation. This is important for a language such as Ruby, which makes no syntactic distinction between variables and functions called without arguments. To yield useful results, you must record the names of any local variables in a given Ruby scope, which allows that syntactic ambiguity to be eliminated.
GitHub's code-navigation pipeline is built atop open source software and standards:
The system is made up of three independent services deployed on Kubernetes that integrate with the main github.com Ruby on Rails application (affectionately called "the monolith" because of its size and complexity): The indexer consumes Kafka messages for git pushes and orchestrates the processes of parsing, tagging, and saving the results for a particular repository at a commit SHA; the tagger is a service that accepts raw source-code content, parses and tags this code, and returns the results; the query service provides a Twirp RPC interface to the symbols database, allowing github.com's front end to look up definitions and references.
Figure 2 shows the relationships among the various indexer sequence components.
Indexing. Developers upload code to GitHub via pushing through Git or as a result of editing a file through the website's interface. Upon receiving a new push, GitHub servers send a message to Kafka describing metadata (project location, change author, target branch) about the push. The system runs a pool of Kubernetes pods, spread across multiple clusters and physical data centers, that listen to these messages. Because messages are distributed across the pool of pods according to a repository's unique identifier, a given pod typically consumes messages for the same repositories over and over again, allowing effective local caching in the indexing pipelines.
As messages are consumed, the indexer process aggregates them, ensuring that pushes to the same repository within a time window are processed as one request. It subsequently filters out invalid messages; because of reasons of scale, pushes are indexed only to the default branch; they must involve only programming languages that the system supports. It then throttles the rate of indexing, so that initial indexes are allowed to complete fully before subsequent pushes are processed. Kafka records when a given message is passed off for processing, and a cap on simultaneous processing puts backpressure on Kafka. This means that in the case of capacity overload, the indexer will not attempt to consume too many messages, providing resiliency and elasticity in the system.
When processing a push, the indexer does a shallow clone of the repository to a temporary directory. The aggregation stage means a single download can be reused in future indexes.
Indexing involves determining which git blobs in the repository at the change under analysis have not been indexed. The OID (object identifier) of a blob is a SHA-1 hash of its contents; this identifier can easily be used to keep track of files that need to be processed. Blobs that haven't changed are shared between commits and require no additional work, though the set of paths and blobs reachable in a particular commit is recorded.
For the blobs that do require parsing and tagging, the indexer calls out to a separate service running on another set of pods. The reason for this separation is that not every push represents the same amount of parsing work. There could be anywhere between one and 20,000 files to process. The indexer workload is primarily I/O-bound: waiting on sockets (for Kafka, the datastore, repo cloning) and reading/writing to the file system. Parsing, however, is primarily compute-bound, so having a separate pool of pods across which requests are load-balanced means resources can be appropriately scaled, and the system won't run into the case where a single indexer pod gets hot as a result of processing a large or active repository.
The system is deployed up to dozens of times a day, and it handles those deployments gracefully by stopping the consumption of any new Kafka messages, allowing inflight indexes a 30-second window to complete processing, and aborting and re-queueing any index that can't finish in that window. Since processing is incremental for each git blob, the new pod that picks up that message will have to parse and tag only those blobs that didn't finish during the original run.
Querying. When tag results have been fully processed, they are stored in a database, along with enough information to serve future queries: For repository X at commit Y where is
foo defined? The system provides a few Twirp RPC (remote procedure call) endpoints that allow the github.com front end user interface to look up definitions and references based on the user's interaction and context. In the past, the team's database of choice was a large MySQL instance, partitioned across multiple machines by Vitess (a MySQL sharding technology). As of this writing, tag data is stored in Microsoft Azure Blob Storage, a cloud storage system that supports uploading files.
Figure 3 shows the relationships among the various query sequence components.
The first prototype of this system used the
ctags command-line tool directly: An invocation of
ctags dumped the yielded tags into the Git storage associated with the tagged repository, and furthermore attempted to do this on every push to that repository. This was great for a demo, but infeasible to run at scale: CPU resources are at a premium on the Git storage servers; multiple pushes to multiple refs made managing the tags file challenging; and handling lookup requests on those servers wasn't ideal.
Ctags didn't have the API needed to take advantage of Git's data structures, and it's not easy to use command-line programs in such a way that their operations can be served by Web services.
Enter Tree-sitter. When the Semantic Code team picked up the project, the next iteration used Tree-sitter instead of
ctags but most of the indexing logic was still implemented in the main github.com Ruby on Rails application. Queued on push, indexing jobs written in Ruby served to fetch Git content over an internal RPC, detect the language of a source file, filter out irrelevant content (for example, binary files, code in languages not supported), call out to a service for parsing/tagging, and save the resulting tags. A dedicated MySQL database was used for storing and retrieving tag data and the query path was essentially direct to this database from the Rails controllers.
This solved a certain set of problems and proved that the Tree-sitter parsers and initial name-binding analyses could provide parsing/tagging as a service, with easy iteration on grammars and easy horizontal scaling of the parsing compute workload. Most importantly, this prototype convinced leadership that this was a valuable set of features to build and that it enhanced the GitHub experience in a meaningful way. It also showed that we had a plan for scaling, supporting all the repositories on GitHub, and supporting new languages.
Out of the monolith. The next iteration moved further out of the github. com Ruby on Rails application, for several reasons:
From an engineering organization perspective, working in the main Ruby on Rails application codebase was slow and came with a number of restrictions. Many parts of GitHub's main deployment pipeline are streamlined today, but at the time deploying a single pull request could take more than a day. To address these challenges, additional services were implemented:
These patterns served the system well for a significant period of time, allowing rapid growth, the release of several new languages for the code-navigation feature, and streamlined development and deployment.
Scaling MySQL. The next iteration required migrating the database from a single MySQL instance to a horizontally distributed cluster of partitioned MySQL instances, powered by the open-source project Vitess. Moving to Vitess entailed engineering complexity: The team needed to write new schemas, choose new sharding strategies, and deploy both primary and secondary MySQL instances within this cluster. During this time, the team migrated from a Semantic-based tagging service to one that operated entirely with (newly developed) Tree-sitter queries. Though Semantic performed well, using the Tree-sitter query language allowed faster iteration and avoided the operational overhead of a program-analysis framework.
Moving to blob storage. The most recent iteration, and the one deployed to github.com as of this writing, entailed a dramatic change in database formats. While the Vitess-sharded MySQL database provided, in theory, significant runway for horizontal scaling (just add more shards), several new constraints became apparent:
Though this migration was primarily about cost structure, the change in data storage resulted in a few interesting performance benefits along the way. Architecturally though, everything in the system stayed the same: Just swap the box that says "MySQL" with a different box that says "Azure"—that was the idea, at least.
From an engineering organization perspective, working in the main Ruby on Rails application codebase was slow and came with a number of restrictions.
In practice, this involved significant work. Not only did the team have to maintain the legacy MySQL and new Azure systems side by side, but it also had to design an Azure-compatible schema for tags data that would allow incremental indexing of repositories and querying of tags with roughly the same performance characteristics as the MySQL-based system. This meant optimizing for space and time and costs. (Storage costs money, but so does each operation: for example, read operations are less expensive than writes, which are less expensive than list-query operations.)
Datastore design. The storage structure the team ended up with is the following, expressed in a file-system-like form. The overall path structure includes a prefix to allow major structural and formatting changes of the datastore. The data for each repository is then prefixed with its id (see Figure 4).
This in-database file hierarchy provides access to all the different data required to provide useful code-navigation information. During indexing, the appropriate files are uploaded to blob storage for the repo under analysis. At query time, the repository, a commit SHA, and the symbol name provide enough information to fetch the appropriate files and compute the results. All files are immutable.
The symbol names by hash prefix file (Figure 5) contains all symbol names in the commit that share a two-character SHA1 hash prefix stored as TSV. This prefix allows speeding up database queries by filtering based on this two-character prefix. This is the first file read when doing a symbol lookup in order to get a list of blob SHAs where a symbol is defined (D) or referenced (R).
The index blobs for a commit file (Figure 6) denotes that a commit has been indexed and holds a map of all blob SHAs in the commit, along with paths for those blobs. This file is read, and intersecting the blob SHAs with those from the previous step gives a list of blobs to fetch in a later step.
This file contains a header and two TSV content sections. The header gives the byte offset (from the start of the file) of the start of each section. Each section contains one line for each path that was indexed. The first section contains
\t fields and is sorted by path. This allows a binary search for a particular path to find its blob SHA in the commit. The lines in the second section contain a single field that is the offset (from the start of the file) of one of the lines in section 1. The section in Figure 6 is sorted by blob SHA and allows a binary search for a set of blob SHAs to find the paths where they appear in the commit.
All definitions and references for a particular git blob (Figure 7) is the final step in symbol lookup, providing the information to return to the front end. Many definition lookups result in a single file read at this step. Since references can be spread out across multiple files, reads are done in parallel with an upper limit.
Definitions are stored first, so that find operations can return early once the first reference is reached. Note that since this is the SHA1 hash of the blob contents, once a blob is parsed, it never needs to be parsed again and can be shared across any commit that includes that blob.
The other files are used to quickly tell if a particular branch or tag has been indexed or not.
refs/(heads|tags)/.tsv contains the single commit SHA for the tip of the branch that has been indexed.
DEFAULT.tsv contains a commit SHA and
ref of the default branch.
Each iteration of the system and even the migration of the datastore happened under the full production load with minimal to no impact on end users. The blob storage change resulted in significant cost savings, an increase in indexing throughput, and a slight decrease in query request times.
We learned, though, that scale—as much as being about large numbers of users and repositories and HTTP requests and the daunting size of the corpus of code hosted on GitHub—is about adoption, user behavior, incremental improvement, and utility. Static analysis in particular is difficult to scale with respect to human behavior; we often think of complex analysis tools working to find potentially problematic patterns in code and then trying to convince the humans to fix them. Our approach took a different tack: use basic analysis techniques to quickly put information that augments our ability to understand programs in front of everyone reading code on GitHub with zero configuration required and almost immediate availability after code changes. The result is a delightful experience of reading, navigating, sharing, and comprehending code on GitHub.
Copyright held by authors/owners. Publication rights licensed to ACM.
Request permission to publish from email@example.com
The Digital Library is published by the Association for Computing Machinery. Copyright © 2022 ACM, Inc.
No entries found