How Gitlab puts gRPC in the Real World
In previous installments of this series, we looked at the historical events that led to the creation of gRPC as well as the details that go with programming using gRPC. We discussed the key concepts of the gRPC specification. We took a look at the application we created especially for this series that demonstrates key gRPC concepts. Also, we examined how to use the auto-generation tool, protoc provided by gRPC to create boilerplate code in a variety of programming languages to speed gRPC development. We also talked about how to bind to protobuf files statically and dynamically when programming under gRPC. In addition, we created a number of lessons on Katacoda's interactive learning environment that illustrate the concepts and practices we covered in the introductory articles.
Having presented the basics required to understand what gRPC is and how it works, we're now going to do a few installments about how gRPC is used in the real world. One of our real-world investigations explored how gRPC is used by Kubernetes in its Container Runtime Interface (CRI) technology.
In this installment, we're going to look at how the Source Control Management Service GitLab adopted gRPC when it refactored its server-side architecture into the Gitaly project.
Gitaly Redefines GitLab Ecosystem
GitLab promotes itself as a comprehensive platform that unifies the entire DevOps process under a single application. Instead of having to use separate tools and services for source control management, issue tracking, project management and continuous integration/continuous deployment (CI/CD) the company combines everything into a single portal. They refer to this unification as "Concurrent DevOps."
But, GitLab had a problem. Its digital infrastructure couldn't keep up with demand as the business grew.
When GitLab started out, it ran its entire platform on a single server. The way the company scaled up its infrastructure as it grew was to spin up identical instances of the server behind a load balancer and then route traffic accordingly. This approach is called horizontal scaling. While useful at the beginning, scaling servers horizontally became a bottleneck.
In addition to the problems inherent with horizontal scaling, the platform had a problem particular to the way it handled access to the .git directory that is the foundation of the Git repositories it hosts. Each Git repository hosted by GitLab has an underlying .git directory. That .git directory stores all the source code files according to the various branches in force in the repository. Also, the .git directory stores activity data, such as commit information, merge information, etc. The .git directory is a mission-critical asset. It's used by all the developers working with the repository as well as system admins, testing personnel, and a plethora of automation scripts that do everything from code escalation to issuing executive reports. As one can imagine, a single .git directory will experience an enormous amount of reads and writes.
Having a large number of people and processes share access to a .git directory caused problems for GitLab. First, if a computer on which a .git directory was stored went down, the entire platform could go down. Second, as read/write activity increased so did CPU utilization and input/output operations (IOPS). The company needed something better.
A group of engineers came up with an idea to solve the problem: instead of having each user and process interact with a .git directory, why not provide a layer of fail-safety around the particular .git directory and then have an optimized server-side process act as a proxy to that directory. All work would be done on the server-side and the result would be returned over the network. This thinking gave birth to Gitaly. Gitaly is now the architecture that processes all requests made to GitLab.
How GitLab Implemented gRPC
Gitaly v1.0, which debuted in November of 2018, completely refactored the way that GitLab handled user requests. Before Gitaly came along all requests coming in the GitLab.com made direct calls to .git directory stored on NFS mounts connected to the GitLab server. Gitaly removed direct access to the .git directory. Instead of having an architecture in which a request to GitLab results in a direct call to an NFS mount containing a particular .git directory, Gitaly makes it so requests to GitLab.com eventually resolve to the Gitaly service. The Gitaly service in turn interacts with a specific .git directory. The communication between the client-side components that make the request to the server-side Gitaly service is facilitated using gRPC.
The Gitaly clients that call the Gitaly servers were created using the protoc autogeneration tool. These clients are private to the GitLab environment and are used only by Gitaly internals. They are not available for public use. There's a Ruby Client and a Go client. A portion of the Ruby client uses internal libraries written in C. The Go implementation used go-grpc.
Figure 1 below illustrates the Gitaly architecture and Table 1 that follows describes each component in the architecture.
Figure 1: The architecture of the Gitaly framework
Table 1: The components that make up the Gitaly framework
Why did the engineers at GitLab choose to use gRPC as the communication mechanism? As Zeger-Jan van de Weg, GitLab's Backend Engineering Manager, Gitaly told ProgrammableWeb:
"One of our values at GitLab, is efficiency... although quite new at the time it [gRPC] was picked at GitLab, it did show mature concepts and lots of experience with RPCs in the past.
The tooling for both gRPC and Protobuf is mature too, and there's good support for multiple languages. For GitLab, it was important to have first-class support for Ruby and Go. As a company, Google usually invests a lot of resources into tooling, and gRPC is no exception.
Furthermore, the community is reasonably sized too. It's not as big as say Ruby on Rails, but most of the day to day questions a developer might have, they can Google the answer and find it. And slightly more advanced use cases were covered too. For example, there was a need for a proxy which peeks into the first message of a [Protocol Buffers] stream to alter routing and partially rewrite the proto message. Examples on how to do that, and what to look out for is something you'll find in minutes. For the Gitaly team, gRPC (plus protobuf) causes very little issues, and not having to worry about stability, or immature tooling allows us to focus on delivering value to customers."
Remember, when it comes to working with tens of thousands of repository files distributed over an ever-growing cluster of machines, GitLab needed a communication protocol that is fast, efficient, and relatively easy to adopt from a developer's perspective. gRPC met the need and then some.
What's interesting to note is that GitLab didn't have a lot of expertise with gRPC when it started development with Gitaly. As van de Weg said during the ProgammableWeb interview,
"At the time gRPC was picked, there was no significant experience with gRPC, nor Protobuf. There's no active training, nor has it been requested. On our team, gRPC is one of the more easy technologies to learn, [as] opposed to running Git on a large scale, and understanding the GitLab architecture."
Yet, despite not having expertise on hand immediately, GitLab prevailed. The company found gRPC a straightforward technology to implement. van de Weg continues,
"As always, a new technology and API takes time to get used to, though gRPC makes it easy to ease into. For me personally, I didn't find gRPC too difficult to get used to. The API has clean abstractions, and doesn't leak too much of the implementation."
Yet, for GitLab, all was not peaches and cream. The company enjoyed considerable success using gRPC in Gitaly, but the success did come with some challenges.
GitLab Confronts Challenges with gRPC
As mentioned above, one of the benefits of gRPC is fast rates of data transfer between sources and targets. Reducing data to a binary format increases transmission speed. But, in order to support a binary format, gRPC requires a well-defined schema that is shared by both client and server. This schema is defined in a protobuf file that describes the methods and type of a gRPC service according to the gRPC specification.
Working with a common schema that's documented in a protobuf file can be a bit difficult for those accustomed to working with self-describing data formats such as JSON or XML. Common to loosely coupled API architectural patterns like REST, a self-describing format doesn't require the client to know anything beforehand about the data sent from a server in order to decode a response. On the other hand, gRPC requires that the structure of an interface be well-known to both client and server and therefore, as API architectural patterns go, is more tightly coupled. Getting used to this formality requires a reset in the developers' mindset. Creating consistent, useful gRPC interfaces was a challenge for Gitaly developers. van de Web acknowledged this challenge saying "The issues getting familiar with gRPC and Protobuf in the early days created inconsistencies in our interface."
In addition to learning how to create data structures/interfaces that could scale with minimal impact, GitLab needed to address issues that came up around the actual size of a binary message returned to a request as van der Web explains,
"Some choices were made a long time ago, to which I'm currently uncertain [is] if these still are optimal. Maximum message size comes to mind, or how to do chunking of potential large requests or responses. In a case where for example, a list of branches is requested from the server, you could send a message per branch found, or send multiple branch objects per message. Both solutions we currently employ, but if the correct solutions are chosen each time [on the part of the requester]? I'd not bet on it."
Gitaly uses sidecars as ancillary services to support higher-level operations. As it turns out the sidecars created some problems that were hard to detect. Some of the problems were directly related to gRPC, but the actual event creating the error was deep in a sidecar, making resolution difficult. A van der Web points out, it took a while to discover the culprits.
"Then in terms of bugs or surprising behavior, there were times where our service errored with Resource Exhausted errors. It was fairly quickly identified to be coming from the sidecar. But other than that, these were very sporadic and didn't have a seemingly coherent source. The errors we're not thrown in the application code but there wasn't enough information yet to reproduce consistently and with that uncovered the root cause. After a while, we discovered that the ruby gRPC server had a concurrency limit that our sidecar was hitting."
One of the other problems GitLab had was around understanding error information coming out of Gitaly internals. While it's true that most of GitLab's developers interacted with Gitaly's internal service using the Gitaly/gRPC clients, there was still a segment of the developer community that needed to work with Gitaly at a lower level. When issues did arise, those developers working a lower level had a hard time understanding what was going on with a request as it made its way into the Gitaly stack because many of the root cause error codes were gRPC specific. van der Web explains the situation,
"The interface on the clients is usually on a higher level... This means that these developers don't know how their requests reach our service, much like many developers don't know how queries are sent to other datastores like Redis or Postgres. However, with gRPC the errors are much more likely to bubble up to these developers. Since gRPC uses HTTP/2, it might have been a better idea to stick with the HTTP status codes for more familiarity with them."
In other words, you can't figure out what's going on if you don't know what the error messages are about. Most developers understand the meaning of HTTP status codes such 200, 404, or 500. On the other hand, gRPC is still an "under the covers" technology for many. As a result, debugging gRPC was still an adventure into the unknown for a large segment of the development community.
Putting It All Together
GitLab is a company that has experienced significant growth. According to Forbes, its year-to-year revenue growth is 143%. It's raised $268 million in Series E funding. And, its valuation as of September 2018 was $2.75 billion dollars. That's a lot of money. None of this would have been possible if GitLab did not have a solid technical infrastructure to support its current activities as well as its projected growth. More than one company has hit the skids because its technology could not support market demands.
To its credit, GitLab had the foresight to understand the risks inherent with its anticipated growth. The company addressed them head-on with Gitaly and gRPC.
Reliable Git repository management is a key feature of the GitLab ecosystem. Without it, all the other services that are part of GitLab's Concurrent DevOps platform become inconsequential. Putting gRPC at the center of its Gitaly repository management service was a mission-critical decision. While a lot of work involved with gRPC adoption was easy for GitLab to do, there were challenges, mostly around getting a handle on working with the Protocol Buffers specification and optimizing message transmission.
To date, GitLab is successful. The company continues to prosper. the choice to use gRPC seems to be a wise one. The formality that goes with implementing gRPC has brought more discipline to GitLab's development efforts.
For those companies considering adopting gRPC, the thing to keep in mind with regard to GitLab is that the company already had a lot of experience writing backend services at a very deep level. Its engineers were well versed in the details of network communication via sockets. They understood the nuances inherent in the HTTP/2 protocol and the Protocol Buffers binary format. In short, they were very comfortable programming for the server-side even before Gitaly came along.
A company approaching gRPC for the first time will do well to make sure it has expertise in server-side programming. This includes everything from a mastery of intricacies of vertical and horizontal scaling to understanding the complexity of working with a binary data format such as Protocol Buffers.
Studying the success and challenges that GitLab experienced will provide real-world lessons that will benefit any company considering adoption of gRPC in the enterprise. gRPC takes some getting used to, but as GitLab has shown, the investment of time and attention produced beneficial results for the short and long terms.