Encrypted Data: The Battle for Security and Speed

Gleb Keselman
Intuit Engineering
Published in
5 min readDec 7, 2020

--

As Intuit evolves into an AI-driven expert platform, solving for sensitive data at scale across organizational lines becomes more critical than ever. In this article, we will discuss the evolution of Intuit’s approach to encryption and data protection, as seen from the eyes of the team responsible for those capabilities.

Motivation

Customer Obsession is one of Intuit’s core values, and even though we are an internal team, developing platform capabilities to be used by engineers, it is still very much a guiding principle in everything we do. One method we use to truly understand our customers, their needs and wants is called “Follow Me Home”, a practice in which we simply observe the customers as they interact with our products, in their “natural habitat”. Every source of confusion or frustration, every direction that isn’t evidently clear is something that we take a note of, and make sure to fix going forward.

During our recent sessions with analysts, engineers, and data scientists, we discovered a common obstacle: oftentimes the data they were trying to use was produced by a service outside of their immediate reach and identifying the original data owner was an onerous, manual task, full of tribal knowledge.

At Intuit we take pride in being great stewards of our customers’ data, and we work hard to ensure it is used for the express benefit of powering the prosperity of those customers, which is why there is an approval process for granting access that must be followed by the data owner before a data scientist can have access.

Even after the data owner is identified, and the approval granted, there is an additional hurdle in the way when the data is encrypted, which as you can guess happens quite often. Allowing access to the encryption key protecting this data is another manual operation, ripe with errors and back and forth communication.

We knew we had to do better.

At a high level, this is the step-by-step solution we had in mind:

  1. An analyst finds an interesting dataset, part of which is encrypted. They don’t know who generated it and therefore, who to ask for access or how to decrypt it.
  2. They send this encrypted data to a dedicated attribution service that identifies which team encrypted it, which encryption keys were used, and requests access.
  3. The identified data owner approves this request, which generates a unique access policy, and provides it back to the analyst.
  4. The analyst can now decrypt the data and use it.

Steps 2–4 can and eventually will be automated.

What does encrypted data look like today?

The typical encrypt function will take an encryption key and some text as input, and return a byte array as an output. Inside the encrypt function, our original text is getting jumbled: padded, shuffled, flipped, rotated, xored, and multiplied on a bit-by-bit basis many times over, which is why the result is not just a string, but a byte array. While the client application could store it as such in the database, there aren’t many fans of storing binary data, and the application will typically perform some encoding (e.g. base64) on the byte array to transform it back to a string before storing it in the database.

In order to inject randomness into this process, the encrypt function will generate a few random bytes, known as the Initialization Vector (IV), which will take part in the shuffling dance together with the encryption key. Somewhat surprisingly, the IV is not a secret and needs to be kept together with the encrypted value so that it can be provided to the decrypt function later on. This is why it is usually appended as a prefix to the encrypted value.

Some encryption algorithms will also generate a message authentication code (MAC), which prevents tampering of data. The latter will also result in a few bytes stored with the encrypted value, usually appended as a suffix.

Next, it is the turn of the encryption key management system (KMS) to add a custom header. The very first thing we knew we had to include in our header was the key version number, to allow for key rotation to happen transparently. We also knew it wouldn’t be the last thing we would be adding, so we made sure the format was forward compatible by adding two bytes: one to indicate the KMS, and another for the header version.

(Encrypted Data Header Version 1)
(Encrypted Data Header Version 1)

Once we developed the encryption key derivation and sharding approach, we had additional parameters to store and a general desire to make the header more extensible in the future, all without constantly making changes to the header version. We solved this by utilizing the type-length-value (TLV) encoding scheme, and having dedicated types for both the old (key version) and new (derivation data) parameters.

(Encrypted Data Header Version 2)
(Encrypted Data Header Version 2)

Whose data is it anyway?

To allow for simple identification of the owner of an encrypted object, we started by extending our encrypted data header format with a new parameter — the key identifier. Its length of 6 bytes was carefully chosen as a compromise, as it is long enough to allow practically unique identifiers (281 trillion possible values) while keeping the additional overhead added to each encrypted object sufficiently low.

On the backend, we also implemented a new API for encrypted data attribution, which receives the key identifier and returns all of the available metadata information, both for the encryption key itself (name, version, validity, associated permissions, lifecycle) and for the project it resides in.

In the future, we will be extending this API by providing additional information to answer questions like:

  • What is the Intuit service associated with the encryption key used?
  • Who are its admins and who are the data owners?
  • What permissions and roles are required to access it?

We have also learned that while CLIs and SDKs are a must, having a shiny UI makes everything better, and we’ll be building on one here. This will allow analysts to plug the encrypted data and get all of this information without going through CLIs or SDKs.

What’s next?

Our team’s next milestone on the matter of managing Intuit’s encrypted data at scale is to connect producers and consumers of data across the different organizations. Additionally, we aim to provide them with automatic capabilities for requesting and providing granular access to encryption keys.

By achieving this goal, we will significantly improve the velocity of Intuit’s analysts, engineers, and data scientists as they work with sensitive data, all while ensuring best-in-class security controls.

Finally, I would like to thank Yaron Sheffer for his unwavering support and guidance, as well as the talented engineering team that led the design and implementation of this functionality: Noam Kachko, Michael Gvirtzman, Olla Nasirov, and Khaled Daghash.

--

--