Evolving to Enterprise-Grade Permissions

Benchling is a data platform to help scientists do research. Hundreds of thousands of scientists across academic labs and enterprise companies use Benchling to store and analyze scientific data, record notes, and share findings with each other. But not everyone should be allowed to access everything. Benchling’s platform should allow users to configure exactly who can […]

Benchling is a data platform to help scientists do research. Hundreds of thousands of scientists across academic labs and enterprise companies use Benchling to store and analyze scientific data, record notes, and share findings with each other.

But not everyone should be allowed to access everything. Benchling’s platform should allow users to configure exactly who can take which actions on what data. This authorization is crucial to life science, where companies often worry about regulatory compliance and IP protection. Scientists also often play specialized roles: one person’s role may be to design DNA, while another’s may be to run analyses and collect results. Our platform should allow these scientists to collaborate without accidentally modifying each other’s data.

We originally built Benchling with simple permission levels (read, write, admin) and recently moved to data policies for fine-grained configuration. This post focuses on two key pieces of the project: the details of how we migrated between permission systems without errors or downtime, and a process called granularization — how we can easily make data policies more and more fine-grained based on new configuration needs.

(See Lessons learned for the summary)

The landscape of collaboration in science

Simple product, simple needs

Benchling started as a free-to-use DNA platform marketed to academic users and their labs. Users had very basic authorization needs: allow other scientists to view or edit their DNA sequences.

The complexity of enterprise collaboration

Benchling’s product expanded, and so did our customers. Users store data in Benchling ranging from DNA sequences to lab notebook entries. Enterprise customers often organize the data to match how their own research divisions are organized.

Consider a Cancer Research Project within Benchling, with a DNA Design Team designing DNA and a team of Research Scientists experimenting with that DNA.

  • DNA Design Team should be able to edit the bases (the individual ATGCs) of DNA sequences on Benchling: bases are core to a DNA sequence's identity
  • DNA Design Team should only be able to view notebook entries, not edit them
  • Research Team should be able to edit a DNA sequence’s metadata, but not its core bases
  • Research Team should only be able to edit notebook entries if they are marked as an author

Note how the configuration needs are more complex than before — different teams are only granted certain actions to specific types of data based on various conditions.

A DNA sequence in Benchling
A notebook entry in Benchling

But these rules are not the same for all customers. They cause too much overhead for smaller companies, including biotech startups and academic labs with only a handful of scientists. These companies want looser configurations, like permitting all scientists to edit DNA sequences, until they grow to a scale where they need to more closely manage their data.

Supporting future needs

Over the past few years, we’ve learned new needs for configuring authorization, both because we’ve partnered with more mature customers and because each customers’ processes and roles are different. We need to support customer authorization needs becoming more and more granular — for example, wanting to configure who can archive old DNA sequences, who can edit DNA sequence bases vs. only metadata. We call this process granularization. As we continue to learn new roles, our system will need to easily support migrating to a new authorization model.

System design

Initial setup: read, write, admin levels

Users organize their Benchling data, like DNA sequences, into projects. Users also organize themselves into teams and organizations. We started with 3 very rudimentary permission levels: read (viewing project contents), write (editing contents), and admin (configuring project permissions).

To configure authorization, the creator of a project would add users, teams, and organizations as project collaborators and assign each a permission level.

Configuring project permissions by assigning permission levels to collaborators

But levels aren’t configurable enough

Larger enterprise customers had more complex needs for permissions, but authorization configuration was limited to the three levels. At the core, levels are

  • Not transparent: read vs. write makes sense, but what is the difference between write vs. admin? Further, different customers have different ideas about how actions should map to them (is editing sensitive data a write or admin action?).
  • Not granular enough: customers want to configure a project by item type or action, like configure an external vendor to edit metadata of DNA they work with but only view other data, but that doesn’t map to levels.

We first tried to solve the first issue by introducing more levels, like “append”. We then tried to solve the second issue by introducing flags. Lots of configurable flags for product needs, like project.enable_edit_dna_sequence_bases flag (on the project) to lock down a project's DNA sequence bases or ENABLE_EDIT_SETTINGS_ADMIN_ONLY flag (site-wide) to configure whether the edit settings action mapped to write or admin, and so on.

While these flags unblocked some product needs, they were inconsistently added to different places — site-wide, by organization, team, project, etc. — and made the authorization code hard to understand. Authorization code is the last thing you want to be hard to understand. More branches meant higher likelihood for untested branches and data security bugs. The combination of levels and flags also made the product difficult to configure, and still didn't map cleanly to the customer's needs.

Authorization models

So how should we model enterprise authorization needs?

First, we researched existing enterprise authorization models to help us design the new system for Benchling. These models include ones from products we use, like AWS and Salesforce. We saw recurring concepts like user groups, configurable authorization rules, and fine-grained permissions that convinced us that we were on the right track with our own design.

Our guiding principle was “what do customer admins need to be able to configure?” Benchling data is already organized into projects. Users are already organized into teams inside organizations. The missing piece is to configure those projects to allow those users to only access and manipulate data pertinent to their function. We model this with user-configurable data policies. Each has a list of statements that captures the authorization rules of that policy.

Policy statements govern data along three axes:

  • Item type: type of data (DNA sequence, notebook entry, etc.)
  • Action: action to take on data (read, write, edit bases, edit metadata, archive, etc.)
  • Condition (optional): property that the user has (if user is author)

We can now configure the Cancer Research Project from before with two data policies:

DataPolicy(name="Research Scientist")
DataPolicy(name="DNA Designer")

Now, the customer admin can configure a project to associate users who have access to that project to policies. They can add policies for users, teams, and organizations.

Configuring project permissions by assigning data policies to collaborators

A user is authorized for a piece of data as long as any policy statement in any policy assigned to the user matches the item type of the data, the intended action by the user, and the condition under which they’re operated.

We explicitly decided to use only use additive permissions. Some of our users wanted the ability for policies to deny authorization to users. This was especially common when the customer worked with contractors and wanted to restrict the contractors’ access. However, subtractive permissions become harder to reason about and harder to implement correctly. If a user’s team is denied a policy but the user is granted the same policy, is the user authorized? Pushing back on product complexity allowed us to implement a simpler, easier-to-reason-about system for Benchling developers and users alike. Instead, we recommend that customers create a separate team for contractors and assign them a more restrictive policy.

How configurable are policy statements?

How do we decide which item types, actions, and conditions to support? For all the things you can do in Benchling, are there 10 actions? 100 actions? Too few actions and customers can’t configure permissions against their processes and roles. Too many actions and the system becomes hard to understand.

The guiding principle is that it’s easy to fragment an existing item type, action, or condition, but it’s very hard, and in fact, nearly impossible to defragment, as customers may already be taking advantage of said granularity. So, we started with the minimum number of actions needed to support our existing product needs, and will carefully granularize as needed, which we’ll dive into a bit later.

Ensuring correctness

Correctness is crucial when implementing permissions — after all, it’s at the core of data access. When we migrated from levels to policies, our top priority was to make sure the new system was still correct, that each user had access to exactly the data they were supposed to have access to.

To migrate from levels to policies without user impact, we started by creating a policy for each of our previous levels: Read, Append, Write, Admin. We then took a number of steps so that we were fully confident that the new policies were behaving the same as levels and the smattering of configuration flags.

Double the checks, double the confidence

For a while, we ran authorization checks through both systems: checking via both policies and levels. To do this, we mapped each policy to a level:

> SELECT * FROM policy;
 id |  name    | legacy_permission_level
1 | "Read" | "READ"
2 | "Append" | "APPEND"
3 | "Write" | "WRITE"
4 | "Admin" | "ADMIN"

Previously, our permissions were modeled by who (a user, organization, or team), what data (e.g. a project), and how much access (read, append, write, admin levels) — in the new system, we transitioned from levels to data policies. To ensure that both the levels and data policies were kept in sync during the migration, we added foreign keys from the old permission models to the corresponding data policy, and used transitive foreign key constraints to ensure that the old level and the new data policy didn’t fall out of sync:

(level, policy_id)
policy(legacy_permission_level, id)

The DB constraints ensured we were updating policies correctly and synchronously. When we later updated our permissions API to configure policies instead of levels, we also redundantly set the levels, to help ensure our system was still working as expected.

This also allowed us to run redundant authorization checks — both with the old level and with the new data policy. In our authorization checks, we’d compare the outputs of the two authorization checks in production, and logged an error if they mismatched.

Correctness, at the cost of performance

Redundant checks helped us catch ~5 bugs. We found bugs in the new code where we didn’t port over a configuration flag or didn’t backfill it properly. We also found bugs in the old code, a result of the old system having a complex imperative implementation. It was especially helpful that we ran it across all production environments, which are all configured differently.

We focused heavily on correctness, and were too lenient on performance. Our authorization checks are extremely hot code paths, since they run in nearly every endpoint. We tried keep the redundant checks relatively short-lived (~a few weeks), since it had a non-trivial effect on performance. Our listing endpoints that query for all readable items weren’t all performant, and running it twice made the slowness more noticeable. In retrospect, it would have been nice to add monitoring to authorization checks to measure how we were affecting performance more precisely and address any performance regressions that surfaced.

New customers, finer-grained permissions

As part of this project, we wanted to empower all engineers at Benchling to easily modify policies to support new customer needs. In a granularization, we’re breaking down axes of policy statements, and allowing customers to customize policies at a more granular level.

The most common granularization is to break off a specific action from the WRITE action. For example, previously editing bases and editing metadata were both WRITE actions. To support configuring each separately, we introduced the EDIT_BASES action, which is granularized, or split off from, the old WRITE action.

Granularizing the WRITE action for DNA sequences to EDIT_BASES and WRITE (edit metadata)

We had a few system requirements during a granularization:

  1. Properly resolve configured policies during authorization. When performing authorization, the code needs to resolve whether a user is authorized to perform some action on a given model against how the customer has configured their policies. The policies live in our PostgreSQL database, and the authorization resolution lives in our code. We had to make sure these were in sync through our deploy cycle, where migrations run before code deploy.
  2. Gracefully handle users modifying policies during the deploy. During a granularization, we need to map all the old policy statements to the new policy statements. In this scenario, this meant copying the policy statement for WRITE actions for the new EDIT_BASES action. After that, if a user were to edit the WRITE action in a policy, we would need to know whether or not to edit the EDIT_BASES action as well.

A naive approach

Naively, we could easily granularize WRITE to EDIT_BASES and WRITE (which still includes editing metadata) in two steps:

  1. Update code to old actions to new actions
  2. Add new policy statements with the new action, based on the old action.

But a few things would break:

  • If the granularization migration has completed and a user updates a policy from the UI, they think they’re configuring both WRITE and EDIT_BASES, but they’re only configuring non-EDIT_BASES WRITE.
  • If the granularization migration runs after the code is updated, there’s a period of time where the EDIT_BASES action does not yet exist in the DB, and any edit bases action is not permitted.

Versioning policies for a seamless migration

To handle this, we versioned policies: we track policy API versions in the code and versions for policy statements in the database. This version allows us seamlessly upgrade during our deploy cycle:

  1. A database migration maps policy statements from version V(x) to V(x + 1) with a new version (see our blog post on migrations). We clean up old versions after a granularization such that there’s a maximum of two API versions at a time.
  2. Code is updated to new version, so it can start respecting the new statements. Given the two sets of policy statements, authorization can start reading from the new set only when the code is also equipped to handle the new actions.
  3. Policy edits are blocked when a new version is being introduced. During our deploy cycle, there’s usually ~10 minutes between the migration and the code being updated across our servers. This means that users are blocked from updating policies for ~10 minutes, which we thought was acceptable given that granularizations happen infrequently, and if they do, users are unlikely to be editing policies at the same time.

Granularization became a multi-step process:

  1. Break out the new action from the old action
  2. Update any code paths to check the new action
  3. Bump the policy API version in code
  4. Write a pre-code-deploy migration that copies the policy statements into a new version
  5. Update the UI to allow configuring the new action
  6. Delete the old version of policy statements

Empowering the engineering team

Given that different product teams within Benchling will want to perform granularizations in the future, we needed to make the process more developer-friendly. To do this, we added a few things:

  • Wrote a Developer Guide for Policy Granularization.
  • Added a database migration template that only requires the engineer to implement map_policy_statement_spec_fn. Given a policy statement, map it to policy statement(s) in the new version. The template includes boilerplate to bump the version and copy each policy statement.
  • We built a UI for admins to update policies for their organization, and it needs to stay in sync with the item types, actions, and conditions represented in our DB and code. By building out a library of UI components and JSON templates around this, we made sure it was easy to update the UI with the new configurations (with a test that errors if we don’t!).
The UI for configuring a data policy

Granularizations will always be customer-driven, and we made an effort to make it easy for the developer to execute it without incurring any correctness issues during the transition.

Lessons learned

  • Policies, with rules configured along three axes (item types, actions, and conditions), helped us create a configurable, easy-to-understand system that allowed customers to model their scientific processes and roles in Benchling.
  • Keeping the number of supported actions small, just flexible enough to meet our current product needs, kept the authorization model easier to reason about. Same with keeping the system purely additive (no policy rule can deny permission).
  • Using database constraints helped ensure that our old and new systems were in-sync during the migration period.
  • Running redundant permission checks during the migration from old to new permission system helped achieve correctness and inspire confidence in the new system.
  • Monitoring performance more actively during the system transition would have helped ensure that the new system and its many hot code paths do not degrade the user experience.
  • Versioning policy APIs helps us ensure a smooth transition when we perform a granularization that incurs no downtime.
  • Investing in developer tooling allows other teams to support customers who need finer-grained configuration needs.

At Benchling, we’re building a platform for scientists to streamline their research. Scientists across academic labs, small biotech startups, and large pharmaceutical companies are relying on Benchling to store their most important data, but they all have different collaboration models. Our permissions system has to support each organization’s data access needs and internal processes. This led us to evolving from a simple, read/write/admin level-based system to a highly configurable, yet easy-to-understand, policy-based system. We know these needs will change as the ways our customers do research evolves, and we’ll continue growing our permissions system alongside it.

And if you’re interested in working on problems like this, we’re hiring!

Discuss on Hacker News

Thanks to Saif, Somak, and Vineet for reading drafts of this.

Evolving to Enterprise-Grade Permissions was originally published in Benchling Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source: Benchling