When building systems for new products, there’s a delicate balance between writing code that works and writing code that lasts. A common anti-pattern is preemptively optimizing systems for the future while still trying to find product market fit. For new product teams, this can be a costly mistake as it leads to a slower iterative […]
When building systems for new products, there’s a delicate balance between writing code that works and writing code that lasts. A common anti-pattern is preemptively optimizing systems for the future while still trying to find product market fit. For new product teams, this can be a costly mistake as it leads to a slower iterative cycles between product experiments.
At Clever, the Discovery team is responsible for facilitating the discovery and usage of ed-tech applications in the K-12 space. As one of the newer product teams, we were tasked with quickly iterating on different experiments to find a product that would fulfill this mission. This resulted in the Clever Library, a marketplace for ed-tech applications where teachers can discover and use new applications with their students. Applications in the Library automatically provision user accounts based on information teachers allow them to access, leading to a seamless experience.
When we built the Library in Spring of 2018, we designed it so that we could get up and running as quickly as possible while still being reasonably flexible to change. But as more and more teachers became active users of our product, we realized we needed to reinvest in our original system to handle our new usage projections.
In this blog post, we’ll talk about how we evolved from an experiment with 50k monthly active users to a scaled version that now supports over 800k monthly active users.
The core responsibility of the Library backend is to answer the question “what school information is shared with this application?” If a teacher installs an application, we want to share the pertinent school information with the application so that they can create the relevant user accounts.
To answer this question, we track student information system (SIS) data and sharing rules that represent connections between users and applications. By applying sharing rules to SIS data, we can figure out what information should be shared.
We had an existing Mongo database that stored SIS data and we introduced a new Dynamo table that stored sharing rules.
We made the following decisions to minimize the amount of work to ship our backend.
We reused existing systems and tooling to expedite development. It was easy to use existing databases and services for SIS data, and Clever had tooling that made spinning up and using DynamoDB dead simple. Engineers also had experience with these, so there was a clear path forward when we started implementing our designs.
We performed the bulk of our business logic at read time to figure out what SIS data was shared with an application, given existing sharing rules. Evaluating business logic at read allowed our data model to be simpler and enabled us to make small product changes more quickly.
Despite being easy to build, simple, and quick to iterate on, the original system we designed started suffering at increased load. At Clever, we perform nightly load tests to catch any issues with running our services at scale.
We send requests to recreate this gradual ramp up in requests like you see above, until we hit our steady state peak load.
The following is a snapshot of our application load balancer latency for read-service during one of these load tests. This snapshot is taken at a steady state peak of ~36 queries per second (qps).
In the highlighted sample above, we can see the distribution of read-service latency in a worst-case event. The 99th percentile of long running requests (P99) exceeds 4 seconds, but you can also tell that outside of this outlier, our latencies across the board are still high. When load approached 50qps, our P50 latency began to exceed 200ms, which is when our load balancer timeouts killed the requests. We had the following pain points:
As we prepared for the 2019/2020 school season, we realized that we needed to support significantly more traffic and to do so under tighter latency SLAs. We also envisioned a plethora of extensions so that teachers could share Library applications with finer granularity and control. This led to the following goals:
Our new scaled system is powered by a Serverless Aurora MySQL database that consolidates both the SIS and sharing data that we need. We chose Serverless Aurora because it provides high availability and fault tolerance. But most importantly, it vertically scales automatically with changes in traffic.
With both sharing rules and SIS data in MySQL, we can use SQL joins to relate both pieces of information. For example, to find out if a teacher is scoped to an application, we might use this high level query:
We could apply this same basic concept to every other read path in read-service. What once required pulling data from two data stores and joining across the data in code can now be done in a single optimized SQL query.
There were several challenges we faced that made it significantly harder to start with this scaled system. Both of these added extra work that we couldn’t justify when building our initial system.
At Clever, we have a critical ETL pipeline that keeps SIS data updated in our SIS Mongo database. Specifically, we had a step in this pipeline, load-sis, that ingested events and wrote to a Mongo database. This SIS data is then used in downstream processes to power Clever’s auto-rostering. To consolidate SIS data and sharing data in a new database, we needed to make changes to this critical pipeline which was both complex and high risk.
For our scaled system, we added a step in our ETL pipeline, load-mysql-sis, that ingests events and performs updates to our MySQL database. With this in place, our write-service began a dual write pattern to both our MySQL database and our Dynamo database until we were ready to fully migrate over.
There was also friction with even using a relational database. At the time, Clever lacked tooling to manage and spin up new relational databases. There was also a gap in domain expertise – we didn’t know how to use SQL safely and effectively. These gaps added ambiguity in our estimates as well, as it was hard to forecast the issues that might arise if we chose to use SQL.
This is a snapshot of read-service latencies during a load test on our scaled system. Here we’re at a steady state peak of 500qps:
Previously at 50qps, we averaged a P50 of 200ms and a P99 of 500ms. Now even scaled at 10x the load, our P99 averages 40ms (90% decrease) and our P50 latency averages 25ms (87% decrease). Because our latency was also no longer tied to the number of classes a teacher taught, we also could now remove the 20 class cap that we previously enforced on teacher installations.
Going from a read service that orchestrated calls to various databases, performed in code joins to reconcile across different data types, to a single SQL query against a single database vastly reduced the complexity of our read service. It has also made it easier to identify and triage issues since we had fewer dependencies.
With Serverless Aurora, we can vertically scale our database cluster as our request volume grows. In our old system, performance was also tied to the number of classes a teacher taught. So our worst case was always teachers who taught many classes. By leveraging the SQL engine on very relational data, our scaled system still performs well for these teachers.
This project was the first production usage of Serverless Aurora at Clever, which was also a newer AWS offering at the time. We chose it because it offered many of the qualities we wanted in a datastore, but this also meant that there were many unknowns when it comes to actually using it. Our key learnings are:
Our new scaled system has been humming along for the 2019/2020 back to school season. But there are always improvements to be made to ensure reliability and ease of future development. In the next few months, we’re looking to make more changes to our data model to enable more sophisticated types of sharing. We’re also investing in more SQL tooling so that it’s easier for engineers to use Serverless Aurora and SQL generally in new projects.
If you enjoyed reading this and would like to work on similar types of challenges, our engineering team is hiring!