Scaling Intuit’s integrations platform

Intuit’s integration platform is used for data-in integrations between strategic e-commerce leaders like PayPal, Square, Google calendar, GoCardless, and Stripe into customer’s Accounting books. Saving time for the subscribers of these integrations. This no-coding platform is deployed as a service and managed by Inuit’s integration platform team. It supports four kinds of integration patterns Data […]

Intuit’s integration platform is used for data-in integrations between strategic e-commerce leaders like PayPal, Square, Google calendar, GoCardless, and Stripe into customer’s Accounting books. Saving time for the subscribers of these integrations. This no-coding platform is deployed as a service and managed by Inuit’s integration platform team. It supports four kinds of integration patterns

  • Data Sync — Running background jobs to sync data
  • Interactive application
  • Webhooks
  • Trigger action workflow

Intuit integration platform at a glance:

Performance and scalability journey:

The Integration platform is composed of Connectors, Jobs, UI, and API. These components can be scaled vertically, but are bottlenecked by a single database server, limiting throughput. Data sync integration is the primary form of integrations. A typical data sync integration subscription will run a job multiple times a day to ingest data. A job and its runtime state are completely managed in the database, this is one of the key contributors to the database load resulting in processing delays.

The journey started about two years back with initial focus on low hanging fruit to relieve immediate pressure on data processing delays. This ended up with a 350% improvement in the job server:

  • Introduce Caching layer — to reduce database lookups
  • Migrated few tables to DynamoDB — for faster writes
  • De-normalized few tables and create table index
  • Optimized application logic
  • Archive obsolete old data

Although the initial results gave us a runway for the next year, the integrations platform had to reach the state where it could scale horizontally to support the projected growth.

The job server relies heavily on the database server for work items and their run time state. It was initially architected to operate in a single instance mode but when the platform started hitting throughput caps, multi-node support was introduced by way of a distributed lock acquisition strategy. Job server capacity improved but a new ceiling quickly emerged as there was contention for acquiring locks on work items. A classical legacy architecture problem. Following is the simplified view of the multi-node architecture of the job server.

Messaging based job framework:

The above architecture violates the famous single responsibility principle of SOLID design. The existing job server has two responsibilities, job scheduler, and job runner. In fact, if one component had more than one responsibility it resulted in a complex and unmanageable code. That leads us to rethink the existing job server architecture and design.

We decided to re-architect the job server framework based on the event-driven architecture. The new architecture will have two main components as shown in the following picture (simplified version).

  • Job producer: Queue candidate jobs
  • Job consumer: Execute a candidate job

The current design allows us to use any messaging platform in the future, but we are currently using AWS SQS as our messaging platform, which enables us to scale without managing the other system. We deployed the new system (partially implemented) in production and the result has been very encouraging. The following graphs provide a glimpse of improvements

Key performance highlights:

Jobs processing increased from 1.4 million to 12.5 million per day, which is a 793% improvement from 2016 and 268% from 2017

No more delay of Priority jobs. All priority jobs are executing in real-time

No jobs waiting more than 60 minutes. 15–60 minutes delay improved by 1100% from 2016 and 520% from 2017

Job delayed time reduced to 3 minutes from 90 minutes in the best case scenario, which is 2900% improvement. Average job delay reduced to 16 minutes from 99 minutes, which is 520% improvement

In Summary, Be bold, do not hesitate to throw away what is working now, use loose coupling, follow single responsibility design principle, be tactical to buy time for bigger changes, and finally track everything.


Scaling Intuit’s integrations platform was originally published in QuickBooks Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source: Intuit