At Just Eat, we’ve been big users of the Google Cloud Platform (GCP) for a number of years. We ingest both realtime and batch data, maintaining an in-house data transformation platform that allows any business user to self-serve data using Google BigQuery. Our in-house Airflow-based transformation platform, named Optimus, has seen massive growth over the […]
At Just Eat, we’ve been big users of the Google Cloud Platform (GCP) for a number of years. We ingest both realtime and batch data, maintaining an in-house data transformation platform that allows any business user to self-serve data using Google BigQuery.
Our in-house Airflow-based transformation platform, named Optimus, has seen massive growth over the past two years. We have over 200 users making around 500 enhancements per week. We’re now at the point of having 1500 data pipelines running in our production environment.
The insights that we generate in BigQuery (BQ) go on to power a number of functions from personalised food discovery services to forensic understanding (of customers, couriers and restaurants) and power all data driven communications. Needless to say, there is a lot of data processing needed to power these functions in various markets around the globe. In this post we’ll discuss how we efficiently manage the costs that arise through processing this data with BigQuery, covering the various cost models and how we make use of them at Just Eat. This will set the scene for a follow-up post containing some practical engineering tips to efficiently write code tailored for BigQuery.
When using BQ, you are billed for how much data you store and how much processing you carry out on that data. We’re going to focus on optimising query cost by exploring the different pricing models that Google support.
BigQuery pricing models are based on the concept of slots. A slot is a unit of computational power. BigQuery can use one or more slots to process your request and the number of slots being used can change over time as the stages of a query progress. The best way of looking at how much energy (cost) your query has used, is to measure slot-hours.
On to pricing models –
Further information on the above can be found on a blog post published by Google.
Now that we’ve discussed the Google nomenclature, let’s get down to the practical aspects of the pricing models and talk about the journey that Just Eat has been on.
We began our data journey with Redshift, it was a great fit due to our existing AWS stack and various integrations. The data team would ingest data, carry out various ETL processes and administer the cluster. But as the business’s need for data grew, increasingly we saw that Data Engineers were being asked to wear many hats, providing downstream teams with all the data they needed to unlock insight.
The realisation came that we were asking Engineers to worry too much about the underlying architecture, acting as DBAs. They had become a bottleneck to all the data alchemy that could be happening downstream.
While both Redshift and BigQuery have their own strengths, we concluded that in order to truly democratise our data with a limited number of engineers, it would make more sense for us to transition to BigQuery. This would enable us to build tools that would allow users to self-serve and unlock insights without an Engineer blocking their progress. A key part to entering this Bronze Age of data at Just Eat was the introduction of BigQuery and the construction of our Data Lake.
To support this goal, we began the development of our GCS (Google Cloud Storage) and BQ backed Data Lake. We started gradually adding in more data and migrating processes until we officially deprecated Redshift in Q1 of 2019.
During this early phase of the Data Lake development, on-demand pricing was critical in allowing us to scale to meet the demands of our users without having to expel too much effort making (often inaccurate) usage estimates.
On-demand pricing worked very well for the team when the Data Lake was in its infancy. Since going all-in on BigQuery, we utilised on-demand pricing for around a year before reaching critical mass, realising the rate of expenditure was not sustainable.
In early 2020, we began to explore flat-rate billing in order to provide us with a more predictable and cost efficient process for the future. When exploring this, our first realisation was that in order to take advantage of slot reservations and segregate user workload we’d need to shuffle our project structure.
Prior to this, users were free to switch projects and query data in the Data Warehouse having activated any project. This presents a problem as billing and slots are controlled at project level meaning users could take resources from business critical processes and potentially hop between projects to pinch additional resources. This would mean that a bad neighbour could hog resources and make cost analysis very difficult.
Our solution to this was to create a “data-users” project. This project’s purpose is to logically separate user queries from ETL and other programmatic actors that need to run reliably in order to support business processes. This project enabled us to map user queries to a set reservation using an assignment and user permissions to query the DWH were restricted to this new project. Using this meant that cost analysis and billing became much simpler and users could not consume slots required for critical automated processes.
Within this initial phase, we settled on this new project structure and three associated reservations; the lion’s share of slots were allocated to critical automated pipelines.
This shift to flat rate billing entered production in early 2020, we revisited the concept towards the middle of the year. Our initial goal was to ensure that we could drive a maintainable level of spend for the Data Lake. Flat Rate helped us ensure this but also meant that we regressed to a Stone Age (pre 2019) situation of having to worry about workloads and assist users with understanding how to tune their queries. We began to see performance issues with critical workloads and realised the journey wasn’t over.
Tackling this problem required two techniques – education and understanding from our users, and further thinking around how we could flex our capacity. The former is an ongoing activity, but has primarily been centred around helping people understand all the new concepts introduced following migrating to flat-rate billing. Our next post will discuss some of the advice and tuning activities that have been carried out internally.
The second technique was to make use of a new feature within BigQuery to dynamically grow our capacity to meet demand. In late February, 2020, Google launched Flex Slots. This feature allowed us to purchase slots for a short, pre-defined duration of time, providing a more flexible approach to growing capacity.
Flex slots allowed us to scale the platform during peak times to reduce the throttling effect of having a number of slots reserved for an activity. In Q3 of 2020, we introduced some automation (aptly named Flexy) to help us purchase additional flex slots to meet demand at known times of heavy usage.
Purchasing flex slots is working fairly well for our application, but we believe we can reduce a significant amount of resource usage by further understanding jobs that would be more efficiently run using on-demand pricing, rather than burning reserved slots.
The next stage in our journey is to identify workloads that are cheaper to execute using on-demand. We will look at changing our in-house tooling to dynamically schedule queries using the most efficient billing method by switching projects on the fly.
All well-shaped objectives should be measurable. Much of the work we’ve described wouldn’t be possible without capturing data points and setting targets to ensure we’re heading in the right direction.
Having visibility of query performance and cost at a higher level was critical in giving us the ability to understand where to apply the Pareto principle. In order to identify resource intensive pipelines we introduced a standard labelling scheme within our ETL tool. Labels were mapped to the API call when issuing queries to BQ and these dimensions were then visible within our cost reporting.
Subsequently, we were able to identify expensive jobs, slice costs by owner, department or feature. This helps the team be clear on objectives and understand bottlenecks, reducing the noisy neighbour effect by ensuring usage metrics are available for all.
Similarly operational metrics around slot usage and capacity were surfaced in Grafana, allowing us to understand real-time usage and monitor any over or under provisioning of the system.
In this post we talked about the various milestones and discoveries that we’ve made during our time adopting BigQuery. Balancing cost and usability has meant that we’ve had to adapt the solution as we’ve grown. In the next post, we’ll look at some SQL optimisations and design principles that we’ve applied to help us keep costs in check and ensure our data lake operates efficiently.
Source: Just Eat