Primer on BigQuery Cost and Performance Optimizations
Other parts in this series:
Part 2
Part 3
Problem Statement
Imagine your company is running multiple BigQuery datasets and you just added a new team of analysts querying it a month before. One morning while sipping your coffee (or tea as we don’t want to discriminate) you get your GCP bill and discover that your BQ analysis and storage costs went up by 5,000%. This very well may cause an instant spitting out of your drink thus creating quite the mess on your keyboard and monitor. This article will assist you in rectifying, if not outright preventing, this unfortunate situation by helping you optimize your BigQuery costs through a few different methods.
Getting the basics right
Before you address your BigQuery costs, you need to do a few preparations and have some basic prerequisite knowledge.
I will be covering these in this first part of this blog series, once you have the basics down and have these preparations done then it’s a straight shot into Part 2 and beyond to actually get your optimizations done.
Here’s what you need to know about BigQuery slots, pricing models and the factors that determine costs:
Slots
At the core of BigQuery’s computation is a construct called a slot. A slot is simply a vCPU with some memory attached — like a mini-VM dedicated to BigQuery processing. When you run a query in BigQuery, the query is processed across a set of allocated slots. In theory, the more slots you have allocated and available, the faster your queries will run.
A slot will perform almost any action that comes up in the process of a BigQuery job.
Shuffle Slots
Within BigQuery, an undocumented job that slots perform is shuffling the data. Up to 60% of the slots allocated to your project can act as a shuffle slot at any given time.
In very simple terms, shuffling is redistributing processed data to a new location for the current or next step in the query plan to run faster. Google’s white paper on BigQuery’s architecture will give you a deeper understanding of how shuffling works. Shuffling helps the query run more efficiently, but it uses very valuable slots to perform operations that aren’t directly moving the query execution along.
Pricing Models
BigQuery currently has two distinct pricing models: on-demand or flat-rate.
On-demand pricing
The default model is on-demand pricing, which allocates 2,000 slots for your queries and another pool of an unpublished quantity of slots for loads, table copies and extract jobs. As excess slots become available across BigQuery in general, the pool of 2,000 slots can grow, but this is not guaranteed so you should not assume this will happen. The most I have seen is around 3,500 slots in short bursts for reference.
In this model you are billed on a set rate of $5 per TB of data scanned by queries.
For many organizations, this is the key driver of BigQuery costs.
Flat-rate pricing
Flat-rate (or slot) pricing sets a flat price for BigQuery scanning at the potential expense of performance. It does this by limiting the number of slots available to what you pre-purchase and removing the $5-per-TB scan price.
Slots (including Flex Slots which I cover below) are always purchased in chunks of 100. Standard (read as non-Flex Slots) are purchased within a commitment for terms of one month or one year. Note that once a reservation is purchased, it cannot be canceled easily. A one-month commitment can be changed to a one-year commitment, but the reverse is not possible.
There is a tricky little situation about what happens when a commitment ends that is different between the two and has changed a few times. Due to this I am just linking the Google documentation for monthly commitments and annual commitments in the chance it changes again. I highly recommend setting up calendar events to notify you well before they are over so you can react accordingly.
Now with that said, when on a flat-rate pricing model your slots can be augmented, or fully replaced with an additional pool of slots called Flex Slots. Flex Slots are identical in capacity and function of normal slots, but they have an option for a shorter and more flexible commitment time with a minimum commitment time of 60 seconds. Once this minimum time is reached you can cancel their reservation at any time and the slots disappear. This allows you to scale up and down on slots as needed. These are generally needed when more slots are needed for short periods of time such as spikes in usage or when a particularly resource-intensive job is being run that might cause all of the slots in a reservation to be used up. A prime example of this is for retail organizations on Black Friday or Cyber Monday, which are the biggest shopping days in the United States.
At the time of writing 100 slots costs $2,000 USD for a 1 month commitment and $1,700 USD per month for a 1 year commitment. So in order to match the performance of on-demand you will need to purchase twenty 100 slot chunks which will be $40,000 USD per month for a 1 month commitment or $34,000 USD per month for a 1 year commitment. Due to this in general it’s not cost effective to purchase 2,000 slots unless you are spending at least that much in BigQuery scan costs.
The Plan for Determining Costs
Before you can go in and optimize costs then you need to analyze the usage data of your project(s). In order to do this you will need to get access to the query data in your projects/datasets.
Quick note on this to level expectations, I am covering this now in the first part of this series so it can be done and start recording data from your usage in the case you don’t read through multiple parts in one sitting (or standing). Starting in part 3 of this series is where the fruits of this labor will actually start being used.
There are two methods of doing this: audit log sinks and the INFORMATION_SCHEMA tables.
The audit log sink is the preferred method for doing this as the data is much richer and it covers multiple projects, datasets, and regions, but if it’s not already enabled it will take some time to populate as jobs are run on BigQuery. Note that there is a small cost associated with the audit log sink in the form of BigQuery storage costs for the data.
Whereas the INFORMATION_SCHEMA views already exist in each dataset, project, and region, but unlike the audit log sink each dataset and region combination will have its own set of these tables so queries may need to be run multiple times. Note that there is some information missing from the INFORMATION_SCHEMA tables that exists in the audit log tables, some examples of these are documented as comments in the SQL queries from later in the article where they will not match up 1-to-1 with the audit log sink queries.
If using the INFORMATION_SCHEMA method then you can go ahead and skip the next section on setting up the audit log sink because your data already exists. There is one caveat here: the data collected in the audit log sink is richer than the INFORMATION_SCHEMA views provide. The downside to this is once you set it up then it starts recording data from that moment in time so regular workloads will need to be run for a time before the data is ready for use. This timeframe is highly dependent on the workloads being run and what is considered “normal” for how often your workloads run.
Note if you are a current DoiT customer and use the BigQuery Lens dashboard in the DoiT Console then you will already have the Audit Log Sink setup in your environment. I recommend using that for the operations listed in these articles. More information on this can be found here including.
The dataset’s fully qualified name will be .doitintl-cmp-bq.cloudaudit_googleapis_com_data_access where the project name will be the name of the project from which you created the service account that is uploaded to the DoiT Console to enable your BQ Lens dashboard.
.
If this is the case you can skip the next section.
Setting Up Audit Log Sink
If you are choosing to use the INFORMATION_SCHEMA views instead of an Audit Log sink skip this section.
The following are instructions on how to set up an Audit Log Sink. Note that it’s best to run these commands either on your localhost or in the Cloud Shell with the gcloud CLI setup to use the project where you wish to store the dataset.
First run the following command and update the sink name, project ID, and dataset name:
gcloud logging sinks create <Sink Name> bigquery.googleapis.com/projects/<Project ID>/datasets/<Dataset Name> \ — use-partitioned-tables \ — log-filter=’protoPayload.metadata.”@type”=”type.googleapis.com/google.cloud.audit.BigQueryAuditMetadata”’
After this command is run it will output a service account that was created for this purpose, copy the full email address of this service account into a note somewhere.
Next you will be granting the service account that was just created the BigQuery Editor role in each project that uses BigQuery. As always in the cloud there are multiple ways to do this, so I am including the 3 most common: per-project, over an entire organization, and over a folder inside of an organization.
Note to get the project, folder, and organization ID you can go into the “project picker” from the top of your GCP console and there is an ID column that will give you the needed ID for the resource.
Per-project:
gcloud projects add-iam-policy-binding <Project ID> \ — member=<Service Account> — role=roles/bigquery.dataEditor
Across an Organization:
gcloud organizations add-iam-policy-binding <Organization ID> \ — -member=<Service Account> — role=roles/bigquery.dataEditor
Throughout a folder:
gcloud resource-manager folders add-iam-policy-binding <Folder ID> \ — -member=<Service Account> — role=roles/bigquery.dataEditor
At this point query data will start being stored in the dataset specified above. Note that in order to generate a good amount of data after running this you should wait a few days at the very least, or a few weeks for the best results. The more time that data is collected from your normal usage the better as this will allow you to find any spikes or patterns in usage.
INFORMATION_SCHEMA views
The INFORMATION_SCHEMA views already exist in each dataset, project and region, but, unlike the audit log sink, each dataset and region combination has its own set of these tables, so queries may need to be run multiple times. The INFORMATION_SCHEMA views omit some information that exists in the audit log tables. Examples are documented as comments in the SQL queries later in the series, where they will not align precisely with the audit log sink queries.
The benefit to the INFORMATION_SCHEMA views is that the data already exists if you haven’t setup the sink.
Up Next
This concludes the first part of this series and serves as the basis for methods described throughout the rest of this series.