This post continues from Part One, where we walked through how to securely onboard a production-scale fleet of IoT devices streaming telemetry data to your Google Cloud environment via IoT Core and Pub/Sub.
Congrats! Youāve successfully registered several IoT devicesāāānow what?
Your next goal should be to design a system that enables large-scale storage, analytics, and visualization/dashboarding capabilities on your data.
In order to do so, you need to design a data flow architecture well in advance that supports such large-scale data operations. This article provides a hands-on walkthrough for doing just that.
Overview
This discussion is broken up into the following sections:
- Batching loading into data sinks
- Storage and analysis of data
- Visualization of warehoused data
Unlike part one, what will be discussed here can be performed entirely through the GCP web console. Only basic SQL experience is required.
The following fully-managed and auto-scaling Google Cloud services will be discussed:
- Pub/SubāāāA serverless message queue
- DataflowāāāA stream and batch data processing engine
- BigQueryāāāA serverless data warehouse
- Data StudioāāāA data visualization/dashboard creation service
Batch loading into dataĀ sinks
Verify that messages areĀ arriving
If you have successfully onboarded devices into the IoT registry and begun streaming data to the IoT core, you should see a steady stream of messages arriving from the GCP IoT main dashboard:

As shown in Part One, these messages are also arriving in your ātemperatureā Pub/Sub topic:

Streaming intoĀ BigQuery
NiceāāāWe see messages arriving into Google Cloud. Next, we need to move Pub/Sub messages into a data warehouse where the data can reside for the purposes of cost-effective long-term retention, as well as readily-scalable analytics. Enter BigQuery.
BigQuery, Google Cloudās fully-managed, serverless, and auto-scaling data warehouse, allows you to pay for both compute and storage with an on-demand pricing model, making it a great data sink for storing and analyzing our IoT data.
But how do we stream Pub/Sub messages into BigQuery? With Dataflow.
Dataflow, Google Cloudās fully-managed and auto-scaling version of Apache Beam, is designed to shuttle data from one service to another. You have the ability to optionally filter and transform data, as well as optimally batch load into load operation-limited services such as databases and data warehousing solutions.
Dataflow contains several Google Cloud-created default templates, including a Pub/Sub-to-BigQuery template, so no coding effort is required to link data ingestion and data storage/analytics services.
Given that Pub/Sub, Dataflow, and BigQuery are all fully-managed and auto-scaling services, and (with the exception of Dataflow) serverless as well, it is possible to build an end-to-end IoT data management system that easily scales from development testing to petabyte-scale operationsāāāwith virtually no infrastructure management required as scaling occurs.
Letās see all these services linked up together in action!
Pub/Sub Subscription Setup
Before we begin moving data from Pub/Sub to Dataflow, we should create a Pub/Sub subscription that is subscribed to the Pub/Sub topic.
Why? Messages hitting a Pub/Sub topic are sent immediately to Pub/Sub topic subscribers (via a Push strategy), then deleted from the topic. By contrast, subscribers can hold on to messages until a process requests messages (via a Pull strategy). It is possible to connect Dataflow to a topic rather than a subscription, but if such a Dataflow job were to experience downtime, messages hitting the topic while Dataflow is down would be lost.
Instead, by connecting Dataflow to a Pub/Sub subscription that is subscribed to the topic, you prevent messages from being lost during downtime. If a Dataflow job were temporarily interrupted, all IoT messages not yet processed by Dataflow would remain in the Pub/Sub subscription, waiting for the Dataflow job to resume pulling messages.
A Pub/Sub subscription to a Pub/Sub topic creates a data architecture that is resilient to downstream data ingestion service interruptions.
To create a subscription within Pub/Sub:
- Navigate to Subscriptions,
- Click āCreate Subscriptionā and name your subscription ātemperature_subā
- Subscribe this to Pub/Sub topic ātemperatureā
- Leave the remaining options at their defaults

Once created, if you click on the subscription and click āPullā, you should see messages begin to stream in:

Storage and analysis ofĀ data
Now that we have a Pub/Sub subscription receiving messages, we are almost ready to create a Dataflow job to move Pub/Sub messages into BigQuery. Before setting that up, we need to create a table in BigQuery where the data from Dataflow will arrive.
BigQuery TableĀ Setup
Navigate to BigQuery, click āCreate Datasetā, and name your dataset āsensordataā, with other options left at their default settings:

Once the dataset has been created, select it, click āCreate tableā, and name your new table ātemperatureā. Make sure you include the schema, partitioning, and clustering options shown in the screenshots below, as these options support common query patterns:


If created correctly, your new, empty table will look like the following:

After moving data into the table, we will showcase a common IoT query pattern: Performing analytics on data matching a specific time frame (e.g. a one-hour window for the present day), and for a specific device.
The table design shown above is ideal for such queries because:
- Partitioning on the UTC timestamp field enables date-specific queries to avoid scanning over DateTime partitions for non-matching days
- Within a partition, clustering (sorting) on deviceId and the epoch timestamp allow for more optimal retrieval of data for a specific device and time-frame within that particular date partition.
To write these queries, we need data to work within the table. Letās get that Dataflow job going!
Dataflow Setup
We currently have messages sitting in a Pub/Sub subscription, waiting to be moved elsewhere, and a BigQuery table ready to accept those messages. What we need now is the ETL glue that links the two together. Since Pub/Sub and BigQuery are both fully-managed, auto-scaling, and serverless services, we ideally want an ETL tool that possesses these qualities as well.
Dataflow (mostly) matches these requirements. The marketing around Dataflow claims it is all three, but in truth, it is not fully serverless. You do need to specify the instance types and sizes that will be used, the minimum and maximum instance counts auto-scaling can bounce between, as well as how much disk temporary disk space each instance will need. You never manage these instances and their ability to decide when to scale, but you do have to provide these specifications. This is in contrast to Pub/Sub and BigQuery, which will auto-scale with no infrastructure configuration.
Despite not being completely serverless, Dataflow is a perfect fit for our Pub/Sub-to-BigQuery ETL requirement. It is also easy to use, particularly given that GCP offers many default Dataflow job templates, including one that supports a Pub/Sub-to-BigQuery workflow. Outside of having to increase the auto-scaling maximum permissible instance count as your IoT data throughput increases over time, you will in theory never have to worry about managing the infrastructure powering Dataflow.
With the basics now understood, letās implement a Dataflow job. Navigate to Dataflow, click āCreate Job from Templateā, and follow these steps:
- Name the job āpubsub-temp-to-bqā
- Use the default streaming template āPub/Sub Subscription to BigQueryā
- Enter the full Pub/Sub Subscription name
- Enter the full BigQuery table ID
- Enter a Cloud Storage bucket location where temporary data may be stored as part of Dataflowās process for batch loading into BigQuery
- Leave the remaining options at their defaults. You typically would expand the Advanced Options and specify parameters such as a particular machine type and size to use, the min/max machine count auto-scaling values, and disk size per machine. However, for testing purposes, these can be left at their defaults.
Your Dataflow job creation screen should look like the following:

After hitting āCreateā and waiting a few minutes for the underlying infrastructure to spin up and start running, you will see data flowing from the Pub/Sub subscription and into the destination BigQuery table.
The Python temperature streaming script provided in Part One streams at a rate of one record per second. Thus, in the Dataflow Directed Acyclic Graph (DAG) shown below, you should see x number of elements streaming per second, where x is the number of devices you are testing with. In my case, there are three devices streaming:

Once you see that the Dataflow job is active and successfully streaming Pub/Sub subscription data to BigQuery, you can run a query with the following format in BigQuery and see real-time data hitting the table:
SELECT * FROM `iottempstreaming.sensordata.temperature` WHERE DATE(timestamp_utc) = "2020-12-18" ORDER BY timestamp_epoch DESC LIMIT 10

We see that partition filtering is taking place by observing that more total data is scanned when the day-filtering WHERE clause is removed.
With my example dataset, 1.1 MBs of filtered data are scanned (as seen above) and 1.7 MBs of unfiltered data are scanned (shown below):
SELECT * FROM `iottempstreaming.sensordata.temperature` ORDER BY timestamp_epoch DESC LIMIT 10

Letās see what the average, minimum, and maximum temperature values are from each sensor within the past hour:
SELECT device_id, ROUND(AVG(temp_f), 1) AS temp_f_avg, MIN(temp_f) AS temp_f_min, MAX(temp_f) AS temp_f_max FROM `iottempstreaming.sensordata.temperature` WHERE timestamp_utc > DATETIME_ADD(CURRENT_DATETIME(), INTERVAL -60 MINUTE) GROUP BY device_id

Congratulations! You have just set up a data workflow that is fully-managed from end-to-end, from data ingestion to the analytics backend. Before we wrap this walkthrough up, letās quickly check out how easily this data can be visualized with Data Studio.
Visualization of warehoused data
Start by running a query similar to the following in BigQuery which grabs all rows of data from a particular day:
SELECT * FROM `iottempstreaming.sensordata.temperature` WHERE DATE(timestamp_utc) = "2020-12-18" ORDER BY timestamp_epoch DESC
To the right of āQuery Resultsā, click on āExplore Dataā, then āExplore with Data Studioā:

This will load a table summarizing the data we just queried. However, by default, it will show a rather uninteresting table summarizing the total number of records streamed per second.
Letās change a few values under the Data section on the right to make this more interesting:
- Select āLine Chartā as the visualization type rather than āTableā
- Remove āRecord Countā as the Metric viewed, and replace this with ātemp_fā. Make sure to change the default āSUMā metric to āAVGā.
- Add ādevice_idā as a breakout dimension
Your choices should give dashboard layout settings similar to the following:

The chart that this produces will show temperature values for each device over time, but it may not be auto-scaled well as the default minimum y-axis value will be zero. To fix this, click on the āStyleā tab, scroll down until you get to option āLeft Y-Axisā, and change these to reasonable values:

You may also want to increase the number of data points that can be present on the chart:

With these updates, you should have a beautiful, interactive chart that allows you to scroll through device temperature values as they fluctuate over time:

Next up: MachineĀ Learning
Stay tuned for part three where we will build a functional machine learning model on this BigQuery dataset and use it to generate real-time predictions.