Solution Walkthrough: Visualizing Daily Cloud Spend on GCP using GKE, Dataflow, BigQuery and Grafana
For any successful cloud adoption, gaining comprehensive visibility into ongoing cloud spend is essential - no one wants to receive a bill higher than expected when trying to plan a budget.
In the case of Google Cloud Platform, there are different charging models for different resources. For example, GCE cost depends on machine type (based on CPU, memory, network and disks), Google Kubernetes Engine (GKE) and Google Cloud Dataproc charges are based on all running nodes on Google Compute Engine (GCE), while some other service costs have more complex formulas. It becomes increasingly difficult to predict cloud spends, especially when you use a lot of different resources. It is important to be aware of the spending and be able to respond in time if it becomes too expensive. If you can proactively monitor billing reports every day, the probability of receiving a surprising bill at the end of the month is drastically reduced.
In a recent project, we wanted to implement an extended billing report functionality to efficiently track spending. While GCP allows exporting billing data into BigQuery for further comprehensive analysis, manually querying BigQuery tables is unlikely sufficient enough; ideally, an operator is able to visualise data and to filter and aggregate by additional parameters. The most common filters are resource type, time period, team and department, and comparison with previous periods.
The first solution we tried was based on Google’s suggestions to use Google Data Studio. While the implementation was very straightforward because of the sample reports and dashboards already available, and only data source configurations were required, the solution was not flexible enough. At creating a chart in Google Data Studio, the user needs to choose all parameters manually instead of providing a formula. To make the billing report more user-friendly we decided to look for another way.
Based on our previous experience we thought that Grafana would be a good option to visualise billing data if we could find a way to connect it to BigQuery (BQ). We tried an open-source BQ plugin for Grafana, but it contained a lot of bugs and was not stable enough. Additionally, there was another issue - BQ jobs take too much time to return data.
Eventually we decided to load data to PostgreSQL in CloudSQL (because CloudSQL is the easiest way to have a relational database instance) and use Grafana for visualisation. Grafana has an official PostgreSQL plugin, we tested it and realized that it was the best fit to our requirements.
The diagram below is an overview to the workflow we implemented. We deployed a cronjob to the Kubernetes cluster that triggered a Cloud Dataflow job every 4 hours. The dataflow job initiated data loading from BQ to PostgreSQL. It checked max(export_time) in PostgreSQL and loaded data from BQ incrementally, i.e.only importing the data since the last BQ export. The end-users could monitor Grafana dashboards connected to the PostgreSQL instance for the latest billing data.
Setting up the Database
Export Billing Data to BigQuery
First, we set up billing data export to BigQuery, as described here.
Create a PostgresQL Database
The easiest way to create a PostgreSQL instance is by using Cloud SQL.
After the instance was up and running, we created a number of database objects, including a table that mimicked the BQ table structure and a materialized view with indexes that we would use for connecting to PostgreSQL from Grafana.
It is a good practice to add a minimal set of labels to all resources. Every label from this set is represented as a separate column in the view. We also created indexes on these columns to make the Grafana queries faster.
After that we also created indexes on the view. The indexes could be created later -- it was more important to first understand how the queries would look like before creating the indexes.
Populating the Database
Create a Service Account
A service account with access to DataFlow and BigQuery was needed as this was how the DataFlow job would retrieve the BigQuery data.
Create buckets for Cloud DataFlow job
Cloud DataFlow jobs need 2 buckets to store the temporary and outputs respectively.
Create a script for DataFlow job to load data from BigQuery to CloudSQL PostgreSQL
For data consistency we defined max(export_time) in PostgreSQL and loaded records from BQ starting from this time.
We also needed a requirements.txt file that contains a list of packages to be installed on workers. In our case we needed only one package beam-nuggets.
This was how the main part of the script (bq-to-sql.py) looked.
After the data was loaded we needed to refresh the materialized view. Since normally the refresh would take some time, it was also possible to create a new materialized view with the same structure (and the corresponding indexes), delete the old view and rename the new one to the old name.
Create a JSON file with SA credentials The same service account used earlier in the Cloud Dataflow workflow is also used in the cron job. The following command created a private key of the service account, that we subsequently uploaded to the Kubernetes cluster as secret to be accessed in the cron job.
Deploy a secret to your K8s cluster
Create the Docker image
We wanted the DataFlow job to run on a daily basis. First of all, we created a Docker image with all the needed environment variables and the Python script.
To create a file with commands to run:
The content of the Dockerfile
In this case, we used Helm for the cron job deployment for ease of deployment and reuse.
Visualising the Data with Grafana
The last step was the creation of the Grafana dashboards. While we could use any charts that made sense, we used Data Studio Billing Report Demo as inspiration (of course we had to write all SQL queries from scratch).
Using a separate user in the database with read-only access to the view with billing data and to connect from Grafana is recommended as a security best practice.
Here are some examples of charts.
The above is a walkthrough of how we implemented a workflow to visualize billing data. The mechanism also supports adding new filters and metrics. At implementation roll out, the client got a set of useful and fast dashboards that help to monitor Google Cloud spending. With this insight, the client is now empowered to make further cost optimizations.
Written by Dariia Vasilenko
How do manage your cloud spending? Are you wondering if you have workloads that can be optimized? Please let us know at firstname.lastname@example.org. We’d love to hear from you!