MySQL to Google BigQuery Replication Guide

MySQL to Google BigQuery Replication

by admin

One of the biggest advantages of utilizing BigQuery for analytics instead of using a third-party off-the-shelf analytics tool (such as Google Analytics, Mixpanel, Localytics, etc.) is the ability to join multiple data sources. For example: joining analytic events coming from your mobile app with user data stored in your MySQL database. This allows you to answer deeper questions like “Which of my customers are spending the most per order?” or “What user behavior do site visitors exhibit before making their first purchase?”

We will begin with a high-level overview and then lean toward the technical side, since this is a topic of somewhat technical complexity.

At a glance, this article covers:

  • Why companies replicate their MySQL databases to Google BigQuery
  • The different approaches for MySQL replication
  • A detailed real-world example implementation of MySQL to Google BigQuery replication
  • Common pitfalls to avoid when building your replication solution

Why replicate MySQL to Google BigQuery?

It’s very common these days to have multiple replicas of your data in several different types of data stores. That’s because different parts of your application or business use data in different ways. By replicating your data to multiple data stores, each optimized for a different use case, you are able to optimize the performance of your system.

Perhaps the most common and straightforward example is read-replicas for your transactional database. Many applications also store their transactional data in ElasticSearch, for example, to enjoy search speed and the rich features of ElasticSearch. Others store the raw data in S3 for backup, or Memcached for caching, etc.

Back to our use case, data teams tend to replicate their MySQL data to BigQuery for three main reasons:

  1. Joining multiple data sources – The main motivation to move to BigQuery is to be able to join and intersect multiple data sources. For example, joining billing data and customer service data with user behavior data. This kind of analysis allows you to ask questions such as: “Which features lead to higher retention rates and more conversions to paid plans?”, or “What user behaviors increase churn?”, and to perform predictions such as: “Which users are likely to become paying customers?”
  2. Analytical queries performance – While MySQL is optimized for Online Transaction Processing (OLTP) workloads, BigQuery is geared towards aggregating and analyzing large amounts of data by taking advantage of its columnar data store and massive parallel processing architecture. These kinds of analytical queries tend to perform very poorly in MySQL, and risk overloading and blocking your operational transactional database. By replicating MySQL to Google BigQuery, developers are able to improve their analytical query speed by many orders of magnitude. In addition, separating the operational database from the analytical database also means that analytical queries will never affect the performance of the production database.
  3. Near real-time access – Another key advantage of replicating MySQL production data into BigQuery is the near-real time aspect of it. By using the Google BigQuery streaming API, new data that’s written to the binary log in MySQL typically will be available for querying in BigQuery within several minutes. This is great for enabling data analysis on BigQuery data in virtually real time.

Ok, so now that we know why you would want to replicate MySQL data to Google BigQuery, let’s talk about how.

Different approaches for MySQL replication

There are several main approaches for database replication. In this section we’ll describe the three most common approaches and compare the advantages and drawbacks of each of them.

Full dump and load

In this approach, periodically, the MySQL tables are fully dumped, the corresponding Google BigQuery tables are dropped and recreated, and the full dumps of the MySQL tables are loaded to BigQuery.

The advantage of this approach is that it is very simple and straightforward. The disadvantage is that dumps are very resource intensive, so that it can slow down the MySQL database during the table dumps. To have a consistent dump of your table you might even need to lock your database during the dump. In addition, implementing a full dump on very large tables incurs high latency.

Therefore, this approach is mostly recommended if you have very small tables.

If you do choose to use this approach, it would be best to use a replica instead of your master database, since using the latter might block your database and interfere with your production application.

Incremental dump and load

In this approach, the MySQL table is periodically queried for updates since the last query. The updates are then loaded into Google BigQuery and a consolidation query reconstructs the original table.

The main advantage of this approach over a full dump and load is that in each iteration, only the updates are extracted and loaded. Typically this significantly reduces the load, since normally only a small portion of the database’s rows are updated between two iterations.

Perhaps the most significant disadvantage of this approach is that it cannot capture row deletions, since deleted rows will never be returned in a query. Similarly, table alterations (such as schema changes: added or removed columns, etc.) are also not captured unless actively queried in each iteration. Another minor disadvantage is that it requires the replicated tables to contain an “updated at” column that the periodical query can use. The necessary consolidation step also adds some complexity to the implementation.

Similarly to the full dump and load approach, it’s best to use a replica and not the master database for the incremental dump and load to avoid blocking the master database.

Binlog replication

This approach (sometimes referred to as change data capture – CDC) utilizes MySQL’s binlog. MySQL’s binlog keeps an ordered log of every DELETE, INSERT, and UPDATE operation, as well as Data Definition Language (DDL) data that was performed by the database. After an initial dump of the current state of the MySQL database, the binlog changes are continuously streamed and loaded into Google BigQuery.

In our opinion, this is the optimal approach. It is the only method that allows for near real-time replication at scale. Its main advantages are that it doesn’t lock or affect the performance of the database, it supports both deletions and table alterations (and therefore enables exactly one-to-one replication), it doesn’t have any requirements over the structure of the tables, and it is very coherent with the stream processing paradigm that allows transformations and near real-time performance. Not surprisingly, this is the most complex approach to implement, and this is its lone disadvantage.

Related articles

Worried About IoT? Create a Strong Data Integration Plan
Worried about IoT? Think About Your Data Integration Plan

f you’ve been paying attention over the last few years, you’ve no doubt heard some of the staggering statistics related…

Understanding Data Sprawl: Why It Matters
What is Data Sprawl?

Imagine that you need to complete your taxes, but all your relevant papers are secreted in drawers, hidden in closets,…

The Easiest Way to Load a CSV into Google BigQuery
The easiest way to load a CSV into Google BigQuery

BigQuery, Google’s data warehouse as a service, is growing in popularity as an alternative to Amazon Redshift. If you’re considering…

Ready to get started?

Purchase your first license and see why 1,500,000+ websites globally around the world trust us.