Data Warehousing Tools
If you’re anywhere in your journey with data analytics, and haven’t been living under a rock (or in a server room) for the last decade, then you’ve certainly heard of data warehouses like those from Oracle or IBM, or even recently-emerged industry giants like Google BigQuery, Amazon Redshift, Microsoft Azure, and Snowflake.
Since their introduction and early evangelism by Bill Inmon and Ralph Kimball in the beginning of the 1990’s, data warehouses emerged to provide a single, definite source of truth — once dispersed amongst an enterprise’s often multiple and disparate data stores — that’s used to show what’s actually happening in the business. At the time data warehouses started to appear, they were an evolutionary step away from multiple on-premise data stores, privately hosted database servers, and even disparate spreadsheets that separate teams within an enterprise often used to capture their data for metrics and analytics.
In order to derive value from your data, you need not only have it in one place and using a single, canonical access language and interface, but you must also have a means to manage metadata, handle governance issues, and scale as your data grows. These are among the many challenges that data warehouses solve.
In this article, we’ll cover a few of the best-known data warehouses, ETL tools, and business intelligence (BI) tools.
Data warehouses
Strictly defined, data warehouses consist of a large store of data gathered from many (often separate) sources that an enterprise uses to guide its decisions.
Part of the broader evolutionary trend of the data ecosystem (if not software generally) is how a do-it-yourself ethos tends to become gradually replaced over time with expert solutions managed by specialists. The first data warehouse solutions were on-premise, and, while those remain, there is also a proliferation of cloud-native solutions which have entered the space.
The functionality for data warehouses is the same for on-prem and cloud native, although one can expect on-prem solutions to contain specific functionality for installation, sharding, replication, scaling and configuration that is absent from the cloud versions, as the latter solutions are generally managed. Data warehouses all:
- store a large repository of integrated data imported from one or many disparate sources, for a single source of truth across an organization;
- require (and in some cases, enable) data cleansing, deduplication, or schema adjustments to be done on imported data;
- enable Machine Learning or AI to be run on large datasets to identify trends, discover hidden relationships and/or predict future events;
- enable data to be queried into different formats for consumption by different stakeholders, or exported into different systems or visualization frameworks;
- allow the generation of custom reports or ad-hoc analysis;
- facilitate data mining;
- serve data scientists, analysts, and other data consumers.
On-premise data warehouses
Using an on-prem solution naturally involves purchasing, installing, and maintaining your own hardware for storing the contents of your data warehouse, in addition to managing the data it stores.
For certain companies with large, established data warehousing infrastructure, or companies with major concerns over accessibility (millisecond response times) or data-security, on-prem solutions may still be the best option.
Here’s a list of common on-prem data warehouse solutions:
- IBM
- Oracle
- Teradata
Cloud-native data warehouses
Cloud-native data warehouses involve purchasing a solution hosted in the cloud, and funnelling data to it, usually through an API or some other means. Because of the advantages cloud-native solutions provide, nearly all providers of traditionally on-prem solutions have a cloud offering. Cloud-based data warehouses are cost-effective, quick and easy to prepare, can scale without any extra effort, have security built in, and support multi-tenancy.
What’s more, cloud native users benefit from delegating maintenance and management of their DWs to third parties. In addition to the labor that’s freed up (for analytics or other activities), users need neither outlay an initial hardware cost nor worry about what to do with excess hardware when scaling down.
Here’s a list of common cloud-native data warehouse solutions:
- Amazon Redshift
- Google BigQuery
- Microsoft Azure
- Snowflake
ETL tools
ETL stands for “Extract, Transform, and Load” and consists of the tools and processes used for pulling data from one store, transforming it for placement, and finally, loading it into another (often aggregate) store. Just as with data warehouses, ETL tools have progressed over time from self-administered to cloud-native offerings.
Batch run/incumbent ETL tools
Remember when you used to see your bank account updated a day after your most recent financial transaction? That’s because historically, many organizations used free compute and storage resources to perform nightly batches of ETL jobs. Some organizations and processes still work this way.
Here’s a list of common batch run/incumbent ETL tools:
- IBM InfoSphere DataStage
- Informatica Power Center
- Microsoft SSIS
- Oracle Data Integrator
Open source ETL tools
These solutions are the evolutionary middle step between incumbent batch-based tools and fully managed cloud-based solutions. They solve some of the problems that batch run tools do not, for example, handling real-time streaming data.
Here’s a list of common open source ETL tools:
- Apache Kafka
- Apache NiFi
- CloverETL
- Jaspersoft
- Pentaho Kettle
- Talend Open Studio
Open source ETL tools have some drawbacks, but are generally a good choice when a customer isn’t seeking a commercial solution.