Implementing a data warehouse is a strong step toward managing data on an enterprise level, either for management purposes or business intelligence efforts.
Some believe that popular Hadoop cluster systems will soon replace data warehouses altogether. However, as elaborated on in the Hadoop Integration section of this guide, Hadoop’s applications are distinct from, and potentially even aided by, the use of data warehouses. When looking to implement such a system, it's important to understand the latest developments in the market.
This guide covers the factors one should consider when selecting and implementing a data warehouse tool.
The primary application of a data warehouse is when data from many sources needs to be consolidated into one location for analysis and business intelligence (BI) purposes.
This could mean that you have point-of-sale (POS) data, enterprise resource planning (ERP) data, and accounting data stored in separate systems, all of which you want to analyze together. This could help you report on the state of your business, or find data correlations between the different systems.
Another scenario is when a central hub is needed to allow the efficient transfer of data across dozens or potentially hundreds of enterprise systems.
For example, patient records from an electronic health record (EHR) system may need to be fed into a billing software. A data warehouse can collect and standardize the data from the EHR, then transfer it to the billing software.
This is better than transferring data directly between the EHR and the billing software, because, if the billing software was changed out for a new system, the transfers between both the EHR and the billing software would have to be reprogrammed, and potentially reformatted. In the data warehouse scenario, if the billing software was replaced, data transfers from the EHR to the data warehouse would be unaffected. Only the transfer from the billing software to the data warehouse would have to be redone. As you can imagine, if data from several sources were all supposed to be fed in and out of the billing software, then the benefit of the data warehouse would be even greater.
The amount of data that warrants a data warehouse is less dependent on the sheer volume of data in giga, tera, or petabytes, than on the number of sources that need to be integrated.
As described above, the functions of a data warehouse are primarily to bring data from multiple sources together under one roof either to be analyzed or to be transferred more efficiently between systems.
If the number of systems between which data needs to be transferred or aggregated is a mere handful, and the quantity of data in them is only tens of gigabytes, then it may be more appropriate to have a simple database with the data from each system stored in tables. This volume of data is small enough to be analyzed in that single database. Additionally, the transfer of data between such systems can be done either manually or with automated processes developed by analysts. Such a solution will require significantly less capital and resources than a full data warehouse implementation.
Because of these factors, quantifying a minimum data requirement is difficult. Data warehouse implementations are unique to each scenario. Reviewing the following considerations and case study can help you get a feel for whether a data warehouse is the right solution for your organization.
Data warehouse tools are largely defined by their ability to expand in volume, incorporate new data sources, and add new capabilities as an enterprise undergoes planned and unplanned growth.
Different solutions offer different methods of access either from the data warehouse, your internal network, or even from the web. Most utilize online analytical process (OLAP) protocols.
You want to consider whether data is immediately accessible after being loaded, or if there is extensive data latency. This affects your ability to work with real-time data. Likewise, performance monitoring is important in determining whether you can perform an ETL load at the same time as a data mining procedure, or if you’ll need to plan your ETL pulls out of the way (such as 3AM on a Sunday morning) to avoid interfering with the performance of your data analysis.
Lastly, some data warehouses have limits on the number of simultaneous users, and setting permission levels to limit user access may be an important consideration for your organization.
An integration in data warehouse terms is a pre-built system for transferring data from, and sometimes to, a particular data source, such as SalesForce, or QuickBooks.
Integrations for various data sources have a large impact on the speed of deployment and ease of expansion. If the data warehouse provider integrates with a new data source (such as a new CRM program) you’d like to feed into your warehouse, then it’s as simple as turning it on. Without a prior integration, developers will have to create the schema for the ETL process. This process takes time and quickly becomes expensive.
In the case that custom integrations are necessary, it’s important to consider whether the provider’s developers will create custom integrations, or if there are APIs or even GUI interfaces available for custom integration schemas and mapping. Some data warehousing providers have tools and interfaces designed so that even non-IT personnel can create a schema for a simple database.
Text analytics, or at analyzing unstructured data, is common among data warehousing solutions. Inevitably, some solutions are more capable when it comes to generating structured data from unstructured data (through the identification of meta data, and patterns).
Likewise, some systems are more time-efficient at analyzing unstructured data, which can make a significant difference in performance.
Basic data mining functionality is often standard in a data warehousing solution since the warehouse is where you will be drawing data from for your company-wide business intelligence efforts. Some data warehousing tools, however, do provide features for more advanced data modeling, scoring, standard reporting from templates, and even visualization at the source of the data.
Determining which features you need will depend on what external BI solution you use to analyze your data, and whether you will be extracting smaller subsets of the data for your analysis or pulling from the entire data set.
Hadoop integration is important when the volume of data your enterprise collects and analyzes is so large that it becomes infeasible or too costly to feed into one central location. Hadoop uses parallel processing abilities to analyze data at the source, across multiple servers. One of the sources at which Hadoop performs parallel data processing can be a data warehouse, which is why Hadoop does not entirely eclipse data warehousing technology, but rather complements it.
If the data warehouse will be storing and backing up data from your entire company, and if certain data transfers must be running at all times, then you will want to have 24/7 phone and tech support.
If your data warehouse is particularly important, then you will want to review the provider’s service level agreement (SLA) for uptime guarantees, as further described below.
Cloud vs. On-Premise
The main considerations with data warehousing when deciding between cloud or on-premise solutions are speed of deployment and uptime/maintenance.
The second greatest factor in speed of deployment after integration is physical installation. Cloud-based warehouses require less technical resources such as servers, as well as less IT maintenance. A cloud data warehouse solution with pre-built integrations for all your data sources can be set up quite rapidly.
However, with a cloud-based data warehouse you are at the mercy of the provider’s uptime and maintenance schedules, not to mention your enterprise’s Internet connection. On-premise solutions allow you to be totally responsible for your own uptime and outages. Depending on the priority level of the data being handled and the transactions the data warehouse will administer between systems, you may want to prioritize control over convenience, or vice-versa.
For years the Volvo Car Corporation collected a lot of data from the numerous digital sensors in each of their cars. Working alongside the devices that trigger signals like the Service Engine Soon light are hundreds of meters and data loggers. All these data points are collected from the car’s central computer when you take it in for servicing and diagnostics.
Volvo identified that this data when analyzed alongside hardware, software and functional specifications, vehicle diagnostics, and warranty claims would allow them to reduce costs, improve the accuracy of warranty reimbursement, and improve the quality of their products.
To perform the intended analysis, Volvo first had to consolidate all their data sources into one location. Volvo teamed up with Teradata to implement a data warehouse.
As a result, the volume of actionable data at the disposal of Volvo analysts increased from 364 gigabytes to 1.7 terabytes. The data was made accessible to over 300 individuals across product design, manufacturing, quality assurance, and warranty departments. Volvo experienced immediate improvements in query performance, user access, targeted cost reduction, and an increase in accuracy through warranty claim analysis, as well as improvements in the quality of current and future production lines.