This is the Big Data era: huge volumes of data careening through networks, slamming into massive storage arrays and being scrutinized with new and improved analytical software running on special-purpose hardware.
With all the celebration about new capabilities for analytics and greatly increased query performance, we forgot a few things: standards, best practices and quality processes that drive our company and support our mission-critical IT systems. We forgot about the little data.
Big Data Paradigm Shift
For the last few centuries, mankind’s view of data has been that it’s:
• Hierarchical: Customers have multiple orders; patients have multiple treatments; and accounts have many transactions.
• Structured and comes in only a few well-known types, including strings of characters and numeric digits with known attributes and metadata
• Stored as-is, usually on some physical medium such as stone, parchment or paper.
When a Big Data problem appears, our tendency has been to think of it as a scaling-up and resource allocation issue.
Big Data today is different in many ways. We’re now faced with new and complex data types, self-describing data and multi-structured data, in addition to the expected high volumes and speeds. Big Data isn’t only a scale-up issue; it’s also a re-architecture and data integration issue. Further, it often involves integration of dissimilar architectures. By insisting that we can deal with Big Data by simply scaling up with faster special-purpose hardware, we aren’t only neglecting the real issues, we’re leaving our current processes and data (the little data) behind.
When we think about integrating all the new Big Data with what we already have, the need for attention to architecture and data integration becomes clear. Let’s consider a possible Big Data implementation and see how we can restore little data to its place of importance.
Big Data: The First Implementation
It’s typical for the IT enterprise to test-drive a new process or idea. This usually takes the form of some initial analyses, perhaps including feasibility studies, proof of concept testing and a pilot project. Often, the first Big Data project implements Business Intelligence (BI) analytics using some new hardware or software that accesses current data-at-rest in the production environment.
Forgotten in this new project are current systems and processes. The result: a shiny new system that produces measurable results from Big Data in production. Regrettably, integrating it into the current Data Warehouse (DW) architecture is now much more difficult. In rushing to implement new technology, IT failed to consider the following:
Test and sandbox environments. You should always include a test environment. Developers will use this to develop queries and reports in an environment separated from production, usually with data that has been cleansed of confidential or restricted data.