Data Lakes, Swamps, and Retention Ponds
The seasons are changing here in North America. We’re once again suffering the inevitable, seasonal progression towards shorter days and lower temps. As I watch the bodies of water around me start to freeze at the edges, I’m reminded of recent conversations I’ve had around big data and its customary bodies of water analogies. Now, realizing that observing the wonders of nature carries me to thoughts of big data does worry me a bit, but I’ll save that for another time.
At a conference I recently attended, I learned from a big data colleague that about 80% of Fortune 1,000 companies have invested in Hadoop, but that less than 23% of those companies are doing any real work utilizing that data. That piqued my interest. This lead me to contemplate the value of collecting data en masse without the reasonably foreseeable need to actually harvest it. This concept is pretty counter intuitive to the way my purpose driven data warehouse biased mind works.
The concept of a data lake is to bring disparate and undefined data in its native form into a shared space where it can be analyzed at some point in the future. The hope is to increase agility and accessibility for data analysis…specifically, big data analysis. It’s not a purpose built data store like most data practitioners are used to, it’s a marshy conglomeration of data silos brought together in one space. No metadata, no preparation of data, no governance. Undefined and expansive generally sounds like a risky IT proposition to me, but this use case could certainly, just as easily, be considered an IT opportunity. We know that low storage costs, cheap IaaS, and free open source big data frameworks are readily available and maturing at rapid rates. The number of Hadoop clusters being spun up in production are up 60% over each of the past two years. This rapid adoption of Hadoop is irrefutable and is poised to continue. Other decisions makers obviously see the benefit. The practical business applications of big data seem to be mostly in sentiment analysis, understanding consumer behaviors via clickstream, sales and marketing opportunities, and fraud detection at the moment. The truth is, as IoT applications grow and the volume of human knowledge increases, the need for understanding our ever accumulating data will grow. Might as well get in front of it. There’s a certain beauty in the chaos. If the effort of capturing ridiculous amounts of varied data in the form of a data lake, swap, or retention pond lays the foundation for solving currently undefined problems and seizing currently unknown opportunities, at a reasonable cost…I consider that a competitive advantage worth seizing.
This entry was posted in Tips and tricks
. Bookmark the permalink