Now that ‘Big Data’ is all the rage, no company or organization involved in todays’ technology development or in, for example, such data-intensive areas as banking or health care wants to be left out. Organizations have taken to collecting vast and oftentimes chaotic stores of structured and unstructured data, commonly referred to as ‘data lakes.’
There are excellent reasons for creating data lakes. Technologies such as Hadoop allow huge volumes of data to be stored inexpensively. Ordinary servers are grouped together and managed as clusters at a fraction of what dedicated database machines and software cost. Oftentimes the format of data in data lakes allows for analysis and number-crunching to be performed at an enormously improved rate. For example, fraudulent credit card activity can be identified and corralled much more quickly because, as Vamsi Chemitiganti puts it in an article from Vamsi Talks Tech titled Hadoop counters Credit Card Fraud, “Hadoop (-related algorythms and methods) can ingest billions of events at scale thus supporting the most mission critical analytics irrespective of data size,” and “Hadoop supports multiple ways of running models and algorithms that are used to find patterns of fraud and anomalies in the data to predict customer behavior.” This ability to process huge amounts of data in a variety of different ways is greatly helping organizations such as PayPal stay a step or two ahead of the criminals, as Vamsis’ superb article points out.
Your organization will be concerned with managing all of its usual, business-critical data while at the same time perhaps ingesting more and more outside data to a point where it becomes unmanageable. Hence the downside to data lakes. Your company, with regards to data management, must be holistically healthy before going any further. Otherwise, a flood of data can become a major liability. An article by Dan Woods in Forbes.com titled Why Data Lakes Are Evil points this out: “With data lakes there’s no inherent way to prioritize what data is going into the supply chain and how it will eventually be used. The result is like a museum with a huge collection of art, but no curator with the eye to tell what is worth displaying and what’s not.” So companies may end up with a giant mess. And, what’s worse, “Companies may not even know all the data they’re collecting, where it’s from, and what risks it exposes them to…Data lakes do not have rules overseeing what they take in, so there is a great danger that companies could be collecting data that exposes them to risk in a certain location.”
An obvious answer to this dilemma is to reach for outside help. Data experts can be brought in to stem the tide, so to speak, and to sort out what data should be retained versus what data should be discarded. As with an analogous situation, such as living with a pack rat (even if that refers to only you yourself), the sooner the mess gets taken care of, the better.
The best a company or organization can hope for is the ability to contain, control, and harness its data stores to maximum advantage without compromising security, employing data lakes side-by-side with traditional ‘data warehousing’ technologies through a balanced, carefully measured approach that applies these technologies in suitable, or custom ways rather, that make sense.
The incredible possibilities that ‘Big Data’ technologies create – such as for example having the ability to upload, process, display and view all manufacturing and supply chain processes and process data almost instantaneously, regardless of location – afford great advantages to those firms who are able to realize them. But it is vitally important that such firms be prepared beforehand by having their data lakes, data warehouses, storage systems, servers and server services, and other computing assets adapted in such a way that these myriad and complex systems function together in a seamless fashion and as a single unit with clearly defined and – as far as it is possible- unassailable security.