Missing the T? Data Storage ETL an Oversight, Says KNIME CEO – Solutions Review

Rather than seeing holistically, how many people interact with content on our website, on our forum, or on social media, wouldnt it be nice to see activity grouped by the organization? Wed see not just an individuals view of the most recent blog but also her colleagues comments on LinkedIn the following day. It would be even better if we could see the connection between the twoenabling us to distinguish between high engagement on a single team or interest from a new department. Wouldnt it be great if an account manager tasked with growing a given account could spot patterns between support calls, social media comments, and online-store visitseven if some of that data came from a recently acquired company.

The biggest problem to allow for this continued making sense of (all of our) data is the nasty combination of ever-changing requirements or questions seeking an answer with ever-changing data sources that need continuous cleaning, transforming, and integrating. Without first organizing and adding structure to all those data sources, its impossible to derive interesting insights. The prominent claim that data is the new oil is surprisingly apt. Like oil, data in its raw form is initially useless only once you refine it is it valuable and useful.

But how do I get to this state of well-organized data?

The solution for this used to be to build a data warehouse i.e. define the one, and only proper structure once and for all and then live with it. When that turned out to be infeasible since data and data sources are ever-changing, data lakes became popular until they also turned out to be, well, rather messy. Then things moved to the cloud, but that didnt really solve the problem of reaching and maintaining a state of well-organized data. Instead of solving it via smart (or not so smart) storage setups, meta query or federated setups promise another answer. Still, they, too, only solve a part of the puzzle.

Keeping your data accessible goes beyond just figuring out how to store the data. Teams also need a way for transformation (the T in ETL) to happen as needed without compromising resources or time. In this piece, we argue that low-code offers exactly that flexibilitygiving anyone access to just the insights they need, as they need them.

But first, lets revisit whats been tried so far.

Data Warehouses have been the holy grail for ages but are rarely spotted in real life. The truth is that they are easy to imagine, hard to design, and even harder actually to put to work.

Lets say we came up with the one true relational model to structure all the data floating around in an organization. In an automotive plant, for instance, perhaps your database holds manufacturing data (e.g., cycle times, lot priorities), product data (e.g., demands and yields), process data (e.g., control limits and process flows), and equipment data (e.g., status, run time, downtime, etc). If you can make sure all this data is properly cleaned, transformed, and uploadedwhich is a big Ifthen theoretically, youd see immediate benefits because the architects of the data warehouse made it easy for you to ask specific questions of your data. Perhaps youd be able to reduce costs related to equipment failures. Or better optimize inventory because you become familiar with the patterns of demand versus yields. Or improve end-of-line testing for higher product quality.

But what happens when we want to add new data from a new machine? Well, we rework the relational modelsomething that is expensive, difficult, and often politically challenging. And what happens when we want to evaluate our CO2 footprint, so we need to connect data from suppliers and data from logistics? We, again, rework the relational model.

Even if people are successfully using our data warehouse to create new insights, new requirements will pop up that we did not think about when we first designed the structure of our warehouse. So rather than freezing that structure once and for all, this will quickly turn into a never-ending construction site, which will never have a coherent, consistent structure that includes all current data of interest. This will, at the very least, delay finding the answers to new questions but more likely make it simply impossible. Not at all the agile, self-service data warehouse we had in mind when we started this project years ago.

After data warehouses, the industry came up with the idea of a data lake dont worry about structure (not even in the data itself), just collect it all and figure out later how to organize it when you actually need it. That was made possible by increasingly cheap storage facilities and NoSQL storage setups. Distributed mechanisms to process this data were also developed, MapReduce being one of the most prominent examples back then.

Our manufacturing, product, process, and equipment data is never cleaned or transformed but dumped, as-is, into one centralized storage facility. When analysts want to make sense of this data, they rely on data engineers to custom-build solutions that include cleaning and transforming for each bespoke question. Although we dont need to rebuild an entire relational model, data engineers do need to be involved in answering each and every business question. Also, an old problem resurfaced: lots of data keeps sitting across the organization in various formats and storage facilities, and even newer data continues to be generated outside of that swamp.

Data Lakes force us, just like data warehouses, to ensure all data sits within that one house or lake; we just dont need to worry about structure before moving it there. And thats precisely the issue the organizing of data into the proper structure still needs to be done; it just gets done later in the process. Instead of structuring the warehouse upfront, we now need to deal with the mechanisms to add structure to the data lake at the time when we look for insights in our data. And we need the help of data engineers to do that.

The next generation of this type of setup moved from on-premise distributed storage clusters to the cloud. The rather limiting map-reduce framework gave room to more flexible processing and analysis frameworks, such as Spark. Still, the two main problems remained: Do we really need to move all our data into one cloud to be able to generate meaningful insights from all of our data? And how do we change the structure after its been living in our data lake? This may work for a new company that starts off with a pure cloud-based strategy and places all of its data into one cloud vendors hands. Still, in real life, data has existed before, outside of that cloud, and nobody really wants to lock themselves in with one cloud storage provider forever.

One big problem of all the approaches described so far is the need to put it all into one repository may that be the perfectly architected warehouse, my inhouse data lake, or the swamp in the cloud.

Federated approaches try to address this by leaving the data where it is and putting a meta layer on top of everything. That makes everything look like it all sits in one location but under the hood it builds meta queries ad hoc, which pull the data from different locations and combine them as requested. These approaches obviously have performance bottlenecks (Amdahls law tells us that the final performance will always depend on the slowest data source needed) but at least they dont require a once and for all upload to one central repository. However, querying data properly is much more than just building distributed database queries. Structuring our distributed data repositories properly for every new query requires expert knowledge for all but basic operations.

The central problem of all these approaches is the need to define the overall structure, e.g. how all those data storage fragments fit together. Beforehand in case of data warehouses, at analysis time for data lakes, through automatic query building for federated approaches.

But the reality is different. In order to truly aggregate and integrate data from disparate sources we need to understand what the data means so we can apply the right transformations at the right time to arrive at a meaningful structure in reasonable time. For some isolated aspects of this, automated (or even learning) tools exist, for instance for entity matching in customer databases. But for the majority of these tasks, expert knowledge will always be needed.

Ultimately, the issue is that the global decision of how we store our data is based on a snapshot of reality. Reality changes fast, and our global decision is doomed to be outdated quickly. The process of extracting insights out of all available data is bottlenecked by this one be-all-end-all structure.

This is why the important part of ETL, the Transformation is either assumed to have been figured out once and for all (in data warehouses), completely neglected (in data lakes), or pushed to a later stage (in federated approaches). But pushing the T to the end has, despite making it someone elses problem, a performance impact as well. If we load and integrate our data without proper transformations we will often create extremely large and inefficient results. Even just ensuring database joins are done in the right order can change performance by several orders of magnitude. Imagine doing this with untransformed data, where customer or machine IDs dont match, names are spelled differently, and properties are inconsistently labeled. Its impossible to get all of this right without timely domain expertise.

Transformation needs to be done where it matters and by the person who understands it.

Low Code allows everybody to do it on the fly, SQL or other experts inject their expertise (code) where its needed. And if a specific type of load, aggregate, transform process should be used by others, its easy to package it up and make it reusable (and also auditable if needed because its documented in one environment the low code workflow). Low-code serves as a lingua franca that can be used across disciplines. Data engineers, analysts, and even line-of-business users can use the same framework to transform data at any point of the ETL process.

Should that low code environment be an integral part of (one of) our data storage technologies? Well, nounless we plan to stick with that data storage environment forever. Much more likely well want to keep the door open to add another type of data storage technologies in the future or maybe even switch from one cloud provider to another one (or go a completely hybrid path and use different clouds together). In that case a low code environment, which after all, is home to lots of our experts domain expertise by now, should make it easy to switch those transformation processes over to our new data environment.

Why did warehouses fail and data lakes dont provide the answer either? Just like with software engineering, the waterfall system doesnt work for dynamic setups with changing environments and requirements. It needs to be agile, explorative when needed, and documentable/governable when moved into production. But since data transformation will always require expertise from our domain experts, we require a setup that allows us to add this expertise continuously to the mix as well.

In the end, we need to provide the people who are supposed to use the data with intuitive ways to create the data aggregations and transformations themselves from whatever data sources, however they want. And at the same time we want to keep the doors open for new technologies that will arise, new tools that we want to try out, and new data sources and types that will show up.

Michael Berthold is co-founder of KNIME, the open analytics platform. He recently left his chair at Konstanz University and is now CEO of KNIME AG. Before that, he held positions in both academia (Carnegie Mellon, UC Berkeley) and industry (Intel, Tripos). He has co-authored several books (the second edition of the Guide to Intelligent Data Science appeared recently), is an IEEE Fellow, and a former president of the IEEE-SMC society.

Follow this link:
Missing the T? Data Storage ETL an Oversight, Says KNIME CEO - Solutions Review

CTERA Networks Partners with SYNNEX Corporation to Drive Market Demand for Hybrid Cloud Storage, Collaboration and ... [Last Updated On: October 5th, 2012] [Originally Added On: October 5th, 2012]
Cloud storage exempt from Ninefold's uptime boost [Last Updated On: October 5th, 2012] [Originally Added On: October 5th, 2012]
Virsto Named Finalist of 2012 Storage Virtualization & Cloud Awards [Last Updated On: October 5th, 2012] [Originally Added On: October 5th, 2012]
Innovative Hybrid Cloud Storage Solutions Now Available From PROMISE Technology [Last Updated On: October 5th, 2012] [Originally Added On: October 5th, 2012]
Box Talks Integration with BlackBerry 10 and Cloud Storage for Business - Video [Last Updated On: October 5th, 2012] [Originally Added On: October 5th, 2012]
AG112's Weekly Technology Tutorials Ep.7 Cloud Storage - Video [Last Updated On: October 5th, 2012] [Originally Added On: October 5th, 2012]
Cloud Storage - Video [Last Updated On: October 5th, 2012] [Originally Added On: October 5th, 2012]
Google Cloud Storage Office Hours - 9/5/2012 - Video [Last Updated On: October 5th, 2012] [Originally Added On: October 5th, 2012]
IBM Cloud Storage -- Future Directions - Video [Last Updated On: October 5th, 2012] [Originally Added On: October 5th, 2012]
Working with best FREE Cloud storage solution - MediaFire - Video [Last Updated On: October 5th, 2012] [Originally Added On: October 5th, 2012]
Best Cloud Storage | How Nate Made $450 His First Hour... - Video [Last Updated On: October 5th, 2012] [Originally Added On: October 5th, 2012]
Cloud Storage Services: Comparison - Video [Last Updated On: October 5th, 2012] [Originally Added On: October 5th, 2012]
Top 10 Free Cloud Storage Services of 2012 - Video [Last Updated On: October 5th, 2012] [Originally Added On: October 5th, 2012]
Cloud Storage Wars - Video [Last Updated On: October 5th, 2012] [Originally Added On: October 5th, 2012]
Secure and Comprehensive Cloud Storage for Health IT - Video [Last Updated On: October 5th, 2012] [Originally Added On: October 5th, 2012]
Free Cloud Storage! - Video [Last Updated On: October 5th, 2012] [Originally Added On: October 5th, 2012]
Microsoft SkyDrive Cloud Storage - Video [Last Updated On: October 5th, 2012] [Originally Added On: October 5th, 2012]
Top 16 Android Cloud Storage Apps Quick Breakdown - Video [Last Updated On: October 5th, 2012] [Originally Added On: October 5th, 2012]
Up to 48GB of FREE Cloud Storage, 14GB Guaranteed - Video [Last Updated On: October 5th, 2012] [Originally Added On: October 5th, 2012]
Nasuni's CEO To Speak At Interop On The Secure Use Of Cloud Storage [Last Updated On: October 6th, 2012] [Originally Added On: October 6th, 2012]
Oracle vs Amazon Cloud Storage: OpenWorld 2012 - Video [Last Updated On: October 6th, 2012] [Originally Added On: October 6th, 2012]
Apple extends iCloud storage for another year [Last Updated On: October 7th, 2012] [Originally Added On: October 7th, 2012]
Interush Introduces Convenient Cloud-Based Storage Service with Release of PHYTTER DOCK Application [Last Updated On: October 9th, 2012] [Originally Added On: October 9th, 2012]
Get a free 15GB cloud-storage account from 4Sync [Last Updated On: October 9th, 2012] [Originally Added On: October 9th, 2012]
Cloud Solutions Increase Customer Engagement and Retention [Last Updated On: October 9th, 2012] [Originally Added On: October 9th, 2012]
Pogoplug offering 100GB of cloud storage to UK users for just £19.99 a year [Last Updated On: October 10th, 2012] [Originally Added On: October 10th, 2012]
New vFoglight Storage 2.0 Provides Integrated Application to Disk Performance Monitoring [Last Updated On: October 10th, 2012] [Originally Added On: October 10th, 2012]
Lunacloud Deploys Cloudian® To Grow Business, Offer S3 Compatible Cloud Storage [Last Updated On: October 11th, 2012] [Originally Added On: October 11th, 2012]
New Cloud Storage Company, ZapDrive, Launches Today Offering 100 GB for $19.99/year. [Last Updated On: October 11th, 2012] [Originally Added On: October 11th, 2012]
Otixo Adds Ubuntu One to Aggregated Cloud Storage Lineup [Last Updated On: October 11th, 2012] [Originally Added On: October 11th, 2012]
Cloud Storage Reviews Announcement Video - Video [Last Updated On: October 11th, 2012] [Originally Added On: October 11th, 2012]
Cloud storage outage strikes Macquarie Telecom [Last Updated On: October 11th, 2012] [Originally Added On: October 11th, 2012]
Online-Storage.com is Now SIO.CO [Last Updated On: October 11th, 2012] [Originally Added On: October 11th, 2012]
C2C Maximizes eMail Archiving Flexibility and Control With Support for the Hybrid Cloud [Last Updated On: October 11th, 2012] [Originally Added On: October 11th, 2012]
OwnCloud: Build your own or manage your public cloud storage services [Last Updated On: October 12th, 2012] [Originally Added On: October 12th, 2012]
Ubuntu's cloud storage service hits Mac in beta, with 5GB free [Last Updated On: October 12th, 2012] [Originally Added On: October 12th, 2012]
Akitio Cloud Hybrid Review: Convenient NAS and USB Storage in One [Last Updated On: October 13th, 2012] [Originally Added On: October 13th, 2012]
Symform Hires Senior Sales Executive to Build Global Partnerships as Distributed Cloud Storage Network Surpasses 5.5 ... [Last Updated On: October 15th, 2012] [Originally Added On: October 15th, 2012]
Get an extra 25GB of storage in the Dropbox Great Space Race [Last Updated On: October 16th, 2012] [Originally Added On: October 16th, 2012]
Microsoft Acquires StorSimple To Increase Cloud Storage Capabilities [Last Updated On: October 17th, 2012] [Originally Added On: October 17th, 2012]
Inktank-Metacloud Partnership Enhances Fully Managed Private Cloud Solution With Enterprise-Class Storage [Last Updated On: October 17th, 2012] [Originally Added On: October 17th, 2012]
Citrix and NetApp Collaborate to Simplify Cloud Storage [Last Updated On: October 17th, 2012] [Originally Added On: October 17th, 2012]
Microsoft Acquires Leader In Cloud-integrated Storage [Last Updated On: October 17th, 2012] [Originally Added On: October 17th, 2012]
Microsoft Buys StorSimple for Enterprise Cloud Storage [Last Updated On: October 18th, 2012] [Originally Added On: October 18th, 2012]
FreedomPACS, Radiology PACS and Cloud Image Storage Provider, Releases Results of County Hospital Case Study ... [Last Updated On: November 1st, 2012] [Originally Added On: November 1st, 2012]
Nirvanix Selects Brocade as Networking Backbone for Global Cloud Expansion [Last Updated On: November 1st, 2012] [Originally Added On: November 1st, 2012]
Pogoplug offers unlimited cloud storage for $5 a month [Last Updated On: November 1st, 2012] [Originally Added On: November 1st, 2012]
NTT Communications Chooses Cloudian® S3 compatible Object Storage Platform for Multi Petabyte Cloud Storage as a Service [Last Updated On: November 1st, 2012] [Originally Added On: November 1st, 2012]
TwinStrata and Google to Host "Beyond Disaster Recovery: Integrating Cloud Storage into Your IT Strategy" Seminar [Last Updated On: November 1st, 2012] [Originally Added On: November 1st, 2012]
Cloud Storage Reviews Outlines "How SugarSync Works" In Latest Guide [Last Updated On: November 1st, 2012] [Originally Added On: November 1st, 2012]
Symform Challenges Users to Think Beyond Centralized Data Centers With Its 'Byte Me' Promotion [Last Updated On: November 1st, 2012] [Originally Added On: November 1st, 2012]
Avere to tart up FTX with cloud storage gateway, mutterings foretell [Last Updated On: November 1st, 2012] [Originally Added On: November 1st, 2012]
Deals WD My Book Live Personal Cloud Storage 2 TB Network Attached Best Price 2012 - Video [Last Updated On: November 1st, 2012] [Originally Added On: November 1st, 2012]
Create and Manage Your Own Cloud Storage Free - Video [Last Updated On: November 1st, 2012] [Originally Added On: November 1st, 2012]
Free Cloud Space 100GB - Video [Last Updated On: November 1st, 2012] [Originally Added On: November 1st, 2012]
DuraCloud Brown Bag Series: How DuraCloud is Different From Amazon - Video [Last Updated On: November 1st, 2012] [Originally Added On: November 1st, 2012]
PocketCloud Explore - Video [Last Updated On: November 1st, 2012] [Originally Added On: November 1st, 2012]
Free 1TB Cloud storage - Video [Last Updated On: November 1st, 2012] [Originally Added On: November 1st, 2012]
Store your files on WEB for free - Unlimited and better than dropbox - Video [Last Updated On: November 1st, 2012] [Originally Added On: November 1st, 2012]
CloudBackupNow - Retention Policy (with audio) - Video [Last Updated On: November 1st, 2012] [Originally Added On: November 1st, 2012]
CloudBackupNow - Retention Policy - Video [Last Updated On: November 1st, 2012] [Originally Added On: November 1st, 2012]
CloudBackupNow - Primer II - Video [Last Updated On: November 1st, 2012] [Originally Added On: November 1st, 2012]
ERP Data Capture animation - Video [Last Updated On: November 1st, 2012] [Originally Added On: November 1st, 2012]
Cash rains DOWN on the Cloud - Nasuni trousers $20m [Last Updated On: November 3rd, 2012] [Originally Added On: November 3rd, 2012]
My PC Backup Review The Cloud Storage Service For You - Video [Last Updated On: November 3rd, 2012] [Originally Added On: November 3rd, 2012]
Samsung ATIV S Review - Phones 4u - Video [Last Updated On: November 3rd, 2012] [Originally Added On: November 3rd, 2012]
Trust Me mv - Video [Last Updated On: November 3rd, 2012] [Originally Added On: November 3rd, 2012]
Product Webinar: Collaborating and Exchanging Large Data at Distance with Faspex 3.0 - Video [Last Updated On: November 3rd, 2012] [Originally Added On: November 3rd, 2012]
DT Daily: Facebook takes aim at Craigslist, Halo 4 reviews a - Video [Last Updated On: November 3rd, 2012] [Originally Added On: November 3rd, 2012]
2 MCSE Private Cloud Storage Basics - Video [Last Updated On: November 3rd, 2012] [Originally Added On: November 3rd, 2012]
Gladinet Cloud Enterprise Quick Start Guide - Video [Last Updated On: November 3rd, 2012] [Originally Added On: November 3rd, 2012]
Installing OfficeDrop Mac File Sync - Video [Last Updated On: November 3rd, 2012] [Originally Added On: November 3rd, 2012]
OfficeDrop Mac File Sync - Video [Last Updated On: November 3rd, 2012] [Originally Added On: November 3rd, 2012]
Secure Cloud Storage - Video [Last Updated On: November 3rd, 2012] [Originally Added On: November 3rd, 2012]
Windows Phone 8: Lenese integrates apps in the camera app - Video [Last Updated On: November 3rd, 2012] [Originally Added On: November 3rd, 2012]
Graphic Video on Wuala Secure Cloud Storage from Paula Hansen and Chart Magic - Video [Last Updated On: November 3rd, 2012] [Originally Added On: November 3rd, 2012]
Hurricane Sandy Cheat Meal Run to Tastee Diner - Video [Last Updated On: November 3rd, 2012] [Originally Added On: November 3rd, 2012]
Cloud Zow Review - Cloudzow Review | Marketing Secret Revealed - Video [Last Updated On: November 3rd, 2012] [Originally Added On: November 3rd, 2012]
What is Cloud Storage? - Video [Last Updated On: November 4th, 2012] [Originally Added On: November 4th, 2012]
Perfume - Chocolate Disco [ hide@BSB Battle In Feb. Remix ] - Video [Last Updated On: November 4th, 2012] [Originally Added On: November 4th, 2012]

Cloud Hosting

Missing the T? Data Storage ETL an Oversight, Says KNIME CEO – Solutions Review

Recent Posts

Categories

Archives

Media Sites

Pages

Site admin