Data Science Strategy: Memanaged Konsistensi Data
Managing Data Consistency Across the Data Science Environment It might seem like a simple task to ensure data consistency across the different parts of the data science environment, but it’s much more difficult than it seems. First off, this area tends to be more complex than it needs to be, eating up more time and resources than originally estimated. The need for consistency includes aspects such as data governance and data formats, but also the labeling of data consistently —using customer IDs across many different sources to enable cor- relation of different data types related to the same customer, for example. The challenge is that there is a built-in contradiction in terms infrastructure between enabling usage of special tools to allow data scientists and data engineers to be innovative and productive and at the same time ensuring consistency in the data. This is because specialized tools are optimized to focus on solving certain problems but either don’t keep the format consistent or don’t interface well with other tools needed in the end-to-end flow. Optimized, specialized machine learn- ing tools are simply not good at playing together with other, similar specialized tools that are addressing comparable or other adjacent problems. But is it really that bad? Well, it can lead to real problems, depending on how much freedom is allowed in the architectural implementation and among the teams. Some examples of problems that can stem from a lack of consistency across the AI environment are described in this list: » » Ad hoc solutions: Every case is treated as an isolated problem that needs to be solved this instant in order for the team to move forward. The result? No long-term solution and no learning between teams. 44 PART 1 Optimizing Your Data Science Investment» » Increased cost: When you have to duplicate tool capabilities in order to manage a lack of consistency or when you have to build capabilities into purchased tools to secure just the basic consistency, those costs add up. » » End-to-end not working: Inconsistencies can occur when the infrastructure is implemented across several cloud vendors, which then makes it difficult or impossible to transfer data and keep data consistent across different virtualized environments. Because corporate management cannot enforce, and may not want to enforce, data consistency across the organization as a company policy, they have to use other means to preserve data consistency end-to-end. One way is to ensure that all teams follow proper and relevant guidelines for evaluating and purchasing new tools that incorporate specific directives related to data consistency. Clearly moti- vating why this is key to a successful data science strategy execution. It’s also vital to consider which limits are needed for each individual company, depending on the type of business, their objectives, and so on. Hold the line when it comes to data consistency: Otherwise, you may end up with a cumbersome and costly implementation of data science, one far removed from the productive data science environment you were hoping for.