Data Science Strategy: Membangun Narasi

From OnnoWiki
Jump to navigation Jump to search

Tidak ada salahnya untuk membuat gambar saat menjelaskan proses yang rumit, lihatlah Gambar 1-1, di mana anda dapat melihat langkah-langkah atau fase utama dalam siklus ilmu data science. Perlu dingat, bahwa model yang di visualisasikan pada Gambar 1-1 mengasumsikan bahwa anda telah mengidentifikasi masalah bisnis tingkat tinggi atau peluang bisnis sebagai titik awal. Motivasi awal ini biasanya bersumber dari sebuah bisnis perspektif, tetapi perlu dianalisis dan dibingkai secara rinci bersama-sama dengan tim data science. Dialog ini penting untuk memahami data yang mana tersedia dan apa yang mungkin dilakukan dengan data tersebut sehingga anda dapat mengatur fokus pekerjaan ke depan. Bukan ide yang baik untuk mulai menangkap setiap dan semua data yang terlihat cukup menarik untuk dianalisis. Oleh karena itu, tahap pertama siklus hidup data science, capture, adalah membingkai data yang anda butuhkan dengan menerjemahkan kebutuhan bisnis menjadi sesuatu yang konkrit terdefinisi dengan baik atau sebuah peluang bisnis.


FIGURE 1-1: The different stages of the data science life cycle.


Masalah dan/atau peluang bisnis awal tidak statis dan akan berubah seiring waktu sesuai dengan kematangan pemahaman anda yang didorong oleh data. Tetap fleksibel dalam hal capture data serta masalah dan/atau peluang mana yang paling penting pada titik waktu tertentu, oleh karena itu sangat penting untuk mencapai tujuan bisnis anda.

Model yang ditunjukkan pada Gambar 1-1 bertujuan untuk mewakili pandangan berbagai tahapan siklus hidup data science, mulai dari meng-capture kebutuhan bisnis dan data melalui persiapan, eksplorasi, dan analisis data hingga mencapai wawasan (insight) dan bertindak (act) berdasarkan itu.

Keluaran dari setiap siklus penuh menghasilkan data baru yang merupakan hasil dari siklus sebelumnya. Ini tidak hanya mencakup data atau hasil baru, yang dapat anda gunakan untuk mengoptimalkan model anda, tetapi juga dapat menghasilkan kebutuhan bisnis baru, masalah, atau bahkan pemahaman baru tentang apa yang seharusnya menjadi prioritas bisnis. Tahapan siklus hidup data science ini juga dapat dilihat sebagai langkah-langkah yang tidak hanya menggambarkan ruang lingkup data science tetapi juga lapisan dalam arsitektur. Berikut adalah pembahasan tahapan-tahapan dalam siklus hidup data science tersebut.


Capture

There are two different parts of the first stage in the life-cycle, since capture refers to both the capture of the business need as well as the extraction and acquisition of data. This stage is vital to the rest of the process. I’ll start by explaining what it means to capture the business need. The starting point for detailing the business need is a high-level business request or business problem expressed by management or similar entities and should include tasks such as » » Translating ambiguous business requests into concrete, well-defined problems or opportunities » » Deep-diving into the context of the requests to better understand what a potential solution could look like, including which data will be needed » » Outlining (if possible) strategic business priorities set by the company that might impact the data science work Now that I’ve made clear the importance of capturing and understanding the business requests and initial scoping of data needed, I want to move on to describ- ing aspects of the data capture process itself. It’s the main interface to the data source that you need to tap into and includes areas such as » » Managing data ownership and securing legal rights to data capture and usage » » Handling of personal information and securing data privacy through different anonymization techniques » » Using hardware and software for acquiring the data through batch uploads or the real-time streaming of data CHAPTER 1 Framing Data Science Strategy 11» » Determining how frequently data will need to be acquired, because the frequency usually varies between data types and categories » » Mandating that the preprocessing of data occurs at the point of collection, or even before collection (at the edge of an IoT device, for example). This includes basic processing, like cleaning and aggregating data, but it can also include more advanced activities, such as anonymizing the data to remove sensitive information. (Anonymizing refers to removing sensitive information such as a person’s name, phone number, address and so on from a data set.) In most cases, data must be anonymized before being transferred from the data source. Usually a procedure is also in place to validate data sets in terms of completeness. If the data isn’t complete, the collection may need to be repeated several times to achieve the desired data scope. Performing this type of validation early on has a positive impact on both process speed and cost. » » Managing the data transfer process to the needed storage point (local and/or global). As part of the data transfer, you may have to transform the data — aggregating it to make it smaller, for example. You may need to do this if you’re facing limits on the bandwidth capacity of the transfer links you use. Maintain Data maintenance activities includes both storing and maintaining the data. Note that data is usually processed in many different steps throughout its life cycle. The need to protect data integrity during the life cycle of a data element is especially important during data processing activities. It’s easy to accidentally corrupt a dataset through human error when manually processing data, causing the data set to be useless for analysis in the next step. The best way to protect data integrity is to automate as many steps as possible of the data management activi- ties leading up to the point of data analysis. Keeping business trust in the data foundation is vital in order for business users to trust and make use of the derived insights. When it comes to maintaining data, two important aspects are » » Data storage: Think of this as everything associated with what’s happening in the data lake. Data storage activities include managing the different retention periods for different types of data, as well as cataloging data properly to ensure that data is easy to access and use. 12 PART 1 Optimizing Your Data Science Investment» » Data preparation: In the context of maintaining data, data preparation includes basic processing tasks such as second-level data cleansing, data staging, and data aggregation, all of which usually involve applying a filter directly when the data is put into storage. You don’t want to put data with poor quality into your data lake. Data retention periods can be different for the same data type, depending on its level of aggregation. For example, raw data might be interesting to save for only a short time because it’s usually very large in volume and therefore costly to store. Aggregated data on the other hand, is often smaller in size and cheaper and easier to store and can therefore be saved for longer periods, depending on the targeted use cases. Process Processing of data is the main data processing layer focused on preparing data for analysis, and it refers to using more advanced data engineering methodologies, such as » » Data classification: This refers to the process of organizing data into categories for even more effective and efficient use, including activities such as the labeling and tagging of data. A well-planned data classification system makes essential data easy to find and retrieve. This can also be of particular importance for areas such as legal and compliance. » » Data modeling: This helps with the visual representation of data and enforces established business rules regarding data. You would also build data models to enforce policies on how you should correlate different data types in a consistent manner. Data models also ensure consistency in naming conven- tions, default values, semantics, and security procedures, thus ensuring quality of data. » » Data summarization: Here your aim is to use different ways to summarize data, like using different clustering techniques. » » Data mining: This is the process of analyzing large data sets to identify patterns or deviations as well as to establish relationships in order to enable problems to be solved through data analysis further down the road. Data mining is a sort of data analysis, focused on enhanced understanding of data, also referred to as data literacy. Building data literacy in the data science teams is a key component of data science success. With low data literacy, and without truly understanding the data you’re preparing, analyzing, and deriving insights from, you run a high risk of failing when it comes to your data science investment. CHAPTER 1 Framing Data Science Strategy 13Analyze Data analysis is the stage where the data comes to life and you’re finally able to derive insights from the application of different analytical techniques. Insights can be focused on understanding and explaining what has happened, which means that the analysis is descriptive and more reactive in nature. This is also the case with real-time analysis: It’s still reactive even when it happens in the here-and-now. Then there are data analysis methods that aim to explain not only why something happened but also what happened. These types of data analysis are usually referred to as diagnostic analyses. Both descriptive and diagnostic methods are usually grouped into the area of reporting, or business intelligence (BI). To be able to predict what will happen, you need to use a different set of analytical techniques and methods. Predictions about the future can be done strategically or in real-time settings. For a real-time prediction you need to develop, train and validate a model before deploying it on real-time data. The model could then search for certain data patterns and conditions that you have trained the model to find, to help you predict a problem before it happens. Figure 1-2 shows the difference between reporting techniques about what has happened (in black) and analytics techniques about what is likely to happen, using statistical models and predictive models (in white). FIGURE 1-2: The difference between reporting and analytics. 14 PART 1 Optimizing Your Data Science InvestmentThis list gives you examples of the kinds of questions you can ask using different reporting and BI techniques: » » Standard reports: What was the customer churn rate? » » Ad hoc reports: How did the code fix carried out on a certain date impact product performance? » » Query drill-down: Are similar product-quality issues reported in all geograph- ical locations? » » Alerts: Customer churn has increased. What action is recommended? And this list gives you examples of the kinds questions you can ask using different analytics techniques: » » Statistical analysis: Which factors contribute most to the product quality issues? » » Forecasting: What will bandwidth demand be in 6 months? » » Predictive modeling: Which customer segment is most likely to respond to this marketing campaign? » » Optimization. What is the optimal mix of customer, offering, price, and sales channel? Analytics can also be separated into two categories: basic analytics and advanced analytics. Basic analytics uses rudimentary techniques and statistical methods to derive value from data, usually in a manual manner, whereas in advanced analytics, the objective is to gain deeper insights, make predictions, or generate recommenda- tions by way of an autonomous or semiautonomous examination of data or ­content using more advanced and sophisticated statistical methods and techniques. Some examples of the differences are described in this list: » » Exploratory data analytics is a statistical approach to analyzing data sets in order to summarize their main characteristics, often with visual methods. You can choose to use a statistical model or not, but if used, such a model is primarily for visualizing what the data can tell you beyond the formal model- ing or hypothesis testing task. This is categorized as basic analytics. » » Predictive analytics is the use of data, statistical algorithms, and machine learning techniques to identify the likelihood of future outcomes based on historical data. This is categorized as advanced analytics. CHAPTER 1 Framing Data Science Strategy 15» » Regression analysis is a way of mathematically sorting out which variables have an impact. It answers these questions: Which factors matter most? Which can be ignored? How do those factors interact with each other? And, perhaps most importantly, how certain am I about all these factors? This is categorized as advanced analytics. » » Text mining or text analytics is the process of exploring and analyzing large amounts of unstructured text aided by software that can identify concepts, patterns, topics, keywords, and other attributes in the data. The overarching goal of text mining is, to turn text into data for analysis via application of natural language processing (NLP) and various analytical methods. Text mining can be done from a more basic perspective as well as from a more advanced perspective, depending on the use case. Communicate The communication stage of data science is about making sure insights and learn- ings from the data analysis are understood and communicated by way of different means in order to come to efficient use. It includes areas such as » » Data reporting: The process of collecting and submitting data in order to enable an accurate analysis of the facts on the ground. It’s a vital part of communication because inaccurate data reporting can lead to vastly uninformed decision-making based on inaccurate evidence. » » Data visualization: This can also be seen as visual communication because it involves the creation and study of the visual representation of data and insights. To help you communicate the result of the data analysis clearly and efficiently, data visualization uses statistical graphics, plots, information graphics, and other tools. Effective visualization helps users analyze and reason about data and evidence because it makes complex data more accessible, understandable, and usable. Users may have been assigned particular analytical tasks, such as making comparisons or understanding causality, and the design principle of the graphical visualization (showing comparisons or showing causality, in this example) ­follows the task. Tables are generally used where users can look up a specific measure- ment, and charts of various types are used to show patterns or relationships in the data for one or more variables. Figure 1-3 below exemplifies how data exploration could work using a table for- mat. In this specific case, the data being explored regards cars, and the hypothesis being tested is which car attribute impacts fuel consumption the most. Is it, for example, the car brand, engine size, horse power or perhaps the weight of the car? 16 PART 1 Optimizing Your Data Science InvestmentAs you can see, exploring the data using tables has its limitation, and does not give an immediate overview. It requires you to go through the data in detail to discover relationships and patterns. Compare this with the graph shown in Figure 1-4 below, where the same data is being visualized in a completely different way. FIGURE 1-3: Example of data exploration using a table. Figure 1-3 is based on a screenshot generated using SAS® Visual Analytics software. Copyright © 2019 SAS Institute Inc., Cary, NC, USA. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. All Rights Reserved. Used with permission. In Figure 1-4, a visualization in the shape of a linear regression graph has been generated for each car attribute, together with text explaining the strength of each relationship to fuel consumption. (Linear regression involves fitting a straight line to a dataset while trying to minimize the error between the points and the fitted line.) The graph in Figure 1-4 shows a very strong positive relationship between the weight of the car and fuel consumption. By studying the relationship between the other attributes and fuel consumption using the graph generated for each tab, it will be quite easy to find the strongest relationship compared to using the table in Figure 1-3. However, in data exploration the key is to stay flexible in terms of which explora- tion methods to use. In this case, it was easier and quicker to find the relationship by using linear regression, but in another case a table might be enough, or none of the just mentioned approaches works. If you have geographical data, for exam- ple, the best way to explore it might be by using a geo map, where the data is distributed based on geographical location. But more about that later on. Actuate The final stage in the data science life cycle is to actuate the insights derived from all previous stages. This stage has not always been seen as part of the data science life cycle, but the more that society moves toward embracing automation, the more the interest in this area grows. CHAPTER 1 Framing Data Science Strategy 17FIGURE 1-4: Visualizing your data. Figure 1-4 is based on a screenshot generated using SAS® Visual Analytics software. Copyright © 2019 SAS Institute Inc., Cary, NC, USA. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. All Rights Reserved. Used with permission. Decision-making for actuation refers to connecting an insight derived from data analysis to trigger a human- or machine-driven decision-making process of identifying and deciding alternatives for the right action based on the values, policies, preferences, or beliefs related to the business or scope of the task. What actually occurs is that a human or machine compares the insight with a predefined set of policies for what needs to be done when a certain set of criteria is fulfilled. If the criteria are fulfilled, this triggers a decision or an action. The actuation trigger can be directed toward a human (for example, a manager) for further decisions to be made in a larger context, or toward a machine when the insight falls within the scope of the predefined policies for actuation. Automation of tasks or decisions increases speed and reduces cost, and if set up properly, also produces continuous and reliable data on the outcome of the implemented action. The stage where decisions are actuated — by either human hand or a machine — is one of the most important areas of data science. It’s fundamental because it will provide data science professionals (also known as data scientists) with new data based on the results of the action (resolution or prevention of a problem, for example), which tells the data scientists whether their models and algorithms are performing as expected after deployment or whether they need to be corrected or improved. The follow-up regarding model and algorithm performance also sup- ports the concept of continuous improvement. 18 PART 1 Optimizing Your Data Science InvestmentPUTTING AUTOMATION IN THE CONTEXT OF DATA SCIENCE What is actually the relationship between data science and automation? And, can automation accelerate data science production and efficiency? Well, assuming that the technology evolution in society moves more and more toward automation, not only for simple process steps previously performed by humans but also for complex actions identified and decided by intelligent machines powered by machine-learning-developed algorithms, the relationship will be a strong one, and data science production and effi- ciency will accelerate considerably due to automation. The decisions will, of course, not really be decided by the machines, but will be based on human-preapproved policies that the machine then acts on. Machine learning doesn’t mean that the machine can learn unfettered, but rather that it always encounters boundaries for the learning set up by the data scientist — boundaries regulated by established policies. However, within these policy boundaries, the machine can learn to optimize the analysis and execution of tasks assigned to it. Despite the boundaries imposed on it, automation powered by machines will become more and more important in data science, not only as a means to increase speed (from detection to correction or prevention) but also to lower cost and secure quality and consistency of data management, actuation of insights, and data generation based on the outcome. When applying data science in your business, remember that data science is transfor- mative. For it to fully empower your business, it isn’t a question of just going out and hiring a couple of data scientists (if you can find them) and put them into a traditional software development department and expect miracles. For data science to thrive and generate full value, you need to be prepared to first transform your business into a data-driven organization.