Data Science Strategy: Membangun Narasi
Tidak ada salahnya untuk membuat gambar saat menjelaskan proses yang rumit, lihatlah Gambar 1-1, di mana anda dapat melihat langkah-langkah atau fase utama dalam siklus ilmu data science. Perlu dingat, bahwa model yang di visualisasikan pada Gambar 1-1 mengasumsikan bahwa anda telah mengidentifikasi masalah bisnis tingkat tinggi atau peluang bisnis sebagai titik awal. Motivasi awal ini biasanya bersumber dari sebuah bisnis perspektif, tetapi perlu dianalisis dan dibingkai secara rinci bersama-sama dengan tim data science. Dialog ini penting untuk memahami data yang mana tersedia dan apa yang mungkin dilakukan dengan data tersebut sehingga anda dapat mengatur fokus pekerjaan ke depan. Bukan ide yang baik untuk mulai menangkap setiap dan semua data yang terlihat cukup menarik untuk dianalisis. Oleh karena itu, tahap pertama siklus hidup data science, capture, adalah membingkai data yang anda butuhkan dengan menerjemahkan kebutuhan bisnis menjadi sesuatu yang konkrit terdefinisi dengan baik atau sebuah peluang bisnis.
FIGURE 1-1: The different stages of the data science life cycle.
Masalah dan/atau peluang bisnis awal tidak statis dan akan berubah seiring waktu sesuai dengan kematangan pemahaman anda yang didorong oleh data. Tetap fleksibel dalam hal capture data serta masalah dan/atau peluang mana yang paling penting pada titik waktu tertentu, oleh karena itu sangat penting untuk mencapai tujuan bisnis anda.
PART 1 Optimizing Your Data Science InvestmentThe model shown in Figure 1-1 aims to represent a view of the different stages of
the data science life cycle, from capturing the business and data need through
preparing, exploring, and analyzing the data to reaching insights and acting on
them.
The output of each full cycle produces new data, which provides the result of the
previous cycle. This includes not only new data or results, which you can use to
optimize your model, but can also generate new business needs, problems, or
even a new understanding of what the business priority should be.
These stages of the data science life cycle can also be seen as not only steps
describing the scope of data science but also layers in an architecture. More on
that later; let me start by explaining the different stages.
Capture
There are two different parts of the first stage in the life-cycle, since capture refers
to both the capture of the business need as well as the extraction and acquisition
of data. This stage is vital to the rest of the process. I’ll start by explaining what it
means to capture the business need.
The starting point for detailing the business need is a high-level business request
or business problem expressed by management or similar entities and should
include tasks such as
» » Translating ambiguous business requests into concrete, well-defined
problems or opportunities
» » Deep-diving into the context of the requests to better understand what a
potential solution could look like, including which data will be needed
» » Outlining (if possible) strategic business priorities set by the company that
might impact the data science work
Now that I’ve made clear the importance of capturing and understanding the
business requests and initial scoping of data needed, I want to move on to describ-
ing aspects of the data capture process itself. It’s the main interface to the data
source that you need to tap into and includes areas such as
» » Managing data ownership and securing legal rights to data capture and usage
» » Handling of personal information and securing data privacy through different
anonymization techniques
» » Using hardware and software for acquiring the data through batch uploads or
the real-time streaming of data
CHAPTER 1 Framing Data Science Strategy
11» » Determining how frequently data will need to be acquired, because the
frequency usually varies between data types and categories
» » Mandating that the preprocessing of data occurs at the point of collection,
or even before collection (at the edge of an IoT device, for example). This
includes basic processing, like cleaning and aggregating data, but it can also
include more advanced activities, such as anonymizing the data to remove
sensitive information. (Anonymizing refers to removing sensitive information
such as a person’s name, phone number, address and so on from a data set.)
In most cases, data must be anonymized before being transferred from
the data source. Usually a procedure is also in place to validate data sets
in terms of completeness. If the data isn’t complete, the collection may
need to be repeated several times to achieve the desired data scope.
Performing this type of validation early on has a positive impact on both
process speed and cost.
» » Managing the data transfer process to the needed storage point (local and/or
global). As part of the data transfer, you may have to transform the data —
aggregating it to make it smaller, for example. You may need to do this if
you’re facing limits on the bandwidth capacity of the transfer links you use.
Maintain
Data maintenance activities includes both storing and maintaining the data. Note
that data is usually processed in many different steps throughout its life cycle.
The need to protect data integrity during the life cycle of a data element is
especially important during data processing activities. It’s easy to accidentally
corrupt a dataset through human error when manually processing data, causing
the data set to be useless for analysis in the next step. The best way to protect data
integrity is to automate as many steps as possible of the data management activi-
ties leading up to the point of data analysis.
Keeping business trust in the data foundation is vital in order for business users
to trust and make use of the derived insights.
When it comes to maintaining data, two important aspects are
» » Data storage: Think of this as everything associated with what’s happening in
the data lake. Data storage activities include managing the different retention
periods for different types of data, as well as cataloging data properly to ensure
that data is easy to access and use.
12
PART 1 Optimizing Your Data Science Investment» » Data preparation: In the context of maintaining data, data preparation
includes basic processing tasks such as second-level data cleansing, data
staging, and data aggregation, all of which usually involve applying a filter
directly when the data is put into storage. You don’t want to put data with
poor quality into your data lake.
Data retention periods can be different for the same data type, depending on its
level of aggregation. For example, raw data might be interesting to save for only a
short time because it’s usually very large in volume and therefore costly to store.
Aggregated data on the other hand, is often smaller in size and cheaper and easier
to store and can therefore be saved for longer periods, depending on the targeted
use cases.
Process
Processing of data is the main data processing layer focused on preparing data for
analysis, and it refers to using more advanced data engineering methodologies,
such as
» » Data classification: This refers to the process of organizing data into
categories for even more effective and efficient use, including activities such
as the labeling and tagging of data. A well-planned data classification system
makes essential data easy to find and retrieve. This can also be of particular
importance for areas such as legal and compliance.
» » Data modeling: This helps with the visual representation of data and enforces
established business rules regarding data. You would also build data models
to enforce policies on how you should correlate different data types in a
consistent manner. Data models also ensure consistency in naming conven-
tions, default values, semantics, and security procedures, thus ensuring quality
of data.
» » Data summarization: Here your aim is to use different ways to summarize
data, like using different clustering techniques.
» » Data mining: This is the process of analyzing large data sets to identify
patterns or deviations as well as to establish relationships in order to enable
problems to be solved through data analysis further down the road. Data
mining is a sort of data analysis, focused on enhanced understanding of data,
also referred to as data literacy. Building data literacy in the data science
teams is a key component of data science success.
With low data literacy, and without truly understanding the data you’re
preparing, analyzing, and deriving insights from, you run a high risk of failing
when it comes to your data science investment.
CHAPTER 1 Framing Data Science Strategy
13Analyze
Data analysis is the stage where the data comes to life and you’re finally able to
derive insights from the application of different analytical techniques.
Insights can be focused on understanding and explaining what has happened,
which means that the analysis is descriptive and more reactive in nature. This is
also the case with real-time analysis: It’s still reactive even when it happens in
the here-and-now.
Then there are data analysis methods that aim to explain not only why something
happened but also what happened. These types of data analysis are usually referred
to as diagnostic analyses.
Both descriptive and diagnostic methods are usually grouped into the area of
reporting, or business intelligence (BI).
To be able to predict what will happen, you need to use a different set of analytical
techniques and methods. Predictions about the future can be done strategically or
in real-time settings. For a real-time prediction you need to develop, train and
validate a model before deploying it on real-time data. The model could then
search for certain data patterns and conditions that you have trained the model to
find, to help you predict a problem before it happens.
Figure 1-2 shows the difference between reporting techniques about what has
happened (in black) and analytics techniques about what is likely to happen, using
statistical models and predictive models (in white).
FIGURE 1-2:
The difference
between
reporting and
analytics.
14
PART 1 Optimizing Your Data Science InvestmentThis list gives you examples of the kinds of questions you can ask using different
reporting and BI techniques:
» » Standard reports: What was the customer churn rate?
» » Ad hoc reports: How did the code fix carried out on a certain date impact
product performance?
» » Query drill-down: Are similar product-quality issues reported in all geograph-
ical locations?
» » Alerts: Customer churn has increased. What action is recommended?
And this list gives you examples of the kinds questions you can ask using different
analytics techniques:
» » Statistical analysis: Which factors contribute most to the product quality
issues?
» » Forecasting: What will bandwidth demand be in 6 months?
» » Predictive modeling: Which customer segment is most likely to respond to
this marketing campaign?
» » Optimization. What is the optimal mix of customer, offering, price, and sales
channel?
Analytics can also be separated into two categories: basic analytics and advanced
analytics. Basic analytics uses rudimentary techniques and statistical methods to
derive value from data, usually in a manual manner, whereas in advanced analytics,
the objective is to gain deeper insights, make predictions, or generate recommenda-
tions by way of an autonomous or semiautonomous examination of data or content
using more advanced and sophisticated statistical methods and techniques.
Some examples of the differences are described in this list:
» » Exploratory data analytics is a statistical approach to analyzing data sets
in order to summarize their main characteristics, often with visual methods.
You can choose to use a statistical model or not, but if used, such a model is
primarily for visualizing what the data can tell you beyond the formal model-
ing or hypothesis testing task. This is categorized as basic analytics.
» » Predictive analytics is the use of data, statistical algorithms, and machine
learning techniques to identify the likelihood of future outcomes based on
historical data. This is categorized as advanced analytics.
CHAPTER 1 Framing Data Science Strategy
15» » Regression analysis is a way of mathematically sorting out which variables
have an impact. It answers these questions: Which factors matter most?
Which can be ignored? How do those factors interact with each other? And,
perhaps most importantly, how certain am I about all these factors? This is
categorized as advanced analytics.
» » Text mining or text analytics is the process of exploring and analyzing large
amounts of unstructured text aided by software that can identify concepts,
patterns, topics, keywords, and other attributes in the data. The overarching
goal of text mining is, to turn text into data for analysis via application of
natural language processing (NLP) and various analytical methods. Text
mining can be done from a more basic perspective as well as from a more
advanced perspective, depending on the use case.
Communicate
The communication stage of data science is about making sure insights and learn-
ings from the data analysis are understood and communicated by way of different
means in order to come to efficient use. It includes areas such as
» » Data reporting: The process of collecting and submitting data in order
to enable an accurate analysis of the facts on the ground. It’s a vital part
of communication because inaccurate data reporting can lead to vastly
uninformed decision-making based on inaccurate evidence.
» » Data visualization: This can also be seen as visual communication because
it involves the creation and study of the visual representation of data and
insights. To help you communicate the result of the data analysis clearly
and efficiently, data visualization uses statistical graphics, plots, information
graphics, and other tools. Effective visualization helps users analyze and
reason about data and evidence because it makes complex data more
accessible, understandable, and usable.
Users may have been assigned particular analytical tasks, such as making
comparisons or understanding causality, and the design principle of the graphical
visualization (showing comparisons or showing causality, in this example) follows
the task. Tables are generally used where users can look up a specific measure-
ment, and charts of various types are used to show patterns or relationships in the
data for one or more variables.
Figure 1-3 below exemplifies how data exploration could work using a table for-
mat. In this specific case, the data being explored regards cars, and the hypothesis
being tested is which car attribute impacts fuel consumption the most. Is it, for
example, the car brand, engine size, horse power or perhaps the weight of the car?
16
PART 1 Optimizing Your Data Science InvestmentAs you can see, exploring the data using tables has its limitation, and does not give
an immediate overview. It requires you to go through the data in detail to discover
relationships and patterns. Compare this with the graph shown in Figure 1-4
below, where the same data is being visualized in a completely different way.
FIGURE 1-3:
Example of
data exploration
using a table.
Figure 1-3 is based on a screenshot generated using SAS® Visual Analytics software.
Copyright © 2019 SAS Institute Inc., Cary, NC, USA. SAS and all other SAS Institute Inc.
product or service names are registered trademarks or trademarks of SAS Institute Inc.
All Rights Reserved. Used with permission.
In Figure 1-4, a visualization in the shape of a linear regression graph has been
generated for each car attribute, together with text explaining the strength of each
relationship to fuel consumption. (Linear regression involves fitting a straight
line to a dataset while trying to minimize the error between the points and the
fitted line.) The graph in Figure 1-4 shows a very strong positive relationship
between the weight of the car and fuel consumption. By studying the relationship
between the other attributes and fuel consumption using the graph generated for
each tab, it will be quite easy to find the strongest relationship compared to using
the table in Figure 1-3.
However, in data exploration the key is to stay flexible in terms of which explora-
tion methods to use. In this case, it was easier and quicker to find the relationship
by using linear regression, but in another case a table might be enough, or none
of the just mentioned approaches works. If you have geographical data, for exam-
ple, the best way to explore it might be by using a geo map, where the data is
distributed based on geographical location. But more about that later on.
Actuate
The final stage in the data science life cycle is to actuate the insights derived from
all previous stages. This stage has not always been seen as part of the data science
life cycle, but the more that society moves toward embracing automation, the
more the interest in this area grows.
CHAPTER 1 Framing Data Science Strategy
17FIGURE 1-4:
Visualizing
your data.
Figure 1-4 is based on a screenshot generated using SAS® Visual Analytics software.
Copyright © 2019 SAS Institute Inc., Cary, NC, USA. SAS and all other SAS Institute Inc.
product or service names are registered trademarks or trademarks of SAS Institute Inc.
All Rights Reserved. Used with permission.
Decision-making for actuation refers to connecting an insight derived from data
analysis to trigger a human- or machine-driven decision-making process of
identifying and deciding alternatives for the right action based on the values,
policies, preferences, or beliefs related to the business or scope of the task.
What actually occurs is that a human or machine compares the insight with a
predefined set of policies for what needs to be done when a certain set of criteria
is fulfilled. If the criteria are fulfilled, this triggers a decision or an action. The
actuation trigger can be directed toward a human (for example, a manager) for
further decisions to be made in a larger context, or toward a machine when the
insight falls within the scope of the predefined policies for actuation.
Automation of tasks or decisions increases speed and reduces cost, and if set up
properly, also produces continuous and reliable data on the outcome of the
implemented action.
The stage where decisions are actuated — by either human hand or a machine —
is one of the most important areas of data science. It’s fundamental because it will
provide data science professionals (also known as data scientists) with new data
based on the results of the action (resolution or prevention of a problem, for
example), which tells the data scientists whether their models and algorithms are
performing as expected after deployment or whether they need to be corrected or
improved. The follow-up regarding model and algorithm performance also sup-
ports the concept of continuous improvement.
18
PART 1 Optimizing Your Data Science InvestmentPUTTING AUTOMATION IN THE
CONTEXT OF DATA SCIENCE
What is actually the relationship between data science and automation? And, can
automation accelerate data science production and efficiency? Well, assuming that the
technology evolution in society moves more and more toward automation, not only
for simple process steps previously performed by humans but also for complex actions
identified and decided by intelligent machines powered by machine-learning-developed
algorithms, the relationship will be a strong one, and data science production and effi-
ciency will accelerate considerably due to automation.
The decisions will, of course, not really be decided by the machines, but will be based
on human-preapproved policies that the machine then acts on. Machine learning doesn’t
mean that the machine can learn unfettered, but rather that it always encounters
boundaries for the learning set up by the data scientist — boundaries regulated by
established policies. However, within these policy boundaries, the machine can learn
to optimize the analysis and execution of tasks assigned to it.
Despite the boundaries imposed on it, automation powered by machines will become
more and more important in data science, not only as a means to increase speed (from
detection to correction or prevention) but also to lower cost and secure quality and
consistency of data management, actuation of insights, and data generation based on
the outcome.
When applying data science in your business, remember that data science is transfor-
mative. For it to fully empower your business, it isn’t a question of just going out and
hiring a couple of data scientists (if you can find them) and put them into a traditional
software development department and expect miracles. For data science to thrive and
generate full value, you need to be prepared to first transform your business into a
data-driven organization.