Data Science Strategy: Memperoleh Data

From OnnoWiki
Jump to navigation Jump to search

Getting Data from There to Here When a company decides to embark on a journey to become data driven, the focus is naturally on the data itself, which inevitably leads to a greater awareness of the actual variety of data needed to gain full proactive and data-driven control of their current business. On top of that, companies soon realize that in order to expand CHAPTER 3 Dealing with Difficult Challenges 41beyond what is possible today, the data sets need to become even more varied. At this point, many companies start to realize that the data which is fundamental to becoming truly data driven might actually belong to someone else or is located in another country, with other data regulations. This section explains how to strate- gically approach such practical challenges as part of your data acquisition. Handling dependencies on data owned by others Dealing with proprietary data is an unavoidable yet manageable challenge faced by any company striving toward becoming fully data-driven. Typically, what hap- pens is that you have identified and carefully specified all the data you need in your data strategy and when you then start looking into how to strategically approach capturing the data, you realize that you have a data ownership problem. If you use only data generated from your internal IT environment, you have, of course, less of a problem. If that’s the case, however, then your company probably isn’t truly data-driven in the proper sense. A data-driven business accounts for how its products and/or services are used and how it performs in real-life settings, not merely in the lab environment. And anytime you start using data generated by life in the real world, you run into the data ownership problem. What kind of data am I talking about? First and foremost, this involves data owned by your customers, but it can also include data owned by your customers’ custom- ers, depending on which business you’re in. You have to take the time to truly understand the detailed context of the data you need. It can relate to issues of data privacy, but it doesn’t have to. It can simply be the case that the data you need in order to better understand your business performance or potential belongs to someone else. Don’t get discouraged when it comes to ownership issues. Most situations can be solved from a legal perspective if you’re willing to address them openly with the data owners, explaining why you need the data and how you will treat the data after it’s in your possession. It’s all about gaining trust with regard to how, and for what purpose, the data will be used. (It wouldn’t hurt to also spell out how your work may, if possible, contribute back to the owners of the data.) At the end of the day, you need to be absolutely certain that you understand (and are complying with) the legal constraints that apply for each different type of data you intend to use. Your use of the data must also be regulated by way of a contrac- tual setup with the party owning the data, including what rights your company has related to data access, storage, and usage over time. 42 PART 1 Optimizing Your Data Science InvestmentLaws and regulations have a habit of changing over time. Lately, the trend is to increase restrictions even further in order to protect an individual’s right to their own data. One recent example is the quite restrictive General Data Protection and Regulation (GDPR) enacted by the European Union (EU) that went into effect in May 2018. Given recent news of the misuse of data by entities such as Cambridge Analytica and Facebook, the U.S. and Canada are definitely looking into legislation similar to the EU’s GDPR. Anything that helps to protect an individual’s right to privacy is all for the best, but just remember that the way you deal with privacy legislation today will most probably be quite different in the near future. Therefore, you should strategically and proactively think through your infrastructure setup and your data needs to ensure that you account for these types of constraints in your current and evolving data science environment. Managing data transfer and computation across-country borders If your company has divisions in a number of different countries or does business (and therefore has many customers) in many countries, one major challenge you might face is how to manage data that needs to cross international borders. You need to carefully consider a number of different aspects of the data puzzle if your company has an international component. Here’s a list of the major concerns: » » Legality: Legal constraints to moving data across borders is a consideration that a company must stay on top of. Laws and regulations differ from country to country, so different solutions may be possible, depending on which country you’re doing business in. The restrictions are also different depending on which type of data you’re moving out of a country. Data with personal information is usually much more difficult to move than non-sensitive data. Breaking laws related to data transfer can be quite costly and can severely impact the company brand if it is determined that you violated cus- tomer trust. » » Data transfer approach: This refers to how you actually execute the data transfer. It’s typically quite costly and also differs from country to country. Depending on the volume of data you want transferred, and the data transfer frequency, you can either rent space in existing connectivity infrastructures and data links or — if you cannot get your requirements met regarding aspects such as capacity, security, or exclusivity — invest in your own links. CHAPTER 3 Dealing with Difficult Challenges 43» » Possibilities for local computation and storage: If you can store the data and carry out the analysis in the country where the data has been captured, you might be able to lower the cost and increase the speed of delivery. However, to get this setup to work efficiently, you need to properly think through what your distributed computational architecture will look like. What will be done where? and where will the source data be kept, for example? Will there be a central point of data storage and global analysis, or will there be only distributed setups? How you answer these questions depends a lot on what type of business is being conducted and what the setup looks like in different countries.