Data Science (en)
Data Science
Data Science is an interdisciplinary field that utilizes scientific methods, processes, algorithms, and systems to extract knowledge and insights from data in various forms, both structured and unstructured, closely related to data mining.
Data science integrates statistics, data analysis, machine learning, and related methods to understand and analyze real-world phenomena using data. It draws techniques and theories from multiple fields, including mathematics, statistics, information science, and computer science.
Turing Award winner Jim Gray described data science as the "fourth paradigm" of science:
Data-driven
He further stated that "everything about science is changing due to the impact of information technology" and the emergence of big data.
In 2012, when Harvard Business Review called it "The Sexiest Job of the 21st Century," the term "data science" became a buzzword. Data science is often used interchangeably with concepts such as business analytics, business intelligence, predictive modeling, and statistics. Notably, Hans Rosling emphasized its appeal in a 2011 BBC documentary, stating, "Statistics is now the sexiest subject around." Nate Silver also described data science as a "sexed-up term for statistics." The frequent rebranding of traditional analytics as "data science" has led to some dilution of the term's meaning.
Today, many universities offer degrees in data science, yet there remains no clear consensus on its precise definition or curriculum. Unfortunately, many data science and big data projects fail due to poor management and inefficient resource utilization.
History
The term "data science" has appeared in various contexts over the past three decades but only gained widespread recognition recently. Initially, it was used as a synonym for computer science by Peter Naur in the 1960s, who later proposed the term "datalogy." In 1974, Naur published Concise Survey of Computer Methods, using "data science" to describe contemporary data processing techniques across numerous applications.
In 1996, members of the International Federation of Classification Societies (IFCS) met in Kobe for a biennial conference, where "data science" appeared in a conference title for the first time: "Data Science, Classification, and Related Methods." Chikio Hayashi introduced the term in a roundtable discussion.
In November 1997, C.F. Jeff Wu delivered an inaugural lecture titled "Statistics = Data Science?" for the H. C. Carver Professorship at the University of Michigan. He characterized statistics as a trilogy of:
Data collection
Data modeling & analysis
Decision making
Wu proposed renaming statistics as "data science" and statisticians as "data scientists."
In 2001, William S. Cleveland introduced data science as an independent discipline in "Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics," published in the International Statistical Review. He identified six technical areas crucial for data science:
Multidisciplinary investigation
Model development
Methods for data processing
Computing with data
Pedagogy
Tool evaluation
Theory
In 2002, the International Council for Science (ICSU) launched the Data Science Journal, focusing on data system descriptions, online publications, and related legal issues. In January 2003, Columbia University began publishing The Journal of Data Science, providing a platform for data professionals to exchange ideas. The National Science Board (2005) defined data scientists as professionals responsible for managing digital data collections, including "information scientists, database programmers, domain experts, curators, librarians, and archivists."
Around 2007, Jim Gray envisioned "data-driven science" as the fourth paradigm of knowledge, emphasizing large-scale computational data analysis as a primary scientific method. He predicted a future where "all scientific literature and data would be online and interoperable."
In 2012, the Harvard Business Review article "Data Scientist: The Sexiest Job of the 21st Century" credited DJ Patil and Jeff Hammerbacher with coining the term "data scientist" in 2008 to describe their roles at LinkedIn and Facebook. They described data scientists as a "new breed," highlighting a growing shortage of skilled professionals in the industry.
In 2013, the IEEE Task Force on Data Science and Advanced Analytics was launched. The European Conference on Data Analysis (ECDA) was organized in Luxembourg, leading to the formation of the European Association for Data Science (EuADS). The first IEEE International Conference on Data Science and Advanced Analytics was held in 2014.
By 2014, data science bootcamps such as General Assembly and The Data Incubator emerged, providing intensive training. The American Statistical Association (ASA) renamed its journal to "Statistical Analysis and Data Mining: The ASA Data Science Journal" in 2014 and later renamed its Statistical Learning and Data Mining section to "Statistical Learning and Data Science" in 2016.
In 2015, Springer launched the International Journal on Data Science and Analytics to publish original research on data science and big data analytics. At the third ECDA conference in 2015, the Gesellschaft für Klassifikation (GfKl) added "Data Science Society" to its name.
Relationship with Statistics
The popularity of "data science" has skyrocketed in business and academia, reflected in the growing demand for professionals. However, some academics and journalists argue that data science is simply a rebranded term for statistics. Forbes writer Gil Press dismissed data science as a buzzword, while Nate Silver described data scientists as "sexed-up statisticians."
Despite the criticism, others argue that data science is evolving similarly to computer science, which was initially an interdisciplinary field before becoming a recognized discipline. NYU Stern professor Vasant Dhar emphasized that data science differs from traditional data analysis by focusing on actionable patterns for predictive modeling rather than just explaining datasets.
In 2015, Stanford professor David Donoho challenged three misconceptions:
Data science is not just about big data; scale is not a defining characteristic.
Data science is not just about computing skills; analytics is used across many disciplines.
Data science is more than applied statistics; modern academic programs need broader scopes.
Donoho, along with John Chambers and William Cleveland, advocated for an inclusive learning approach in data science, prioritizing predictive modeling over traditional statistical explanations.
Future advancements in data science will support open science, ensuring research data is freely available for all researchers. Initiatives like the US National Institute of Health (NIH) aim to improve data transparency and reproducibility.
See Also
Data Science: Machine Learning MindMap
Data Science: Components & Tools
Data Science: Algorithm Cheat Sheet