Data science is an emerging field which leverages approaches from statistics, computer science, machine learning and business analysis. Many organizations are in the process of growing their data science capabilities. The following is collection of useful resources and links. I am currently working on a comprehensive manuscript, which will be made available on this web page. For researchers leaving academia for a data science role, incubators and training fellowships like the one run by the ASI can be a very helpful.
Statistics is the basis for data science work. Statistics is about finding and describing structures in data, identifying relationship and providing inference. Data scientists should be familiar with Bayesian and frequentist points of view of statistics (e.g., here), as well as basic techniques of hypothesis and A/B testing and statistical inference. We will provide a few more links in the future.
Machine learning (ML) is a branch of computer science, in which an algorithm is developed by training it on available data, rather than hard coding rules. Such models can then be used to make prediction for new previously unseen data. This is contrast to conventional algorithms which carry out tasks deterministically based on predefined rules.
If you are working in the python ecosystem then machine learning is made easy by the extensive scikit learn library. This contains many instructive examples and there are many libraries for common ML algorithms which can be readily used. One usually distinguishes between:
- Supervised learning
- Unsupervised learning.
In supervised learning we deal with a situation, where we have data, which we already understand to a certain extent, e.g. we have image data with labels, such that if the the pictures shows an animal or not, and we want to predict labels for additional unlabelled data. Then we would 'supervise' or train a learning algorithm to learn the labels such that can be used in other contexts. We usually distinguish regression problems where we predict a continuous variable from classification problems where we predict a label.
In unsupervised learning we want to learn structure of data. One of the most common things to do is cluster analysis, i.e., to find out whether the data naturally separates into clusters, which are for instance identified by higher density. When working with high dimensional data techniques of dimensionality reduction are very useful. One standard technique is principal component analysis (PCA), with a good explicit introduction here .
A particular ML technique is that artificial neural networks. In recent years quite spectacular successes have been achieved with neural networks. Training a network with many hidden layers is termed deep learning. An excellent introductory online book about deep learning is Michael Nielson's Neural Networks and Deep Learning . A state of the art open source library for deep learning is google's TensorFlow . A good high level interface is keras.
One primary language for data science work is python, which has a large open source development community behind it. Jupyter or zeppelin notebooks are a convenient tools which run interactively in a browser and can be used for visualisation and for data exploration. For statistical analysis R is also very mature and powerful. Rstudio is a nice development framework. Scala is also increasingly used in data science and engineering. We plan to provide a few more links to important concepts here.
In many settings data science comes accompanied by data engineering. Whereas the former is more concerned with data exploration and predictive analytics, the latter helps to provide the infrastructure and make data applications scalable and robust. Data engineering also plays an important part in distributed infrastructures (e.g., hadoop, spark ) and the right choice of databases (e.g, mysql, mongodb, cassandra, neo4j ).
There are many interesting publicly available data sets to be found on the internet. We will provide a few links: