Data Science Glossary

Data Science Glossary

This data science glossary contains the most important terms, structured into three categories:

Machine Learning Glossary

The machine learning glossary is structured into:

  • General terms
  • Reinforcement-learning terms
  • Supervised-learning terms
  • Unsupervised-learning terms

General Terms

Data Point Data Wrangling Machine Learning
Model Complexity Model Reinforcement Learning
Supervised Learning Unsupervised Learning

Reinforcement Learning

Action Agent Environment
Observation Policy Reward
State

Supervised Learning

AUC Categorical Outcome Class
Classifier Confusion Matrix Cross-Validation
Dependent Variable Estimate False Positive Rate
Feature Engineering Forecasting Gold Standard
Ground Truth Independent Variable Inference
Inference Interpretability Label
Linear Model Model Validation Non-Linear Model
Observation Outcome Performance
Prediction Quantitative Outcome Regressor
Sensitivity Specificity Test Data
Training Data True Positive Rate Validation Data

Unsupervised Learning

Clustering Dimensionality Reduction k-means
PCA

Action

In reinforcement learning, agents try to perform actions that maximize the reward. Each action changes the learning environment and thus yields a new state.

Agent

In reinforcement learning, an agent is the learner that interacts with the environment. Based on a given state, the agent selects an appropriate action by considering past earned rewards. The policy of an agent determines the actions that should be executed for each state. ### AUC {#auc}

AUC means area under the curve. When evaluating scoring classifiers, the term AUC usually refers to the ROC (receiver operating characteristic)-AUC. The ROC curve determines the true positive rate and false positive rate for all cutoffs on the scores. When available, the ROC-AUC is preferable to other measures such as sensitivity and specificity.

Categorical Outcome

See Outcome.

Classifier

Classifiers (classification models) are used for the prediction of categorical outcomes. Classifiers that ouput quantitative outcomes are called scoring classifiers and are more interpretable than non-scoring classifiers.

Clustering

Clustering, one of the main applications of unsupervised learning, is used to assign each sample to a group of samples. These groups of samples are called clusters. Clustering can be used for the visual exploration of data or for the automated identification of outliers. One of the simplest and most well-known clustering algorithms is k-means.

Confusion Matrix

The confusion matrix is used to evaluate the predictive performance of a classifier. The name confusion matrix stems from the fact that the table illustrates which predictions are confused among the two classes. For binary classifiers, which differentiate between a positive (+1) and a negative (-1) class, the confusion matrix is a 2x2 table of the following form:

Predicted Class   Ground Truth
  +1 -1
+1   TP FP
-1   FN TN

The entries are defined as follows:

  • TP: The number of samples from the positive class that were correctly predicted
  • FP: The number of samples from the negative class that were falsely predicted
  • FN: The number of samples from the positive class that were falsely predicted
  • TN: The number of samples from the negative class that were correctly predicted

From the confusion matrix, one can determine performance metrics such as sensitivity, specificity, and the AUC.

Cross-Validation

Cross-validation is a strategy for evaluating the predictive performance of a model. In k-fold cross-validation, the data set is split into k-folds such that each fold is used for training once, while the remaining data are used for testing. Nested cross-validation introduces another layer by introducing an additional fold that is used for selecting the model that is evaluated on the test fold.

Data Point

See Observation.

Data Wrangling

Data wrangling describes the unpopular task of transforming data into a machine-readable format. For example, data wrangling could entail transforming semi-structured data (e.g. from spreadsheets) to the CSV (comma-separated values) format. Data wrangling is often performed via automated scripts but may also involve manual steps. Note that data wrangling does not involve feature engineering.

Dependent Variable

See Feature.

Dimensionality Reduction

In dimensionality reduction, data are projected to a low-dimensional subspace. This is either done in order to obtain better data visualizations or during feature engineering in the context of supervised learning. Dimensionality reduction techniques such as PCA are unsupervised methods.

Environment

In reinforcement learning, the environment determines the observable states and the actions that an agent can perform. A popular framework for specifying environments is OpenAI’s Gym.

Estimate

See Prediction.

False Positive Rate

Given a classifier, the false positive rate represents the ratio of false positive predictions among all samples from the negative class:

FPR = FP / (FP + TN)

See also Confusion Matrix.

Feature matrix

See Features.

Features

Features are the dependent variables in the supervised learning scenario. The columns of a feature matrix, \[X \in \mathbb{R}^{n \times p}\,\] represent the values of the p features. For example, to predict the weather, two possible features are the level of precipitation and the cloudiness.

Feature Engineering

Supervised learning aims at learning the general associations betwen features and outcomes. However, in their original form, the input data are often not well-suited for this purpose. Feature engineering is concerned with transforming the data such that machine learning models can easily learn from the data.

Forecasting

See Prediction.

Gold Standard

See Ground Truth.

Ground Truth

In order to perform supervised learning, it is necesary that the outcome for each data point is known. The measured outcome should reflect the ground truth. Otherwise, models are optimized with respect to the wrong values, aka garbage in, garbage out.

Independent Variable

See Outcome.

Inference

See Prediction.

Interpretability

Interpretability describes whether a model is able to produce results that humans can easily interpret. Interpretability is closely tied to model complexity (i.e. the effective numbers of model parameters). Simple models such as linear models have few parameters and can easily be understood and interpreted. Complex models such as deep neural networks have large numbers of parameters, which makes them hard to understand and interpret.

There are many application scenarios in which it is acceptable to sacrifice some predictive performance in favor of greater interpretability. This is because in machine learning applications such as decision support systems, it is key that human operators can understand the intentions of the model.

k-means

k-means is a simple yet powerful clustering algorithm that identifies k cluster centers in the data. The algorithm terminates when the cluster centers have converged.

Label

In classification, labels are the values that are used to differentiate between individual classes. For example, one could use Sunny and Cloudy as labels for observations that have been made on sunny and cloudy days, respectively. However, to apply supervised learning algorithms, numeric labels such as +1 and -1 would be necessary.

Linear Model

See Model.

Machine Learning

Machine learning encompasses artificial intelligence approaches that are concerned with learning from data. There are three machine learning areas: supervised learning, unsupervised learning, and reinforcement learning.

Once a model has been fitted to the data, it is possible to make predictions given new data points (supervised learning), structure data (unsupervised learning), or select optimal actions in a dynamic environment (reinforcement learning).

Model

Models are the mathematical approximation of real-world phenomena. In supervised learning, models are constructed using pairs of input data and observed outcomes. In unsupervised learning, the outcomes are not available such that only the structure of the data is modeled. In reinforcement learning, models are constructed according to states, actions, and rewards.

Besides these machine learning approaches, which use optimization algorithms to fit models to data, there is a host of other models that are useful for specific tasks, for example, hidden Markov models, epidemiological models, and Bayesian models.

It is possible to differentiate between linear and non-linear models. While linear models assume linear relationship between the features, non-linear models assume non-linear relationships.

One should always remember the following famous quote from British statistican George E.P. Box:

All models are wrong but some are useful.

Model Complexity

Model complexity is defined by the effective numbers of parameters that make up a model. For example, deep learning models with many parameters are more complex than simple models, such as linear models. Complex models should be avoided if there are not sufficient training data available.

Model Validation

Model validation entails the following steps:

  1. Fitting the model to a set of training data
  2. Tuning the hyperparameters of the model using a set of validation data
  3. Evaluating predictive performance on an independent test data set

The two most popular approaches for validation are:

  1. Splitting the data into a training, validation, and test set
  2. Using cross-validation, in which the model is trained on various subsets of the data.

Non-Linear Model

See Model.

Outcome

In supervised learning, the outcome is a measurement of the ground truth. Principal types of outcomes are categorical outcomes (class labels) and quantitative outcomes. For example, when predicting the weather, Sunny and Cloudy would be categorical outcomes, while the amount of precipitation would be a quantitative outcome.

The underlying variable associated with the outcome is called the independent variable.

Observation

In supervised learning, observations are the rows of the feature matrix. Observations are also called data points or samples. The number of observations is usually denoted by N.

For the use of the term observation in reinforcement learning, see State.

PCA

Principal component analysis (PCA) is a standard dimensionality reduction technique. It is based on finding a projection to orthogonal coordinates that maintain as much variance as possible.

Policy

In reinforcement learning, the policy of an agent is a mapping from states to actions. This means that the policy defines the behavior of the agent in the environment. There are on-policy and off-policy reinforcement learning algorithms.

Performance

In supervised learning, predictive performance is the ability of a model to correctly classify observations. To quantify predictive performance, metrics such as the AUC can be utilized.

Prediction

Prediction is the act of applying a model on a new data point in order to determine the estimated outcome. Inference is often used synonymously, although inference is geared towards learning about the data generation process. Forecasting is a special form of prediction in which time-series are used as the input.

The term estimate is a synonym for prediction that is popular in the statistical community because it underlines the fact that predictions are only approximations of reality.

Quantitative Outcome

See Outcome.

Regressor

Regressors (regression models) are used to predict the outcomes for quantitative variables. Compared to classifiers, they allow for fine-grained predictions.

Reinforcement Learning

Reinforcement learning (RL) is an area of machine learning in which one or multiple agents perform actions in an environment, after observing the state. Once an action has been performed, the agent receives a reward. By balancing exploration (finding novel states) and exploitation (reaping rewards), RL agents can learn an optimal policy, which identifies the best action to take for every state.

In recent years, reinforcement learning has gained in popularity due to the emergence of deep RL, in which deep neural networks are used to learn which states are associated with the greatest rewards.

Reward

In reinforcement learning, agents obtain rewards after performing an action. Agents adjust their policy in order to maximize the reward.

Sample

For its use in supervised learning, see Observation.

Sensitivity

The sensitivity of a classifier is defined by its true positive rate:

sensitivity = TPR = TP/(TP+FN).

See also Confusion Matrix.

Specificity

Specificity indicates the true negative rate of a classifier:

specificity = 1 - TP / (TP + FP) = 1 - FPR

Since specificty considers the number of false positives (FP), it allows for conclusions about the false positive rate (FPR). See also Confusion Matrix.

State

In reinforcement learning, the state indicates the observations that an agent has made at a given point in time. States are usually represented by numeric vectors or matrices. Crafting appropriate states is a form of feature engineering.

Supervised Learning

Supervised learning is an area of machine learning that is concerned with learning from pairs of input data and associated outcomes. Once a model has been trained on a set of training data, it is tuned using a validation data set, and, finally, evaluated on an independent test data set. The application of a supervised learning model on new data is called prediction, inference, or forecasting.

Models that are trained on labeled data (i.e. categorical outcomes) are called classifiers. Models that are trained on quantitative outcomes are called regressors.

Test Data

Test data refers to the set of data that is used for evaluating the predictive performance of a model.

Training Data

Training data refers to the set of data that is used for fitting a model.

True Positive Rate

See Sensitivity.

Unsupervised Learning

Unsupervised learning is an area of machine learning that is concerned with the identification of models that are capable to represent the properties of the data in a condensed manner, which allows for greater interpretability.

Evaluating the performance of unsupervised learning methods is more challenging than for supervised learning because there are no outcomes that provide the ground truth. Popular unsupervised methods include k-means and PCA.

Validation Data

Validation data refers to the set of data that is used for tuning the hyperparameters of a model.

Software Engineering Glossary

The software engineering glossary is structured into:

  • DevOps terms
  • General terms
  • Scrum terms
  • Testing terms

DevOps Terms

CI/CD Continuous Delivery Continuous Integration
Deployment Infrastructure as Code

General Terms

Acceptance-Test-Driven Development Behavior-Driven Development Extreme Programming
Mob Programming Pair Programming Test-Driven Development

Scrum Terms

Backlog Item Backlog Daily
Development Team Product Owner Refinement
Scrum Master Sprint Planning Sprint Retrospective
Sprint Review

Testing Terms

End-to-End Test Integration Test Types of Test
Unit Test

c(“Product Owner”, “Development Team”, “Scrum Master”, “Daily”, “Sprint Planning”, “Sprint Retrospective”, “Refinement”, “Backlog”, “Backlog Item”)

Acceptance-Test-Driven Development

Acceptance-test-driven development (ATDD) is based on the idea that automated acceptance tests should be specified before starting with the implementation of a new feature. Since the acceptance criteria typically reflect the requirements of the business stakeholders, these tests are typically formulated in such a way that they are understandable in layman’s terms, for example, using Cucumber.

Backlog Item

TODO

Behavior-Driven Development

Behavior-driven development (BDD) is related to acceptance-test-driven development (ATDD). BDD places a focus on the fact that the required behavior of the software is made explicit and is available in an easily understood manner, e.g. via Cucumber.

Backlog

The backlog is a collection of backlog items (e.g. features, bugs, enablers) that are planned to be implemented by the development team. The product backlog contains all the backlog items that are relevant for the product, while the sprint backlog contains only those backlog items that are relevant for the current sprint.

CI/CD

CI/CD is the shorthand for continuous integration and continuous delivery. CI/CD is realized through automated pipelines that trigger on code changes in the version control system. Such a pipeline implements CI through automated tests. If the tests are successful, CD is performed by deploying the software on a staging area or even on production instances.

Continuous Delivery

Continuous delivery (CD) refers to the ability to automatically deploy increments of software to a staging area. CD should not be confused with continuous deployment, which goes one step further by automatically deploying the software to the production instances.

Deployment

Deployment refers to the release of a software in a non-local environment. Software is typically developed in the following environment:

  1. Development environment (e.g. local machines of developers)
  2. Staging area (a server similar to the production environment)
  3. The prouction instances running the software that is available to the customers

In complex systems, multiple staging areas may be used.

Development Team

The development team engineers the software according to the product backlog.

End-to-End Test

End-to-end tests validate the functionality of full application workflows from a user perspective.

For example, an end-to-end test for an e-commerce business could consist of the following steps:

  1. User login
  2. Adding products to the shopping basket
  3. Ordering the selected items
  4. Receiving the invoice through a confirmation email

Ideally, end-to-end tests are implemented using a behavior-driven development approach.

Extreme Programming

Extreme programming (XP) refers to a collection of software engineering practices, most notably pair programming. Since its inception, many of these practices have become an integral part of what is considered agile software development today.

Infrastructure as Code

Infrastructure as code (IaC) is a practice in which a technical infrastructure is defined in terms of code such that the infrastructure can be maintained (e.g. reconfigured) by executing the code instead of manual interventions.

Examples of IaC frameworks include Ansible, Chef, and Puppet.

Integration Test

An integration test validates functional correctness across multiple application modules. See Types of Tests for a listing of all types of tests.

Mob Programming

Mob programming extends the concept of pair programming from pairs of developers to the entire team. This means that the entire team collaboratively works on the same piece of code. This coding practice is particularly useful when implementing complex program logic that requires the knowledge from various team members.

Pair Programming

Pair programming is an extreme programming technique in which two developers collaboratively work on the same piece of code. One developer assumes the role of the driver, while the other assumes the role of the navigator.

The developer acting as the driver writes the code, while the developer acting as the navigator guides the implementation efforts. Typically, the roles are switched continually (e.g. all thirty minutes).

Product Owner

In Scrum, the product owner (PO) is responsible for maximizing the value that is delivered by the development team. To achieve this goal, he has to prioritize backlog items by considering the value they offer to the customer as well as the technical complexity of the implementation. The development team informs the PO about new software features in the sprint review meeting.

Scrum

TODO

Sprint

In Scrum, a sprint refers to the period of time after which a new increment of software is produced. The duration of a sprint defines the interval in which the majority of Scrum meetings take place (e.g. planning, review, retrospective) and determines the rhythm according to which software is developed. Typically, sprints have a length of two weeks.

Sprint Review

In Scrum, the sprint review is a meeting in which the development team presents its progress on the sprint backlog to the product owner.

Test-Driven Development

Test-driven development (TDD) is a development practice that ensures that a sufficient number of varied unit tests are implemented during the development of a new functionality. The following pattern can be used for TDD:

  1. Write up the boilerplate for the functionality you want to test (e.g. class, function) but do not implement the functionality yet.
  2. Write a failing test.
  3. Adapt the implementation such that the test is passed successfully.
  4. Continue with Step 2. Terminate the TDD process if you cannot think of any additional tests that would demonstrate a failure in your implementation.

Types of Tests

Software testing can be tested using the following types of tests:

These types of tests form the test pyramid. Most application tests are made up of a large number of unit tests, a small number of integration tests, and just a few end-to-end-tests.

Unit Test

Unit tests validate the functionality of a unit of application code. The term unit is not formally defined and could be a single, independent function, multiple functions, or a complete software module. A single unit test should test only a single type of functionality, if possible. This allows the precise identification of errors in the application. See Types of Tests for a listing of all types of tests.