Need a holiday from data science? Then this page is for you because this category encompasses all the posts that are not directly associated with data science. Until now, these posts have mostly dealt with blogging with Hugo but let’s see what the future brings. Anyway, I don’t plan to stray too far away from the intended focus of the blog, so there should never be too many posts under this category.
Machine learning is a field of artificial intelligence (AI) that is concerned with learning from data. Machine learning has three components:
Supervised learning: Fitting predictive models using data for which outcomes are available. Unsupervised learning: Transforming and partitioning data where outcomes are not available. Reinforcement learning: on-line learning in environments where not all events are observable. Reinforcement learning is frequently applied in robotics. Posts on machine learning In the following posts, machine learning is applied to solve problems using R.
Humans are visual creatures. Thus, visualization is one of the most important tools for conveying information and data scientists should be adapt at selecting appropriate visualizations.
Which plot is appropriate? Choosing an appropriate plot for a given set of data can be hard because there are so many types of plots such as scatter plots, box plots, and histograms. Fortunately, I have created an overview of the most important plots, when they are appropriate, and how they can be used in R.
As a data scientist, it is important to have a deep understanding of statistics. Here, I introduce basic statistical concepts and quantities.
Types of measurements and variables Important statistical concepts include the following:
Types of measurement scales Nomenclature for variables: dependent vs independent variables Statistical quantities You should definitely know about the following, frequently used statistical quantities:
Centrality measures: mean and median, mode Measure of dispersion: standard deviation, variance, covariance, interquartile-range Interval estimates: confidence intervals Probability distributions Commonly occuring probability distributions are:
Using statistical tests, it is possible to make a statement about the significance of a set of measurements by calculating a test statistic. If it is unlikely to obtain a test statistic at least as extreme as the observed value, then the result is significant. For example, at a significance level of 5%, the probability of a false positive test result would be bounded by roughly 5%.
Parametric vs non-parametric tests There is a multitude of tests for determining statistical significance.