Data Science Blog: Understand. Implement. Succed.

Comments are an important aspect of many websites, particularly blogs, whose success depends on their ability to create communities. However, including comments is inherently more difficult for static websites than for dynamic websites (e.g. managed through Wordpress). With Hugo, comments can be easily integrated via Disqus. The disadvantage, however, is that foreign JavaScript code needs to be executed and that the comments are not part of the page itself. Here, I will explain how comments can be integrated into a web page using Staticman.

Bar plots display quantities according to the height of bars. Since standard bar plots do not indicate the level of variation in the data, they are most appropriate for showing individual values (e.g. count data) rather than aggregates of several values (e.g. arithmetic means). Although variation can be shown through error bars, this is only appropriate if the data are normally distributed.

Comparing Medians and Inter-Quartile Ranges Using the Box Plot

The box plot is useful for comparing the quartiles of quantitative variables. More specifically, lower and upper ends of a box (the hinges) are defined by the first (Q1) and third quartile (Q3). The median (Q2) is shown as a horizontal line within the box. Additionally, outliers are indicated by the whiskers of the boxes whose definition is implementation-dependent. For example, in geom_boxplot of ggplot2, whiskers are defined by the inter-quartile range (IQR = Q3 - Q1), extending no further than 1.5 * IQR.

Using probability distributions in R: dnorm, pnorm, qnorm, and rnorm

R is a great tool for working with distributions. However, one has to know which specific function is the right wrong. Here, I’ll discuss which functions are available for dealing with the normal distribution: dnorm, pnorm, qnorm, and rnorm.

Visualizing Individual Data Points Using Scatter Plots

The scatter plot is probably the most simple type of plot that is available because it doesn’t do anything more than to show individual measurements as points in a plot. The scatter plot is particularly useful for investigating whether two variables are associated.

Determining the Distribution of Data Using Histograms

It is always useful to spend some time exploring a new data set before processing it further and analyzing it. One of the most convenient ways to get a feel for the data is plotting a histogram. The histogram is a tool for visualizing the frequency of measurements in terms of a bar plot. Here we’ll take a closer look at how the histogram can be used in R.

Two of the most commonly used statistical measures are the mean and the median. Both measures indicate the central value of a distribution, that is, the value at which one would expect the majority of data points to lie. In many applications, however, it is useful to think about which of the two measures is more appropriate given the data at hand. In this post, we’ll investigate the differences between both quantities and give recommendations when one should be preferred over the other.

Comparing Measurements Across Several Groups: ANOVA

The means of quantitative measurements from two groups can be compared using Student’s t-test. To compare the means of measurements for more than two levels of a categorical variable, one-way ANOVA has to be used. Here, we’ll explore the parametric, one-way ANOVA test as well as the non-parametric version of the test, the Kruskal-Wallis test, which compares median values.

Type 1 vs Type 2 Errors: Significance vs Power

When planning statistical tests, it is important to think about the consequenes of type 1 and type 2 errors. Typically, type 1 errors are considered to be the worse type of error. While the rate of type 1 errors is limited by the significance level, the rate of type 2 errors depends on the statistical power of the test. Here, we discuss how the null hypothesis should be chosen and how the two types of errors are related.

Effect Sizes: Why Significance Alone is Not Enough

So, you performed a test for significance and obtained a positive result. That’s great but it’s not time to celebrate yet. You may ask: Why? Isn’t a significant test result sufficient to show the existence of an effect? This statement, however, is not true for two reasons. First, a significant result only indicates the existence of an effect but doesn’t prove it. For example, at a significance level of 5%, an exact test will yield a false positive result in 5% of the cases. Second, a significant result does not necessarily make a statement about the magnitude of the effect. In this post, we’ll investigate the difference between statistical significance and the effect size, which describes the magnitude of an effect.

Posts

Staticman: An Alternative to Disqus for Comments on Static Sites

Bar Plots and Error Bars

Comparing Medians and Inter-Quartile Ranges Using the Box Plot

Using probability distributions in R: dnorm, pnorm, qnorm, and rnorm

Visualizing Individual Data Points Using Scatter Plots

Determining the Distribution of Data Using Histograms

Mean vs Median: When to Use Which Measure?

Comparing Measurements Across Several Groups: ANOVA

Type 1 vs Type 2 Errors: Significance vs Power

Effect Sizes: Why Significance Alone is Not Enough