Although ordinary least-squares regression is often used, it is not appropriate for all types of data. Using the airquality data set, I try to find a generalized linear model that fits the data better. For this purpose, I use the following methods: weighted regression, Poisson regression, and imputation.
All posts with the R tag deal with applications of the statistical programming language R in the data science setting.
Posts about R
Linear machine learning models are very convenient for interpretation. This post discusses the following aspects: residuals, coefficients, standard errors, p-values, the F-statistic, and much more.
Box plots are limited since they only show Q1, Q2, and Q3. Box plot alternatives such as the beeswarm and violin plot, however, provide more information about the overall distribution of the data.
Line plots are ideally suited for visualizing time series data. Using some stock market data, I demonstrate how line plots can be generated using native R, the MTS package, and ggplot.
Bar plots are frequently used due to their simplicity. However, they also do not convey a lot of information. Here, I discuss how error bars can be used to visualize variance and under which circumstances bar charts should not be used.
Box plots are ideal for showing the variation of measurements because they do not only visualize the first, second, and third quartile, but also outliers.
Since R was made for statistical computations, it is very easy to deal with distributions in R. Since there are multiple functions for each distribution, I exemplify their application using the normal distribution.
Scatter plots are a great tool for learning about individual data points. Here, I demonstrate the use of scatter plots for visualizing the correlation between two variables.
Histograms are an ideal tool for visualizing the distribution of a variable and frequently used for data exploration. Here, I show how a histogram acan aid in differentiating two distributions.
If your data follows a normal distribution using the mean is fine. But what should you do in other cases? Here, I explore the implications of using one or the other measure.