Variables can be identified by their value as well as their role. Variables are categorized into quantitative, categorical, and ordinal variables, depending on their values. Moreover, when variables are used in statistical models, additional terms are used to indicate their role such as dependent, independent, and confounding variable.
As a data scientist, it is important to have a deep understanding of statistics. Here, I introduce basic statistical concepts and quantities.
Types of measurements and variables
Important statistical concepts include the following:
- Types of measurement scales
- Nomenclature for variables: dependent vs independent variables
You should definitely know about the following, frequently used statistical quantities:
- Centrality measures: mean and median, mode
- Measure of dispersion: standard deviation, variance, covariance, interquartile-range
- Interval estimates: confidence intervals
Commonly occuring probability distributions are:
- Uniform distribution: all values are equally likely
- Normal distribution: a bell-shaped curve, typical for many population characteristics (e.g. IQs, heights)
- Poisson distribution: an integer distribution that is ideal for count data
- Exponential distribution: a heavy-tailed distribution
Posts on basic statistics
You can find eplanations of basic statistical concepts and their use in R in the following posts.
Since R was made for statistical computations, it is very easy to deal with distributions in R. Since there are multiple functions for each distribution, I exemplify their application using the normal distribution.
If your data follows a normal distribution using the mean is fine. But what should you do in other cases? Here, I explore the implications of using one or the other measure.