Skip to main content
R Programming
CHAPTER 17 Beginner

Statistical Analysis in R

Updated: May 18, 2026
5 min read

# CHAPTER 17

Statistical Analysis in R

1. Chapter Introduction

R was built by statisticians for statistical computing — it has the most comprehensive statistical toolkit of any programming language. This chapter masters descriptive statistics, measures of central tendency, dispersion, and shape analysis.

2. Measures of Central Tendency

r
1234567891011121314151617181920212223242526
# Dataset: Employee salaries
set.seed(42)
salaries <- c(45000, 52000, 58000, 62000, 68000, 72000, 75000,
              78000, 85000, 92000, 105000, 125000, 180000)

# Mean (arithmetic)
mean(salaries)          # 84769.23 — affected by outliers!

# Median (middle value)
median(salaries)        # 75000 — resistant to outliers

# Mode — R has no built-in mode(), create one:
mode_val <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}
scores <- c(85, 90, 90, 78, 85, 85, 92, 88)
mode_val(scores)        # 85 (appears 3 times)

# Geometric mean (growth rates, ratios)
returns <- c(1.10, 0.95, 1.15, 1.08, 0.98)  # Annual growth multipliers
geo_mean <- exp(mean(log(returns)))
cat("Geometric mean return:", round((geo_mean-1)*100, 2), "%\n")

# Trimmed mean (remove extreme values)
mean(salaries, trim=0.1)  # Remove top/bottom 10% before averaging

3. Measures of Dispersion

r
123456789101112131415161718192021222324252627282930313233
# Range
range(salaries)         # 45000 180000
diff(range(salaries))  # 135000 (range width)

# Variance and Standard Deviation
var(salaries)    # Sample variance (n-1 denominator)
sd(salaries)     # Sample standard deviation
# Population:
var_pop <- function(x) mean((x - mean(x))^2)
sd_pop  <- function(x) sqrt(var_pop(x))

# Coefficient of Variation (relative dispersion)
cv <- sd(salaries) / mean(salaries) * 100
cat("CV:", round(cv, 1), "%\n")  # Higher CV = more variable

# Quantiles and IQR
quantile(salaries)        # 0%, 25%, 50%, 75%, 100%
quantile(salaries, 0.90)  # 90th percentile
IQR(salaries)             # Interquartile range (Q3 - Q1)

# Five-number summary (+ mean)
summary(salaries)
# Min.  1st Qu.  Median    Mean  3rd Qu.    Max.
# 45000   65000   75000   84769   92000  180000

# Outlier detection using IQR rule
Q1 <- quantile(salaries, 0.25)
Q3 <- quantile(salaries, 0.75)
iqr <- IQR(salaries)
lower_fence <- Q1 - 1.5 * iqr
upper_fence <- Q3 + 1.5 * iqr
outliers <- salaries[salaries < lower_fence | salaries > upper_fence]
cat("Outliers:", outliers, "\n")  # 180000

4. Distribution Shape

r
12345678910111213141516171819202122232425262728293031323334353637383940
library(moments)  # For skewness and kurtosis

# Skewness — symmetry of distribution
skewness(salaries)   # Positive = right-skewed (long right tail)
# salary data is right-skewed (a few very high earners)

# Kurtosis — tail heaviness
kurtosis(salaries)
# Normal distribution kurtosis = 3
# Excess kurtosis (kurtosis - 3) tells if heavier/lighter tails than normal

# Normality test
shapiro.test(salaries)   # W statistic, p-value
# If p < 0.05: reject normality hypothesis

# Comprehensive statistics report
describe_stats <- function(x, label="Variable") {
  cat(sprintf("=== %s ===\n", label))
  cat(sprintf("N:        %d\n", length(x)))
  cat(sprintf("Mean:     %.2f\n", mean(x, na.rm=TRUE)))
  cat(sprintf("Median:   %.2f\n", median(x, na.rm=TRUE)))
  cat(sprintf("Std Dev:  %.2f\n", sd(x, na.rm=TRUE)))
  cat(sprintf("CV:       %.1f%%\n", sd(x,na.rm=T)/mean(x,na.rm=T)*100))
  cat(sprintf("Min:      %.2f\n", min(x, na.rm=TRUE)))
  cat(sprintf("Max:      %.2f\n", max(x, na.rm=TRUE)))
  cat(sprintf("IQR:      %.2f\n", IQR(x, na.rm=TRUE)))
  cat(sprintf("Skewness: %.3f\n", moments::skewness(x)))
  cat(sprintf("Kurtosis: %.3f\n", moments::kurtosis(x)))
  cat(sprintf("Shapiro p: %.4f %s\n",
              shapiro.test(x)$p.value,
              if(shapiro.test(x)$p.value < 0.05) "(NOT normal)" else "(Normal)"))
  cat("\n")
}

describe_stats(salaries, "Employee Salary")

# Multi-variable correlation summary
library(corrplot)
cor_matrix <- cor(mtcars[, c("mpg","hp","wt","disp")])
print(round(cor_matrix, 2))

5. Common Mistakes

  • Using mean for skewed data: The mean salary is $84,769 but the median is $75,000. For right-skewed data (income, prices), the median is a better "typical" value because it's not pulled by extreme values.
  • Sample vs population variance: var() and sd() use n-1 (sample statistics). For population parameters, use var_pop = mean((x - mean(x))^2).

6. MCQs

Question 1

Median is preferred over mean when?

Question 2

IQR(x) computes?

Question 3

Positive skewness indicates?

Question 4

sd() in R uses?

Question 5

shapiro.test() tests for?

Question 6

Coefficient of Variation (CV) measures?

Question 7

Outlier by IQR rule: values outside?

Question 8

mean(x, trim=0.1) computes?

Question 9

summary(x) for numeric vector shows?

Question 10

Geometric mean is appropriate for?

7. Interview Questions

  • Q: When would you use median instead of mean?
  • Q: What does the IQR tell you about a dataset?

8. Summary

Central tendency: mean() (sensitive to outliers), median() (robust), mode (custom function). Dispersion: var(), sd(), IQR(), quantile(). Shape: skewness() (asymmetry), kurtosis() (tail weight). Normality: shapiro.test(). Outliers: IQR fence = Q1±1.5×IQR. Use summary() for quick 5-number summary. Right-skewed data → prefer median for "typical" value.

9. Next Chapter Recommendation

In Chapter 18: Probability Distributions, we work with R's built-in distribution functions — normal, binomial, Poisson, and more.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·