Skip to main content
R Programming
CHAPTER 28 Beginner

R Programming Interview Preparation

Updated: May 18, 2026
5 min read

# CHAPTER 28

R Programming Interview Preparation

1. Chapter Introduction

R interviews test three dimensions: language proficiency, statistical knowledge, and data manipulation skill. This chapter compiles 50 interview questions from data analyst, data scientist, and statistician roles at top companies.

---

Section A: R Language Questions (Q1-20)

Q1. What is the difference between <- and = in R? Both assign values, but <- is the convention for variable assignment in scripts. = is used inside function arguments. <- works at any depth; = works only at the top level of an expression.

Q2. Explain R's apply family. apply(m, margin, f): over matrix margins. lapply(x, f): list result. sapply(x, f): simplified (vector/matrix). tapply(x, group, f): by groups. mapply(f, ...): multi-variate sapply. vapply(x, f, type): type-safe sapply.

Q3. What is vectorization in R? Why is it important? Operations applied to entire vectors without explicit loops. x * 2 multiplies every element simultaneously. Vectorized code is 10-100x faster than equivalent loops because it uses optimized C/Fortran internally.

Q4. Difference between NA, NULL, NaN, and Inf? NA: missing value placeholder (length 1). NULL: empty object (length 0). NaN: undefined math (0/0). Inf: infinity (1/0). Use is.na(), is.null(), is.nan(), is.infinite() to check each.

Q5. What is the difference between a list and a data frame? List: ordered collection of any types, any lengths. Data frame: list of equal-length vectors — the standard tabular data structure. Data frames have rownames, colnames; lists don't require equal lengths.

Q6. Explain R's lexical scoping. Functions search for variables in the environment where they were defined (not where called). This enables closures — functions that "remember" their creation environment.

Q7. What does do.call(f, args) do? Calls function f with arguments provided as a list. Example: do.call(paste, list("hello","world",sep="-")) = "hello-world". Used when argument list is built dynamically.

Q8. When would you use tryCatch()? When a function might error but you want to continue execution. Pattern: result <- tryCatch(expr, error=function(e) default_val, warning=function(w) ...).

Q9. What is the pipe operator %>%? Passes result of left side as first argument to right side: x %>% f(y) = f(x, y). Enables readable left-to-right chains. Native R pipe |> (R 4.1+) similar but less flexible.

Q10. How is a factor different from a character vector? Factors store categories as integers with labels — memory-efficient, enable ordering, work better in statistical models. Characters are plain strings. Use factors for categorical variables with fixed, known levels.

---

Section B: Statistical Questions (Q21-35)

Q11. When do you use median instead of mean? For skewed distributions or data with outliers. Mean is sensitive to extreme values; median is not. Income, salary, housing prices → always report median as "typical" value.

Q12. Explain p-value in plain English. P(data this extreme or more | H₀ is true). Small p-value means: "If the null hypothesis were true, seeing this data would be very unlikely." It does NOT mean H₀ is false or H₁ is probable.

Q13. What is the Central Limit Theorem? The sampling distribution of the mean approaches normality as sample size increases — regardless of the underlying distribution. Foundational to hypothesis testing and confidence intervals.

Q14. Difference between Type I and Type II error? Type I (α): False positive — rejecting true H₀. Type II (β): False negative — failing to reject false H₀. Significance level α controls Type I error rate. Power (1-β) controls Type II error rate.

Q15. When would you use a paired t-test vs two-sample t-test? Paired t-test: same subjects measured twice (before/after). Two-sample t-test: independent groups (treatment vs control). Paired is more powerful when measurements are correlated within subjects.

---

Section C: Coding Challenges (Q36-50)

r
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
# Challenge 1: Remove duplicates and sort by frequency
words <- c("apple","banana","apple","cherry","banana","apple","date")
freq_table <- sort(table(words), decreasing=TRUE)
print(freq_table)
unique_words <- names(freq_table)

# Challenge 2: Find outliers in a vector
find_outliers <- function(x) {
  Q1 <- quantile(x, 0.25)
  Q3 <- quantile(x, 0.75)
  iqr <- IQR(x)
  x[x < Q1 - 1.5*iqr | x > Q3 + 1.5*iqr]
}
find_outliers(c(10,12,11,13,100,12,11,10,200))  # 100 200

# Challenge 3: Moving average function
moving_avg <- function(x, window=3) {
  n <- length(x)
  result <- rep(NA, n)
  for (i in window:n) {
    result[i] <- mean(x[(i-window+1):i])
  }
  result
}
# Vectorized version using filter():
moving_avg_fast <- function(x, k) as.numeric(filter(x, rep(1/k, k), sides=1))

# Challenge 4: Summarise multiple columns at once
library(dplyr)
mtcars %>%
  group_by(cyl) %>%
  summarise(across(c(mpg, hp, wt), list(mean=mean, sd=sd), .names="{col}_{fn}"))

# Challenge 5: Build frequency table with percentage
freq_report <- function(x, label="Variable") {
  tbl <- table(x)
  pct <- round(prop.table(tbl) * 100, 1)
  result <- data.frame(
    Category = names(tbl),
    Count    = as.numeric(tbl),
    Percent  = as.numeric(pct)
  ) %>% arrange(desc(Count))
  cat(sprintf("=== %s ===\n", label))
  print(result)
  invisible(result)
}

# Challenge 6: Validate and clean email addresses
clean_emails <- function(emails) {
  pattern <- "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"
  valid   <- grepl(pattern, emails, perl=TRUE)
  cat("Valid:", sum(valid), "| Invalid:", sum(!valid), "\n")
  list(valid=emails[valid], invalid=emails[!valid])
}

# Challenge 7: Simulate Central Limit Theorem
set.seed(42)
n_samples <- 1000
sample_means <- replicate(n_samples, mean(rexp(30, rate=0.1)))
cat("Mean of sample means:", round(mean(sample_means), 3), "\n")
cat("SE (expected):", round(sd(rexp(10000, 0.1))/sqrt(30), 3), "\n")
# Sampling distribution → approximately normal regardless of exponential parent
hist(sample_means, breaks=30, col="#1565C0", main="CLT: Means of Exp(0.1) samples")

4. MCQs

Question 1

do.call(rbind, listofdfs) does?

Question 2

Reduce("+", 1:5) returns?

Question 3

tryCatch(expr, error=function(e) NA) returns NA when?

Question 4

Vectorized code is faster because?

Question 5

which(x > 5) returns?

Question 6

Sys.time() returns?

Question 7

CLT importance: sampling distribution of mean?

Question 8

across(cols, funs) in dplyr applies?

Question 9

replicate(n, expr) in R?

Question 10

prop.table(table(x)) * 100 gives?

5. Summary

50 interview Q&A: language (vectorization, apply, NA/NULL, scoping, pipes), statistics (p-value, CLT, t-tests, correlation), and coding (outlier detection, frequency tables, email validation, CLT simulation, moving averages). Key themes: vectorized thinking, statistical interpretation, tidy workflows, and defensive programming with tryCatch().

6. Next Chapter Recommendation

In Chapter 29: Performance Optimization in R, we profile code, use data.table, vectorize bottlenecks, and parallelize computation.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·