R Programming Interview Preparation
# CHAPTER 28
R Programming Interview Preparation
1. Chapter Introduction
R interviews test three dimensions: language proficiency, statistical knowledge, and data manipulation skill. This chapter compiles 50 interview questions from data analyst, data scientist, and statistician roles at top companies.---
Section A: R Language Questions (Q1-20)
Q1. What is the difference between <- and = in R?
Both assign values, but <- is the convention for variable assignment in scripts. = is used inside function arguments. <- works at any depth; = works only at the top level of an expression.
Q2. Explain R's apply family.
apply(m, margin, f): over matrix margins. lapply(x, f): list result. sapply(x, f): simplified (vector/matrix). tapply(x, group, f): by groups. mapply(f, ...): multi-variate sapply. vapply(x, f, type): type-safe sapply.
Q3. What is vectorization in R? Why is it important?
Operations applied to entire vectors without explicit loops. x * 2 multiplies every element simultaneously. Vectorized code is 10-100x faster than equivalent loops because it uses optimized C/Fortran internally.
Q4. Difference between NA, NULL, NaN, and Inf?
NA: missing value placeholder (length 1). NULL: empty object (length 0). NaN: undefined math (0/0). Inf: infinity (1/0). Use is.na(), is.null(), is.nan(), is.infinite() to check each.
Q5. What is the difference between a list and a data frame? List: ordered collection of any types, any lengths. Data frame: list of equal-length vectors — the standard tabular data structure. Data frames have rownames, colnames; lists don't require equal lengths.
Q6. Explain R's lexical scoping. Functions search for variables in the environment where they were defined (not where called). This enables closures — functions that "remember" their creation environment.
Q7. What does do.call(f, args) do?
Calls function f with arguments provided as a list. Example: do.call(paste, list("hello","world",sep="-")) = "hello-world". Used when argument list is built dynamically.
Q8. When would you use tryCatch()?
When a function might error but you want to continue execution. Pattern: result <- tryCatch(expr, error=function(e) default_val, warning=function(w) ...).
Q9. What is the pipe operator %>%?
Passes result of left side as first argument to right side: x %>% f(y) = f(x, y). Enables readable left-to-right chains. Native R pipe |> (R 4.1+) similar but less flexible.
Q10. How is a factor different from a character vector? Factors store categories as integers with labels — memory-efficient, enable ordering, work better in statistical models. Characters are plain strings. Use factors for categorical variables with fixed, known levels.
---
Section B: Statistical Questions (Q21-35)
Q11. When do you use median instead of mean? For skewed distributions or data with outliers. Mean is sensitive to extreme values; median is not. Income, salary, housing prices → always report median as "typical" value.
Q12. Explain p-value in plain English. P(data this extreme or more | H₀ is true). Small p-value means: "If the null hypothesis were true, seeing this data would be very unlikely." It does NOT mean H₀ is false or H₁ is probable.
Q13. What is the Central Limit Theorem? The sampling distribution of the mean approaches normality as sample size increases — regardless of the underlying distribution. Foundational to hypothesis testing and confidence intervals.
Q14. Difference between Type I and Type II error? Type I (α): False positive — rejecting true H₀. Type II (β): False negative — failing to reject false H₀. Significance level α controls Type I error rate. Power (1-β) controls Type II error rate.
Q15. When would you use a paired t-test vs two-sample t-test? Paired t-test: same subjects measured twice (before/after). Two-sample t-test: independent groups (treatment vs control). Paired is more powerful when measurements are correlated within subjects.
---
Section C: Coding Challenges (Q36-50)
4. MCQs
do.call(rbind, listofdfs) does?
Reduce("+", 1:5) returns?
tryCatch(expr, error=function(e) NA) returns NA when?
Vectorized code is faster because?
which(x > 5) returns?
Sys.time() returns?
CLT importance: sampling distribution of mean?
across(cols, funs) in dplyr applies?
replicate(n, expr) in R?
prop.table(table(x)) * 100 gives?
5. Summary
50 interview Q&A: language (vectorization, apply, NA/NULL, scoping, pipes), statistics (p-value, CLT, t-tests, correlation), and coding (outlier detection, frequency tables, email validation, CLT simulation, moving averages). Key themes: vectorized thinking, statistical interpretation, tidy workflows, and defensive programming withtryCatch().