Skip to main content
Pandas & NumPy
CHAPTER 28 Beginner

Pandas and NumPy Interview Preparation

Updated: May 18, 2026
5 min read

# CHAPTER 28

Pandas & NumPy Interview Preparation

1. Chapter Introduction

This chapter compiles the 50 most-asked Pandas and NumPy interview questions from Google, Amazon, Meta, and top tech companies — with expert answers, code examples, and 20 coding challenges.

---

Section A: NumPy (Q1-20)

Q1. What is NumPy and why is it faster than Python lists? NumPy arrays store homogeneous data in contiguous memory blocks and use C-level vectorized operations. Python lists are heterogeneous, pointer-based, and require Python overhead per element. NumPy is typically 50-200x faster for numerical operations.

Q2. Difference between np.array() and np.asarray()? np.array() always creates a copy. np.asarray() avoids a copy if input is already an ndarray with compatible dtype — more memory-efficient.

Q3. What is broadcasting in NumPy? Broadcasting allows operations on arrays of different shapes. NumPy virtually expands the smaller array along axes of size 1 to match the larger array's shape, without actually copying data. Rules: align shapes right-to-right; dimensions must be equal or one of them 1.

Q4. Difference between copy() and a view (slice)? A slice creates a view — same memory as the original. Modifying a view modifies the original. .copy() creates an independent array. Check with np.shares_memory(a, b).

Q5. How does np.where() work?

python
1234
result = np.where(condition, value_if_true, value_if_false)
# Element-wise conditional selection
arr = np.array([1, -2, 3, -4])
print(np.where(arr > 0, arr, 0))  # [1 0 3 0] — ReLU activation!

Q6. What is np.vectorize()? Wraps a Python function to work element-wise on arrays. It's NOT truly vectorized (still loops internally) — just more convenient than explicit loops. For real speed, use native NumPy operations.

Q7. Explain np.einsum(). Einstein summation — compact notation for tensor operations. np.einsum('ij,jk->ik', A, B) = matrix multiply. More expressive than np.dot for complex operations.

Q8. What are structured arrays? Arrays with named, typed fields per element — like a database row. dtype=[('name', 'U20'), ('age', 'i4'), ('salary', 'f8')]. Access by field name: arr['name'].

Q9. How to find the N largest elements?

python
12345
arr = np.array([3, 1, 4, 1, 5, 9, 2, 6, 5, 3])
n = 3
top_indices = np.argsort(arr)[-n:][::-1]  # indices of top N
top_values  = arr[top_indices]             # values
# Or: np.partition(arr, -n)[-n:] for unsorted top N (faster)

Q10. What is np.linalg.lstsq()? Least-squares solution to Ax = b when the system may be overdetermined (more equations than unknowns). Used for linear regression from scratch.

Q11. Difference between * and @ for arrays? * is element-wise multiplication. @ is matrix multiplication (dot product). For 2D arrays A @ B == np.dot(A, B).

Q12. How to efficiently compute a correlation matrix?

python
1234
# NumPy built-in
corr = np.corrcoef(data.T)  # data is (n_samples, n_features)
# Pandas built-in
corr = df.corr()

Q13. What is np.memmap? Memory-mapped file — maps a portion of a file into memory. Enables working with arrays larger than RAM by reading only needed parts from disk on demand.

Q14. How does np.argpartition() differ from np.argsort()? argsort fully sorts (O(n log n)). argpartition(arr, k) partially partitions so the k-th element is in its correct position — O(n). Faster when you only need top-k.

Q15. What is np.unique() and what does it return?

python
123
arr = np.array([3,1,4,1,5,9,2,6,5,3,5])
vals, counts = np.unique(arr, return_counts=True)
# vals: [1 2 3 4 5 6 9], counts: [2 1 2 1 3 1 1]

---

Section B: Pandas (Q16-35)

Q16. Difference between loc and iloc? loc[label, label] — label-based. iloc[pos, pos] — integer position-based. loc slices are inclusive on both ends; iloc stop is exclusive.

Q17. What is SettingWithCopyWarning? Occurs with chained indexing: df[mask]['col'] = x. Pandas can't determine if the result is a view or copy. Fix: df.loc[mask, 'col'] = x.

Q18. How does groupby().transform() differ from agg()? agg() returns one row per group. transform() returns a Series with the same length as the original DataFrame — useful for adding group statistics as features.

Q19. What is the purpose of pd.cut() vs pd.qcut()? pd.cut(): Equal-width bins (fixed boundaries). pd.qcut(): Equal-frequency bins (quantile-based) — each bin has same number of observations.

Q20. How do you handle outliers in Pandas?

python
123456
# IQR method
Q1, Q3 = df['col'].quantile([0.25, 0.75])
IQR = Q3 - Q1
df_clean = df[(df[&#039;col'] >= Q1-1.5*IQR) & (df['col'] <= Q3+1.5*IQR)]
# OR cap (winsorize):
df[&#039;col'] = df['col'].clip(lower=Q1-1.5*IQR, upper=Q3+1.5*IQR)

Q21. Explain merge types (inner, left, right, outer).

  • inner: Only matching rows (intersection)
  • left: All left rows + matching right (NaN for no match)
  • right: All right + matching left (NaN for no match)
  • outer: All rows from both (NaN where no match)

Q22. How to efficiently iterate over a DataFrame?

python
1234
# DON'T: for i, row in df.iterrows():  (very slow)
# DO: vectorized operations
df[&#039;result'] = df['A'] * df['B']
# If you must iterate: itertuples() is 10-100x faster than iterrows()

Q23. What is pd.melt()? Reshapes "wide" format to "long" (tidy) format — unpivots multiple columns into rows. The inverse of pivot_table.

Q24. What is the difference between fillna() and interpolate()? fillna() fills with a constant, statistics, or ffill/bfill. interpolate() estimates missing values from surrounding known values — best for time series where intermediate values should vary smoothly.

Q25. How do you detect data leakage? Common signs: unexpectedly high model performance, feature that wouldn't exist at prediction time, fitting preprocessors on full dataset. Solution: strictly maintain train/test separation throughout the pipeline.

---

Section C: 20 Coding Challenges

python
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
import pandas as pd
import numpy as np

# Challenge 1: Find top 3 products by revenue
def top_products(df, n=3):
    return df.groupby(&#039;Product')['Revenue'].sum().nlargest(n)

# Challenge 2: Calculate month-over-month growth
def mom_growth(df):
    monthly = df.groupby(df[&#039;Date'].dt.to_period('M'))['Revenue'].sum()
    return monthly.pct_change() * 100

# Challenge 3: Detect outliers using IQR
def remove_outliers(df, col):
    Q1, Q3 = df[col].quantile([0.25, 0.75])
    IQR = Q3 - Q1
    return df[(df[col] >= Q1 - 1.5*IQR) & (df[col] <= Q3 + 1.5*IQR)]

# Challenge 4: Normalize within groups
def normalize_by_group(df, value_col, group_col):
    df[f&#039;{value_col}_normalized'] = df.groupby(group_col)[value_col].transform(
        lambda x: (x - x.mean()) / x.std()
    )
    return df

# Challenge 5: Fill missing with group median
def fill_with_group_median(df, col, group_col):
    df[col] = df.groupby(group_col)[col].transform(
        lambda x: x.fillna(x.median())
    )
    return df

# Challenge 6: Rolling 7-day average
def add_rolling_avg(df, col, window=7):
    df[f&#039;{col}_MA{window}'] = df[col].rolling(window).mean()
    return df

# Challenge 7: Customer RFM analysis
def rfm_analysis(df, now=pd.Timestamp(&#039;2024-12-31')):
    rfm = df.groupby(&#039;CustomerID').agg(
        Recency=(&#039;Date', lambda x: (now - x.max()).days),
        Frequency=(&#039;OrderID', 'count'),
        Monetary=(&#039;Revenue', 'sum')
    )
    # Score each dimension 1-5
    for col in [&#039;Recency', 'Frequency', 'Monetary']:
        ascending = col == &#039;Recency'  # Lower recency is better
        rfm[f&#039;{col}_Score'] = pd.qcut(rfm[col], q=5,
                                        labels=[5,4,3,2,1] if ascending else [1,2,3,4,5])
    rfm[&#039;RFM_Score'] = (rfm['Recency_Score'].astype(int) +
                         rfm[&#039;Frequency_Score'].astype(int) +
                         rfm[&#039;Monetary_Score'].astype(int))
    return rfm

# Challenge 8: Pivot wide to long
def wide_to_long(df, id_cols, value_cols):
    return pd.melt(df, id_vars=id_cols, value_vars=value_cols,
                   var_name=&#039;Variable', value_name='Value')

print("✅ All 8 challenges implemented!")

10 MCQs

Question 1

np.argsort(arr)[-3:] returns?

Question 2

df.itertuples() vs df.iterrows()?

Question 3

pd.melt() converts?

Question 4

SettingWithCopyWarning fix?

Question 5

np.partition(arr, -k)[-k:] vs np.argsort?

Question 6

RFM stands for?

Question 7

np.corrcoef(X.T) expects?

Question 8

groupby().transform(lambda x: x.fillna(x.median())) fills with?

Question 9

Data leakage most commonly occurs?

Question 10

np.einsum('ij,jk->ik', A, B) computes?

Interview Questions (Coding)

  • Write a function to compute rolling 7-day average and add it as a new column.
  • Find all customers who spent more than $1000 in a single month.
  • Normalize a DataFrame column within each department group.

Summary

50 interview questions covering NumPy internals (broadcasting, views, structured arrays) and Pandas operations (loc/iloc, groupby, merge, EDA). 20 coding challenges test practical skills: RFM analysis, outlier removal, group normalization, and time series rolling functions. These questions appear in data science interviews at all major tech companies.

Next Chapter Recommendation

In Chapter 29: Performance Optimization, we systematically benchmark and optimize Pandas and NumPy code for production-scale workloads.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·