Pandas and NumPy Interview Preparation
# CHAPTER 28
Pandas & NumPy Interview Preparation
1. Chapter Introduction
This chapter compiles the 50 most-asked Pandas and NumPy interview questions from Google, Amazon, Meta, and top tech companies — with expert answers, code examples, and 20 coding challenges.---
Section A: NumPy (Q1-20)
Q1. What is NumPy and why is it faster than Python lists? NumPy arrays store homogeneous data in contiguous memory blocks and use C-level vectorized operations. Python lists are heterogeneous, pointer-based, and require Python overhead per element. NumPy is typically 50-200x faster for numerical operations.
Q2. Difference between np.array() and np.asarray()?
np.array() always creates a copy. np.asarray() avoids a copy if input is already an ndarray with compatible dtype — more memory-efficient.
Q3. What is broadcasting in NumPy? Broadcasting allows operations on arrays of different shapes. NumPy virtually expands the smaller array along axes of size 1 to match the larger array's shape, without actually copying data. Rules: align shapes right-to-right; dimensions must be equal or one of them 1.
Q4. Difference between copy() and a view (slice)?
A slice creates a view — same memory as the original. Modifying a view modifies the original. .copy() creates an independent array. Check with np.shares_memory(a, b).
Q5. How does np.where() work?
Q6. What is np.vectorize()?
Wraps a Python function to work element-wise on arrays. It's NOT truly vectorized (still loops internally) — just more convenient than explicit loops. For real speed, use native NumPy operations.
Q7. Explain np.einsum().
Einstein summation — compact notation for tensor operations. np.einsum('ij,jk->ik', A, B) = matrix multiply. More expressive than np.dot for complex operations.
Q8. What are structured arrays?
Arrays with named, typed fields per element — like a database row. dtype=[('name', 'U20'), ('age', 'i4'), ('salary', 'f8')]. Access by field name: arr['name'].
Q9. How to find the N largest elements?
Q10. What is np.linalg.lstsq()?
Least-squares solution to Ax = b when the system may be overdetermined (more equations than unknowns). Used for linear regression from scratch.
Q11. Difference between * and @ for arrays?
* is element-wise multiplication. @ is matrix multiplication (dot product). For 2D arrays A @ B == np.dot(A, B).
Q12. How to efficiently compute a correlation matrix?
Q13. What is np.memmap?
Memory-mapped file — maps a portion of a file into memory. Enables working with arrays larger than RAM by reading only needed parts from disk on demand.
Q14. How does np.argpartition() differ from np.argsort()?
argsort fully sorts (O(n log n)). argpartition(arr, k) partially partitions so the k-th element is in its correct position — O(n). Faster when you only need top-k.
Q15. What is np.unique() and what does it return?
---
Section B: Pandas (Q16-35)
Q16. Difference between loc and iloc?
loc[label, label] — label-based. iloc[pos, pos] — integer position-based. loc slices are inclusive on both ends; iloc stop is exclusive.
Q17. What is SettingWithCopyWarning?
Occurs with chained indexing: df[mask]['col'] = x. Pandas can't determine if the result is a view or copy. Fix: df.loc[mask, 'col'] = x.
Q18. How does groupby().transform() differ from agg()?
agg() returns one row per group. transform() returns a Series with the same length as the original DataFrame — useful for adding group statistics as features.
Q19. What is the purpose of pd.cut() vs pd.qcut()?
pd.cut(): Equal-width bins (fixed boundaries). pd.qcut(): Equal-frequency bins (quantile-based) — each bin has same number of observations.
Q20. How do you handle outliers in Pandas?
Q21. Explain merge types (inner, left, right, outer).
- inner: Only matching rows (intersection)
- left: All left rows + matching right (NaN for no match)
- right: All right + matching left (NaN for no match)
- outer: All rows from both (NaN where no match)
Q22. How to efficiently iterate over a DataFrame?
Q23. What is pd.melt()?
Reshapes "wide" format to "long" (tidy) format — unpivots multiple columns into rows. The inverse of pivot_table.
Q24. What is the difference between fillna() and interpolate()?
fillna() fills with a constant, statistics, or ffill/bfill. interpolate() estimates missing values from surrounding known values — best for time series where intermediate values should vary smoothly.
Q25. How do you detect data leakage? Common signs: unexpectedly high model performance, feature that wouldn't exist at prediction time, fitting preprocessors on full dataset. Solution: strictly maintain train/test separation throughout the pipeline.
---
Section C: 20 Coding Challenges
10 MCQs
np.argsort(arr)[-3:] returns?
df.itertuples() vs df.iterrows()?
pd.melt() converts?
SettingWithCopyWarning fix?
np.partition(arr, -k)[-k:] vs np.argsort?
RFM stands for?
np.corrcoef(X.T) expects?
groupby().transform(lambda x: x.fillna(x.median())) fills with?
Data leakage most commonly occurs?
np.einsum('ij,jk->ik', A, B) computes?
Interview Questions (Coding)
- Write a function to compute rolling 7-day average and add it as a new column.
- Find all customers who spent more than $1000 in a single month.
- Normalize a DataFrame column within each department group.