ML and NLP interview questions
Machine Learning Questions
1. Difference between SGD / AdaGrad / Adam / AdamW?
- SGD: \(W_{t+1}=W_t-\lambda \nabla_w J(w)\)
- AdaGrad: \(W_{t+1}=W_t-\frac{\lambda}{\sqrt{G_t+\epsilon}} \nabla_w J(w)\)
- RMSprop: \(W_{t+1}=W_t-\frac{\lambda}{\sqrt{E\left[G^2\right]_t+\epsilon}} \nabla_w J(w)\)
- Adam:
\begin{aligned} & m_t=\beta_1 m_{t-1}+\left(1-\beta_1\right) \nabla_w J(w) \\ & v_t=\beta_2 v_{t-1}+\left(1-\beta_2\right)\left(\nabla_w J(w)\right)^2 \\ & \hat{m}_t=\frac{m_t}{1-\beta_1^1} \\ & \hat{v}_t=\frac{v_t}{1-\beta_2^l} \\ & W_{t+1}=W_t-\frac{\lambda}{\sqrt{v_t}+\epsilon} \hat{m}_t \end{aligned}
- AdamW: \(W_{t+1}=W_t-\lambda (\frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon}+\alpha W_t )\)
[resource]: a good blog
2. Bias-Variance Tradeoff
Bias is how far off the model’s predictions are from the actual values. Variance is how much the model’s predictions change when trained on different datasets.
- Increasing model complexity decreases bias but increases variance.
- Simpler models have higher bias but lower variance.
3. Overfitting and Underfitting
How to Prevent Overfitting:
- Reduce model complexity : prune decision trees, reduce layers in deep learning.
- Regularization: L1/L2 regularization, dropout in neural networks.
- Use more training data to help the model generalize better.
- Cross-validation: k-fold CV to test model robustness.
- Early stopping: stop training when validation loss increases.
How to Prevent Underfitting:
- Increase model complexity: use deeper networks, more features.
- Reduce regularization: decrease L1/L2 penalties.
- Use better feature engineering.
- Train for longer.
4. Random Forest vs. Gradient Boosting
Both Random Forest and Gradient Boosting are ensemble learning methods based on decision trees.
Feature | Random Forest (RF) | Gradient Boosting (GB) |
---|---|---|
Training | Parallel (independent trees) | Sequential (each tree fixes previous errors) |
Bias vs. Variance | Higher bias, lower variance | Higher variance, lower bias |
Overfitting | Less prone to overfitting | Can overfit if not tuned properly |
Speed | Faster (parallel computation) | Slower (sequential computation) |
Best For | General-purpose models, stability | High-performance tasks, structured data |
Robustness | Works well with default settings | Requires careful tuning |
5. Decision tree for classification
The goal is to measure impurity. Two metrics:
- Gini Impurity: For each node, calculate \(Gini=1-\sum_{i=1}^C p_i^2\). \(C\) is the number of classes, \(p_i\) is the proportion of samples belonging to class \(i\) in a given node.
- Entropy and Information Gain: \(Entropy =-\sum_{i=1}^C p_i \log _2 p_i\). \(I G= Entropy_{p a r e n t}-\sum\left(\frac{N_i}{N} \times\right.\) Entropy \(\left._i\right)\)
6. Bootstrap
Bootstrap is a resampling technique used to create multiple datasets from a single dataset by randomly sampling with replacement. It is commonly used in ensemble learning (like Random Forest) to improve model stability and performance.
Advantages of Bootstrap:
- Reduces variance by averaging multiple models.
- Improves generalization by training on diverse datasets.
- Works well for small datasets where splitting into train/test might lose valuable data.
7. L1 and L2 regularization
L1 Regularization (Lasso):
- Encourages sparsity (sets some weights to exactly zero).
- Leads to feature selection (removes less important features).
L2 Regularization (Ridge)
- Shrinks weights toward zero but does not eliminate them.
- Helps with multicollinearity (reduces variance but keeps all features).
7. Explain Principal Component Analysis (PCA)
Steps:
- Standardize the Data: Each feature has mean = 0 and variance = 1.
- Compute the Covariance Matrix: Measures the relationships between features.
- Compute Eigenvalues & Eigenvectors
- Select the Top Principal Components
- Project Data onto New Axes
9. PCA vs LDA
PCA (Principal Component Analysis) and LDA (Linear Discriminant Analysis) are both dimensionality reduction techniques. PCA: Focuses on preserving variance in the data. LDA focuses on maximizing class separability.
Why PCA focuses on the variance of data?
"variance in the data" means how spread out the data points are across different dimensions. Variance measures how much the values in a dataset differ from the mean. If a feature has high variance, it means the values are more spread out, which typically indicates it holds more distinguishing information. PCA finds the directions having the highest variance because these directions contain the most information.
Feature | PCA (Principal Component Analysis) | LDA (Linear Discriminant Analysis) |
---|---|---|
Purpose | Maximizes variance to retain the most information | Maximizes class separability for better classification |
Supervision | Unsupervised (does not use class labels) | Supervised (requires class labels) |
What It Finds | Directions with the highest data variance | Directions that best separate different classes |
How It Works | Computes eigenvectors of the covariance matrix | Computes eigenvectors of the scatter matrices (between-class and within-class) |
Dimensionality Constraint | Can reduce dimensions up to the number of original features | Can reduce dimensions to at most C - 1 (where C = number of classes) |
Best For | Data compression, feature reduction, visualization | Classification tasks where class separation is important |
Output | Orthogonal principal components capturing the most variance | New axes optimized for class distinction |
8. Batch Normalization and Layer Normalization
Batch Normalization normalizes neuron outputs across the batch dimension for each feature, but layer normalization across the feature dimension for each individual input sample.
Feature | Batch Normalization | Layer Normalization |
---|---|---|
Normalization Axis | Across batch (for each feature) | Across features (for each sample) |
Dependency | Depends on batch size | Works independently of batch size |
Use Case | CNNs, large batch training | RNNs, Transformers, small batch training |
Training Stability | Can be unstable in small batches | Stable for all batch sizes |
Use BN when working with structured grid-like data. Use LN when working with sequential or non-grid data.
10. Activation Functions
- ReLU: \(f(x)=\max (0, x)\)
- Leaky ReLU: \(f(x) = \max(0.01x, x)\)
- Sigmoid: \(f(x) = \frac{1}{1 + e^{-x}}\)
- Tanh: \(f(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}\)
- Softmax: \(f(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}}\)
Why is ReLU better and more often used than Sigmoid in Neural Networks?
1. Avoids the Vanishing Gradient Problem: the derivative of the sigmoid function is very small when 𝑥 is large or small
2. Computational Efficiency: Simple max(0, x) operation, which is much faster to compute exponentials.
11. Describe how convolution works
key concepts: convolution, kernel, stride, receptive field.
receptive field: refers to the region of the input image that a neuron in a particular layer "sees". As we go deeper in the network, receptive fields grow larger, capturing broader and more complex features. By the final layers, neurons can "see" almost the entire image. \(RF = RF_{pre} + (K-1) \times S\).
12. Why do we use convolutions for images rather than just FC layers
- Preserve Spatial Structure: Images have local patterns (e.g., edges, textures), and convolutions capture them while FC layers lose spatial relationships.
- Reduce Parameters: A fully connected layer requires each neuron to connect to every pixel, leading to huge parameter sizes for high-resolution images.
- Translation Invariance: Convolutions allow the network to detect patterns anywhere in the image, while FC layers memorize specific positions.
13. Implement a sparse matrix class
14. Reverse a bitstring
data = b'\xAD\xDE\xDE\xC0'
my_data = bytearray(data)
my_data.reverse()
15. What is the significance of Residual Networks
The main thing that residual connections did was allow for direct feature access from previous layers. This makes information propagation throughout the network much easier. Also, it helps to solve the problem of vanishing gradients.
16. How to deal with imbalanced dataset?
- Oversampling or undersampling
- Data augmentation
- Algorithm-wise: Class Weight Adjustment,
- Using appropriate metrics: e.g. Precision, Recall, F1-score, AUC-ROC Curve
17. Precision, Recall, F1-score, AUC-ROC Curve, P-R Curve
- ROC Curve: True Positive Rate (TPR) [y-axis], False Positive Rate (FPR) [x-axis]
- AUC: Area Under the ROC Curve
ROC Curve shows model performance at various thresholds.
P-R Curve: Precision (Positive Predictive Value) [y-axis], Recall (Sensitivity) [x-axis]
Scenario | Use ROC Curve? | Use P-R Curve? |
---|---|---|
Balanced dataset | ✅ Yes | ❌ No |
Imbalanced dataset (rare positives) | ❌ No | ✅ Yes |
False Positives matter more | ❌ No | ✅ Yes |
Overall classification performance | ✅ Yes | ❌ No |
Detecting rare events (e.g., fraud, disease) | ❌ No | ✅ Yes |
18. Difference between supervised, unsupervised, and reinforcement learning.
Type | Supervised Learning | Unsupervised Learning | Reinforcement Learning |
---|---|---|---|
Definition | Learns from labeled data (input-output pairs) | Learns patterns from unlabeled data | Learns by interacting with an environment and receiving rewards |
Data Type | Labeled (e.g., (image, label)) | Unlabeled (only input data) | Sequential decision-making data |
Goal | Minimize error between predictions and true labels | Find hidden structures/patterns | Maximize cumulative rewards |
Examples | Classification (e.g., spam detection), Regression (e.g., house price prediction) | Clustering (e.g., customer segmentation), Dimensionality Reduction (e.g., PCA) | Game playing (AlphaGo), Robotics, Self-driving cars |
Training Approach | Uses loss function (e.g., MSE, Cross-Entropy) | Finds similarities or distributions | Uses trial-and-error learning (policy optimization) |
19. Techniques for data augmentation in CV.
- Geometric Transformations: Rotation, Flipping, Scaling
- Color-Based Augmentations: Brightness Adjustment, Contrast Adjustment, Hue & Saturation Changes
- Noise-Based Augmentations: Gaussian Noise, Blur, Cutout
20. What is vanishing gradient?
Vanishing gradient problem occurs in deep neural networks when gradients become extremely small during backpropagation, making it difficult for the network to learn.
Why Does It Happen?
- Activation Function Effects: such as Sigmoid and Tanh functions
- Weight Initialization Issues: such as weights are too small
- Deep Networks (Many Layers): If gradients are less than 1, they shrink exponentially.
How to address vanishing gradient?
- Use ReLU Instead of Sigmoid or Tanh
- Batch Normalization
- Better Weight Initialization
- Use Residual Networks (ResNet)
21. What are dropouts?
During each forward pass, dropout randomly "drops" (sets to zero) a percentage of neurons in the layer. This prevents the network from becoming too dependent on specific neurons and forces it to learn more robust and generalizable features.
22. How does t-SNE work?
t-SNE visualizes high-dimensional data by placing similar points close together in a 2D or 3D space.
23. Direct Preference Optimization(DPO)
Direct Preference Optimization (DPO) is a training method for aligning language models (LLMs) with human preferences without needing reinforcement learning. It serves as an alternative to Reinforcement Learning from Human Feedback (RLHF) while maintaining simplicity and efficiency.
At the very beginning, the initial loss should be close to zero. \( Loss = \log \frac{P_\theta\left(y_{\text {preferred }} \mid x\right)}{P_\theta\left(y_{\text {disliked }} \mid x\right)}-\log \frac{P_{\text {ref }}\left(y_{\text {preferred }} \mid x\right)}{P_{\text {ref }}\left(y_{\text {disliked }} \mid x\right)}\)
Even though the initial loss is near zero, training gradually adjusts \(P_\theta\) to diverge from \(P_{\text {ref }}\) and increase preference alignment. Though the loss starts at zero, gradient updates still occur. The model slightly shifts probabilities toward the preferred responses.
24. DPO vs RLHF
Feature | DPO (Direct Preference Optimization) | RLHF (Reinforcement Learning from Human Feedback) |
---|---|---|
Training Complexity | Simple (direct fine-tuning) | Complex (requires multiple stages: reward model + RL) |
Optimization Method | Supervised learning (log-ratio loss) | Reinforcement learning (PPO, policy optimization) |
Reward Model Needed? | ❌ No | ✅ Yes (trained separately using preference data) |
Loss Function | Log-ratio of preference probabilities | Reward function trained via RL (KL-regularized PPO) |
KL Regularization | Implicitly baked into the objective | Explicit KL penalty required to prevent divergence |
Gradient Stability | More stable (direct loss minimization) | Unstable (gradient explosion, KL tuning required) |
Computational Cost | Lower (direct preference fine-tuning) | Higher (training a reward model + RL updates) |
Convergence Speed | Faster | Slower |
Interpretability | More interpretable (directly optimizes preference scores) | Less interpretable (reward models can introduce bias) |
Primary Use Cases | Aligning LLMs with human preferences (e.g., chatbots, summarization) | Preference alignment, reinforcement learning tasks (e.g., AI safety, robotics) |
Challenges | Still under research, requires careful hyperparameter tuning | Instability, difficulty in fine-tuning rewards |
25. Difference between Discriminative models and Generative models
Feature | Discriminative Models | Generative Models |
---|---|---|
Definition | Learn the decision boundary between classes | Model the actual distribution of the data |
Probability Modeled | \(P(y \| X)\) (Conditional probability of labels given input) | \(P(X, y)\) (Joint probability of input and labels) |
Goal | Directly classify or predict labels | Generate new data or classify by computing likelihood |
Examples | Logistic Regression, SVM, Random Forest, Neural Networks | Naïve Bayes, Gaussian Mixture Model (GMM), GANs |
Advantage | Usually more accurate for classification tasks | Useful for data generation and handling missing data |
Disadvantage | Requires large labeled datasets, less flexible for missing data | Can be computationally expensive and harder to optimize |
In short:
- Discriminative models focus on learning boundaries for classification.
- Generative models learn the data distribution and can generate new samples.
26. How to deal with exploding gradient?
- Gradient Clipping: Set a threshold to clip gradients during backpropagation.
- Use Smaller Learning Rate: Reducing the learning rate prevents drastic weight updates.
- Weight Regularization
- Use Proper Weight Initialization
- Use Normalization
- Switch to a Different Activation Function: ReLU (instead of sigmoid/tanh) helps mitigate vanishing gradients but can cause exploding gradients. Try LeakyReLU or ELU instead.
27. Why does gradient vanishing happen?
- Activation Functions with Small Derivatives: such as Sigmoid and Tanh
- Chain Rule in Backpropagation: If each derivative is a small number (e.g., <1), their product shrinks exponentially as we go deeper, leading to almost zero gradients for early layers.
- Improper Weight Initialization
NLP
1. Transformers and Multi-head Self-attention
Please see the code: https://github.com/audreycs/transformer_from_scratch.
1. Why using multi-head?
A: Multiple attention heads allow the model to learn different aspects of the input simultaneously.
2. When calculating Multi-Head Attention, why use scaling before taking the softmax?
A: To stabilize gradients and improve numerical stability.
3. why adding positional encoding in the input?
A: Transformers have no recurrence (RNNs) or convolution (CNNs), meaning they do not
inherently understand token order. Unlike RNNs, which process sequences sequentially,
Transformers process all tokens in parallel, making them faster but order-agnostic.
Add Positional Encoding (PE) to the input embeddings so that the model can capture token
positions and sequence order.