ML and NLP interview questions

Machine Learning Questions

1. Difference between SGD / AdaGrad / Adam / AdamW?

  • SGD:   \(W_{t+1}=W_t-\lambda \nabla_w J(w)\)
  • AdaGrad:   \(W_{t+1}=W_t-\frac{\lambda}{\sqrt{G_t+\epsilon}} \nabla_w J(w)\)
  • RMSprop:   \(W_{t+1}=W_t-\frac{\lambda}{\sqrt{E\left[G^2\right]_t+\epsilon}} \nabla_w J(w)\)
  • Adam:
    \begin{aligned} & m_t=\beta_1 m_{t-1}+\left(1-\beta_1\right) \nabla_w J(w) \\ & v_t=\beta_2 v_{t-1}+\left(1-\beta_2\right)\left(\nabla_w J(w)\right)^2 \\ & \hat{m}_t=\frac{m_t}{1-\beta_1^1} \\ & \hat{v}_t=\frac{v_t}{1-\beta_2^l} \\ & W_{t+1}=W_t-\frac{\lambda}{\sqrt{v_t}+\epsilon} \hat{m}_t \end{aligned}
  • AdamW:   \(W_{t+1}=W_t-\lambda (\frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon}+\alpha W_t )\)

[resource]: a good blog


2. Bias-Variance Tradeoff

Bias is how far off the model’s predictions are from the actual values. Variance is how much the model’s predictions change when trained on different datasets.

  • Increasing model complexity decreases bias but increases variance.
  • Simpler models have higher bias but lower variance.


3. Overfitting and Underfitting

How to Prevent Overfitting:

  • Reduce model complexity : prune decision trees, reduce layers in deep learning.
  • Regularization: L1/L2 regularization, dropout in neural networks.
  • Use more training data to help the model generalize better.
  • Cross-validation: k-fold CV to test model robustness.
  • Early stopping: stop training when validation loss increases.

How to Prevent Underfitting:

  • Increase model complexity: use deeper networks, more features.
  • Reduce regularization: decrease L1/L2 penalties.
  • Use better feature engineering.
  • Train for longer.


4. Random Forest vs. Gradient Boosting

Both Random Forest and Gradient Boosting are ensemble learning methods based on decision trees.

Feature Random Forest (RF) Gradient Boosting (GB)
Training Parallel (independent trees) Sequential (each tree fixes previous errors)
Bias vs. Variance Higher bias, lower variance Higher variance, lower bias
Overfitting Less prone to overfitting Can overfit if not tuned properly
Speed Faster (parallel computation) Slower (sequential computation)
Best For General-purpose models, stability High-performance tasks, structured data
Robustness Works well with default settings Requires careful tuning


5. Decision tree for classification

The goal is to measure impurity. Two metrics:

  • Gini Impurity:   For each node, calculate \(Gini=1-\sum_{i=1}^C p_i^2\). \(C\) is the number of classes, \(p_i\) is the proportion of samples belonging to class \(i\) in a given node.
  • Entropy and Information Gain:   \(Entropy =-\sum_{i=1}^C p_i \log _2 p_i\).   \(I G= Entropy_{p a r e n t}-\sum\left(\frac{N_i}{N} \times\right.\) Entropy \(\left._i\right)\)


6. Bootstrap

Bootstrap is a resampling technique used to create multiple datasets from a single dataset by randomly sampling with replacement. It is commonly used in ensemble learning (like Random Forest) to improve model stability and performance.

Advantages of Bootstrap:

  • Reduces variance by averaging multiple models.
  • Improves generalization by training on diverse datasets.
  • Works well for small datasets where splitting into train/test might lose valuable data.


7. L1 and L2 regularization

L1 Regularization (Lasso):

  • Encourages sparsity (sets some weights to exactly zero).
  • Leads to feature selection (removes less important features).

L2 Regularization (Ridge)

  • Shrinks weights toward zero but does not eliminate them.
  • Helps with multicollinearity (reduces variance but keeps all features).


7. Explain Principal Component Analysis (PCA)

Steps:

  1. Standardize the Data: Each feature has mean = 0 and variance = 1.
  2. Compute the Covariance Matrix: Measures the relationships between features.
  3. Compute Eigenvalues & Eigenvectors
  4. Select the Top Principal Components
  5. Project Data onto New Axes


9. PCA vs LDA

PCA (Principal Component Analysis) and LDA (Linear Discriminant Analysis) are both dimensionality reduction techniques. PCA: Focuses on preserving variance in the data. LDA focuses on maximizing class separability.

Why PCA focuses on the variance of data?

"variance in the data" means how spread out the data points are across different dimensions. Variance measures how much the values in a dataset differ from the mean. If a feature has high variance, it means the values are more spread out, which typically indicates it holds more distinguishing information. PCA finds the directions having the highest variance because these directions contain the most information.

Feature PCA (Principal Component Analysis) LDA (Linear Discriminant Analysis)
Purpose Maximizes variance to retain the most information Maximizes class separability for better classification
Supervision Unsupervised (does not use class labels) Supervised (requires class labels)
What It Finds Directions with the highest data variance Directions that best separate different classes
How It Works Computes eigenvectors of the covariance matrix Computes eigenvectors of the scatter matrices (between-class and within-class)
Dimensionality Constraint Can reduce dimensions up to the number of original features Can reduce dimensions to at most C - 1 (where C = number of classes)
Best For Data compression, feature reduction, visualization Classification tasks where class separation is important
Output Orthogonal principal components capturing the most variance New axes optimized for class distinction


8. Batch Normalization and Layer Normalization

Batch Normalization normalizes neuron outputs across the batch dimension for each feature, but layer normalization across the feature dimension for each individual input sample.

Feature Batch Normalization Layer Normalization
Normalization Axis Across batch (for each feature) Across features (for each sample)
Dependency Depends on batch size Works independently of batch size
Use Case CNNs, large batch training RNNs, Transformers, small batch training
Training Stability Can be unstable in small batches Stable for all batch sizes

Use BN when working with structured grid-like data. Use LN when working with sequential or non-grid data.


10. Activation Functions

  • ReLU:   \(f(x)=\max (0, x)\)
  • Leaky ReLU:   \(f(x) = \max(0.01x, x)\)
  • Sigmoid:   \(f(x) = \frac{1}{1 + e^{-x}}\)
  • Tanh:   \(f(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}\)
  • Softmax:   \(f(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}}\)

.

Why is ReLU better and more often used than Sigmoid in Neural Networks?
1. Avoids the Vanishing Gradient Problem: the derivative of the sigmoid function is very small when 𝑥 is large or small
2. Computational Efficiency: Simple max(0, x) operation, which is much faster to compute exponentials.

11. Describe how convolution works

key concepts: convolution, kernel, stride, receptive field.

receptive field: refers to the region of the input image that a neuron in a particular layer "sees". As we go deeper in the network, receptive fields grow larger, capturing broader and more complex features. By the final layers, neurons can "see" almost the entire image. \(RF = RF_{pre} + (K-1) \times S\).

12. Why do we use convolutions for images rather than just FC layers

  1. Preserve Spatial Structure: Images have local patterns (e.g., edges, textures), and convolutions capture them while FC layers lose spatial relationships.
  2. Reduce Parameters: A fully connected layer requires each neuron to connect to every pixel, leading to huge parameter sizes for high-resolution images.
  3. Translation Invariance: Convolutions allow the network to detect patterns anywhere in the image, while FC layers memorize specific positions.

13. Implement a sparse matrix class

14. Reverse a bitstring

data = b'\xAD\xDE\xDE\xC0'
my_data = bytearray(data)
my_data.reverse()

15. What is the significance of Residual Networks

The main thing that residual connections did was allow for direct feature access from previous layers. This makes information propagation throughout the network much easier. Also, it helps to solve the problem of vanishing gradients.

16. How to deal with imbalanced dataset?

  1. Oversampling or undersampling
  2. Data augmentation
  3. Algorithm-wise: Class Weight Adjustment,
  4. Using appropriate metrics: e.g. Precision, Recall, F1-score, AUC-ROC Curve

17. Precision, Recall, F1-score, AUC-ROC Curve, P-R Curve

  • ROC Curve: True Positive Rate (TPR) [y-axis], False Positive Rate (FPR) [x-axis]
  • AUC: Area Under the ROC Curve

ROC Curve shows model performance at various thresholds.

P-R Curve: Precision (Positive Predictive Value) [y-axis], Recall (Sensitivity) [x-axis]

.

Scenario Use ROC Curve? Use P-R Curve?
Balanced dataset ✅ Yes ❌ No
Imbalanced dataset (rare positives) ❌ No ✅ Yes
False Positives matter more ❌ No ✅ Yes
Overall classification performance ✅ Yes ❌ No
Detecting rare events (e.g., fraud, disease) ❌ No ✅ Yes

18. Difference between supervised, unsupervised, and reinforcement learning.

Type Supervised Learning Unsupervised Learning Reinforcement Learning
Definition Learns from labeled data (input-output pairs) Learns patterns from unlabeled data Learns by interacting with an environment and receiving rewards
Data Type Labeled (e.g., (image, label)) Unlabeled (only input data) Sequential decision-making data
Goal Minimize error between predictions and true labels Find hidden structures/patterns Maximize cumulative rewards
Examples Classification (e.g., spam detection), Regression (e.g., house price prediction) Clustering (e.g., customer segmentation), Dimensionality Reduction (e.g., PCA) Game playing (AlphaGo), Robotics, Self-driving cars
Training Approach Uses loss function (e.g., MSE, Cross-Entropy) Finds similarities or distributions Uses trial-and-error learning (policy optimization)

19. Techniques for data augmentation in CV.

  1. Geometric Transformations: Rotation, Flipping, Scaling
  2. Color-Based Augmentations: Brightness Adjustment, Contrast Adjustment, Hue & Saturation Changes
  3. Noise-Based Augmentations: Gaussian Noise, Blur, Cutout

20. What is vanishing gradient?

Vanishing gradient problem occurs in deep neural networks when gradients become extremely small during backpropagation, making it difficult for the network to learn.

Why Does It Happen?

  • Activation Function Effects: such as Sigmoid and Tanh functions
  • Weight Initialization Issues: such as weights are too small
  • Deep Networks (Many Layers): If gradients are less than 1, they shrink exponentially.

How to address vanishing gradient?

  • Use ReLU Instead of Sigmoid or Tanh
  • Batch Normalization
  • Better Weight Initialization
  • Use Residual Networks (ResNet)

21. What are dropouts?

During each forward pass, dropout randomly "drops" (sets to zero) a percentage of neurons in the layer. This prevents the network from becoming too dependent on specific neurons and forces it to learn more robust and generalizable features.

22. How does t-SNE work?

t-SNE visualizes high-dimensional data by placing similar points close together in a 2D or 3D space.

23. Direct Preference Optimization(DPO)

Direct Preference Optimization (DPO) is a training method for aligning language models (LLMs) with human preferences without needing reinforcement learning. It serves as an alternative to Reinforcement Learning from Human Feedback (RLHF) while maintaining simplicity and efficiency.

.

At the very beginning, the initial loss should be close to zero. \( Loss = \log \frac{P_\theta\left(y_{\text {preferred }} \mid x\right)}{P_\theta\left(y_{\text {disliked }} \mid x\right)}-\log \frac{P_{\text {ref }}\left(y_{\text {preferred }} \mid x\right)}{P_{\text {ref }}\left(y_{\text {disliked }} \mid x\right)}\)

Even though the initial loss is near zero, training gradually adjusts \(P_\theta\) to diverge from \(P_{\text {ref }}\) and increase preference alignment. Though the loss starts at zero, gradient updates still occur. The model slightly shifts probabilities toward the preferred responses.

24. DPO vs RLHF

Feature DPO (Direct Preference Optimization) RLHF (Reinforcement Learning from Human Feedback)
Training Complexity Simple (direct fine-tuning) Complex (requires multiple stages: reward model + RL)
Optimization Method Supervised learning (log-ratio loss) Reinforcement learning (PPO, policy optimization)
Reward Model Needed? ❌ No ✅ Yes (trained separately using preference data)
Loss Function Log-ratio of preference probabilities Reward function trained via RL (KL-regularized PPO)
KL Regularization Implicitly baked into the objective Explicit KL penalty required to prevent divergence
Gradient Stability More stable (direct loss minimization) Unstable (gradient explosion, KL tuning required)
Computational Cost Lower (direct preference fine-tuning) Higher (training a reward model + RL updates)
Convergence Speed Faster Slower
Interpretability More interpretable (directly optimizes preference scores) Less interpretable (reward models can introduce bias)
Primary Use Cases Aligning LLMs with human preferences (e.g., chatbots, summarization) Preference alignment, reinforcement learning tasks (e.g., AI safety, robotics)
Challenges Still under research, requires careful hyperparameter tuning Instability, difficulty in fine-tuning rewards

25. Difference between Discriminative models and Generative models

Feature Discriminative Models Generative Models
Definition Learn the decision boundary between classes Model the actual distribution of the data
Probability Modeled \(P(y \| X)\) (Conditional probability of labels given input) \(P(X, y)\) (Joint probability of input and labels)
Goal Directly classify or predict labels Generate new data or classify by computing likelihood
Examples Logistic Regression, SVM, Random Forest, Neural Networks Naïve Bayes, Gaussian Mixture Model (GMM), GANs
Advantage Usually more accurate for classification tasks Useful for data generation and handling missing data
Disadvantage Requires large labeled datasets, less flexible for missing data Can be computationally expensive and harder to optimize

In short:

  • Discriminative models focus on learning boundaries for classification.
  • Generative models learn the data distribution and can generate new samples.

26. How to deal with exploding gradient?

  • Gradient Clipping: Set a threshold to clip gradients during backpropagation.
  • Use Smaller Learning Rate: Reducing the learning rate prevents drastic weight updates.
  • Weight Regularization
  • Use Proper Weight Initialization
  • Use Normalization
  • Switch to a Different Activation Function: ReLU (instead of sigmoid/tanh) helps mitigate vanishing gradients but can cause exploding gradients. Try LeakyReLU or ELU instead.

27. Why does gradient vanishing happen?

  • Activation Functions with Small Derivatives: such as Sigmoid and Tanh
  • Chain Rule in Backpropagation: If each derivative is a small number (e.g., <1), their product shrinks exponentially as we go deeper, leading to almost zero gradients for early layers.
  • Improper Weight Initialization

NLP

1. Transformers and Multi-head Self-attention

Please see the code: https://github.com/audreycs/transformer_from_scratch.

1. Why using multi-head?
A: Multiple attention heads allow the model to learn different aspects of the input simultaneously.

2. When calculating Multi-Head Attention, why use scaling before taking the softmax?
A: To stabilize gradients and improve numerical stability.

3. why adding positional encoding in the input?
A: Transformers have no recurrence (RNNs) or convolution (CNNs), meaning they do not 
inherently understand token order. Unlike RNNs, which process sequences sequentially, 
Transformers process all tokens in parallel, making them faster but order-agnostic.
Add Positional Encoding (PE) to the input embeddings so that the model can capture token 
positions and sequence order.

2. Position Encoding

\begin{equation} \begin{gathered} P E(\text { pos }, 2 i)=\sin \left(\text { pos } / 10000^{2 i / d}\right) \\ P E(\text { pos }, 2 i+1)=\cos \left(\text { pos } / 10000^{2 i / d}\right) \end{gathered} \end{equation}