Regularization
- It is used to prevent Overfitting of ML models.
- Overfitting is related to high variance.
- High Variance means: Model’s sensitivity to small fluctuations in the training dataset. if you slightly change the training data=> the predictions change significantly.
- Bias-Variance Tradeoff:
- Tradeoff between the error introduced by
bias
and the error introduced byvariance
. - High bias can cause a model to miss relevant relations between features and target outputs (underfitting).
- High variance can cause a model to model random noise in the training data, rather than the intended outputs (overfitting).
- Tradeoff between the error introduced by
Criteria to use?
- Usually preffered to use if you see big difference in performance of model outcomes during training and testing phase.
Types:
- L1, L2 and Elastic Net regularization {defined by us during model creation}
- Model parameters, Dropout, early stoping etc {implicit by the model as hyper parameters}
L1(Lasso) Regularization:
- Adds a penalty equal to the absolute value of the magnitude of coefficients/weights.
L = L0 + λ∑∣wi∣ L0 = original loss function wi = model coefficients. λ = regularization parameter hence, controls the strength of the penalty.
- L1 can yield
sparse models
where some coefficient weights are exactly zero. - L1 is useful for
feature selection
, effectively determining which features are important for the prediction. - Usage:
- Use L1 if you need a sparse model, where feature selection is important and only a subset of features are meaningful.
L2(Ridge) Regularization:
- Adds a penalty equal to the square of the magnitude of coefficients/weights.
L = L0 + λ∑w<sub>i</sub><sup>2</sup> The terms are the same as those defined for L1.
- L2 does not zero out coefficients.
- L2 encourages coefficients to be small.
- All features get to contribute, albeit modestly, to the model.
- L2 give better prediction results and is robust against outliers and multicollinearity.
- Usage:
- Use L2 when you expect that many small or medium effects are important.
- L2 can perform better when
multicollinearity
is present. - L2 tends to have better prediction performance due to its ability to shrink coefficients evenly.
Elastic Net regularization:
- Combination of both L1 and L2 regularization.
- For example:
L = L0 + λ∑∣wi∣ + (1-λ)∑w<sub>i</sub><sup>2</sup>
- Usage:
- It is a good choice when you have correlated features and you want to balance the feature selection and overfitting prevention or not sure which one to use among L1 and L2 initially.
Model parameters:
- Adjusting hyperparameters given by respective model.
- For example: model_depth in tree models, colsample_bytree in xgboost etc. or in neural nets number of layers and the number of neurons per layer and many more…
Dropout regularization:
- Randomly droping out a certain number of neurons from the network during each iteration of training and let model train on several subsets of data to lean and perform.
- Adjust dropout parameters while design models.
Early Stopping:
- Using validation dataset to observe model performance while training and then stopping in between to before a degradation in performance stars.
- For example:
- In LightGBM, Xgboost set : early_stopping_rounds
- In LLM, setup
Early Stopping Callback Function
Outliers:
- Data points that deviate significantly from other observations.
- Outliers can skew and mislead the training process.
- To handle outliers:
- Transformation: Use transformation like logarithmic, square root, etc.
- Remove outliers from the dataset if not needed.
- Visualize: Visual methods like box plots, or clustering methods to identify outliers.
Multicollinearity:
- Two or more predictor variables in a regression model are highly correlated.
- To handle:
- Use Regularization.
- Visualization : Plot correlation for each parameters.
Example: Code Implementation:
Logistic Regression with L1 and L2:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# Sample data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# L1 Regularization: 'solver parameter' specifies the algorithm to use in the optimization problem.
model_l1 = LogisticRegression(penalty='l1', solver='liblinear') #liblinear: binary classification
model_l1.fit(X_train, y_train)
# L2 Regularization
model_l2 = LogisticRegression(penalty='l2', solver='liblinear')
model_l2.fit(X_train, y_train)
XGBoost:
import xgboost as xgb
# Create DMatrix
d_train = xgb.DMatrix(X_train, label=y_train)
# Set parameters for L1 and L2
params = {
'alpha': 1.0, # L1 regularization term on weights
'lambda': 1.0, # L2 regularization term on weights
'objective': 'reg:squarederror'
}
# Train model
model = xgboost.train(params, d_train, num_boost_round=10)
LighBGM:
import lightgbm as lgb
# Create dataset for LightGBM
d_train = lgb.Dataset(X_train, label=y_train)
# Set parameters for L1 and L2
params = {
'objective': 'binary',
'lambda_l1': 0.5, # L1 regularization
'lambda_l2': 0.5, # L2 regularization
}
# Train model
gbm = lgb.train(params, d_train, num_boost_round=100)
Large Language Model(Trainer from Hugging face): from HF documentation
from transformers import TrainerCallback, TrainerControl, TrainerState
# This class monitors the evaluation loss after each evaluation step.
class EarlyStopping(TrainerCallback):
def __init__(self, early_stopping_patience: int = 3, early_stopping_threshold: float = 0.0):
self.early_stopping_patience = early_stopping_patience
self.early_stopping_threshold = early_stopping_threshold
self.best_metric = None
self.patience_counter = 0
# called after each evaluation
def on_evaluate(self, args, state: TrainerState, control: TrainerControl, **kwargs):
# Retrieve the evaluation metric from the state
eval_metric = state.log_history[-1]['eval_loss']
# Check if it's the best metric we've seen so far
if self.best_metric is None or eval_metric < self.best_metric - self.early_stopping_threshold:
self.best_metric = eval_metric
self.patience_counter = 0
else:
self.patience_counter += 1
# Check if we should stop training
if self.patience_counter >= self.early_stopping_patience:
print("No improvement in metric for", self.early_stopping_patience, "evaluation steps. Stopping training...")
control.should_training_stop = True
from transformers import Trainer, TrainingArguments, AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
training_args = TrainingArguments(
output_dir='./',
evaluation_strategy="epoch",
save_strategy="epoch",
num_train_epochs=100,
per_device_train_batch_size=4,
per_device_eval_batch_size=2
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
callbacks=[EarlyStopping(early_stopping_patience=3)]
)
trainer.train()