Overview of Machine Learning Libraries and Frameworks
- Scikit-learn:
- Good for small to medium datasets and supports models like regression, classification, clustering, and dimensionality reduction.
- Not usually preferred for large data sets or deep learning tasks
from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # Load dataset data = load_iris() X = data.data y = data.target # Split dataset X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) # Train model model = RandomForestClassifier(n_estimators=100) model.fit(X_train, y_train) # Predict and evaluate predictions = model.predict(X_test) print("Accuracy:", accuracy_score(y_test, predictions))
- TensorFlow and Keras:
- Used in high-performance numerical computation and deep learning tasks.
- Supports GPUs
- PyTorch:
- Alternative of Tensorflow and used in high-performance numerical computation and deep learning tasks.
- Data Preprocessing with Pytorch
- Data Manipulation with Pytorch
- XGBoost: Extreme Gradient Boosting
- Optimized distributed gradient boosting library.
- Supports both classification and regression.
- Supports regularization to prevent overfitting.
import xgboost as xgb from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error # Load data and create DMatrix data = xgb.DMatrix(data=X, label=y) # Split data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25) # Set parameters params = { "max_depth": 10, "eta": 0.1, "objective": "reg:squarederror" } # Train model model = xgb.train(params, xgb.DMatrix(X_train, label=y_train), num_boost_round=10) # Predict and evaluate predictions = model.predict(xgb.DMatrix(X_test)) print("RMSE:", mean_squared_error(y_test, predictions, squared=False))
- LightGBM:
- It is a gradient boosting framework and uses tree based algorithms.
- It is designed for distributed and efficient training for large datasets.
- Supports categorical features.
import lightgbm as lgb from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # Load and prepare data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # Create dataset for LightGBM train_data = lgb.Dataset(X_train, label=y_train) test_data = lgb.Dataset(X_test, label=y_test, reference=train_data) # Parameters params = { 'objective': 'binary', #objective function 'metric': 'binary_logloss', # evaluation metrics 'num_leaves': 31, 'learning_rate': 0.05 } # Train model gbm = lgb.train(params, train_data, num_boost_round=100, valid_sets=[test_data]) # Predict and evaluate predictions = gbm.predict(X_test) predicted_classes = [1 if prob > 0.5 else 0 for prob in predictions] print("Accuracy:", accuracy_score(y_test, predicted_classes))
Classification and Regression Techniques
Classification:
- Objective:
- Predict a categorical outcome variable(qualitative label).
- For example: Email is “spam” or “not spam” or identify whether a transaction is fraudulent etc.
- Evaluation Metrics: Accuracy, Precision, Recall, F1 score, and the area under the ROC curve (AUC-ROC).
- Linear Models: Logistic Regression, Support Vector Machines (SVM).
- Tree-Based Models: Decision Trees, Random Forest, Gradient Boosting Machines (GBM), XGBoost, LightGBM.
- Neural Networks: MLP (Multi-Layer Perceptrons), CNNs (Convolutional Neural Networks), RNNs(Recurrent Neural Networks).
- Neighbors-Based: K-Nearest Neighbors (KNN).
Regression:
- Objective:
- Predict a continuous outcome variable (dependent variable) based on one or more predictor variables (independent variables).
- For example: predict price of a house based on its size, location, and age.
- Evaluation Metrics: Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), R-squared.
- Linear Models: Linear Regression, Ridge, Lasso
- Tree-Based Models: Regression Trees, Random Forest Regression, XGBoost Regression, LightGBM Regression.
- Neural Networks: Same as in
Classification
, but with different output layer configurations Support Vector Regression (SVR).
Handling Overfitting and Underfitting
Overfitting:
- It happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data.
Strategies to Handle Overfitting:
- Regularization: Techniques like L1, L2 and Elastic Net regularization regularization are commonly used in linear and
logistic regression. - Pruning: Reducing the depth of the tree(Tree-based models).
- Cross-validation: Using techniques like k-fold cross-validation{dividing the data into multiple parts}.
- Ensemble Methods: Techniques like bagging and boosting reduce variance and bias.
- Dropout: Used in neural networks to randomly drops out a certain number of neurons from the network during each iteration of training.
Underfitting:
- It happens when a model is too simple to learn the underlying pattern of the data and fails to capture the underlying trend.
Strategies to Handle Underfitting:
- Increasing Model Complexity, Feature Engineering, Decreasing Regularization etc.
Note:
- Choosing the right model and techniques depends greatly on the nature of the data and the specific requirements of the application. Always start with simple!!!