Overview of Machine Learning Libraries and Frameworks

  1. Scikit-learn:
    • Good for small to medium datasets and supports models like regression, classification, clustering, and dimensionality reduction.
    • Not usually preferred for large data sets or deep learning tasks
       from sklearn.ensemble import RandomForestClassifier
       from sklearn.datasets import load_iris
       from sklearn.model_selection import train_test_split
       from sklearn.metrics import accuracy_score
      
       # Load dataset
       data = load_iris()
       X = data.data
       y = data.target
      
       # Split dataset
       X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
      
       # Train model
       model = RandomForestClassifier(n_estimators=100)
       model.fit(X_train, y_train)
      
       # Predict and evaluate
       predictions = model.predict(X_test)
       print("Accuracy:", accuracy_score(y_test, predictions))
      
  2. TensorFlow and Keras:
    • Used in high-performance numerical computation and deep learning tasks.
    • Supports GPUs
  3. PyTorch:
  4. XGBoost: Extreme Gradient Boosting
    • Optimized distributed gradient boosting library.
    • Supports both classification and regression.
    • Supports regularization to prevent overfitting.
        import xgboost as xgb
        from sklearn.model_selection import train_test_split
        from sklearn.metrics import mean_squared_error
    
        # Load data and create DMatrix
        data = xgb.DMatrix(data=X, label=y)
    
        # Split data
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
    
        # Set parameters
        params = {
            "max_depth": 10,
            "eta": 0.1,
            "objective": "reg:squarederror"
        }
    
        # Train model
        model = xgb.train(params, xgb.DMatrix(X_train, label=y_train), num_boost_round=10)
    
        # Predict and evaluate
        predictions = model.predict(xgb.DMatrix(X_test))
        print("RMSE:", mean_squared_error(y_test, predictions, squared=False))
    
    
  5. LightGBM:
    • It is a gradient boosting framework and uses tree based algorithms.
    • It is designed for distributed and efficient training for large datasets.
    • Supports categorical features.
        import lightgbm as lgb
        from sklearn.model_selection import train_test_split
        from sklearn.metrics import accuracy_score
    
        # Load and prepare data
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    
        # Create dataset for LightGBM
        train_data = lgb.Dataset(X_train, label=y_train)
        test_data = lgb.Dataset(X_test, label=y_test, reference=train_data)
    
        # Parameters
        params = { 
            'objective': 'binary', #objective function
            'metric': 'binary_logloss', # evaluation metrics
            'num_leaves': 31, 
            'learning_rate': 0.05
        }
    
        # Train model
        gbm = lgb.train(params, train_data, num_boost_round=100, valid_sets=[test_data])
    
        # Predict and evaluate
        predictions = gbm.predict(X_test)
        predicted_classes = [1 if prob > 0.5 else 0 for prob in predictions]
        print("Accuracy:", accuracy_score(y_test, predicted_classes))
    
    

Classification and Regression Techniques

Classification:

  • Objective:
    • Predict a categorical outcome variable(qualitative label).
    • For example: Email is “spam” or “not spam” or identify whether a transaction is fraudulent etc.
    • Evaluation Metrics: Accuracy, Precision, Recall, F1 score, and the area under the ROC curve (AUC-ROC).
  • Linear Models: Logistic Regression, Support Vector Machines (SVM).
  • Tree-Based Models: Decision Trees, Random Forest, Gradient Boosting Machines (GBM), XGBoost, LightGBM.
  • Neural Networks: MLP (Multi-Layer Perceptrons), CNNs (Convolutional Neural Networks), RNNs(Recurrent Neural Networks).
  • Neighbors-Based: K-Nearest Neighbors (KNN).

Regression:

  • Objective:
    • Predict a continuous outcome variable (dependent variable) based on one or more predictor variables (independent variables).
    • For example: predict price of a house based on its size, location, and age.
    • Evaluation Metrics: Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), R-squared.
  • Linear Models: Linear Regression, Ridge, Lasso
  • Tree-Based Models: Regression Trees, Random Forest Regression, XGBoost Regression, LightGBM Regression.
  • Neural Networks: Same as in Classification, but with different output layer configurations Support Vector Regression (SVR).

Handling Overfitting and Underfitting

Overfitting:

  • It happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data.

Strategies to Handle Overfitting:

  • Regularization: Techniques like L1, L2 and Elastic Net regularization regularization are commonly used in linear and
    logistic regression.
  • Pruning: Reducing the depth of the tree(Tree-based models).
  • Cross-validation: Using techniques like k-fold cross-validation{dividing the data into multiple parts}.
  • Ensemble Methods: Techniques like bagging and boosting reduce variance and bias.
  • Dropout: Used in neural networks to randomly drops out a certain number of neurons from the network during each iteration of training.

Underfitting:

  • It happens when a model is too simple to learn the underlying pattern of the data and fails to capture the underlying trend.

Strategies to Handle Underfitting:

  • Increasing Model Complexity, Feature Engineering, Decreasing Regularization etc.

Note:

  • Choosing the right model and techniques depends greatly on the nature of the data and the specific requirements of the application. Always start with simple!!!