Step1 : Problem statement :
- What is the Objective?
- What is the expected output? for example: Binary variable (example: 0 = “first prediction”, 1 = “second prediction”).
Step2 : Data Collection :
- Ensure all the data is collected from all the possible sources, data can be strcuctured, unstructured, semi-structured or all.
Step3 : Data Preprocessing :
- 3.1 Data Cleaning:
- Handling missing values :
- It depends on the type of data. For example: For numerical type column one could choose mean/median/mode depending on the requirements.
- Or if it is string type data, one can fill it with N/A, unknown etc or data can be dropped.
- Removing duplicates: Ensure that there are no duplicates in the dataset, to avoid skewness of the results.
- This preprocessing usually depends on the requirements.
- Handling missing values :
- 3.2 Data Transformation
- Feature Engineering : This step involves (if needed) create new features that could help in the prediction.
- Normalization/Standardization: Scale the data if you’re using models that are sensitive to the magnitude of input values.
-
3.3 Handling Dates(if needed) : Data parsing should be correct, consistent and extract useful information from dates.
-
3.4 Encoding Categorical Data: Convert categorical variables into a form that can be provided to ML models by using one-hot encoding, label encoding, or using embedding layers for deep learning models.
- 3.4.1 : Label Encoding : It is usually prefered when the categorical feature is ordinal (the categories have a meaningful order or rank).
from sklearn.preprocessing import LabelEncoder encoder = LabelEncoder() # encoder.fit_transform(...)
- 3.4.2 : One-Hot Encoding with sklearn : It is usually preferred when you are implementing a pipeline or need to handle unseen data during
model deployment. Similar to
pd.get_dummies()
from sklearn.preprocessing import OneHotEncoder encoder = OneHotEncoder(drop='first', sparse=False) # encoder.fit_transform(...)
- 3.4.3 : Hashing : Preferred when number of unique categories is large and dimentionality reduction is needed.
from sklearn.feature_extraction import FeatureHasher hasher = FeatureHasher(n_features=6, input_type='string') # hasher.transform(.....)
- 3.4.4 : Category Encoders : Advanced encoding techniques for categorical variables
import category_encoders as ce encoder = ce.BinaryEncoder(cols=['month', 'day_of_week']) # encoder.fit_transform(df)
- 3.4.1 : Label Encoding : It is usually prefered when the categorical feature is ordinal (the categories have a meaningful order or rank).
Step4 : Feature Selection :
- 4.1 Correlation Analysis: Remove highly correlated features to reduce multicollinearity.
- 4.2 Importance Metrics: Utilize model-based metrics like feature importance or etc.
Step5 : Data Partitioning :
Split the data into training, validation, and test sets.
Step6 : Developing a Baseline Model :
Start with a simple model to establish a baseline performance for example Logistic regression or a basic decision tree.
Step7 : Model Training and Evaluation :
- Train your model on the training set and evaluate it using the validation set.
- Adjust parameters, as needed, and consider upgrading model based on the parameters by fine-tuninig or model enhancement.