EDA - Exploratory Data Analysis
- This is one of the major step before desigining any machine learning model.
- It comes after loading the data from available resources into the system.
- Understand the
data
- Understand the
independent and dependent features
, features means attributes(columns) of data - Find
corerelation
between features. - Perform
feature engineering
if needed. - Perform
standardization
of data points if needed. - Visualize the data.
- Remove unwanted rows or columns if any.
- Cleaning of the data.
- Etc Etc
Some common steps for EDA
-
Step0: Our dataset name is dataset based on some housing data in CA.
- Step1 : Converting to data frame
df=pd.DataFrame(datase.data,columns=datase.feature_names)
- Step2: Add dependent feature to data frame. In our case
price
df['Price']=dataset.target
- To get info on dataframe: df.info()
- To Summarizing The Stats of the data : df.describe()
- Step3: Check for missing values
df.isnull().sum()
- If there is any missing value, fill them with mean/media/mode or any other parameter as per the problem statement
- For empty columns we can drop them
- Step4: Find Correlation - applied to numerical data
df.corr()
- Complete correlation between two variables is expressed by either + 1 or -1.
- From wikipedia: The correlation coefficient is +1 in the case of a perfect direct (increasing) linear relationship (correlation), −1 in the case of a perfect inverse (decreasing) linear relationship (anti-correlation),and some value in the open interval (−1,1) in all other cases, indicating the degree of linear dependence between the variables. As it approaches zero there is less of a relationship (closer to uncorrelated). The closer the coefficient is to either −1 or 1, the stronger the correlation between the variables.
- Step 5: Standardization and Normalization
- Scale your data before feeding it into machine learning models
- Standardization: Use standardization when you want to center your data around zero with a standard deviation of one.
- Normalization: Use normalization when you want to rescale your data to fit within a specific range, typically [0, 1] or [-1, 1].
- Why Standardize or Normalize?
- Improved Model Performance
- Prevention of Dominance
- Regularization
from sklearn.preprocessing import StandardScaler scaler=StandardScaler() X_train=scaler.fit_transform(X_train) """The fit method calculates the mean and standard deviation of the training data, and the transform method applies the scaling.""" X_test=scaler.transform(X_test) """Transform the test data using the same scaler. It is crucial to use the same scaler fitted on the training data to maintain consistency."""
- Step6: Visualization using pairplot, scatter, regplot