EDA - Exploratory Data Analysis

This is one of the major step before desigining any machine learning model.
It comes after loading the data from available resources into the system.
Understand the data
Understand the independent and dependent features, features means attributes(columns) of data
Find corerelation between features.
Perform feature engineering if needed.
Perform standardization of data points if needed.
Visualize the data.
Remove unwanted rows or columns if any.
Cleaning of the data.
Etc Etc

Some common steps for EDA

Step0: Our dataset name is dataset based on some housing data in CA.

Step1 : Converting to data frame

 df=pd.DataFrame(datase.data,columns=datase.feature_names)

Step2: Add dependent feature to data frame. In our case price
```
 df['Price']=dataset.target
```
- To get info on dataframe: df.info()
- To Summarizing The Stats of the data : df.describe()
Step3: Check for missing values
```
 df.isnull().sum()
```
- If there is any missing value, fill them with mean/media/mode or any other parameter as per the problem statement
- For empty columns we can drop them
Step4: Find Correlation - applied to numerical data
```
 df.corr()   
```
- Complete correlation between two variables is expressed by either + 1 or -1.
- From wikipedia: The correlation coefficient is +1 in the case of a perfect direct (increasing) linear relationship (correlation), −1 in the case of a perfect inverse (decreasing) linear relationship (anti-correlation),and some value in the open interval (−1,1) in all other cases, indicating the degree of linear dependence between the variables. As it approaches zero there is less of a relationship (closer to uncorrelated). The closer the coefficient is to either −1 or 1, the stronger the correlation between the variables.

Step 5: Standardization and Normalization

Scale your data before feeding it into machine learning models
Standardization: Use standardization when you want to center your data around zero with a standard deviation of one.
Normalization: Use normalization when you want to rescale your data to fit within a specific range, typically [0, 1] or [-1, 1].

Why Standardize or Normalize?

Improved Model Performance
Prevention of Dominance

Regularization

  from sklearn.preprocessing import StandardScaler
  scaler=StandardScaler()
  X_train=scaler.fit_transform(X_train) 
  """The fit method calculates the mean and standard deviation of the training data, and the transform method applies the scaling."""
  X_test=scaler.transform(X_test)
  """Transform the test data using the same scaler. It is crucial to use the same scaler fitted on the training data to maintain consistency."""

Step6: Visualization using pairplot, scatter, regplot

Satyam Mittal

EDA(Exploratory Data Analysis) - Basic Introduction

EDA - Exploratory Data Analysis

Some common steps for EDA

Pytorch_DataPreprocessing

Pytorch_DataManipulation