EDA - Exploratory Data Analysis

  • This is one of the major step before desigining any machine learning model.
  • It comes after loading the data from available resources into the system.
  • Understand the data
  • Understand the independent and dependent features, features means attributes(columns) of data
  • Find corerelation between features.
  • Perform feature engineering if needed.
  • Perform standardization of data points if needed.
  • Visualize the data.
  • Remove unwanted rows or columns if any.
  • Cleaning of the data.
  • Etc Etc

Some common steps for EDA

  • Step0: Our dataset name is dataset based on some housing data in CA.

  • Step1 : Converting to data frame
     df=pd.DataFrame(datase.data,columns=datase.feature_names)
    
  • Step2: Add dependent feature to data frame. In our case price
     df['Price']=dataset.target
    
    • To get info on dataframe: df.info()
    • To Summarizing The Stats of the data : df.describe()
  • Step3: Check for missing values
     df.isnull().sum()
    
    • If there is any missing value, fill them with mean/media/mode or any other parameter as per the problem statement
    • For empty columns we can drop them
  • Step4: Find Correlation - applied to numerical data
     df.corr()   
    
    • Complete correlation between two variables is expressed by either + 1 or -1.
    • From wikipedia: The correlation coefficient is +1 in the case of a perfect direct (increasing) linear relationship (correlation), −1 in the case of a perfect inverse (decreasing) linear relationship (anti-correlation),and some value in the open interval (−1,1) in all other cases, indicating the degree of linear dependence between the variables. As it approaches zero there is less of a relationship (closer to uncorrelated). The closer the coefficient is to either −1 or 1, the stronger the correlation between the variables.
  • Step 5: Standardization and Normalization
    • Scale your data before feeding it into machine learning models
    • Standardization: Use standardization when you want to center your data around zero with a standard deviation of one.
    • Normalization: Use normalization when you want to rescale your data to fit within a specific range, typically [0, 1] or [-1, 1].
    • Why Standardize or Normalize?
      • Improved Model Performance
      • Prevention of Dominance
      • Regularization
          from sklearn.preprocessing import StandardScaler
          scaler=StandardScaler()
          X_train=scaler.fit_transform(X_train) 
          """The fit method calculates the mean and standard deviation of the training data, and the transform method applies the scaling."""
          X_test=scaler.transform(X_test)
          """Transform the test data using the same scaler. It is crucial to use the same scaler fitted on the training data to maintain consistency."""
        
  • Step6: Visualization using pairplot, scatter, regplot

Pytorch_DataPreprocessing

Pytorch_DataManipulation