Deep Dive Into Machine learning : Principal Component Analysis
Note: While learning the topic, I prepared the note for my easy reference by referring various sources. Hopefully, this will be helpful to others too to understand the algorithm in the simple manner.

What is Principal Component Analysis:
Principal component analysis (PCA) is a dimensionality reduction and machine learning method used to simplify a large data set into a smaller set.
When to use Principal Component Analysis:
PCA is used in various scenarios.
- Data preprocessing: PCA is often used to preprocess data for machine learning algorithms.
- Dimensionality reduction: PCA can help you reduce the dimensionality of your data by identifying the most important features. This can be helpful when your data has a lot of variables and is difficult to analyze.
- Feature extraction: PCA can help you derive new features from your data that might be more insightful than the original features.
- Data visualization: PCA can help you visualize high-dimensional data in two or three dimensions, which can help you identify patterns and outliers.
When not to use Principal Component Analysis:
If the relationship is weak between variables, PCA does not work well to reduce data.
Steps to implement Principal Component Analysis:
import libraries:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
Import the dataset, verify the type of the dataset, get the keys
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
type(cancer)
cancer.keys()
Create the data frame
df = pd.DataFrame(cancer[‘data’],columns = cancer[‘feature_names’])
Verify the head of the dataframe
df.head()
Preprocess the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(df)
scaled_data = scaler.transform(df)
Implement PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(scaled_data)
x_pca = pca.transform(scaled_data)
Compare the dimensions before and after the PCA implementation
scaled_data.shape
output: (569, 30)
x_pca.shape
output: (569, 2)
Plot a figure using the dimensions
plt.figure(figsize=(8,6))
plt.scatter(x_pca[:,0],x_pca[:,1],c=cancer[‘target’])
plt.xlabel(‘First Principal Component’)
plt.ylabel(‘Second Principal Component’)
Verify the PCA components
pca.components_
Create a dataframe of the components
df_comp = pd.DataFrame(pca.components_,columns=cancer[‘feature_names’])
Create a heatmap of the components to know which features constitute the components
sns.heatmap(df_comp,cmap=’plasma’)

Working Jupyter Notebook: