Deep Dive Into Machine learning : Principal Component Analysis

Swarnalata Patel
2 min readOct 24, 2024

--

Note: While learning the topic, I prepared the note for my easy reference by referring various sources. Hopefully, this will be helpful to others too to understand the algorithm in the simple manner.

What is Principal Component Analysis:

Principal component analysis (PCA) is a dimensionality reduction and machine learning method used to simplify a large data set into a smaller set.

When to use Principal Component Analysis:

PCA is used in various scenarios.

  • Data preprocessing: PCA is often used to preprocess data for machine learning algorithms.
  • Dimensionality reduction: PCA can help you reduce the dimensionality of your data by identifying the most important features. This can be helpful when your data has a lot of variables and is difficult to analyze.
  • Feature extraction: PCA can help you derive new features from your data that might be more insightful than the original features.
  • Data visualization: PCA can help you visualize high-dimensional data in two or three dimensions, which can help you identify patterns and outliers.

When not to use Principal Component Analysis:

If the relationship is weak between variables, PCA does not work well to reduce data.

Steps to implement Principal Component Analysis:

import libraries:

import pandas as pd

import numpy as np

import seaborn as sns

import matplotlib.pyplot as plt

%matplotlib inline

Import the dataset, verify the type of the dataset, get the keys

from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()

type(cancer)

cancer.keys()

Create the data frame

df = pd.DataFrame(cancer[‘data’],columns = cancer[‘feature_names’])

Verify the head of the dataframe

df.head()

Preprocess the data

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaler.fit(df)

scaled_data = scaler.transform(df)

Implement PCA

from sklearn.decomposition import PCA

pca = PCA(n_components=2)

pca.fit(scaled_data)

x_pca = pca.transform(scaled_data)

Compare the dimensions before and after the PCA implementation

scaled_data.shape

output: (569, 30)

x_pca.shape

output: (569, 2)

Plot a figure using the dimensions

plt.figure(figsize=(8,6))

plt.scatter(x_pca[:,0],x_pca[:,1],c=cancer[‘target’])

plt.xlabel(‘First Principal Component’)

plt.ylabel(‘Second Principal Component’)

Verify the PCA components

pca.components_

Create a dataframe of the components

df_comp = pd.DataFrame(pca.components_,columns=cancer[‘feature_names’])

Create a heatmap of the components to know which features constitute the components

sns.heatmap(df_comp,cmap=’plasma’)

Working Jupyter Notebook:

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

No responses yet

Write a response