Introduction to Principal Component Analysis (PCA) (2024)

Prev Tutorial: Support Vector Machines for Non-Linearly Separable Data


Original author	Theodore Tsesmelis
Compatibility	OpenCV >= 3.0

In this tutorial you will learn how to:

Use the OpenCV class cv::PCA to calculate the orientation of an object.

Principal Component Analysis (PCA) is a statistical procedure that extracts the most important features of a dataset.

Consider that you have a set of 2D points as it is shown in the figure above. Each dimension corresponds to a feature you are interested in. Here some could argue that the points are set in a random order. However, if you have a better look you will see that there is a linear pattern (indicated by the blue line) which is hard to dismiss. A key point of PCA is the Dimensionality Reduction. Dimensionality Reduction is the process of reducing the number of the dimensions of the given dataset. For example, in the above case it is possible to approximate the set of points to a single line and therefore, reduce the dimensionality of the given points from 2D to 1D.

Moreover, you could also see that the points vary the most along the blue line, more than they vary along the Feature 1 or Feature 2 axes. This means that if you know the position of a point along the blue line you have more information about the point than if you only knew where it was on Feature 1 axis or Feature 2 axis.

Hence, PCA allows us to find the direction along which our data varies the most. In fact, the result of running PCA on the set of points in the diagram consist of 2 vectors called eigenvectors which are the principal components of the data set.

The size of each eigenvector is encoded in the corresponding eigenvalue and indicates how much the data vary along the principal component. The beginning of the eigenvectors is the center of all points in the data set. Applying PCA to N-dimensional data set yields N N-dimensional eigenvectors, N eigenvalues and 1 N-dimensional center point. Enough theory, let’s see how we can put these ideas into code.

The goal is to transform a given data set X of dimension p to an alternative data set Y of smaller dimension L. Equivalently, we are seeking to find the matrix Y, where Y is the Karhunen–Loève transform (KLT) of matrix X:

\[ \mathbf{Y} = \mathbb{K} \mathbb{L} \mathbb{T} \{\mathbf{X}\} \]

Organize the data set

Suppose you have data comprising a set of observations of p variables, and you want to reduce the data so that each observation can be described with only L variables, L < p. Suppose further, that the data are arranged as a set of n data vectors \( x_1...x_n \) with each \( x_i \) representing a single grouped observation of the p variables.

Write \( x_1...x_n \) as row vectors, each of which has p columns.
Place the row vectors into a single matrix X of dimensions \( n\times p \).

Calculate the empirical mean

Find the empirical mean along each dimension \( j = 1, ..., p \).
Place the calculated mean values into an empirical mean vector u of dimensions \( p\times 1 \).
\[ \mathbf{u[j]} = \frac{1}{n}\sum_{i=1}^{n}\mathbf{X[i,j]} \]

Calculate the deviations from the mean

Mean subtraction is an integral part of the solution towards finding a principal component basis that minimizes the mean square error of approximating the data. Hence, we proceed by centering the data as follows:

Subtract the empirical mean vector u from each row of the data matrix X.
Store mean-subtracted data in the \( n\times p \) matrix B.
\[ \mathbf{B} = \mathbf{X} - \mathbf{h}\mathbf{u^{T}} \]
See Also
Principal Component Analysis
where h is an \( n\times 1 \) column vector of all 1s:
\[ h[i] = 1, i = 1, ..., n \]

Find the covariance matrix

Find the \( p\times p \) empirical covariance matrix C from the outer product of matrix B with itself:
\[ \mathbf{C} = \frac{1}{n-1} \mathbf{B^{*}} \cdot \mathbf{B} \]
where * is the conjugate transpose operator. Note that if B consists entirely of real numbers, which is the case in many applications, the "conjugate transpose" is the same as the regular transpose.

Find the eigenvectors and eigenvalues of the covariance matrix

Compute the matrix V of eigenvectors which diagonalizes the covariance matrix C:
\[ \mathbf{V^{-1}} \mathbf{C} \mathbf{V} = \mathbf{D} \]
where D is the diagonal matrix of eigenvalues of C.
Matrix D will take the form of an \( p \times p \) diagonal matrix:
\[ D[k,l] = \left\{\begin{matrix} \lambda_k, k = l \\ 0, k \neq l \end{matrix}\right. \]
here, \( \lambda_j \) is the j-th eigenvalue of the covariance matrix C
Matrix V, also of dimension p x p, contains p column vectors, each of length p, which represent the p eigenvectors of the covariance matrix C.
The eigenvalues and eigenvectors are ordered and paired. The j th eigenvalue corresponds to the j th eigenvector.

Note: sources [1], [2] and special thanks to Svetlin Penkov for the original tutorial.

Note: Another example using PCA for dimensionality reduction while maintaining an amount of variance can be found at opencv_source_code/samples/cpp/pca.cpp

Read image and convert it to binary

Here we apply the necessary pre-processing procedures in order to be able to detect the objects of interest.

Extract objects of interest

Then find and filter contours by size and obtain the orientation of the remaining ones.

Extract orientation

Orientation is extracted by the call of getOrientation() function, which performs all the PCA procedure.

First the data need to be arranged in a matrix with size n x 2, where n is the number of data points we have. Then we can perform that PCA analysis. The calculated mean (i.e. center of mass) is stored in the cntr variable and the eigenvectors and eigenvalues are stored in the corresponding std::vector’s.

Visualize result

The final result is visualized through the drawAxis() function, where the principal components are drawn in lines, and each eigenvector is multiplied by its eigenvalue and translated to the mean position.

The code opens an image, finds the orientation of the detected objects of interest and then visualizes the result by drawing the contours of the detected objects of interest, the center point, and the x-axis, y-axis regarding the extracted orientation.

FAQs

Introduction to Principal Component Analysis (PCA)? ›

Principal component analysis (PCA) is a dimensionality reduction and machine learning method used to simplify a large data set into a smaller set while still maintaining significant patterns and trends.

What is the introduction of PCA? ›

Principal Component Analysis (PCA) is a technique for dimensionality reduction that identifies a set of orthogonal axes, called principal components, that capture the maximum variance in the data.

Get More Info ›

What is the principal component analysis PCA? ›

Principal Component Analysis (PCA) is a powerful technique used in data analysis, particularly for reducing the dimensionality of datasets while preserving crucial information. It does this by transforming the original variables into a set of new, uncorrelated variables called principal components.

Get More Info Here ›

How does PCA work for dummies? ›

Principal Component Analysis (PCA) finds a way to reduce the dimensions of your data by projecting it onto lines drawn through your data, starting with the line that goes through the data in the direction of the greatest variance. This is calculated by looking at the eigenvectors of the covariance matrix.

Is PCA considered machine learning? ›

Principal Component Analysis (PCA) is one of the most commonly used unsupervised machine learning algorithms across a variety of applications: exploratory data analysis, dimensionality reduction, information compression, data de-noising, and plenty more.

Discover More ›

What is the main purpose of PCA? ›

PCA reduces the number of variables or features in a data set while still preserving the most important information like major trends or patterns. This reduction can decrease the time needed to train a machine learning model and helps avoid overfitting in a model.

Get More Info ›

What is the principal component analysis in layman's terms? ›

Principal component analysis, or PCA, is a statistical procedure that allows you to summarize the information content in large data tables by means of a smaller set of “summary indices” that can be more easily visualized and analyzed.

Read The Full Story ›

What are the benefits of PCA? ›

PCA produces large variance, which helps visualization. PCA is based on linear algebra, which is computationally simple for computers to solve. It accelerates other machine learning methods, allowing them to converge quicker when trained on main components rather than the original dataset.

Keep Reading ›

What are the benefits of principal component analysis? ›

Other benefits of PCA include reduction of noise in the data, feature selection (to a certain extent), and the ability to produce independent, uncorrelated features of the data. PCA also allows us to visualize data and allow for the inspection of clustering/classification algorithms.

Discover More Details ›

How to interpret a PCA? ›

To interpret each principal components, examine the magnitude and direction of the coefficients for the original variables. The larger the absolute value of the coefficient, the more important the corresponding variable is in calculating the component.

How is PCA used in real life? ›

Real World Applications of PCA

Beyond just reducing the size, this is useful for image classification algorithms. Visualizing multidimensional data. PCA allows us to represent the information contained in multidimensional data in reduced dimensions which are more compatible with visualization.

Learn More Now ›

Is PCA hard to understand? ›

Principal component analysis (PCA) is difficult to interpret because it involves analyzing complex spectra with a vast amount of information, requiring simplification for analysis. Large data sets are difficult to interpret because they contain a lot of information.

See Details ›

What is PCA step by step? ›

Steps Involved in the PCA

Step 1: Standardize the dataset. Step 2: Calculate the covariance matrix for the features in the dataset. Step 3: Calculate the eigenvalues and eigenvectors for the covariance matrix. Step 4: Sort eigenvalues and their corresponding eigenvectors.

See Details ›

What is the principal component analysis explained simply? ›

To sum up, principal component analysis (PCA) is a way to bring out strong patterns from large and complex datasets. The essence of the data is captured in a few principal components, which themselves convey the most variation in the dataset. PCA reduces the number of dimensions without selecting or discarding them.

Discover More Details ›

What are the disadvantages of principal component analysis? ›

Disadvantages: Loss of information: PCA may lead to loss of some information from the original data, as it reduces the dimensionality of the data. Interpretability: The principal components generated by PCA are linear combinations of the original variables, and their interpretation may not be straightforward.

Get More Info Here ›

Is PCA considered as AI? ›

Since its introduction, PCA has evolved considerably. From a purely statistical method, it has transformed into a cornerstone of machine learning and AI, serving as a vital technique for data preprocessing, feature selection, and dimensionality reduction.

See Details ›

What is the description of a PCA? ›

A Personal Care Assistant (PCA) is a professional who assists the elderly, disabled or people in recovery with performing day-to-day activities. For example, they help with household chores, hygiene and mobility support.

See Details ›

What is the PCA meaning? ›

PCA is the acronym for Personal Care Assistant and the term has been appearing in local job postings recently. Many times, the terms PCA and CNA (Certified Nursing Assistant) are used interchangeably but they aren't exactly the same depending on where you live.