# A C++ code to compute OpenGL 4×4 GL_MODELVIEW_MATRIX from 2D-3D points homography

## The theory

I’ve recently ran in this problem, which is one of the very well know problems in computer vision, the homography estimation. Given two equicardinal sets of $N$ points:

$\mathbf{x} = \{\mathbf{x}_1,\ldots,\mathbf{x}_N \}$ with $\mathbf{x}_i \in \mathbb{R}^2$

$\mathbf{X} = \{\mathbf{X}_1,\ldots,\mathbf{X}_N \}$ with $\mathbf{X}_i \in \mathbb{R}^3$

find the matrix $\mathbf{A}$ such that the following projective correspondence is valid:

$\mathbf{x}_i \sim \mathbf{A} \mathbf{X}_i \quad \forall i \in \{1,\ldots,N\}$

where the lowercase points $\mathbf{x}_i = (u_i,v_i)$ are the image coordinates, i.e. coordinates defined on a plane (in OpenGL terms typically the plane is the screen and the image points are the pixel coordinates) and the uppercase points $\mathbf{X}_i = (x_i,y_i,z_i)$ are instead the 3D world coordinates of points to be projected on the screen plane. In order to make computations easier is useful to homogenize the coordinates. This is done by explicitly adding a $1$ to both image and world points, increasing their dimension by 1. For more details about correct and formal introduction to the meaning of coordinates homogenization look for other resources, here is not my interest to introduce such a complicate argument, I just say that in homogenous coordinates the projection of points into planes can be simply expressed as linear matrix-vector product.  (Soatto et.al have a good introduction to homogenization and 3D views reconstruction in general).

The tilde $\sim$ means that this is a projective relation, in formal terms this is an ordinary equation up to a constant factor. As one might expect, for each corresponces $\mathbf{X}_i \leftrightarrow \mathbf{x}_i$ we derive the relationship:

$\begin{bmatrix} \mathbf{0}^T & - \mathbf{X}_i^T & v_i \mathbf{X}_i^T \\ \mathbf{X}_i^T & \mathbf{0}^T & -u_i \mathbf{X}_i^T \end{bmatrix}=\begin{pmatrix}\mathbf{A}^1\\ \mathbf{A}^2\\ \mathbf{A}^3\end{pmatrix}$

Where each $\mathbf{A}^{iT}$ is a row-vector of 4 entries, the i-th row of $\mathbf{A}$ and $\mathbf{0}^T \in \mathbb{R}^{1 \times 4}$ is a row vector composed of 4 zeros. From a set of $N$ point correspondences, we obtain a $2n \times 12$ matrix, called $\mathbf{G} \in \mathbb{R}^{2n \times 12}$.

For the curious reader, the matrix $\mathbf{G}\in \mathbb{R}^{2n \times 12}$ can be explicitly built in the following way:



The elements of matrix $\mathbf{A}$ are then the 12 values contained in the last column of the unitary matrix $\mathbf{V}$ rearranged in column-major order.

The following code snippet does the dirty job:

What is now interesting and not very clear from the sources, is instead how to get from a 3×4 matrix like $\mathbf{A}$, the two 4×4 matrices of OpenGL, namely GL_MODELVIEW_MATRIX and GL_PROJECTION_MATRIX.

The problem is that the OpenGL convention is different from the usual cartesian coordinates used in computer vision literature. In OpenGL the y-axis of image coordinates is inverted and the origin is in the upper-left part of the monitor and the camera is looking down the negative z-axis. All these things make the computation of OpenGL projection and modelview matrices (here denoted respectively with $\mathbf{P}\in\mathbb{R}^{4\times 4}$ and $\mathbf{MV}\in\mathbb{R}^{4\times 4}$ ) definitely a pain.

In order to get the OpenGL projection matrix from the ComputerVision projection matrix, one needs to better understand what the projection matrix $\mathbf{A}$ contains in terms of camera model and how that matrix can be decomposed in its extrinsic and intrinsic parts. This process is also known as camera resectioning or decomposition of camera matrix.

### Homography and pinhole camera model

The homography matrix $\mathbf{A}$ also called projection matrix in the computer vision literature, has internally 11 degrees of freedom which come from the pinhole camera model. The first 5 parameters are called intrinsic parameters and they  encompass focal length, image format, and principal point. In this simplified basic pinhole model, we consider the central projection of points in space onto a plane. The center of projection here is the origin of Euclidean coordinates plane and the image (or focal) plane is the plane at Z=f,  where f is the focal distance. Under this model, a point with world coordinates $\mathbf{x}=(x,y,z)^T$ is mapped to the point on the image plane where a line joining the point $\mathbf{x}$ to the center of projection $\mathbf{C}$ meets the image plane. The center of projection is called the camera center also know as optical center, while the line from the camera center $\mathbf{C}$ perpendicular to the image plane is called principal axis or principal ray. The plane through the camera centre, parallel to the image plane is called the principal plane of the camera.

The principal point is the image of the physical camera center in image-coordinates, the intersection of the principal ray with the image plane. In a simple setup, with a camera centered in (0,0,0) looking down the z-axis, at screen resolution 1024×768, the principal point is exactly half the width and height, i.e. 512×384. All the camera intrinsic parameters are contained in a upper-triangular $3\times 3$ matrix that we call $\mathbf{K}$.

The camera extrinsic parameters are 3 camera orientation angles and 3 camera translation along x,y,z axis, forming in total 6 extrinsic parameters. Numerically the camera center is the right-null vector of $\mathbf{A}$ because $\mathbf{A}\cdot \mathbf{C}=\mathbf{0}$ and can be obtained as:

$\begin{matrix} C_x=\det([\mathbf{p}_2,\mathbf{p}_3,\mathbf{p}_4]) & C_Y=-\det([\mathbf{p}_1,\mathbf{p}_3,\mathbf{p}_4]) & C_z=\det([\mathbf{p}_1,\mathbf{p}_2,\mathbf{p}_4]) \end{matrix}$

while the orientations are better written in terms of a $3\times 3$ rotation matrix $\mathbf{R}$. Both the matrices $\mathbf{R}$ and $\mathbf{K}$ are computed from the upper-left 3×3 part of the original projection matrix $\mathbf{A}$. We can write:

$\mathbf{A}=[\mathbf{M} | - \mathbf{M}\mathbf{C}=\mathbf{K}[\mathbf{R} | - \mathbf{R}\mathbf{C}]$

Now we can form the intrinsic  $3\times 3$ matrix $\mathbf{K}$ and the extrinsic orientations from RQ decomposition of $\mathbf{M}$, paying attention that the axes are correct. It turns out that the correct code to accomplish the RQ decomposition is the following, where we require, in order to remove ambiguities, that K has positive diagonal entries:

## From intrinsic and extrinsinc matrices to OpenGL Projection and ModelView matrices

Now that we have developed all the necessary knowledge on homography estimation, pinhole model and camera resectioning, we can obtain the infamous OpenGL matrices.

First compute the GL_PROJECTION_MATRIX, this is done in the following code snippet:

then the simpler GL_MODELVIEW_MATRIX is obtained in the following snippet:

## The code

In this long article I have tried to explain at my best how to get both GL_MODELVIEW_MATRIX and GL_PROJECTION_MATRIX from just the two sets of points correspondences already discussed before. I’ve written a C++ class that helps a user in the process. The full code is hosted on my GitHub page

https://github.com/CarloNicolini/OpenGL-CameraCalibration

Once downloaded, you can try to compile it. If you have CMake you just need to

$> git clone https://github.com/CarloNicolini/OpenGL-CameraCalibration CameraCalibration$> cd CameraCalibration
$> mkdir build$> cmake ../
$> make testCameraCalibration$> ./testCameraCalibration [2D_points_file.txt] [3D_points_file.txt]

For the matrix computations I use the matrix library Eigen http://eigen.tuxfamily.org/ which is a freely available and very powerful, templatized header-only C++ library.

## Example:

In our example we already know the GL_MODELVIEW_MATRIX and the GL_PROJECTION_MATRIX and we want to find them from just the points correspondences. The 3D points are:

1 0 0
0 1 0
0 0 1
-1 -1 -1
1 1 1
0 0 0
-0.5 0.5 0
-0.2 -0.4 0.2
-0.4 0.1 0.8
0.2 0.51 0.118
5 1 2
1 4 6
10 -1 5

and their corresponding 2D images are:

817.258 513.731
769.14 562.358
848.552 519.778
730.073 405.62
851.791 594.5
791.141 500.383
766.171 524.836
805.826 476.392
823.965 516.715
791.94 536.803
992.483 651.365
1113.56 940.746
1261.94 641.017

## References

Yi Ma, S.Soatto – “An invitation to 3D vision”

R.Hartley, A.Zisserman – “Multiple view geometry in computer vision”, 2nd edition.