VO

Feature points method

Feature points method is popular because it is quite steady and insensible to light and dynamic.

Features include corners, edges and blocks. The easiest feature to detect is corner, then edge, then block.

Corners also have problems. Maybe some feature are corners but when we are close to it, it's no more a corner. Or when we rotate the camera, we could not recognize the corner. For this, computer researchers build several corners such as SIFT, SURF and ORB. These good corners have these features:

Repeatability: we could get the same region from different images.
Distinctiveness: different regions have different expressions.
Efficiency: the feature corners are much less than the non-feature corners.
Locality: the features are only correspond with a small region.

The feature points consists of Key-points and Descriptors. Key-points are the key-points in the pictures, some of them have oriention and size. Descriptors are usually a vector that descript the surrounding information of these key-points. Hence, if two descriptors are close in spatial, we could treat them as the same descriptos.

SIFT is very accurate, but very compute cosuming. We usually use the less consuming orb.

ORB feature

ORB feature consists of key-points and descriptors. Its key-points called "Oriented FAST". Therefore, Extracting the ORB feature needs two steps:

FAST corner extraction. Compute the FAST as well as the key-point oriention in order to increse rotation fixed property for computing the BRIEF.
BRIEF descriptors: to describe the surrounding environments.

FAST key-point

FAST is a key-point which detects the place where the pixel gray scale changes significantly. It is famous for its computing speed. Its contain is that if a pixel's brightness is different from its neiber pixels, it might be a corner. Its detecting steps is as followings:

Pick a pixel p, its brightness is .
Set a threshold T(20% of for example).
Treat p as the center of the cirble, pick the 16 surrounding pixels by the radius of 3.
Suppose the circle has continuous points whose brightness greater than or less than , then we could treat the pixel p as a key-point(N is usually 12)
Cycle the above steps for every pixel.

To be faster: detect the 1,5,9,13 brightness. Besides, we use the Non-maximal suppression to keep the maximal responding corner to avoid the corner focus problem.

Problems in FAST:

Number is large and uncertain, but we usually want fixed features. Therefore, in ORB, we fix the original FAST algorithm. Compute the Harris responding value, and pick the former largest N corner, as the result corner set.
It has scale problem: some corner might be seen as a corner when you're far away from the target but no more a corner when you access to it. ORB attach the scale and rotation description. Scale invariance is constructed by the image pyramid. And the feature rotation is computed by the intensity centroid.

Centroid refers to the image block gray value as the center of the weight. The steps are shown as follows:

In a small image block , the moment of the image block is:

2. Through the moment we could get the centroid of the image block:

3. Connect the central

and the centroid

, we could get a

, we could define the direction of key-points:

###BRIEF descriptor After extracting the Oriented FAST, we could compute the discriptor of every point. ORB uses improved BRIEF feature.

BRIEF is a binary descriptor, its description vector consists of many 0 and 1, the 0 and 1 represents the size relationship: if is larger than , it sets 1, otherwise it sets 0. If we pick 128 such pairs, we could get a 128-dimension vector consist of . They pick such pairs randomly following some probability rule.

###Feature matching Feature matching is a significant step in the SLAM, it deals with the data association problem, when we see the landmark and determine the relationship of the landmarks we have seen before. By accurate matching, we could reduce the burden on the pose estimation and the optimization. But the mismatch always limit the SLAM.

How to match these points? The easiest method is the Brute-Force Marcher. For every key-point , compare its distance to every other key-points'. The distance is Hamming distance.

But as the amount of the key-points get larger, the brute-force matcher turns to be very slow. We use FLANN to compute the situation where there are a large of amount of key-points.

Then we hope to estimate the camera's motion according to the paired points. This varies according to the cameras' types:

When the camera is monocular, we only know the 2D pixel coordinates. We estimate the cameras' motion according to two pairs of 2D points by using Epipolar Geometry.
When the camera is stereo, we know the distances. We estimate the cameras' motion according to two pairs of 3D points by using ICP.
If we have the 3D points and their projection position in the camera, we solve it by using PnP.

2D-2D: Epipole Geometry

Epipole confinement

We hope to solve the motion between two frames , suppose the motion from the first frame to the second frame is , and the center of the camera is , is a key-point in , while is a key-point in . First of all, and intersect at the point . The point , and determine a plane Epipolar plane. The line intersects the plane with . These points are called Epipoles, are called baseline. and are called Epipolar line.

Suppose the spatial coordinate of P in the first frame is:

We know the pixel coordinate of is:

K here is the intrinsics of the camera, R, t shows the motion of this camera. But if we use the homogenous coordinate, we can write these formulas in the terms of multiplying a non-zero constant up to a scale:

Now, suppose:

and here are the normalized coordinate. Bring it into this formula, we could get:

left multiply :

Then left multiply :

we could see from the left part of the equation that is vertical to . So the product is 0. Then we could get:

Bring into , we could get:

These two equations are called epipole confinement. Its geometric meaning is that are in the same plane. We mark the middle matrix into two: fundamental matrix F and essential matrix E, and simplify the epipole refinement:

The camera pose estimate problem turns into the following two steps:

solve E or F through the paired points pixel position.
solve R,t according to E or F.

Essential matrix

According to the definition, the essential matrix is a 3x3 matrix, its property shows as followings:

when it multiply with a non-zero constant, it also obeys the epipole refinement. This is called scale equivalence.
the singular value must be in the form of .
according to scale equivalence. have 6-1 =5 freedom degree.

We could use eight-point-algorithm to estimate E.

Suppose a paired matching points: . According to epipole refinement:

expand the matrix E into vector:

Write into the form of e:

Put all these points into a equation:

If there're eight pairs of points which could solve E, and then solve R,t. This process is solved through SVD:

U and V are orthogonal array, is the Singular value matrix. We know from E's intrinsic property. Through the SVD decomposition, for any E, there may be 2 possible corresponding t,R:

represent rotating 90 degree along Z axis. Because -E equals to E, so negate t will get the same result. There are 4 probable solutions.

The last problem: E may not fit its intrinsic property: Singular Value . The normal deed is: SVD decomposition of E and get the Singular Matrix , suppose that , then:

. It means projection on the flow E. Of course, the much easier deed is set Singular Value diag(1,0,0), because E has the sacale eqaualism.

Homography matrix

If all of the points in the scene are on the same plane, we could use the homography property to do motion estimation.

Discussion

E, F, H all can decompose the motion, but H needs all feature points stay on a plane. Its solved t and R are the same in scale, if we multiply a non-zero constant, the decomposition is fit. Hence, we usually nomarlize t, and let its length equal to 1.

scale amibiguity

The normalization leads to scale ambiguity, if we multiply a constant to t, the equation also holds. In monocular slam, we normalize the t, which means fixed scale.

Another normalization method is to set the average depth as 1. It has more stable values in computing.

pure rotation in initialization

If the camera is only rotation, then t becomes zero, and the E will also be 0, then we could not solve R. So in initialization, we could not only rotate but also translate.

More than 8 pairs of points

For 8 points method, we could get the coefficient matrix is A:

For 8 points method, the size of A is 8x9, if the count of the matched points pairs more than 8, there may be no e. Hence, we could minimize the quadric to solve this equation:

This solves the E matrix, but when the mismatch exists, we might prefer the Ramdom Sample Concensus,RANSAC.

Triangulation

When we have estimated the camera motion, we should estimate the feature points' spatial coordination through triangulation.

Because of noises, the two lines may not intersect, so we solve them by the method of Least Squares Method.

Triangulation discussion

Triangulation is concluded by translation, if we only rotate the camera, we could not get the result. 2 ways to raise the accuration: 1. enhance the measure accuracy; 2. increase the translation value. But we could not increase the translation value as much as possible because the light and appearance will have significant changes.

There is a depth filter if we decrease the variance of the mesurement.

3D-2D: PnP

Using 3D or rgbd sensor. Avoid solving initialization, pure ratation and scale problem.

Many solve method: P3P, DLT, EPnP, UPnP. Besides, we could use the nonlinear optimization to build the least squatrs problem, This is Bundle Adjustment.

###DLT:

a spatial point P, whose coordinate is . In image , its projection point . R and t is unknown. We define a matrix [R|t] whose size is 3x4, we expand it as followings:

Eliminate s, we could get two constrians:

To simplify the expression, we define the row vector:

Then we have:

and

t is a variable to solve, and every key-point provide two linear constrains. Suppose we have N key-points, we could list the linear equations:

t has 12 dimensions in total, and we should have 6 pairs key-points at least. This method calls DLT. When we have more than 6 pairs points, we could solve the least squars solutions by SVD.

In DLT solution, we should project the matrix into SE(3) flow.

###P3P

3D points: A,B,C is in the world coordinate

2D points: a, b, c is the 2D point.

Consider Oab and OAB. Using the cosine theorem:

<img src="http://latex.codecogs.com/gif.latex?%5Cbegin%7Bmatrix%7D%20OA%5E2+OB%5E2-2OA%5Ccdot%20OB%20%5Ccdot%20cos%20%3Ca%2Cb%3E%20%3D%20AB%5E2%5C%5C%20OB%5E2+OC%5E2-2OB%5Ccdot%20OC%20%5Ccdot%20cos%20%3Cb%2Cc%3E%20%3D%20BC%5E2%5C%5C%20OA%5E2+OC%5E2-2OA%5Ccdot%20OC%20%5Ccdot%20cos%20%3Ca%2Cc%3E%20%3D%20AC%5E2%20%5Cend%7Bmatrix%7D"

divided by OC^2, and note x = OA/OC, y = OB/OC:

note , we could conclude that:

Replace v with the form in the equation 1:

Notice the known and unknown variables, because we know the 2D pictures in the imagine, so we know the cosine value of cos<a,b>, cos<b,c> and cos<a,c>. At the same time, could compute by the world coordination A,B,C. However, the x,y are unknown. Similar to E, we have 4 solutions. We could compute the most possible solution and get the 3D coordination. And get the R and t by the 3D-3D pair.

P3P's disadvantage:

only uses 3 points. Waste lots of data.
If influenced by the noises, then there will be mismatching, not available.

To solve these problems, we use EPnP or UPnP and so on. Throughout SLAM, we use P3P/EPnP first then Bundle Adjustment.

Bundle Adjustment

Linear method introduced above: solve the pose of cameras, then compute the position of the spatial point, nonlinear optimization: treat them all as the optimizable variables, then optimize them. We could use PnP or ICP to optimize them. In PnP problem, this Bundle Adjustment problem is a reprojection error problem.

Consider there are n spatial points P and their projections p, we wish to compute the pose R,t of camera, its Lie Algebra is . Suppose a spatial point , and its projected pixel coordinate .

Despite of Lie Algebra using . Writing in the form of matrix:

Because the pose of cameras and noises, we could sum the noises, and construct the least squares problem. Then minimize the error:

The error is an error between the pixel coordinate and the 3D point projection following the current pose, it means reprojection error. Using the homogenous coordinates, the error have 3 dimensions. But due to the last dimension u is 1, the error of this dimension is always zero, so we often use inhomogeneous coordinates because the there are only 2 errors components. We know the p_1 and p_2 is the projection from the same point P through feature matching, however, we do not know the pose of cameras. So we adjust the pose of camera to make this error to zero. Becuase that we must consider many points throughout the adjustment process, the error might unable to down to 0.

By Lie Algebra, we could build the unrestraint optimization problem. Then we could use G-N or L-M to solve this problem. Before we use G-N or L-M, we should know the derivative of every error, this is linearization:

If e is the pixel coordinate error(2 dimensions), x is the pose of camera(6 dimensions), J will be a 2x6 matrix. We now deduce the form of J.

Remember the content of Lie Algebra, we introduce how to solve the Lie Algebra's derivative using perturbation model. Firstly, note the spatial point coordinate is P', and withdraw the former 3 dimensions of it:

The camera projection model that's relative to P' is:

expand it we could get:

eliminate s with the 3rd line:

After the definition of the middle component, we could left multiply the perturbation component to the to the . According to chain principle:

The here represent left multiply the perturbation in the Lie Algebra. The first component is the derivative of the projection point, we could conclude:

Leaving many steps out, we could get:

Using BA optimization

g2o vertex: the pose $\xi \in \mathfrak{se}(3)$ , and all of the spatial positions

g2o edges: every 3D point in the second camera's projection, described in the observation equation:

After we optimize the pose using BA, we could find out that there are some changes in the R mattrix.

3D-3D:ICP

Since we have a paired 3D point:

Now, we find a Euclidean transformation R,t, leads to:

It can be solved by the method of Iterative Closest Point (ICP). We could notice that there is no camera module in the 3D-3D transformation. In laser, we could not get the match relationship. To solve this problem, we could use two kinds of problems: SVD and nonlinear optimization.

SVD method

We now define the error between the ith pair points:

Then we construct the LSP, and minimize the error:

Nonlinear optimization method

When we represent the pose using Lie Algebra, we could write the function:

We could use the PnP and ICP together. For the depth known key-points, we could compute the 3D-3D error; for the deep unknown key-points, we could use the 3D-2D remap error.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chap7.md

chap7.md

VO

Feature points method

ORB feature

FAST key-point

2D-2D: Epipole Geometry

Epipole confinement

Essential matrix

Homography matrix

Discussion

scale amibiguity

pure rotation in initialization

More than 8 pairs of points

Triangulation

Triangulation discussion

3D-2D: PnP

Bundle Adjustment

Using BA optimization

3D-3D:ICP

SVD method

Nonlinear optimization method

Files

chap7.md

Latest commit

History

chap7.md

File metadata and controls

VO

Feature points method

ORB feature

FAST key-point

2D-2D: Epipole Geometry

Epipole confinement

Essential matrix

Homography matrix

Discussion

scale amibiguity

pure rotation in initialization

More than 8 pairs of points

Triangulation

Triangulation discussion

3D-2D: PnP

Bundle Adjustment

Using BA optimization

3D-3D:ICP

SVD method

Nonlinear optimization method