IMAGE REGISTRATION COMPARATIVE ANALYSIS : NORMALIZED CORRELATION VERSUS SIFT-BASED REGISTRATION

The paper compares the image registration algorithms: the classical normalized correlation (as a representative of intensity-based algorithms) and the SIFT-based algorithm (feature-based registration). A gradient subpixel correction algorithm was also used for normalized correlation. We compared the effectiveness of their work on real images (including a terrain map) when modeling artificial distortions. The accuracy of determining the position (shift) of one image relative to another in the presence of rotation and scale changes was studied. The experiment was carried out using a simulation model created in the Python programming language using the OpenCV computer vision library. 
The results of the experiments show that in the absence of rotation and scale changes between the registered images the normalized correlation provides a slightly smaller root-mean-square error. At the same time, if there are even small such distortions, for example, a rotation of more than 2 degrees and a scale change of more than 2 percent, the probability of correct registration for the normalized correlation drops sharply. It was also noted that the advantages of normalized correlation are almost 5 times higher speed and the possibility of using it for small fragments (50x50 or less), where it is problematic for the SIFT algorithm to allocate a sufficient number of keypoints. 
It was also shown that the use of a two-stage algorithm (SIFT-based registration at the first stage, and optimization with normalized correlation as a criterion at the second) allows you to get both high accuracy and stability to rotation and scale change, but this will be accompanied by high computational costs.


Introduction
Until now, a large variety of different image registration algorithms have been developed. They can usually be divided into two groups: area-(or intensity-) based and feature-based [1,2]. It should also be noted that a new direction is actively developing now when trained neural networks are used as a measure of similarity [3].
Intensity-based algorithms are commonly used to determine the shift between images when there is little or no rotation and scale changes. If the considered model of geometric transformation between images is more complex (where, for example, the presence of rotation and change in scale cannot be neglected), then either the reliability of such algorithms drops sharply, or an excessive amount of computation is required for their normal operation (because it is necessary to iterate over all possible sets of transformation parameters, each time calculating the correlation measure of similarity). The most common among this group of methods are correlation methods.
At the same time, feature-based methods work quite stably with a variety of models of geometric transformations. However, their accuracy is inferior to correlation methods. Also, they are often inferior in speed [4]. Among the methods using the calculation of features, the most important place is occupied by the methods based on keypoints. In this case, the search (detection) of keypoints is first carried out on two images that are registered. Let's call one of them (usually larger) the reference image (RI), and the second imagethe current image (CI). Then a descriptor is calculated for each keypoint. A descriptor is some array of values that describes this keypoint. As a result, we will have a set of descriptors for RI and a set of descriptors for CI. If we find correspondences between them, then we can estimate the geometric transformation between the images (using the found correspondences between the keypoints).
For comparison, from the group of intensity-based methods, the classic variation of the normalized cross-correlation method (further NCC to denote an algorithm) was chosen [5,6], one of the most popular correlation methods. As a representative of the second group of methods, an algorithm that uses a SIFT detector (and a keypoint descriptor) was chosen [7]. SIFT was created at the turn of the 21st century. And, although since then there have been many different improvements to the original algorithm (for example, in [8]), this work uses the classic implementation available in the OpenCV package [9]. This should be enough to investigate the basic patterns and obtain approximate quantitative characteristics.
Aim and tasks of the research. The purpose of this work is to compare qualitatively and quantitatively (including computational performance) algorithms from two main classes. There are many works making comparisons within each class (for example for intensity-based - [10], featurebased - [11]) but almost no between different classes. A comparison of the two algorithms considered in the article will allow, in addition to confirming the obvious general patterns, to obtain specific indicators of quality and reliability, which are useful to understand when choosing a tool for registering images.
Also in the second part of the work, it is shown (and quantified) that the use of a combined algorithm (using the SIFT at the first stage, and the NCC at the second stage) allows to join the advantages of both approaches and achieve high estimation accuracy for complex geometric transformation models.
The novelty of this work is not the ideas of the approaches themselves, which are generally known, but specific numerical performance indicators calculated by a simulation experiment (including real terrain maps as images that are registered).

Simulation setup
The comparison was performed using a simulation model created in the Python programming language (with using OpenCV, SciPy [12], and NumPy [13] libraries). The structure of the computational experiment is shown in Fig. 1. First, a reference image is selected, for example, a satellite image (or its region). Then the model randomly selects a fragment of a given size (in section 2.1 the fragment of 200 by 200 pixels was used, and 50x50in the 2.2 section) of this image, adds distortion (rotate, scale, add random subpixel shift and noise). So we got CI. We passed RI and CI to both registration algorithms (NCC and SIFT-based). The coordinates of the CI relative to RI were estimated. We considered the position of the CI center to be the coordinates of the CI.
In the CI formation model, we calculated new pixel positions after geometric transformation and then used bicubic interpolation to find their intensities. In all experiments, the values of the subpixel shifts along X and Y were modeled with a uniform distribution in the intervals from -0.5 to 0.5.
The matchTemplate function (from OpenCV) performed an NCC evaluation. Also for the NCC, the gradient method of subpixel shift correction was used [5].
Implementations of SIFT feature selection algorithms and feature matching of two images are taken also from the OpenCV library. The descriptors are first calculated using the detectAndCompute method of the SIFT_create class. The resulting descriptors of the two images (RI and CI) are then matched. OpenCV implements two matching methods, BFMatcher and Flann, and all experiments in the work were performed using the BFMatcher class, which showed higher performance than Flann in our case. After that, the findHomography and getPerspectiveTransform functions are used to calculate the coordinates of the CI relative to RI.
The process of forming a CI and searching for it in the original image (RI) was repeated 100 times for each of the parameter sets. Based on these tests, the probability of correct registration ( P ) and the root-mean-square error (RMSE) were then calculated. The registration was considered erroneous when the obtained estimate of the CI position differed from the true value by more than 2 pixels. All of the following results were obtained with a signal-to-noise ratio of 20 (the standard deviation of the image was 20 times greater than the noise).

NCC and SIFT for real images registration
We used two main images (as RI) in our experiments: a real terrain snapshot of the National aerospace university "Kharkov aviation institute" (downloaded using [14], see Fig. 2, a) and a raccoon picture (Fig. 2, b). Fig. 2. Test images (RI): asatellite image of the territory of NAU "KhAI", 1000x1000; braccoon image, 1280x1080 The SIFT descriptor is a vector with 128 values. For the RI from Fig. 2, a, the algorithm finds 17453 keypoints. While for a fragment (CI), the number will vary (depending on the fragment) from 500 to 800 keypoints.
The average runtime for NCC is 0.0264 seconds. For the SIFT-based algorithm, this value is 0.44 seconds. If we consider that the keypoints for the RI are calculated in advance, then the execution time decreases to 0.1 seconds, which is still almost four times longer than for NCC.
At first, the rotation (  ) and the scaling change ( s ) were not simulated. In this case, the probability of correct matching for both algorithms was equal to 1 for signal-to-noise ratios greater than 10 (modern video cameras provide very low noise levels). The RMSE was less than 0.05 pixels (slightly lower for NCC compared to SIFT). Thus, in the absence of changes in scale and rotation, we can emphasize a slight advantage of the NCC in the accuracy. Now let's analyze what will happen if there are a rotation and a change in scale (not taken into account in the NCC algorithm).
In Fig. 3 you can see that for NCC the probability of successful registration P begins to decline sharply when the absolute value of rotation angle is more than about 2 degrees for the image in Fig. 2, a and 1 degreein Fig. 2, b. In Fig. 4, we see that the accuracy of the NCC registration also worsens when the absolute value of the angle increases. At the same time, the probability of the SIFT-based algorithm is almost always equal to one and the accuracy only slightly deteriorates with increasing angle modulus. If P is equal to zero, then the RMSE cannot be calculated, therefore the RMSE graphs have gaps.
We see a similar situation in the presence of scale distortion (in this work, only figures for the probability of correct registration are provided). With increasing scaling, the results of the NCC drop sharply. This appears earlier for the image in Fig. 2, a, which is obviously less "smooth" and accordingly has a higher frequency content. a b Fig. 3. Effect of rotation  on the probability of correct registration P : аfor the image in Fig. 2, а; bfor the image in Fig. 2, b a b Fig. 4. Effect of rotation  on the RMSE (on X-axis): аfor the image in Fig. 2, а; bfor the image in Fig. 2, b   a b Fig. 5. Effect of scale s on the probability of correct registration P : аfor the image in Fig. 2, а; bfor the image in Fig. 2, b

Small CI case (50x50)
If in previous experiments a CI with a size of 200x200 was used, then this checks what will happen with smaller fragments. Fig. 6 shows plots of P versus rotation and scale change (for the image in Fig. 2, a). It can be seen that the SIFT-based algorithm starts to work much worse. This is because it is not always possible to select a sufficient number of keypoints for such a small CI. Although in such case it is possible to improve the situation somewhat by using keypoint selection settings (for example, changing the threshold for Lowe's ratio test [7], and, accordingly, choosing unreliable keypoints), the SIFT-based algorithm will have problems when working with small fragments. a b Fig. 6. Probability of correct registration P for small CI (50x50) and RI from Fig. 2, а with respect to: arotation  ; bscale s

Combining SIFT-based registration and NCC
In this subsection, a two-stage algorithm was investigated, when the SIFT-based algorithm is used at the first stage, and then the obtained estimates are set as initial values for the modified NCC at the second stage. The modified NCC consists in solving an optimization problem with the NCC objective function in four parameters: shifts along X and Y axes, rotation, and scale. Such an algorithm was used, for example, in the work [15]. As can be seen from Fig. 7, this two-step algorithm is rotationally resistant (it behaves similarly for the scaling) and provides high accuracy. But at the same time, it requires almost 10 times more computational costs compared to the SIFT-based algorithm (with RI descriptors precomputation).
It should also be noted that such a high accuracy (thousandths of a pixel) as in Fig. 7 became possible due to the complete coincidence of the interpolation algorithms, which were used in the CI formation model and the registration algorithm. In reality, this is not the case, which leads to noticeably higher RMSE (at least hundredths of a pixel). Fig. 7. Effect of rotation  on the RMSE (on X-axis) for the image in Fig. 2, а (similar to Fig. 4, а, with a SIFT+NCC graph)