^{②} (Jiangsu Key Laboratory of Big Data Analysis Technology, Nanjing University of Information Science & Technology, Nanjing 210044, China)
^{③} (Zhejiang Province Key Laboratory for Signal Processing, Zhejiang University of Technology, Hangzhou 310023, China)
^{④} (Guangxi Key Lab of MultiSource Information Mining and Security, Guangxi Normal University, Guilin 541004, China)
^{⑤} (Key Laboratory of GeoSpatial Information Technology, Ministry of Land and Resources, Chengdu University of Technology, Chengdu 610059, China)
^{⑥} (MLR Key Laboratory of Metallogeny and Mineral Assessment Institute of Mineral Resources, Chinese Academy of Geological Sciences, Beijing 100037, China)
^{②} (南京信息工程大学江苏省大数据分析技术重点实验室 南京 210044)
^{③} (浙江工业大学浙江省信号处理重点实验室 杭州 310023)
^{④} (广西师范大学广西多源信息挖掘与安全重点实验室 桂林 541004)
^{⑤} (成都理工大学国土资源部地学空间信息技术重点实验室 成都 610059)
^{⑥} (中国地质科学院矿产资源研究所国土资源部成矿作用与资源评价重点实验室 北京 100037)
Different sensors have different descriptions for the same scene. Infrared sensors are sensitive to high heat radiation within the region. They can extract targets according to the infrared energy difference between target and background, which can pass through a certain thickness of soil layers and even concrete layers. However, compared to Synthetic Aperture Radar (SAR) imaging, infrared imaging is vulnerable to the influence of clouds, rain, and fog. SAR is an irreplaceable reconnaissance tool due to its advantages of all day, all weather, and long detection range. However, in some cases, the image information obtained by a single SAR is not enough to be used for better analysis and understanding of the target or scene^{[1]}. Therefore, combined with the advantages of SAR reconnaissance and infrared reconnaissance, the research of SAR image and infrared image fusion can greatly improve the reconnaissance efficiency. The “LANTIRN” pod on the American F16 fighter takes infrared reconnaissance as the main means of low altitude reconnaissance, and combines it with SAR reconnaissance to play a good effect. The Ref. [2] takes the fusion between the SAR data and the infrared data as one of the core issues of missile multimode guidance. The fusion between infrared image and SAR image can help to output a fused image which is more suitable for human visual perception or computer processing and analysis. It can significantly improve the lack of information obtained by a single sensor, improving the clarity of the resulting image and information content, which is conducive to more accurate, more reliable, more comprehensive access to the target or scene information. It is mainly used in military operation, national defense, resource survey, and other fields.
In recent years, the methods based on multiscale decomposition have received extensive attentions in image fusion^{[3–6]}. However, there are some drawbacks in these methods. Firstly, some multiscale decomposition tools lack of shift invariance, or some do have the shift invariance, but their computational complexities are quite high. Secondly, the lowfrequency components obtained by multiscale decomposition tools are the approximate representation of the source images, in which the number of pixel grayscales close to zero is small, as a result, the lowfrequency information of the source images cannot be described sparsely, and it is not convenient to capture the salient features of the source images. Therefore, in this paper, we apply the Complex Contourlet Transform (CCT) proposed in Ref. [7] to the remote sensing image fusion. This multiscale decomposition tool is fast and shiftinvariant, which can reduce the influence of the low accuracy of image registration on the fusion results. In Ref. [8], the complex contourlet transform is applied to image denoising and has achieved relatively good results, but the use of complex contourlet transform in image fusion is still in the exploratory stage. In recent years, Sparse Representation (SR) has been applied to image fusion as a new signal processing model. The image fusion method using the sparse representation model or the joint sparse representation in the Refs. [9,10] improves the image fusion effect. But the two methods directly carry out the fusion in the sparse representation domain. Considering that multiscale decomposition tool can describe the details of the image from multiple scales, if the multiscale decomposition tool is not used, the fusion image cannot inherit the detailed information of the source images well. In Ref. [11], a method of fusion between an infrared image and a visible image based on NonSubsampled Contourlet Transform (NSCT) and sparse representation is proposed. However, the combined use of sparse representation and NSCT has a high computational complexity. In addition, in view of the grayscale difference between the infrared image and the SAR image and the interference of the speckle noise in SAR image, if the lowfrequency component is directly fused without sparse representation, it may result in confusion of pixels and the target in the fusion image is not significant.
To this end, an image fusion method in CCT domain based on joint sparse representation is proposed to fuse the SAR image and the infrared image. The fused image via the proposed method combines well the advantages of the SAR and infrared images and has a better visual quality.
2 Complex Contourlet Transform and Joint Sparse Representation 2.1 Complex contourlet transformComplex contourlet transform is obtained by combining contourlet transform with doubletree complex wavelet transform. The principle of this transform is that: after the original image being decomposed by doubletree complex wavelet transform, the doubletree structure is formed. Then the 2dimensional Directional Filter Banks (DFB) are used to separate the highfrequency components in six directions, hence the subbands can be expanded to the numbers of 2^{n}. The essence of CCT is to replace the Laplacian Pyramid (LP) filter structure in the contourlet transform with the doubletree structure in the DualTree Complex Wavelet Transform (DTCWT), so as to replace the original single highfrequency component with the highfrequency components in the six directions, thus the highfrequency components can better capture the details of the image. CCT takes into account the amplitude and phase information of the original signal, and the decomposition speed is fast. Meanwhile, it retains the property of shift invariance. The principle of CCT is shown in Fig. 1.
By using an overcomplete dictionary matrix that contains M atoms, a signal can be represented as a sparse linear combination of these atoms, thus revealing the essential features of the original image more sparsely. The mathematical definition of the sparse representation model is:
$\mathop {\arg \;\min }\limits_{{α}} {\left\ {{α}} \right\_0}, \quad{\rm{s}}.{\rm{t}}.\;\left\ {{{x}}  {{D}}{{α}}} \right\_2^2 < \varepsilon$

(1) 
where
The Joint Sparse Model (JSM) has been developed from the theory of sparse representation. Then JSM1, JSM2, and JSM3 were proposed^{[12]} in succession. These models consider that each original signal contains both a sparse portion common to all signals and a unique sparse portion of each signal. Each signal in the signal ensemble
${{{V}}_i} = {{{V}}_{\rm{c}}} + {{{V}}_{{u_i}}} = {{D}}({{{s}}_{\rm{c}}} + {{{s}}_i}),\;i = 1,2, ·\!·\!· ,K$

(2) 
where
${{V}} = {{{D}}_{{\rm{JSR}}}}{{S}}$

(3) 
where
The lowfrequency components obtained by CCT are the approximate representation of the source images, but their sparseness are not enough. Considering that the acquired multisource remote sensing images are the descriptions of the same scene from different aspects, there exists a certain correlation between the lowfrequency components of the two source images, i.e. there is joint sparsity between the lowfrequency components of the images to be fused, while there are some differences between them. Therefore, for the fusion of lowfrequency components of the original images, the joint sparse representation is implemented on them. Thus the common features and the unique features of the lowfrequency components of the image to be fused are distinguished, the fusion is performed by selecting the unique features with a larger l_{1} norm, while the common features remain unchanged. Specific steps are as follows.
Step 1 Create the training sample set. Given the lowfrequency components of the two images to be fused are L_{1} and L_{2}, the sliding window (the step size is 1) is used to form a series of 4×4 image blocks in a rowfirst manner. Then, all image blocks are reorganized into column vectors V_{1}, V_{2} in a rowfirst manner, and the training sample set is chosen from them randomly.
Step 2 Joint sparse representation. The matrix V_{1} and V_{2} obtained in Step 1 are merged into a union matrix V_{3}. The KSVD^{[13]} method is used to train the samples to construct the dictionary of V_{3}. According to the joint sparse representation model:
${{V}} \!=\! \left[ {\begin{array}{*{20}{c}}{{{{V}}_1}}\\{{{{V}}_2}}\end{array}} \right] \!=\! \left[ {\begin{array}{*{20}{c}}{{D}} & {{D}} & 0\\0 & 0 & {{D}}\end{array}} \right]\left[ {\begin{array}{*{20}{c}}{{{{s}}_{\rm{c}}}}\\{{{{s}}_1}}\\{{{{s}}_2}}\end{array}} \right] \!=\! {{{D}}_{{\rm{JSR}}}}{{S}}$

(4) 
Then OMP method^{[14]} is used to find the sparse representation coefficients for Eq. (4).
Step 3 Fusion of sparse representation coefficients of lowfrequency components. The fusion consists of two parts: the selection of the activity evaluation index and the design of the fusion rule. The l_{1} norm of is used as the evaluation index of the activity degree. Let the sparse representation coefficient after fusion be S_{F}, then S_{F} =
Step 4 Reconstruction. Reconstructing the lowfrequency component by V_{F} is an inverse sliding window process, namely, the column vectors of the fusion vector matrix V_{F} are restored into the image blocks. Since the step size of the sliding window is 1, there is a partial overlap between the adjacent image blocks. Thus the overlapped parts of the adjacent image blocks are subjected to weighted averaging to obtain the fused components of the lowfrequency components.
3.2 Fusion of highfrequency componentsThe highfrequency components of the image contain details of the source images, such as textures, edges. The larger the coefficients of the highfrequency components, the richer the information of the region where the central pixel is located. When the central pixel of the image local region is the target pixel, the grayscales of the local region are more discrete and the region information entropy is larger. When the central pixel of the image local region is the background pixel, the grayscales of the local region are less discrete and the region information entropy is smaller. However, when the background information remains in all directions of the highfrequency components, the region information entropy is larger, and the region energy is larger as well. It is possible to distinguish the background pixels from the target pixels using the visual sensitivity coefficient based on the fact that the human eye is more sensitive to local changes in the image^{[15]}. In addition, considering the fact that the discrete degree of the grayscales of the local area where the target pixels are located is generally larger than that of the background pixels, the fusion rule of the highfrequency component is designed by combining the advantages of the visual activity coefficients with the energy matching degree. The fusion rules of highfrequency components are designed so that the highfrequency components after fusion can better inherit the detail information of the source images and improve the visual effect. The visual sensitivity coefficient
$\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!{{η}}\left[ {{{C}}_{k,l}^x\left( {i, j} \right)} \right] = \frac{{{{C}}_{k, l}^x\left( {i, j} \right)}}{{\overline {{{{C}}^x}\left( {i, j} \right)} }}$

(5) 
${{ρ}}\left( {i, j} \right) = \frac{{2{{C}}_{k, l}^A\left( {i, j} \right){{C}}_{k, l}^B\left( {i, j} \right)}}{{{{\left[ {{{C}}_{k, l}^A\left( {i, j} \right)} \right]}^2} + {{\left[ {{{C}}_{k, l}^B\left( {i, j} \right)} \right]}^2}}}$

(6) 
where
Let the energy matching degree threshold be T. When
$\begin{align}
& C_{k,l}^{\text{F}}\left( i,j \right)=\eta _{k,l}^{A}\left( i,j \right)\text{max}\left[ C_{k,l}^{A}\left( i,j \right),C_{k,l}^{B}\left( i,j \right) \right] \\
& \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ +\eta _{k,l}^{B}\left( i,j \right)\text{min}\left[ C_{k,l}^{A}\left( i,j \right),C_{k,l}^{B}\left( i,j \right) \right]\ \\
\end{align}$

(7) 
when
${{{C}}_{k, l}^{\rm F}}\left( {i, j} \right) = \left\{ \begin{aligned}& {{C}}_{k, l}^A\left( {i, j} \right), {{C}}_{k, l}^A\left( {i, j} \right) \ge {{C}}_{k, l}^B\left( {i, j} \right)\\& {{C}}_{k, l}^B\left( {i, j} \right), {{C}}_{k, l}^A\left( {i, j} \right) < {{C}}_{k, l}^B\left( {i, j} \right)\end{aligned} \right.$

(8) 
The fused procedure of the proposed image fusion method based on CCT and joint sparse representation is shown in Fig. 2.
To evaluate the performance of the proposed image fusion method, the SAR images and infrared images of the same scene which were from the SAHARA project of the Royal Military Academy in Belgium are fused, as shown in Figs. 3(a)–3(f). Source images are in 256×256 size. The proposed image fusion method is compared with the method based on LP, the method based on Wavelet Transform (WT), the method based on NSCT, the method based on DTCWT, and the method based on sparse representation in Ref. [11]. The experimental results by six methods are shown in Fig. 4, Fig. 5, and Fig. 6.
From Fig. 4, Fig. 5, and Fig. 6, it can be seen that the fusion image obtained by the LP fusion method is blurred, the overall brightness is relatively dim, the image local contrast is slightly low, and the target is not too salient. The WT fusion method has improved the overall brightness in the fusion image, but the edge is still blurred, and some parts of the target and background are mixed together. The fusion image obtained by the NSCT fusion method can better retain the contours of source images, but there is still a problem that the contrast is relatively low. In addition, the overall brightness of the fusion image is relatively dark. The result of DTCWT fusion method is slightly worse than that of NSCT fusion method. Compared with the abovementioned four methods, the overall brightness of the fused image is further improved by the method in Ref. [11]. Meantime, the contrast between the target and the background is improved. However, some obvious haloes appear in the image. For instance, there are obvious artifacts appearing in the right bottom part of Figs. 4(e), 5(e), and 6(e). The fusion image obtained by the method proposed in this paper has the best visual effect and no obvious artifacts. The overall brightness is more coincident with the human eye perception, and the image texture is continuous and the image details are clear. The image contrast is higher and the fused image inherits the original contour information of objects in the source images.
In this paper, six objective evaluation indices^{[16]}, such as Information Entropy (IE), Mutual Information (MI), Correlation Coefficient (CC), Spatial Frequency (SF), Average Gradient (AG), Standard Deviation (SD), and running time (Time) are used to compare the experimental results of six different fusion methods. Tab. 1 gives the quantitative evaluation results of the six methods.
From the experiments, the fusion time of the proposed method compared with the fusion method in Ref. [11] is obviously reduced. Although compared with other classical fusion methods, this method has no great advantage in time, but the improvement of fusion accuracy must be at the expense of fusion time. As can be seen from Tab. 1, the information entropy and standard deviation of the proposed method are always higher than those of the other five methods, while other indexes are sometimes slightly lower than other methods. It shows that the robustness and overall performance of the proposed method are the best, which is consistent with the subjective analysis. The proposed method is superior to other five methods in terms of information entropy and standard deviation. It reflects that the fusion image contains more detail information and has a higher local contrast. Remarkably, the sparse representation fusion method in Ref. [11] has higher spatial frequency and average gradient for the fusion results of the second group of infrared image and SAR image. But actually the reason is that the method cannot discriminate the common features and the unique features of the lowfrequency components of the source images, resulting in image distortion. In the proposed method, the lowfrequency components of the infrared image and the SAR image are decomposed by complex contourlet transform, and the common features of the lowfrequency components of the source images are distinguished from each other by the joint sparse representation. By combining the visual sensitivity coefficient and the energy matching degree to fuse the highfrequency components, the rich detail information of the two source images is captured. The fusion result can highlight the target and enhance the background, texture, and other details. On the whole, the proposed method is superior to the other five methods in the subjective visual effect and objective quantitative evaluation index.
5 ConclusionA novel fusion method between the SAR and infrared image in complex contourlet domain based on joint sparse representation is proposed in this paper. The method can take full advantage of SAR and infrared image. Experimental results demonstrate that the proposed fusion method has a higher performance and a better visual quality.
[1]  Chen Lei, Yang Fengbao, Wang Zhishe, et al. Mixed fusion algorithm of SAR and visible images with feature level and pixel[J]. OptoElectronic Engineering, 2014, 41(3): 5560. (0) 
[2]  Zeng Xianwei, Fang Yangwang, Wu Youli, et al.. A new guidance law based on information fusion and optimal control of structure stochastic jump system[C]. Proceedings of 2007 IEEE International Conference on Automation and Logistics, Jinan, China, 2007: 624–627. (0) 
[3]  Ye Chunqi, Wang Baoshu and Miao Qiguang Fusion algorithm of SAR and panchromatic images based on region segmentation in NSCT domain[J]. Systems Engineering and Electronics, 2010, 32(3): 609613. (0) 
[4]  Xu Xing, Li Ying, Sun Jinqiu, et al. An algorithm for image fusion based on curvelet transform[J]. Journal of Northwestern Polytechnical University, 2008, 26(3): 395398. (0) 
[5]  Shi Zhi, Zhang Zhuo and Yue Yangang Adaptive image fusion algorithm based on shearlet transform[J]. Acta Photonica Sinica, 2013, 42(1): 115120. DOI:10.3788/gzxb (0) 
[6]  Liu Jian, Lei Yingjie, Xing Yaqiong, et al. Fusion technique for SAR and gray visible image based on hidden Markov model in nonsubsample shearlet transform domain[J]. Control and Decision, 2016, 31(3): 453457. (0) 
[7]  Chen Dipeng and Li Qi. The use of complex contourlet transform on fusion scheme[C]. Proceedings of World Academy of Science, Engineering and Technology, Prague, Czech Republic, 2005: 342–347. (0) 
[8]  Wu Yiquan, Wan Hong and Ye Zhilong Fabric defect image noise reduction based on complex contourlet transform and anisotropic diffusion[J]. CAAI Transactions on Intelligent Systems, 2013, 8(3): 214219. (0) 
[9]  Wei Qi, BioucasDias J, Dobigeon N, et al. Hyperspectral and multispectral image fusion based on a sparse representation[J]. IEEE Transactions on Geoscience and Remote Sensing, 2015, 53(7): 36583668. DOI:10.1109/TGRS.2014.2381272 (0) 
[10]  Yu Nannan, Qiu Tianshuang, Bi Feng, et al. Image features extraction and fusion based on joint sparse representation[J]. IEEE Journal of Selected Topics in Signal Processing, 2011, 5(5): 10741082. DOI:10.1109/JSTSP.2011.2112332 (0) 
[11]  Wang Jun, Peng Jinye, Feng Xiaoyi, et al. Image fusion with nonsubsampled contourlet transform and sparse representation[J]. Journal of Electronic Imaging, 2013, 22(4): 043019 DOI:10.1117/1.JEI.22.4.043019 (0) 
[12]  Duarte M F, Sarvotham S, Baron D, et al.. Distributed compressed sensing of jointly sparse signals[C]. Proceedings of Conference Record of the ThirtyNinth Asilomar Conference on Signals, Systems and Computers Asilomar, Pacific Grove, CA, USA, 2005: 1537–1541. (0) 
[13]  Aharon M, Elad M and Bruckstein A rmKSVD: An algorithm for designing overcomplete dictionaries for sparse representation[J]. IEEE Transactions on Signal Processing, 2006, 54(11): 43114322. DOI:10.1109/TSP.2006.881199 (0) 
[14]  Mallat S G and Zhang Zhifeng Matching pursuits with timefrequency dictionaries[J]. IEEE Transactions on Signal Processing, 1993, 41(12): 33973415. DOI:10.1109/78.258082 (0) 
[15]  Kong Weiwei and Lei Yingjie Technique for image fusion based on NSST domain and human visual characteristics[J]. Journal of Harbin Engineering University, 2013, 34(6): 777782. (0) 
[16]  Fan Xinnan, Zhang Ji, Li Min, et al. A multisensor image fusion algorithm based on local feature difference[J]. Journal of Optoelectronics·Laser, 2014, 25(10): 20252032. (0) 