Online Supplement

Reference-based Multi-stage Progressive Restoration for Multi-degraded Images

Yi Zhang, Qixue Yang, Damon M. Chandler, and Xuanqin Mou

This webpage serves as the online supplement of the paper “Reference-based Multi-stage Progressive Restoration for Multi-degraded Images”, submitted to the IEEE Transactions on Image Processing.

Ref-IRT is a reference-based multi-degraded image restoration (MDIR) method which operates based on transferring similar textures from the reference image to the distorted image such that the edge/texture/structure information lost in quality degradation can be supplemented and restored. Our method is based on three subsequent stages. The first stage performs the preliminary restoration on the distorted image such that similar edges/textures can be possibly and more accurately found in the reference image; the second primary restoration stage aims to improve the restoration performance by transferring similar edges/textures from a reference image to the distorted image to help recover the lost information; the third stage performs the final restoration such that more accurate texture features are transferred to further enhance the restored image quality.

Due to the page-length limits of the journal article, here, we first introduce the proposed XRIR dataset in more detail. Then, we describe the subjective study conducted to quantify the visual improvement achieved by different MDIR methods tested on multi-degraded images. Moreover, we show the performance of MPENet in predicting distortion parameters. We also present more details about Ref-IRT+, a modified version of the Ref-IRT approach, based on a practical degradation model [1]. Finally, we show visual results of different MDIR methods tested on real-world images taken from the LIVE Challenge Dataset [2] and/or captured by our own camera.

[Download the code and dataset]

1. XJTU-referenced image restoration (XRIR) dataset

The XRIR dataset contains 200 high-resolution pristine images each of which has a corresponding reference. Among the 200 image pairs, 137 were carefully collected from the Internet and 63 were captured by using our own camera. Compared with CUFED5 [3] and WR_SR [4], XRIR has advantages in both image number and content diversity. Specifically, the 200 image pairs roughly cover 11 different categories of image content: indoor, outdoor, building, landmark, animal, plant, ocean/lake, forest, mountain, human, and others (which mainly contain the man-made object not belonging to any of the aforementioned 10 categories). Sample images in XRIR with different categories are shown in Figure 1, and the dataset distribution is illustrated in Table 1. The image resolutions of XRIR range from a minimum of 1200×1600 pixels to a maximum of 5874×3810 pixels. Thus, the dataset can be used as a benchmark for performance evaluation of both reference-based image super-resolution and restoration tasks.

	*indoor*	*outdoor*	*building*	*landmark*	*animal*	*plant*
input
reference
	*ocean/lake*	*forest*	*mountain*	*human*	*others*
input
reference

Figure 1. Sample pristine-reference image pairs with different categories in the XRIR dataset.

Table 1. Common image statistics computed for different image categories in the XRIR dataset.

Image category		*indoor*	*outdoor*	*building*	*landmark*	*animal*	*plant*	*ocean/lake*	*forest*	*mountain*	*human*	*others*
Percentage of total images		6.5%	13%	19.5%	12.5%	3.5%	7.5%	16.5%	4%	9%	3%	5%
L*	Mean	46.53	51.24	50.62	52.36	56.88	49.91	51.37	45.53	51.73	47.07	45.46
	Variance	455.16	660.02	638.14	463.98	454.62	541.61	505.16	564.37	505.79	617.38	629.65
	Skewness	0.26	0.04	-0.02	-0.18	-0.24	0.19	-0.22	0.49	0.18	-0.18	0.32
	Kurtosis	2.99	2.34	2.09	2.93	2.53	2.47	2.31	2.62	2.54	2.22	2.46
	-Slope	1.57	1.36	1.42	1.41	1.45	1.37	1.38	1.42	1.40	1.55	1.43
a*	Mean	2.83	0.15	0.00	3.58	-8.90	-2.60	-0.75	-3.50	-2.27	5.15	7.20
	Variance	47.83	31.74	36.19	32.45	51.25	119.13	29.08	50.47	43.62	164.12	121.49
	Skewness	1.33	1.16	1.01	1.15	0.73	0.07	1.19	0.20	0.30	1.94	0.94
	Kurtosis	15.95	17.39	13.54	14.78	15.80	4.68	12.91	7.39	9.38	12.47	8.61
	-Slope	1.50	1.38	1.43	1.37	1.36	1.43	1.36	1.53	1.39	1.58	1.51
b*	Mean	11.70	1.58	2.61	5.36	-0.23	6.22	-3.69	9.08	0.64	10.28	7.15
	Variance	114.35	173.74	125.49	214.57	161.74	216.19	142.48	105.86	199.23	216.71	120.28
	Skewness	0.17	0.39	0.43	-0.09	1.08	0.61	0.68	0.76	0.24	0.32	0.39
	Kurtosis	7.19	4.82	4.96	3.54	8.62	4.02	4.65	4.98	3.53	5.60	5.81
	-Slope	1.59	1.51	1.53	1.50	1.51	1.50	1.48	1.59	1.50	1.64	1.53

We also report some common image statistics of our dataset. Specifically, each image in the dataset was first converted from the RGB to CIE L*a*b* color space (by using the rgb2lab function in Matlab which uses a D65 white point and sRGB input space by default). Then, for each channel (L*, a*, and b*), the mean, variance, skewness, kurtosis, and the (negative) slope of the 1D-averaged magnitude spectrum were computed. The histogram of each of these statistic values computed for all dataset images are shown in Figure 2. For each plot, the x-axis represents the statistic value and the y-axis represents the probability corresponding to the value. Observe that the mean, skewness, and slope parameters generally follow a Gaussian distribution, while the variance and kurtosis values generally follow a Weibull distribution. Moreover, the aforementioned statistics for each image category are presented in Table 1, in which each entry represents the average statistic value computed for all the images belonging to the same category. According to Table 1, it seems that different image categories differ in the mean/skewness values of a*/b* channels, meaning that color statistic might be a good feature to distinguish between different categories in the XRIR dataset. In summary, the dataset contains a reasonable variety of mean luminances and chromaticities, levels of activity/contrast, levels of sparseness, and smooth vs. busy regions.

Figure 2. Histograms of common image statistics computed for all 200 image pairs in the XRIR dataset.

2. Subjective study

Since Ref-IRT transfers similar textures from the reference to the target image, and deep learning methods are prone to introduce texture artifacts that can sometimes be visually unpleasant to viewers, we conducted a subjective study to quantify the visual improvement achieved by the proposed method. To this end, 20 multi-degraded images were randomly selected from each of the three datasets (i.e., CUFED5 [3], WR_SR [4], and XRIR) resulting in 60 images in total. For each distorted image, 19 MDIR methods (see the paper for more details) were applied to yield 19 restored images. During the test, for each of the 60 distorted images, the subjects were presented with the 19 restored versions. The subjects were asked to select the image(s) with the best/highest perceived quality. Specifically, if the subject was 100 percent confident that an image is the best, then the single image was selected and recorded. Otherwise, the subject could select 2 or 3 images that he/she considered to have the best quality. For each trial, the 19 restored images were randomly presented to the subject one image at a time, and the subject did not know which image was produced by which IR method in order to avoid potential bias. In addition, for each trial, the subject was not allowed to select more than 3 images. Accordingly, 14 subjects with different ages (ranging from 19-40 years in age) and genders participated in the study.

Results tested on the three groups of dataset images are shown in Table 2, in which each entry denotes how many times the restored image produced by applying the IR method in the row to the distorted image in the column was selected. The last column of Table 2 contains the average probability values which indicate how likely the MDIR methods in the row tend to give the better results. As can be observed, in most cases, our method was selected more frequently than the others, demonstrating that images produced by our method are generally perceived to be of higher visual quality than others.

Table 2. Subjective results tested on sample images randomly selected from the CUFED5, WR_SR, and XRIR datasets.

CUFED5	Image ID	'004'	'005'	'011'	'024'	'025'	'026'	'031'	'042'	'057'	'058'	'066'	'069'	'073'	'077'	'079'	'086'	'089'	'091'	'097'	'111'	Average
	RL-Restore	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0.006
	OWAN	0	1	2	0	4	0	0	1	0	3	0	0	0	0	0	2	0	0	0	1	0.043
	HOWAN	0	0	1	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0.006
	RMBN	0	0	0	0	1	0	0	0	0	1	0	1	0	0	0	0	1	0	0	0	0.012
	MEPS	0	0	0	0	0	0	0	0	0	3	0	0	0	0	0	0	0	0	0	0	0.009
	DnCNN	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0.000
	DuRN	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0.003
	MIRNet	0	1	2	0	4	0	0	2	0	1	1	0	0	0	2	0	1	0	0	0	0.043
	COLA-Net	0	2	0	1	1	0	0	0	1	0	0	0	2	0	0	0	0	1	0	0	0.024
	SwinIR	0	0	0	0	4	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0.012
	Restormer	0	2	1	0	5	0	0	1	1	3	0	0	1	0	3	0	1	2	1	2	0.070
	DoubleUNet	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0.003
	W-Net	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	0	0	0	0	0	0.006
	StackUNet	0	0	0	0	0	0	0	0	1	0	0	0	1	0	0	0	0	1	0	0	0.009
	TTSR	0	0	4	1	0	0	0	0	1	0	1	0	0	0	0	0	0	0	1	0	0.024
	RefVAE	0	0	0	0	1	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0.006
	MASA	0	0	0	0	0	0	2	0	0	0	0	0	0	0	0	0	0	0	2	1	0.015
	DATSR	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0.003
	Ref-IRT	14	11	9	13	6	14	12	12	11	11	14	13	12	12	9	12	11	11	12	12	0.704
WR_SR	Image ID	'002'	'004'	'014'	'016'	'019'	'024'	'026'	'028'	'029'	'030'	'032'	'037'	'039'	'053'	'058'	'060'	'061'	'063'	'067'	'079'	Average
	RL-Restore	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0.000
	OWAN	0	4	5	4	0	5	4	0	2	0	2	0	4	1	0	0	3	3	0	1	0.089
	HOWAN	0	0	0	0	0	0	0	0	0	1	0	2	0	1	0	0	0	1	0	0	0.012
	RMBN	0	0	0	0	0	0	0	1	0	0	2	1	0	0	2	1	0	1	0	1	0.021
	MEPS	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	1	0	0	1	0.007
	DnCNN	0	0	0	0	0	0	2	0	0	0	0	2	0	0	0	0	0	0	0	2	0.014
	DuRN	0	0	0	0	0	1	0	0	0	0	0	4	0	0	0	0	0	0	0	1	0.014
	MIRNet	0	3	2	3	5	4	3	2	3	3	0	3	1	3	1	0	4	7	0	1	0.113
	COLA-Net	0	0	0	2	0	4	3	3	0	0	0	4	0	0	2	0	0	0	0	1	0.045
	SwinIR	0	0	0	1	0	2	5	3	2	0	0	1	0	0	1	0	0	1	0	2	0.042
	Restormer	1	4	2	1	4	2	2	2	1	5	0	0	2	1	2	0	7	5	0	1	0.099
	DoubleUNet	0	0	0	4	0	1	0	0	0	0	0	0	2	1	0	0	0	1	0	1	0.024
	W-Net	0	1	0	4	5	5	1	1	3	2	0	2	0	2	1	2	1	5	0	0	0.082
	StackUNet	0	0	0	2	0	0	2	0	0	0	0	2	0	1	0	0	0	0	0	0	0.016
	TTSR	0	1	0	0	2	1	0	0	0	2	0	0	0	1	1	0	0	0	0	1	0.021
	RefVAE	1	1	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	1	0.009
	MASA	0	0	0	0	0	0	0	0	0	0	0	0	0	0	2	2	0	0	0	0	0.009
	DATSR	0	0	1	1	1	1	0	1	1	0	0	1	1	0	0	1	0	0	1	0	0.024
	Ref-IRT	13	10	12	5	6	2	6	6	8	8	10	0	7	10	7	10	9	5	13	5	0.358
XRIR	Image ID	'040'	'046'	'050'	'055'	'058'	'074'	'083'	'093'	'102'	'104'	'105'	'111'	'132'	'146'	'153'	'158'	'164'	'171'	'178'	'185'	Average
	RL-Restore	0	0	0	0	0	0	0	0	0	0	2	0	0	0	0	0	0	0	0	0	0.005
	OWAN	0	0	0	0	2	0	2	1	0	3	2	0	1	3	0	0	0	0	0	0	0.034
	HOWAN	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0.002
	RMBN	0	0	0	0	0	0	3	1	0	0	4	0	0	1	1	0	0	0	0	2	0.029
	MEPS	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	1	0.005
	DnCNN	0	0	0	0	0	0	0	0	0	0	1	1	0	0	0	0	0	0	1	0	0.007
	DuRN	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0.000
	MIRNet	0	4	0	2	6	2	3	3	5	6	2	3	1	3	2	3	1	0	1	4	0.124
	COLA-Net	0	0	0	0	1	0	6	1	0	0	1	1	3	1	2	0	1	0	1	0	0.044
	SwinIR	0	0	0	0	0	0	2	0	0	0	5	1	0	0	0	0	0	0	0	0	0.019
	Restormer	0	1	0	7	3	8	0	0	2	5	2	6	3	6	2	0	1	0	0	4	0.122
	DoubleUNet	0	0	0	0	0	0	0	0	0	0	1	0	0	0	3	0	0	0	0	0	0.010
	W-Net	0	1	0	1	1	1	2	0	0	1	1	2	2	0	1	0	1	0	2	0	0.039
	StackUNet	0	0	1	0	0	0	0	0	0	0	0	1	1	0	0	0	0	0	0	0	0.007
	TTSR	0	6	0	0	2	4	1	2	2	3	0	0	0	3	0	0	0	1	0	1	0.061
	RefVAE	0	0	0	0	2	0	0	0	1	0	2	0	0	0	0	0	0	0	0	0	0.012
	MASA	1	0	0	0	0	0	0	2	1	0	3	0	0	0	0	0	0	0	0	0	0.017
	DATSR	0	1	0	0	1	0	0	1	1	1	1	0	0	0	0	0	0	0	0	0	0.015
	Ref-IRT	13	10	14	12	5	7	6	5	8	8	4	7	10	4	9	13	13	12	12	12	0.448

3. Performance of MPENet

In Ref-IRT, the MPENet was employed to predict the three distortion parameter values of a multi-degraded image, which were used to add the same level of distortions to the reference image for better content/texture matching. Here, we evaluate the performance of MPENet in distortion parameter estimation. To this end, the PLCC/SROCC values between the estimated distortion parameters and the recorded ground-truth distortion parameters for each dataset image were calculated. Table 3 shows the PLCC/SROCC results computed for each distortion type in the three datasets (i.e., CUFED5 [3], WR_SR [4], XRIR). A large value of PLCC/SROCC closer to one indicates a better prediction.

Table 3. PLCC and SROCC values computed for each distortion type in the three testing datasets.

Dataset	CUFED5			WR_SR			XRIR
Matrix	Blur	Noise	JPEG	Blur	Noise	JPEG	Blur	Noise	JPEG
PLCC	0.980	0.999	0.999	0.966	0.999	0.999	0.967	0.999	0.998
SROCC	0.985	0.998	1.000	0.961	0.998	1.000	0.970	0.998	1.000

As observed in Table 2, in general, all distortion parameter values can be well predicted, though the performance on Gaussian blur is a bit weak. This is due to the fact that the blur artifact can be easily destroyed by the subsequent noise and compression distortions. Also, the non-strict boundary between a blurry image and the pristine version adds another difficulty for blur parameter estimation. For example, a pristine image that focuses on a specific object can also display blur in the surrounding areas. Despite such an inaccuracy, we believe that the proposed MPENet is qualified for the distortion parameter estimation task, because only similar distortions are required for the reference image to achieve a decent matching result.

4. More details about Ref-IRT⁺

(1) Network architecture of MPENet

To fill in the performance gap of Ref-IRT in dealing with synthesized distortions and real-world applications, we additionally trained our method on images corrupted by using a practical degradation model [1]. In addition to JPEG compression, this new degradation model considers different Gaussian blur kernels (i.e., isotropic and anisotropic Gaussian kernels) to generate the blur distortion, different Gaussian noise models (i.e., the channel-independent additive white Gaussian noise (AWGN), the gray-scale AWGN, and the general case) to generate the noise distortion, and different sequential orders of distortions being added to images to expand the degradation space.

A diagram of a computer program

Description automatically generated with medium confidence

Figure 3. Network architecture of the modified MPENet in Ref-IRT⁺.

To enable our approach to work on this new degradation model, the distortion parameter estimation (DPE) block in MPENet has to be modified since more distortion parameters are required which include (1) the 2×2 covariance matrix of the multivariate normal distribution for generating the isotropic/anisotropic Gaussian kernels; (2) the 3×3 covariance matrix of the Gaussian noise model; and (3) the quality parameter for the JPEG compression. Since and are symmetric matrices, the number of the distortion parameters for the three distortion types are three, six, and one, respectively. Also, the MPENet has to predict the sequential order of the blur and noise distortions being added to the image, because the blur distortion can help reduce the perceived noise strength. Accordingly, the modified MPENet is shown in Figure 3, in which (a) is used for predicting the sequential order, and (b) is used for predicting the ten distortion parameters. Note that (a) and (b) are fed by the same three feature vectors, which are obtained from the three average pooling layers in front. Besides, the modified MPENet takes as input the RGB color image, instead of the luminance image, because either the channel-independent AWGN or the grayscale AWGN or the generalized AWGN were added to the three channels.

(2) Visual results on different distortion types/intensities

Ref-IRT⁺ was tested on three distortion types (blur + JPEG, noise + JPEG, blur + noise + JPEG) at three distortion intensities (mild, moderate, severe), and thus nine distortion scenarios were considered in the test. Figure 4 shows the visual results of Ref-IRT⁺ tested on sample distorted images generated from the pristine images in the CUFED5 dataset. The first column denotes the input images; the second and third columns denote the results produced by the first and final stages of Ref-IRT⁺. As can be observed, the reference image is less likely to help when images are corrupted by noise and JPEG compression. Also, when images are mildly distorted, the reference image might be unnecessary, because a deep CNN model might be good enough for recovering the mildly-distorted image contents.

	Blur + JPEG
Mild
Moderate
Severe
	Noise + JPEG
Mild
Moderate
Severe
	Blur + Noise + JPEG
Mild
Moderate
Severe
	Input	Ref-IRT-I⁺	Ref-IRT⁺

Figure 4. Visual results of Ref-IRT+ tested on sample distorted images generated from the pristine images in the CUFED5 dataset with different distortion types and intensities.

5. Test on real-world images

We also tested our algorithm on real-world images. To this end, images taken from the LIVE Challenge dataset and captured by our own camera were used for testing. For testing on LIVE Challenge, the reference images were randomly selected from the 127 pristine images in the LIVE [5], CSIQ [6], and CBSD68 [7] datasets. For testing on images of our own, the target image was captured by a web camera which is of lower quality, and the reference image was captured by another high-quality camera on the similar scene. Visual results of Ref-IRT⁺ tested on sample images taken from the LIVE Challenge dataset [2] are shown in Figure 5. Visual comparison of Ref-IRT vs. other MDIR methods tested on the image captured by our own camera is shown in Figure 6. As can be observed, Ref-IRT⁺is able to remove the blur, noise, and compression artifacts in real-world images thanks to training with the practical degradation model [1]. Also, by referring to a reference, Ref-IRT is able to achieve better results than other MDIR methods.

Input	Ref-IRT⁺	Input	Ref-IRT⁺

Figure 5. Visual results of Ref-IRT⁺ tested on sample real-world images from the LIVE Challenge dataset.


Input	RL-Restore	OWAN	HOWAN

RMBN	MEPS	DnCNN	DuRN

MIRNet	COLA-Net	SwinIR	Restormer

DoubleUNet	W-Net	StackUNet	TTSR

RefVAE	MASA	DATSR	Ref-IRT

Figure 6. Visual comparison of Ref-IRT vs. other MDIR methods tested on a real-world image captured by using a web camera.

References

[1] K. Zhang, J. Liang, L. Van Gool, and R. Timofte, “Designing a practical degradation model for deep blind image super-resolution,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 4791–4800.

[2] D. Ghadiyaram and A. C. Bovik, “Massive online crowdsourced study of subjective and objective picture quality,” IEEE Transactions on Image Processing, vol. 25, no. 1, pp. 372–387, 2015.

[3] Z. Zhang, Z. Wang, Z. Lin, and H. Qi, “Image super-resolution by neural texture transfer,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 7982–7991, 2019.

[4] Y. Jiang, K. C. Chan, X. Wang, C. C. Loy, and Z. Liu, “Robust reference-based super-resolution via c2-matching,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2103–2112, 2021.

[5] H. R. Sheikh, Z. Wang, A. C. Bovik, and L. K. Cormack, “Image and video quality assessment research at LIVE,” Online, http://live.ece.utexas.edu/research/quality/.

[6] E. C. Larson and D. M. Chandler, “Most apparent distortion: full reference image quality assessment and the role of strategy,” Journal of Electronic Imaging, vol. 19, no. 1, p. 011006, 2010.

[7] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics,” in IEEE International Conference on Computer Vision (ICCV), vol. 2, 2001, pp. 416–423.