diff --git a/paper/RCN_CVPR_ID0000.tex b/paper/RCN_CVPR_ID0000.tex index fbb1c65..90ab19d 100644 --- a/paper/RCN_CVPR_ID0000.tex +++ b/paper/RCN_CVPR_ID0000.tex @@ -139,38 +139,36 @@ \subsection{Recursive Neural Network in Computer Vision} In addition, we demonstrate that very deep recursions significantly boost the performance. We apply the same convolution up to 25 times (previous maximum is three). It is an interesting future direction to see if a single-recursive-layer approach can work for other tasks. \section{Proposed Method} -In this section, our proposed method is explained. We first propose a model using recursive convolutions and discuss its limitations. Then we explain our extensions to overcome the issues and the training procedure. +In this section, our proposed method is explained. We first explain a basic model using recursive convolutions and discuss its limitations. Then we propose an improved model and the training procedure for it. \begin{figure*}[t] \includegraphics[width=\textwidth]{figs/f1} - \caption {Overview of our method. (a): Our model architecture. A given input, low resolution image, passes through embedding network, inference network, and reconstruction network to be super-resolved. (b): Detailed description of embedding network. (c): Detailed description of inference network. (d): Detailed description of reconstruction network.} + \caption {Architecture of our basic model. It consists of three parts: embedding network, inference network and reconstruction network. Inference network has a recursive layer and its unfolded version is in Figure \ref } \label{fig:overview} \end{figure*} -\subsection{Simple Model} +\subsection{Basic Model} -Simonyan and Zisserman \cite{simonyan2015very} have demonstrated the effectiveness of stacking small filters many times to make the network (very) deep. We similarly use small filters ($3\times3$) for all conv. layers. -Our network configuration is outlined in Figure \ref{fig:overview}. +Our network, outlined in Figure \ref{fig:overview}, consists of three sub-networks: embedding, inference and reconstruction networks. \textbf{Embedding net} is used to represent the given image as feature maps ready for inference. Next, \textbf{inference net} solves the task. Once inference is done, final feature maps in inference net are fed into \textbf{reconstruction net} to generate the output image. -It consists of three sub-networks: embedding, inference and reconstruction networks. \textbf{Embedding net} is used to represent the given image as feature maps ready for inference. Next, \textbf{inference net} solves the task. Once inference is done, final feature maps in inference net are fed into \textbf{reconstruction net} to generate the output image. +We now look into each sub-network. \textbf{Embedding net} takes the input image (grayscale or RGB) and represent it as a set of feature maps. Intermediate representation used to pass information to inference net largely depends on how inference net internally represent its feature maps in their hidden layers. Learning this representation is done end-to-end altogether with learning other sub-networks. -We now look into each sub-networks. \textbf{Embedding net} takes the input image (grayscale or RGB) and represent it as a set of feature maps. Intermediate representation used to pass information to inference net largely depends on how inference net internally represent its feature maps in their hidden layers. Learning this representation is done end-to-end altogether with learning other sub-networks. - -\textbf{Inference net} is the main component that solves the task, super-resolution. It needs to serve two purposes: analysis and synthesis. Analysis requires large region to understand the given image and synthesis needs a highly non-linear regression function. Both are handled entirely with a single convolutional layer deeply-recursive. Each recursion widens receptive field and increases non-linearities at the same time. As analysis and synthesis are intertwined in the recursive layer, division of network capacity between analysis and synthesis is naturally handled while learning the model. +\textbf{Inference net} is the main component that solves the task, super-resolution. Analyzing a large image region is done by a single recursive layer. Each recursion applies the same convolution followed by a rectified linear unit. With convolution filters larger than $1\times 1$, receptive field is widened every recursion. \begin{figure} \includegraphics[width=0.5\textwidth]{figs/f2} - \caption {An illustration of inference network. \textbf{Left}: A folded version of inference network. \textbf{Right}: An unfolded version of inference network along with recursion. The same filters W are applied to feature maps recursively. By doing this, our model can utilize very large context just with the relatively small number of parameters.} + \caption {Unfolding inference network. \textbf{Left}: A recursive layer \textbf{Right}: Unfolded structure. The same filters W are applied to feature maps recursively. Our model can utilize very large context without adding new weight parameters. } + \label{fig:inference_network} \end{figure} -While feature maps from the final application of the recursive layer represent the high-resolution image, transformation of them back into the original image space is necessary. This is done by \textbf{reconstruction net}. +While feature maps from the final application of the recursive layer represent the high-resolution image, transformation of them (multi-channel) back into the original image space (1 or 3-channel) is necessary. This is done by \textbf{reconstruction net}. -We have a single hidden layer for each sub-net. Only the layer for inference net is recursive. Other sub-nets are vastly similar to standard mutilayer perceptrons (MLP) with a single hidden layer. For MLP, full connection of $F$ neurons is equivalent to a convolution with $1\times 1\times F \times F$. In our sub-nets, we use $3\times 3\times F \times F$ filters. This is because image gradients are more informative than the raw intensities for super-resolution (TODO cite). +We have a single hidden layer for each sub-net. Only the layer for inference net is recursive. Other sub-nets are vastly similar to standard mutilayer perceptrons (MLP) with a single hidden layer. For MLP, full connection of $F$ neurons is equivalent to a convolution with $1\times 1\times F \times F$. In our sub-nets, we use $3\times 3\times F \times F$ filters. For embedding net, we use $3\times 3$ filters because image gradients are more informative than the raw intensities for super-resolution. For inference net, $3\times 3$ convolutions imply that hidden states are passed to adjacent pixels only. Reconstruction net also considers direct neighbors for transformation. -\textbf{Mathematical Formulation} Now we give the mathematical formulation of our model. The network takes an interpolated input image (to the desired size) as input and predicts the target image as in SRCNN \cite{Dong2014}. Given an input image ${\bf x}$ (low-resolution) and an output image ${\bf y}$ (high-resolution), our goal is to learn a model $f$ that predicts values $\mathbf{\hat{y}}=f(\mathbf{x})$. We also discuss predicting residuals ${\bf y - x} $ in later sections, but we assume ${\bf y}$ is the output of our network for the moment. +\textbf{Mathematical Formulation} Now we give the mathematical formulation of our model. The network takes an interpolated input image (to the desired size) as input ${\bf x}$ and predicts the target image ${\bf y}$ as in SRCNN \cite{Dong2014}. Our goal is to learn a model $f$ that predicts values $\mathbf{\hat{y}}=f(\mathbf{x})$. - Let $f_1, f_2, f_3$ denote sub-net functions: embedding, inference and reconstruction, respectively. Our model is the composition of three functions: $f({\bf y}) = f_3(f_2 (f_1({\bf x}))).$ We use convolution followed by ReLUs (rectified linear unit) for all hidden layers. + Let $f_1, f_2, f_3$ denote sub-net functions: embedding, inference and reconstruction, respectively. Our model is the composition of three functions: $f({\bf y}) = f_3(f_2 (f_1({\bf x}))).$ Embedding net $f_1({\bf x})$ takes the input vector ${\bf x}$ and computes the matrix output $H_0$, which is an input to the inference net $f_2$. Hidden layer values are denoted by $H_{-1}$. The formula for embedding net is as follows: \begin{align} @@ -178,7 +176,7 @@ \subsection{Simple Model} H_0 &= max(0, W_{0}*H_{-1} + b_0)\\ f_1({\bf x}) &= H_0, \end{align} -the operator $*$ denotes a convolution and $max(0,\cdot)$ corresponds to a ReLU. Weight and bias matrices are $W_{-1},W_0$ and $b_{-1},b_0$. +where the operator $*$ denotes a convolution and $max(0,\cdot)$ corresponds to a ReLU. Weight and bias matrices are $W_{-1},W_0$ and $b_{-1},b_0$. Inference net $f_2$ takes the input matrix $H_0$ and computes the matrix output $H_{D}$. Here, we use the same weight and bias matrices $W$ and $b$ for all operations. Let $g$ denote the function modeled by a single recursion of the recursive layer: $g(H)=max(0,W*H+b)$. The recurrence relation is \begin{equation} @@ -186,7 +184,7 @@ \subsection{Simple Model} \end{equation} for $d = 1, ..., D$. -Inference net $f_2$ is equivalent to the composition of elementary function: +Inference net $f_2$ is equivalent to the composition of the same elementary functions $g$: \begin{equation} f_2(H) = g \circ g \circ \cdots \circ g(H) = g^{D}(H). \end{equation} @@ -200,49 +198,55 @@ \subsection{Simple Model} \textbf{Model Properties} Now we have all components for our model $f({\bf y}) = f_3(f_2 (f_1({\bf x})))$. The recursive model has pros and cons. One good thing is that the model requires no new parameters for additional recursions. That means widening receptive field to utilize larger image context can be done with the network capacity fixed. -While the recursive model is simple and powerful, we find training a deeply-recursive network very difficult. The maximum number of recursions successful in previous methods is three \cite{Liang_2015_CVPR}. Among many reasons, two severe problems are \textit{vanishing} and \textit{exploding gradients} \cite{bengio1994learning, pascanu2013difficulty}. TODO Fixed-vector memory problems??? +While the recursive model is simple and powerful, we find training a deeply-recursive network very difficult. Also, the maximum number of recursions successful in previous methods is three \cite{Liang_2015_CVPR}. Among many reasons, two severe problems are \textit{vanishing} and \textit{exploding gradients} \cite{bengio1994learning, pascanu2013difficulty}. -The \textit{exploding gradients} problem refers to the large increase in the norm +\textit{exploding gradients} refer to the large increase in the norm of the gradient during training. Such events are due to -the explosion of the long term components, which can grow exponentially more than short term ones. +the multiplicative nature of chained gradients. Long term components can grow exponentially and this problem occurs only for deep recursions. The -\textit{vanishing gradients} problem refers to the opposite behavior, -when long term components go exponentially -fast to norm 0, making it impossible for the model to -learn correlation between distant pixels. +\textit{vanishing gradients} problem refers to the opposite behavior. Long term components approach exponentially +fast to zero vector. Due to it, learn relation between distant pixels is very hard. In addition to gradient problems, there exists an issue with finding the optimal number of recursions. In theory, very large recursions are always good since the network sees more image region and we hope it learn to keep the important information and discard the unnecessary. But in practice, a fixed-sized vector carrying all context information until the end of all recursions might lack capacity if recursions are too deep. -To resolve the gradient and capacity issues, we propose an extended model. +To resolve the gradient and optimal recursion issues, we propose an advanced model. -\subsection{Extended Model} -To overcome the weaknesses of the simple model, we propose two extensions: deep-supervision and residual-learning. Deep-supervision for convolutional network is first proposed in Lee et al \cite{lee2014deeply}. Their method simultaneously minimizes classification error while improving the directness and transparency of the hidden layer learning process. +\subsection{Advanced Model} +To overcome the weaknesses of the simple model, we propose two extensions: recursive-supervision and skip-connection. Deep-supervision for convolutional network is first proposed in Lee et al \cite{lee2014deeply}. Their method simultaneously minimizes classification error while improving the directness and transparency of the hidden layer learning process. -\textbf{Deep-Supervision} We deeply-supervise all recursions in order to alleviate the effect of vanishing/exploding gradients. There are two significant differences from our deep-supervision to the original deep-supervision. In \cite{lee2014deeply}, they associate one classifier for each hidden layer. For each additional layer, new classifier has to be introduced and new parameters thereby. If this approach is used, our modified network looks as in Figure \ref{fig: TODO}. We need $D$ different reconstruction networks. This is against our original purpose of using recursive networks: not introducing new parameters while stacking more layers. +\textbf{Recursive-Supervision} We recursively-supervise all convolutions in order to alleviate the effect of vanishing/exploding gradients. There are two significant differences from our recursive-supervision to the original deep-supervision. In \cite{lee2014deeply}, they associate one classifier for each hidden layer. For each additional layer, new classifier has to be introduced and new parameters thereby. If this approach is used, our modified network looks as in Figure \ref{fig: recursive_supervision}. We need $D$ different reconstruction networks. This is against our original purpose of using recursive networks: not introducing new parameters while stacking more layers. -As we have assumed that the same representation can be used again and again during convolutions in inference net, much better regularization is to use the same recon. net for all recursions. Our recon. net now outputs $D$ predictions and all predictions are simultaneously supervised during training. +As we have assumed that the same representation can be used again and again during convolutions in inference net, much better regularization is to use the same recon. net for all recursions. Our recon. net now outputs $D$ predictions and all predictions are simultaneously supervised during training (Figure \ref{fig:recursive_supervision}). The second difference of our method to the original deep-supervision is that we use all $D$ intermediate predictions to compute the final output. All predictions are averaged during testing. The optimal weights are automatically learned during training. In contrast, \cite{lee2014deeply} discard all intermediate classifiers during testing. -Our deep-supervision naturally eases the difficulty of training recursive networks. Backpropagation goes through small number of layers if supervising signal goes directly from loss layer to early recursion. Summing all gradients backpropagated from different prediction losses give smoothing effect. The effect of vanishing/exploding gradients along one backpropagation path is alleviated. +Our recursive-supervision naturally eases the difficulty of training recursive networks. Backpropagation goes through small number of layers if supervising signal goes directly from loss layer to early recursion. Summing all gradients backpropagated from different prediction losses give smoothing effect. The effect of vanishing/exploding gradients along one backpropagation path is alleviated. + +Moreover, the importance of picking the optimal number of recursions is reduced as our supervision enables utilizing predictions from all intermediate layers. If recursions are too deep for the given task, we expect the weight for the intermediate predictions high. By looking at weights of predictions, we can figure out the marginal gain from additional recursions. -Moreover, the importance of picking the optimal number of recursions is reduced as our deep-supervision enables utilizing predictions from all intermediate layers. If recursions are too many for the given task, we expect the weight for the intermediate prediction high. By looking at weights of predictions, we can figure out the effectiveness of additional recursions. +\textbf{Skip-connection} Now we describe our second extension: skip-connection. We find input and output images are highly correlated. Carrying most if not all of input values until the end of the network is necessary. Due to gradient problems, learning that input and output are mostly similar is very difficult with deep recursions. -\textbf{Residual-Learning} Now we describe our second extension: residual-learning. If input and output signals are in the same vector space and highly correlated, predicting only the difference is very useful for two reasons. First, network capacity to store the input signal is saved. Second, the exact copy of input signal can be easily lost during many recursions. Instead of learning direct mapping from input to output, we predict the residual only and then input signal is added back for output. +Adding layer skips \cite{bishop2006pattern} is successfully used for a semantic segmentation network \cite{long2014fully}. We employ a similar idea: input image is directly fed into the recon. net. The skip-connection has two advantages. First, Network capacity to store the input signal during recursions is saved. Second, the exact copy of input signal can be used during target prediction. -Our residual learning is simple yet very effective. In super-resolution, LR and HR images are vastly similar. In most regions, differences are zero and only small number of locations have non-zero values. Modeling image details is often used in super-resolution methods \cite{Timofte2013, Timofte, bevilacqua2012,bevilacqua2013super}, but we demonstrate that this domain-specific knowledge can significantly improve an general end-to-end learning method like deep-learning, especially if the net is very recursive. -As illustrated in Figure \ref{fig: TODO}, the main network now learns very sparse predictions and the required capacity to solve the task gets significantly reduced. +Our skip-connection is simple yet very effective. In super-resolution, LR and HR images are vastly similar. In most regions, differences are zero and only small number of locations have non-zero values. Modeling image details is often used in super-resolution methods \cite{Timofte2013, Timofte, bevilacqua2012,bevilacqua2013super}, but we demonstrate that this domain-specific knowledge can significantly improve an general end-to-end learning method like deep-learning, especially if the net is deeply-recursive. +As illustrated in Figure \ref{fig: TODO}, the inference network now learns a mapping to very sparse predictions and the required capacity to solve the task gets significantly reduced. -\textbf{Mathematical Formulation} We revisit the mathematical formulation of our model with two extensions. In the simple model, the prediction is $\hat{{\bf y}} = f_3(H_D)$, where $f_3$ and $H_D$ denote the recon net and the final hidden state of the inference net, respectively. With residual-learning, +\textbf{Mathematical Formulation} We revisit the mathematical formulation of our model with two extensions. In the simple model, the prediction is $\hat{{\bf y}} = f_3(H_D)$, where $f_3$ and $H_D$ denote the recon. net and the final hidden state of the inference net, respectively. With skip-connection, \begin{equation} -\hat{{\bf y}} = {\bf x} + f_3(f_2(f_1({\bf x}))) = {\bf x} + f_3(g^{(D)}(f_1({\bf x}))). +\hat{{\bf y}} = f_3'({\bf x}, f_2(f_1({\bf x}))). \end{equation} -Each intermediate prediction under deep-supervision is + + +%\hat{{\bf y}} = {\bf x} + f_3(f_2(f_1({\bf x}))) = {\bf x} + f_3(g^{(D)}(f_1({\bf +Each intermediate prediction under recursive-supervision is \begin{equation} -\hat{{\bf y}}_{d} = {\bf x} + f_3(g^{(d)}(f_1({\bf x}))). +\hat{{\bf y}}_{d} = f_3'({\bf x}, g^{(d)}(f_1({\bf x}))). \end{equation} +Recon. net with skip-connection $f_3({\bf x}, H_d)$ can take various functional forms. For example, input can be concatenated to the feature maps $H_d$. As the input is an interpolated input image (roughly speaking, $\hat{\bf y} \approx {\bf x}$), we find $f_3'({\bf x}, H_d) = {\bf x} + f_3(H_d)$ is enough for our purpose. + + Now, the final output is the weighted average of all intermediate predictions: \begin{equation} \hat{{\bf y}}^{(final)} = \sum_{d=1}^{D} w_d \cdot \hat{{\bf y}}_d. @@ -261,19 +265,18 @@ \subsection{Extended Model} \includegraphics[width=\textwidth]{figs/f3} \caption{Deep supervision illustration. (a): Original form of deep supervision. To supervise every feature maps at $d = 1, ..., D$, we need to have corresponding reconstruction networks for every feature maps at $d = 1, ..., D$, which result in a large number of parameters. (b): Our deep supervision. Unlike in (a), the reconstruction network is shared among feature maps in directed acyclic graph manner. It is possible because the space of feature maps remain same in our method. Furthermore, not only we use deep supervision to regularize intermediate feature maps, but we use all the outputs from the intermediate recursion to get the final output by weighted sum. So we can utilize useful information though it emerges from the early stage of recursion.} \end{center} +\label{fig:recursive_supervision} \end{figure*} \subsection{Training} -\textbf{Objective} We now describe the training objective to minimize to find optimal parameters of our model. Let ${\bf x}$ denote an interpolated low-resolution image and ${\bf y}$ a high-resolution image. -Given training dataset $\{{\bf x}^{(i)},{\bf y}^{(i)}\}{}_{i=1}^{N}$, our goal is to learn a model $f$ that predicts values $\mathbf{\hat{y}}=f(\mathbf{x})$. +\textbf{Objective} We now describe the training objective to minimize to find optimal parameters of our model. Given training dataset $\{{\bf x}^{(i)},{\bf y}^{(i)}\}{}_{i=1}^{N}$, our goal is to find the best model $f$ that predicts values $\mathbf{\hat{y}}=f(\mathbf{x})$. -In the least-squares regression setting, typically used in super-resolution -problems, the mean squared error $\frac{1}{2}||\mathbf{y}-f(\mathbf{x})||^{2}$ +In the least-squares regression setting, typical in SR, the mean squared error $\frac{1}{2}||\mathbf{y}-f(\mathbf{x})||^{2}$ averaged over training set is minimized. This favors high Peak Signal-to-Noise -Ratio (PSNR), a widely-used evaluation criteria for SR. +Ratio (PSNR), a widely-used evaluation criteria. -With deep-supervision, we have $D+1$ objectives: supervising $D$ outputs from recursions and the final output. For intermediate outputs, we have the loss function +With recursive-supervision, we have $D+1$ objectives: supervising $D$ outputs from recursions and the final output. For intermediate outputs, we have the loss function \begin{equation} l_1(\theta) = \sum_{d=1}^D \sum_{i=1}^N \frac{1}{2}||{\bf y}^{(i)} - \hat{\bf y}_d^{(i)} ||^{2}, \end{equation} @@ -288,7 +291,16 @@ \subsection{Training} \end{equation} where $\alpha$ denotes the importance of the companion objective on the intermediate outputs and $\beta$ denotes the multiplier of weight decay. Setting $\alpha$ high makes the training procedure stable as early recursions easily converge. As training progresses, $\alpha$ decays to boost the performance of the final output. -Training is carried out by optimizing the regression objective using mini-batch gradient descent based on back-propagation (LeCun et al. \cite{lecun1998gradient}). We set the momentum parameter to 0.9 and weight decay to 0.0001. +Training is carried out by optimizing the regression objective using mini-batch gradient descent based on back-propagation (LeCun et al. \cite{lecun1998gradient}). We implement our model using the \textit{MatConvNet}\footnote{\url{ http://www.vlfeat.org/matconvnet/}} package \cite{arXiv:1412.4564}. + +\section{Experimental Results} +TODO replace. Simonyan and Zisserman \cite{simonyan2015very} have demonstrated the effectiveness of stacking small filters many times and making a network (very) deep. We similarly use small filters ($3\times3$) for all conv. layers. + + +TODO replace. We set the momentum parameter to 0.9 and weight decay to 0.0001. + +TODO initialization + % %Training deep models often fail to converge. He et al. \cite{he2015delving} uses a theoretically sound initialization method which helps very deep models converge when training from scratch and they succeed in training 30 weight layers. They, however, report no benefit from training extremely deep models for their problem. In our work, adding layers are beneficial in general. For large scale factors, deep models exploiting contextual information spread in very large field are dominant. % @@ -296,190 +308,9 @@ \subsection{Training} % %\textcolor{red}{Data preparation is similar to SRCNN \cite{Dong2014} with some differences. Input patch size is equal to the size of receptive field and images are divided into sub-images with no overlap. 64 sub-images constitue a mini-batch, where sub-images from different scales can be in the same batch.} % -%We implement our model using the \textit{MatConvNet}\footnote{\url{ http://www.vlfeat.org/matconvnet/}} package \cite{arXiv:1412.4564}. -% -%\section{Understanding Properties} -% -%\subsection{Effectiveness of Recursion} -%In this section, we compare our recursive network to canonical CNNs. -% -%TODO plot performance curve and parameter curves (RCN flat) -% -%In this section, we study three properties of our proposed method. First, we show our method with a single network performs as well as a method using multiple networks trained for each scale. We can effectively reduce model capacity (the number of parameters) of multi-network approaches. -% -%Second, we show our residual-learning network converges very fast in relative to the standard CNN. Moreover, our network gives a significant boost in performance. -% -%Third, we show large depth is necessary for the task of SR. A very deep network utilizes more contextual information in an image \textcolor{red}{and models complex functions with many nonlinear layers.} We experimentally confirm that deeper networks give better performances than shallow ones. -% -%\subsection{Single Model for Multiple Scales} -%Scale augmentation during training is a key technique to equip a network with super-resolution machines of multiple scales. Many SR processes for different scales can be executed with our multi-scale machine with much smaller capacity than that of single-scale machines combined. -% -%We start with an interesting experiment as follows: we train our network with a scale factor $s_{\text{train}}$ and it is tested under another scale factor $s_{\text{test}}$. Here, factors 2,3 and 4 that are widely used in SR comparisons are considered. Possible pairs ($s_{\text{train}}$,$s_{\text{test}}$) are tried for the dataset `Set5' \cite{bevilacqua2012}. Experimental results are summarized in Table \ref{tab:SRCNN_Factor_Test}. -% -%Performance is degraded if $s_{\text{train}} \neq s_{\text{test}}$. For scale factor 2, the model trained with factor 2 gives PSNR of 37.10 (in dB), whereas models trained with factor 3 and 4 give 30.05 and 28.13, respectively. A network trained over single-scale data is not capable of handling other scales. In many tests, it is even worse than bicubic interpolation, the method used for generating the input image. -% -%We now test if a model trained with scale augmentation is capable of performing SR at multiple scale factors. The same network used above is trained with multiple scale factors $s_{\text{train}} = \{2,3,4\}$. In addition, we experiment with the cases $s_{\text{train}} = \{2,3\}, \{2,4\}, \{3,4\}$ for more comparisons. -% -%We observe the network copes with any scale used during training. When $s_{\text{train}} = \{2,3,4\}$ ($\times 2, 3, 4$ in Table \ref{tab:SRCNN_Factor_Test}), its PSNR for each scale is comparable to those achieved from the corresponding result of single-scale network: 37.06 vs. 37.10 ($\times 2$), 33.27 vs. 32.89 ($\times 3$), 30.95 vs. 30.86 ($\times 4$). -% -%Another pattern is that for large scales ($\times 3,4$), our multi-scale network performs over single-scale network: our model ($\times 2,3$), ($\times 3,4$) and ($\times 2, 3,4$) give PSNRs 33.22, 33.24 and 33.27 for test scale 3, respectively, whereas ($\times 3$) gives 32.89. Similarly, ($\times 2,4$), ($\times 3,4$) and ($\times 2, 3,4$) give 30.86, 30.94 and 30.95 (vs. 30.84 by $\times 4$ model), respectively. From this, we observe training multiple scales boost the performance for large scales. -% -%\subsection{Residual Images for Better Learning} -%\label{sec:residual} -% -%As we already have low-resolution image as input, predicting high-frequency components is enough for the purpose of SR. Predicting residuals (HR - ILR) is widely used in several previous methods \cite{Timofte2013, Timofte,zeyde2012single}. However, it has not been studied in the context of deep-learning-based SR. -% -%In this work, we have proposed a network structure that learns from a residual image (ground truth minus input, i.e. high-resolution image minus interpolated low-resolution). We now study the effect of this modification to a standard CNN structure in detail. -% -%First, we find this residual network converge much faster. Two networks are compared experimentally: residual network and standard network. We use depth 10 (weight layers) and scale factor 2. Performance curves for various learning rates are shown in Figure \ref{fig:residual2}. All use the same learning rate scheduling mechanism that has been mentioned above. -% -%Second, at convergence, the residual network shows superior performance. In \textcolor{red}{Figure \ref{fig:residual2}}, residual networks give higher PSNR when training is done. % -%In short, this simple modification to a standard network structure is beneficial and one can explore the validity of the idea in other image restoration problems. % -%\begin{figure} -%\vspace{-1cm} -%\centering -%\includegraphics[scale=0.3]{figs/fig4_sffsr.pdf} -%\vspace{-0.7cm} -%\caption{Receptive field for a neuron in network grows as layers are stacked. In our work, up to 20 layers are used reaching 41$\times$41 at maximum. } -%\label{fig:receptive_field} -%\end{figure} -% -% -%\begin{figure*} -%\begin{center} -%\begin{tabular}{ccccc} -%\graphicspath{{figs/}}\includegraphics[width=0.18\textwidth]{img_082_1_w.png} & -%\graphicspath{{figs/}}\includegraphics[width=0.18\textwidth]{img_082_6_w.png} & -%\graphicspath{{figs/}}\includegraphics[width=0.18\textwidth]{img_082_7_w.png} & -%\graphicspath{{figs/}}\includegraphics[width=0.18\textwidth]{img_082_10_w.png} & -%\graphicspath{{figs/}}\includegraphics[width=0.18\textwidth]{img_082_9_w.png} -%\\ -%Original / PSNR (dB) &A+ / 29.84 &SRCNN / 29.20 &Huang et al. / 30.17 &RCN (Ours) / 30.86 \\ -%\end{tabular} -%\end{center} -%\vspace{-.5cm} -%\caption{Super-resolution results (Urban100) with scale factor $\times$4 Our result is visually pleasing. }\label{fig:c1} -%\end{figure*} -% -%\begin{figure*} -%\begin{center} -%\begin{tabular}{cccc} -%\graphicspath{{figs/}}\includegraphics[width=0.23\textwidth]{img_053_1_w.png} & -%\graphicspath{{figs/}}\includegraphics[width=0.23\textwidth]{img_053_6_w.png} & -%\graphicspath{{figs/}}\includegraphics[width=0.23\textwidth]{img_053_7_w.png} & -%\graphicspath{{figs/}}\includegraphics[width=0.23\textwidth]{img_053_9_w.png} -%\\ -%Original / PSNR (dB) &A+ / 22.31 &SRCNN / 22.34 &RCN (Ours) / 23.13 \\ -%\end{tabular} -%\end{center} -%\vspace{-.5cm} -%\caption{Super-resolution results (Urban100) with scale factor $\times$3. Our result is visually pleasing.}\label{fig:c2} -%\end{figure*} -% -% -%\begin{figure*} -%\begin{center} -%\begin{tabular}{cccc} -%\graphicspath{{figs/}}\includegraphics[width=0.23\textwidth]{img_058_1_w.png} & -%\graphicspath{{figs/}}\includegraphics[width=0.23\textwidth]{img_058_6_w.png} & -%\graphicspath{{figs/}}\includegraphics[width=0.23\textwidth]{img_058_7_w.png} & -%\graphicspath{{figs/}}\includegraphics[width=0.23\textwidth]{img_058_9_w.png} -%\\ -%Original / PSNR (dB) &A+ / 25.87 &SRCNN / 25.76 &RCN (Ours) / 26.57 \\ -%\end{tabular} -%\end{center} -%\vspace{-.5cm} -%\caption{Super-resolution results (Urban100) with scale factor $\times$3. Our result is visually pleasing.}\label{fig:c3} -%\end{figure*} -% -%\subsection{High Depths for Large Contexts} -%In this section, we study the depth of a convolutional neural network (CNN) in the context of super-resolution. We first start with the definition of receptive field in a CNN. -% -%CNNs exploit spatially-local correlation by enforcing a local connectivity pattern between neurons of adjacent layers \cite{Bengio-et-al-2015-Book}. In other words, hidden units in layer $m$ take as input a subset of units in layer $m-1$. They form spatially contiguous receptive fields (Figure \ref{fig:receptive_field}). -% -%Imagine that layer $m-1$ is the input image. In the figure, units in layer $m$ have receptive fields of 3$\times$3 in the input and are thus only connected to 9 adjacent neurons in the input layer. Units in layer $m+1$ have a similar connectivity with the layer below. We say that their receptive field with respect to the layer below is 3$\times$3, but their receptive field with respect to the input is larger (5$\times$5). Each unit is unresponsive to variations outside of its receptive field with respect to the input. The architecture thus ensures that the learned filters produce the strongest response to a spatially local input pattern. -% -%However, as shown above, stacking many such layers leads to filters that become increasingly “global” (i.e. responsive to a larger region of pixel space). For example, the unit in hidden layer $m+1$ can encode a non-linear feature of 5$\times$5 (in terms of image pixel space). -% -%In this work, we use small filters of the same size 3$\times$3 for all layers. For the first layer, its receptive field is of size 3$\times$3. For the next layers, the size of its receptive field increases by 2 in both height and width. For depth $M$ network, its receptive field has size $(2M+1)\times(2M+1)$. Its size is proportional to the depth. -% -%In the task of SR, this corresponds to the amount of contextual information that can be exploited to infer high-frequency components. Large receptive field means the network can use more context from large image region to predict details of an image. As SR is an ill-posed inverse problem, collecting and analyzing more neighbor pixels give more clues. For example, if there are some image patterns entirely contained in a receptive field, it is plausible that this pattern is recognized and used to super-resolve the image. -% -%We now experimentally show that very deep networks significantly improve performance. Since our layers are of the same type, we train and test networks of depth ranging from 5 to 20 (only counting weight layers excluding nonlinearity layers). -% -%In Figure \ref{fig:depth}, we show the results. In most cases, performance increases as depth increases. The exception is scale factor 2. As depth increases, performance on scale factors 3 and 4 improve rapidly. Since we use Euclidean loss, it tends to correct severe errors first, which are more prevalent in high scale factors. Different loss functions can be explored to balance the importance between scales and these are left as a future work. -% -%\begin{table*} -%\footnotesize -%\begin{center} -%\begin{tabular}{ |c|c|c|c|c|c|c|c|c|c| } -%\hline -%\multirow{2}{*}{Dataset} & \multirow{2}{*}{Scale} & {Bicubic} & {A+} & {SRCNN} & {Huang et al.} & {RCN} & {RCN-291} & {RCN+} & {RCN-291+ (Ours)}\\ -% -% & & PSNR/Time & PSNR/Time & PSNR/Time & PSNR/Time & PSNR/Time & PSNR/Time & PSNR/Time & PSNR/Time\\ -%\hline -%\hline -%\multirow{3}{*}{Set5} -%& $\times$2& 33.66 / - & 36.55 / 0.44& 36.66 / 2.12& -& 37.06 / 0.31& \color{blue} 37.38 / 0.15& 37.24 / 0.18& \color{red} 37.53 / 0.19\\ -%& $\times$3& 30.39 / - & 32.59 / 0.23& 32.75 / 2.13& -& 33.27 / 0.22& 33.42 / 0.15& \color{blue} 33.50 / 0.18& \color{red} 33.67 / 0.19\\ -%& $\times$4& 28.42 / - & 30.28 / 0.16& 30.49 / 2.21& -& 30.95 / 0.22& 31.09 / 0.15& \color{blue} 31.22 / 0.18& \color{red} 31.35 / 0.19\\ -%\hline -%\hline -%\multirow{3}{*}{Set14} -%& $\times$2& 30.23 / - & 32.28 / 0.92& 32.45 / 3.79& -& 32.61 / 0.32& \color{blue} 32.84 / 0.21& 32.77 / 0.26& \color{red} 33.04 / 0.27\\ -%& $\times$3& 27.54 / - & 29.13 / 0.48& 29.30 / 3.78& -& 29.48 / 0.32& 29.63 / 0.21& \color{blue} 29.64 / 0.26& \color{red} 29.77 / 0.27\\ -%& $\times$4& 26.00 / - & 27.32 / 0.34& 27.50 / 3.80& -& 27.71 / 0.32& 27.85 / 0.21& \color{blue} 27.91 / 0.26& \color{red} 28.01 / 0.27\\ -%\hline -%\hline -%\multirow{3}{*}{Urban100} -%& $\times$2& 26.61 / - & 28.57 / 3.15& 28.82 / 12.78& -& 29.22 / 0.76& \color{blue} 29.52 / 0.51& 29.41 / 0.67& \color{red} 29.76 / 0.68\\ -%& $\times$3& 24.36 / - & 25.82 / 1.68& 25.99 / 12.85& -& 26.35 / 0.76& 26.52 / 0.51& \color{blue} 26.53 / 0.67& \color{red} 26.76 / 0.68\\ -%& $\times$4& 23.07 / - & 24.22 / 1.17& 24.40 / 12.60& 24.55 / -& 24.62 / 0.77& 24.78 / 0.51& \color{blue} 24.82 / 0.67& \color{red} 24.99 / 0.68\\ -%\hline -%\hline -%\multirow{3}{*}{B100} -%& $\times$2& 29.32 / - & 30.77 / 0.58& 30.89 / 2.43& 30.70 / -& 31.02 / 0.25& \color{blue} 31.21 / 0.17& 31.13 / 0.21& \color{red} 31.30 / 0.22\\ -%& $\times$3& 27.15 / - & 28.18 / 0.31& 28.28 / 2.48& 28.10 / -& 28.42 / 0.25& \color{blue} 28.55 / 0.17& 28.52 / 0.21& \color{red} 28.64 / 0.22\\ -%& $\times$4& 25.92 / - & 26.77 / 0.22& 26.84 / 2.39& 26.74 / -& 26.98 / 0.25& \color{blue} 27.10 / 0.17& 27.08 / 0.20& \color{red} 27.20 / 0.21\\ -%\hline -%\end{tabular} -%\end{center} -%\caption{\textcolor{red}{PSNR for scale factor $\times$2, $\times$3 and $\times$4 on datasets `Set5', `Set14', `Urban100', and `B100'.} {\color{red} Red color} indicates the best performance and {\color{blue}blue color} indicates the second best one.} \label{table_all} -%\end{table*} -% -%%\begin{table*} -%%\small -%% \centering -%% \caption{PSNR for scale factor $\times$3 for Set14. {\color{red}Red color} indicates the best performance and {\color{blue}{blue color}} indicates the second best one.} -%% \begin{tabular} -%% {|c|c|c|c|c|c|c|c|c|c|} -%% \hline -%% Set14 & Scale & {Bicubic} & {Yang et al.} & {Zeyde et al.} & {ANR} & {NE+LLE} & {SRCNN} & {A+} & {Ours (SFFSR)} \\ -%% \hline -%% baboon & $\times$3 & 23.21 & 23.21 & 23.52 & 23.56 & 23.55 & 23.60 & \color{blue} 23.62 & \color{red} 23.66 \\ -%% barbara & $\times$3 & 26.25 & 26.25 & \color{red} 26.76 & 26.69 & \color{blue} 26.74 & 26.66 & 26.47 & 26.31 \\ -%% bridge & $\times$3 & 24.40 & 24.40 & 25.02 & 25.01 & 24.98 & 25.07 & \color{red} 25.17 & \color{blue} 25.16 \\ -%% coastguard & $\times$3 & 26.55 & 26.55 & 27.15 & 27.08 & 27.07 & \color{blue} 27.20 & \color{red} 27.27 & 27.12 \\ -%% comic & $\times$3 & 23.12 & 23.12 & 23.96 & 24.04 & 23.98 & \color{blue} 24.39 & 24.38 & \color{red} 24.81 \\ -%% face & $\times$3 & 32.82 & 32.82 & 33.53 & 33.62 & 33.56 & 33.58 & \color{blue} 33.76 & \color{red} 33.77 \\ -%% flowers & $\times$3 & 27.23 & 27.23 & 28.43 & 28.49 & 28.38 & 28.97 & \color{blue} 29.05 & \color{red} 29.56 \\ -%% foreman & $\times$3 & 31.18 & 31.18 & 33.19 & 33.23 & 33.21 & 33.35 & \color{blue} 34.30 & \color{red} 34.46 \\ -%% lenna & $\times$3 & 31.68 & 31.68 & 33.00 & 33.08 & 33.01 & 33.39 & \color{blue} 33.52 & \color{red} 33.76 \\ -%% man & $\times$3 & 27.01 & 27.01 & 27.90 & 27.92 & 27.87 & 28.18 & \color{blue} 28.28 & \color{red} 28.52 \\ -%% monarch & $\times$3 & 29.43 & 29.43 & 31.10 & 31.09 & 30.95 & \color{blue} 32.39 & 32.14 & \color{red} 33.94 \\ -%% pepper & $\times$3 & 32.39 & 32.39 & 34.07 & 33.82 & 33.80 & 34.35 & \color{blue} 34.74 & \color{red} 35.00 \\ -%% ppt3 & $\times$3 & 23.71 & 23.71 & 25.23 & 25.03 & 24.94 & 26.02 & \color{blue} 26.09 & \color{red} 26.63 \\ -%% zebra & $\times$3 & 26.63 & 26.63 & 28.49 & 28.43 & 28.31 & 28.87 & \color{blue} 28.98 & \color{red} 29.37 \\ -%% \hline -%% \hline -%% \bf Average & $\times$3 & 27.54 & 27.54 & 28.67 & 28.65 & 28.60 & 29.00 & \color{blue} 29.13 & \color{red} 29.43 \\ -%% \hline -%% \end{tabular} -%%\end{table*} -% -\section{Experimental Results} + In this section, we evaluate the performance of our method on several datasets. We first describe datasets used for training and testing our method. Next, parameters necessary for training are given. After outlining our experimental setup, we compare our method with several state-of-the-art SISR methods.