Naive Bayes Super-Resolution ForestJordi SalvadorEduardo Pérez-PelliteroTechnicolor R&I chnicolor.comAbstract32.5NBSRF32.41. IntroductionOngoing research on example-based super resolution hasrecently provided large performance improvements in termsof accuracy and efficiency thanks to the introduction of thelatest machine learning approaches. One possible strategyis internal learning [10, 8, 29], where the cross-scale selfsimilarity property of small (square) natural image patchesis exploited to regress the appearance of high-resolutionpatches. Despite the nice properties of these methods,e.g. their implicit adaptivity to the image contents, an important practical drawback is the computational cost induced by the required nearest-neighbor search.In order to alleviate this problem, one possibility is to exploit the latest and efficient approximate nearest-neighborsearch algorithms [2, 13, 11, 22, 25]. The alternative is theprogressive reduction of the search range for similar patchespresented by the latest best performing internal learningmethods [8, 29]. The evolution from the work of Glasner etal. [10] to that of Yang et al. [29] is remarkable: The search32.3PSNR (dB)This paper presents a fast, high-performance method forsuper resolution with external learning. The first contribution leading to the excellent performance is a bimodaltree for clustering, which successfully exploits the antipodal invariance of the coarse-to-high-res mapping of naturalimage patches and provides scalability to finer partitionsof the underlying coarse patch space. During training anensemble of such bimodal trees is computed, providing different linearizations of the mapping. The second and maincontribution is a fast inference algorithm, which selects themost suitable mapping function within the tree ensemblefor each patch by adopting a Local Naive Bayes formulation. The experimental validation shows promising scalability properties that reflect the suitability of the proposedmodel, which may also be generalized to other tasks. Theresulting method is beyond one order of magnitude fasterand performs objectively and subjectively better than thecurrent state of the art.ASRFA 32.2SRCNN32.13231.9Zeyde et al.31.8101ANR100Slower ---10-1Running time (s)10-2--- FasterFigure 1. The proposed Naive Bayes Super-Resolution Forestclearly improves PSNR and runtime in comparison to the state ofthe art. (SRCNN uses the public slower implementation.)on the entire multi-scale pyramid of each intermediate image towards the desired scale in the former is reduced to aniterated in-place example selection (one of just 9 possiblepatches in the immediately lower scale) in the latter thanksto the incorporation of additional prior knowledge from offline training. The exploitation of this additional prior alsorequires a search, which is common to most external learning approaches as we observe below. An additional limitation in internal learning methods exploiting cross-scaleself-similarity is that the processing must be iteratively applied in small scaling factors to fullfill the self-similarityassumption, which translates into a costlier processing.In contrast, recent external learning approaches [23, 19,24, 20] cast the super-resolution problem as a mapping fromeither low-res or coarse (upscaled by interpolation) patchesto high-res ones in a single step. Since the mapping function is non-linear, it is beneficious to properly split it intolocally linear mapping functions. During an offline trainingstage, low-res or coarse patches can be clustered (e.g. extracting centroids with K-SVD [1]) under the assumption ofmapping function linearity close to each centroid. This approach, which contrasts with former efforts in costly sparsecoding-based regression [30, 31, 18], provides a streamlined online inference: (1) determine the best cluster (linearization) for each patch; (2) apply the precomputed linearmapping function; and (3) reconstruct by patch overlapping.325

Contributions. We present a novel example-based superresolution method with external learning that provides afast, scalable and accurate solution by tackling the most demanding problem in the inference stage (i.e. the selection ofthe local linearization for each patch) and the corresponding training process. In more detail, the contributions ofthis paper are: (1) a hierarchical manifold learning strategywith bimodal trees that allows grouping antipodal patchesand provides fast local linearization search (Section 3); (2)an efficient Local Naive Bayes strategy for per-patch treeselection (Section 4); and (3) the resulting method, namelyNaive Bayes Super-Resolution Forest (NBSRF) that amplyimproves speed and both objective and perceived quality incomparison to the state of the art, as Fig. 1 reflects.full image reconstruction can be accomplished by overlapping all reconstructed patches.A major improvement on both speed and accuracy wasobtained by [31] with the introduction of K-SVD [1] to traina coarse dictionary (for coarse patches only, without involving the high-res ones in the optimization) and OrthogonalMatching Pursuit (OMP) [26] to solve the decompositionproblem (similar to the one in Eq. 2) during the inferencestage. Despite the improvements, the use of OMP during inference is clearly the bottleneck. Consequently, Anchored Neighbor Regression (ANR) [23] proposed to relaxthe sparsity constraint in Eq. 2 and anchor the set of mostsimilar atoms, resulting in a L2 -regularized regression thatcan be computed in closed form during training:min ky Nl (y)αk22 λkαk22 ,2. Related workαThe super-resolution problem is that of estimating ahigh-resolution version X of a low-resolution observed image Y generated by the modelY (X Hd ) s,(1)where denotes image convolution, Hd is a filter preventingaliasing from appearing in Y and s is a downsamplingoperator with scaling factor s (i.e. the desired upscalingfactor). A common asumption in the example-based superresolution literature is that Hd does not introduce additionalblur in Y , so the latter is assumed to be sharp. In [17] amethod is proposed, based on internal learning, to estimatethe overall blur kernel for super-resolving blurred images,but in the following we keep the no-blur assumption to beconsistent with most recent methods. Both images X andY are decomposed into overlapping patches {x} and {y}.2.1. External learningYang et al. [30] proposed introducing additional knowledge about the scaling process by means of Sparse Coding.During the training stage, a large set of patches extractedfrom natural images and their downscaled (low-res) counterparts are used to generate two coupled sparse dictionaries, one for high-res Dh and another one for low-res Dl .The inference is based on a decomposition of each low-respatch y as a sparse linear combination of the atoms in Dl ,which results in a costly optimization process for each patchthat can be expressed asmin ky Dl αk22 λkαk0 .α(2)The data term enforces similarity between the observedpatch and the reconstruction, whereas the sparse regularization enforces a small number of non-zero entries in thedecomposition α. The reconstruction of the high-res patchx is straight-forward once α is known: x Dh α, and the(3)where Nl (y) contains just the nearest neighbors within Dlof one of the atoms. The resulting linear mapping for thechosen atom in the dictionary is Nh (y)α, where Nh (y) isthe high-res counterpart to Nl (y) and the inference stage isreduced to a relatively costly exhaustive nearest-neighborsearch, a matrix multiplication (locally linear mappingfunction) and reconstruction by overlapping.The Simple Functions approach of Yang and Yang [28]proposes an alternative model where clustering is applieddirectly onto low-res patches and then a separate mappingfrom low-res to high-res patches is learned. Their simplemapping function of choice is an affine transformation thatcan be learned offline for each cluster. Again, a practicallimitation of the method is that the affine mapping for eachpatch is determined by an exhaustive search over all possible clusters.More recently, [19, 24] have proposed to obtain a dictionary trained with K-SVD like in [23] with the differencethat, instead of learning the linear regression function ofeach atom in the dictionary using only other atoms, the entire set of training samples are used. This produces a bettersampling of the coarse manifold and also provides a unification of [28] and [23].2.2. Locally linear map searchAs hinted above, perhaps the most affecting practicalbottleneck in many super-resolution approaches relying onlocal linearizations of the mapping function is the exhaustive search stage [23, 28, 24]. One possible solution is presented in [19], where a fast nearest-neighbor search is builtsuch that the clustering induced by the K-SVD dictionaryatoms can be efficiently exploited in the inference stage,resulting in large speed-ups. The idea is to construct thisfast search strategy based on Spherical Hashing [11] to getlogarithmic-time access to the anchor points (and their corresponding linear mapping functions) obtained by K-SVD.However, whereas K-SVD usually leads to a good selection326

of the most relevant dictionary atoms for reconstructing allpossible observed data, it also involves a rather slow processthat might be impractical for large training datasets.An alternative and more uniform solution is to performunsupervised hierarchical clustering, e.g. [9, 16], in order toboth determine clusters of similar patches (or features) andprovide an intrinisic logarithmic-time search during the inference stage. This strategy is not new to super-resolution asthe random projections approach [9] (which approximates[16] without computing the PCA of the elements in each hierarchical cluster) has already been employed for in-placeregression [29]. Very recently, [20, 12] have also proposedto combine regressions from random tree ensembles to improve the accuracy. Since hierarchical approaches are capable of solving both the clustering and efficient inferenceproblems, in the following we explore the applicability onour fast and accurate super-resolution onAbovesimilaritythresholdPartition 1Partition 23. Hierarchical manifold learningUnimodalOur model is in essence related to [28], in the sense thatwe aim at providing a direct mapping from coarse to highres patches by dividing the input coarse space into clustersand then learning a locally linear mapping to the high-respatches, and also to [24, 19], so that our mapping is actually a correction layer for an initial coarse estimation of thehigh-res patch. The coarse estimate of X is thus first obtained via iterative back-projection (IBP) from the low-resinput YX̃ (n 1) : X̃ (n) Hu ((Y (X̃ (n) Hd ) s) s), (4)where the superscript (n) indicates the iteration and Hu isan interpolation filter. Typically, a small number of iterations, e.g. 2 or 3, is sufficient to reach convergence. At apatch level, the reconstruction model continues asx x̃ R(x̃ x̄) x̃ R(x̃0 ),(5)where R is a nonlinear regression function and x̃0 is themean-subtracted (or floating) version of x̃. Whereas otherapproaches (e.g. [31, 23, 24, 19]) use handcrafted featuresas input to the regression problem, below we explain howfeatures can be adaptively computed in order to select oneof the M available local linearizations Ri of R obtainedduring training.We further observe that given a patch x with a certainstructure, the scaled version αx contains the same structure with different contrast. Since this argument is valid forboth x and x̃0 when they are related by a linear transform,it makes sense to group all patches with the same structureby normalizing them (x̂ x̃0 kx̃0 k 12 ). The result is that allpossible patch structures lie on a unitary hypersphere centered at the origin akin to the toy example in Fig. 2 top.Manifold ofnormalizedpatchesPartition 1Partition 2BimodalFigure 2. Top, typical unimodal partitions on a normalized-featurespace. The spherical cap (delimited by the solid line) is the fraction of the manifold that can be described with a single principaldirection. Bottom, clustering for antipodal patch grouping withunimodal and bimodal partitions. Note the more balanced partitions of the (colored) data possible with the bimodal scheme.Based on the good results in [29] for in-place exampleregression one might consider a similar space partitioningstrategy, i.e. using unimodal partition trees: for each node,split data based on the thresholding of a feature (the response to a certain filter). This is the mechanism underlying the PCA tree [16], its random projection approximation [9] and also the faster k-D tree. In the latter, a setof features is precomputed for all data and the spliting isbased on the thresholding of the most sensitive feature foreach node, whereas the PCA tree and its approximation provide an adaptive computation of relevant features during theroot-to-leaf tree traversal.This strategy is valid when data are distributed as a thinmanifold [9]. However, inspecting the configuration of ourproblem, depicted in Fig. 2 top (with all data lying on a hypersphere), we observe that partitioning data in this fashionwould be rather suboptimal. The resulting partition withunimodal approaches is rather unbalanced in our setup: theset of data lying out of the inclusion partition (projectionabove threshold) is much more heterogeneus than the onelying inside (below threshold).3.1. Antipodal symmetryIn addition to the normalization of data, it is also important to observe that antipodal points, i.e. x̂ and x̂, are also327

representations of the same structure with a scaling factor.An example of such pair of points is also shown in Fig. 2top. Hence, an efficient hierarchical partition of the spacein our problem should aim to also keep all anti