학술·연구정보가이드: Computer Science 분야 (05): Semantics; Clustering algorithms; Cross-modal hashing

피인용 상위 논문

A multi-view embedding space for modeling internet images, tags, and their semantics.
Gong, Y., Ke, Q., Isard, M. and 1 more (2014) International Journal of Computer Vision, 106 (2), pp. 210-233.

more... less...

This paper investigates the problem of modeling Internet images and associated text or tags for tasks such as image-to-image search, tag-to-image search, and image-to-tag search (image annotation). We start with canonical correlation analysis (CCA), a popular and successful approach for mapping visual and textual features to the same latent space, and incorporate a third view capturing high-level image semantics, represented either by a single category or multiple non-mutually-exclusive concepts. We present two ways to train the three-view embedding: supervised, with the third view coming from ground-truth labels or search keywords; and unsupervised, with semantic themes automatically obtained by clustering the tags. To ensure high accuracy for retrieval tasks while keeping the learning process scalable, we combine multiple strong visual features and use explicit nonlinear kernel mappings to efficiently approximate kernel CCA. To perform retrieval, we use a specially designed similarity function in the embedded space, which substantially outperforms the Euclidean distance. The resulting system produces compelling qualitative results and outperforms a number of two-view baselines on retrieval tasks on three large-scale Internet image datasets.
Multi-View Intact Space Learning.
Xu, C., Tao, D., Xu, C.
(2015) IEEE Transactions on Pattern Analysis and Machine Intelligence, 37 (12), pp. 2531-2544.

more... less...

It is practical to assume that an individual view is unlikely to be sufficient for effective multi-view learning. Therefore, integration of multi-view information is both valuable and necessary. In this paper, we propose the Multi-view Intact Space Learning (MISL) algorithm, which integrates the encoded complementary information in multiple views to discover a latent intact representation of the data. Even though each view on its own is insufficient, we show theoretically that by combing multiple views we can obtain abundant information for latent intact space learning. Employing the Cauchy loss (a technique used in statistical learning) as the error measurement strengthens robustness to outliers. We propose a new definition of multi-view stability and then derive the generalization error bound based on multi-view stability and Rademacher complexity, and show that the complementarity between multiple views is beneficial for the stability and generalization. MISL is efficiently optimized using a novel Iteratively Reweight Residuals (IRR) technique, whose convergence is theoretically analyzed. Experiments on synthetic data and real-world datasets demonstrate that MISL is an effective and promising algorithm for practical applications.
Collective matrix factorization hashing for multimodal data.
Ding, G., Guo, Y., Zhou, J.
(2014) Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2083-2090.

more... less...

Nearest neighbor search methods based on hashing have attracted considerable attention for effective and efficient large-scale similarity search in computer vision and information retrieval community. In this paper, we study the problems of learning hash functions in the context of multimodal data for cross-view similarity search. We put forward a novel hashing method, which is referred to Collective Matrix Factorization Hashing (CMFH). CMFH learns unified hash codes by collective matrix factorization with latent factor model from different modalities of one instance, which can not only supports cross-view search but also increases the search accuracy by merging multiple view information sources. We also prove that CMFH, a similarity-preserving hashing learning method, has upper and lower boundaries. Extensive experiments verify that CMFH significantly outperforms several state-of-the-art methods on three different datasets.
Cross-modal retrieval with correspondence autoencoder.
Feng, F., Wang, X., Li, R.
(2014) MM 2014 - Proceedings of the 2014 ACM Conference on Multimedia, pp. 7-16.

more... less...

The problem of cross-modal retrieval, e.g., using a text query to search for images and vice-versa, is considered in this paper. A novel model involving correspondence autoencoder (Corr-AE) is proposed here for solving this problem. The model is constructed by correlating hidden representations of two uni-modal autoencoders. A novel optimal objective, which minimizes a linear combination of representation learning errors for each modality and correlation learning error between hidden representations of two modalities, is used to train the model as a whole. Minimization of correlation learning error forces the model to learn hidden representations with only common information in different modalities, while minimization of representation learning error makes hidden representations are good enough to reconstruct input of each modality. A parameter $\alpha$ is used to balance the representation learning error and the correlation learning error. Based on two different multi-modal autoencoders, Corr-AE is extended to other two correspondence models, here we called Corr-Cross-AE and Corr-Full-AE. The proposed models are evaluated on three publicly available data sets from real scenes. We demonstrate that the three correspondence autoencoders perform significantly better than three canonical correlation analysis based models and two popular multi-modal deep models on cross-modal retrieval tasks.
On deep multi-view representation learning.
Wang, W., Arora, R., Livescu, K. and 1 more (2015) 32nd International Conference on Machine Learning, ICML 2015, 2, pp. 1083-1092.

more... less...

Recent online services rely heavily on automatic personal-ization to recommend relevant content to a large number of users. This requires systems to scale promptly to accommo-date the stream of new users visiting the online services for the first time. In this work, we propose a content-based rec-ommendation system to address both the recommendation quality and the system scalability. We propose to use a rich feature set to represent users, according to their web brows-ing history and search queries. We use a Deep Learning ap-proach to map users and items to a latent space where the similarity between users and their preferred items is maxi-mized. We extend the model to jointly learn from features of items from different domains and user features by intro-ducing a multi-view Deep Learning model. We show how to make this rich-feature based user representation scalable by reducing the dimension of the inputs and the amount of training data. The rich user feature representation allows the model to learn relevant user behavior patterns and give useful recommendations for users who do not have any in-teraction with the service, given that they have adequate search and browsing history. The combination of different domains into a single model for learning helps improve the recommendation quality across all the domains, as well as having a more compact and a semantically richer user latent feature vector. We experiment with our approach on three real-world recommendation systems acquired from different sources of Microsoft products: Windows Apps recommen-dation, News recommendation, and Movie/TV recommen-dation. Results indicate that our approach is significantly better than the state-of-The-Art algorithms (up to 49% en-hancement on existing users and 115% enhancement on new users). In addition, experiments on a publicly open data set also indicate the superiority of our method in compar-ison with transitional generative topic models, for model-ing cross-domain recommender systems. Scalability analy-sis show that our multi-view DNN model can easily scale to encompass millions of users and billions of item entries. Experimental results also confirm that combining features from all domains produces much better performance than building separate models for each domain.