ArXiv Weekly: 10 NLP Papers You May Want to Read
[NLP paper 1/10]
Why you may want to read this: Newest paper from Shih-Fu Chang (Professor of Electrical Engineering and Computer Science, Columbia University).
Training with Streaming Annotation.
Tongtao Zhang, Heng Ji, Shih-Fu Chang, Marjorie Freedman
In this paper, we address a practical scenario where training data is released in a sequence of small-scale batches and annotation in earlier phases has lower quality than the later counterparts. To tackle the situation, we utilize a pre-trained transformer network to preserve and integrate the most salient document information from the earlier batches while focusing on the annotation (presumably with higher quality) from the current batch. Using event extraction as a case study, we demonstrate in the experiments that our proposed framework can perform better than conventional approaches (the improvement ranges from 3.6 to 14.9% absolute F-score gain), especially when there is more noise in the early annotation; and our approach spares 19.1% time with regard to the best conventional method.
[NLP paper 2/10]
Why you may want to read this: Newest paper from Richard Socher (Chief Scientist at Salesforce).
Limits of Detecting Text Generated by Large-Scale Language Models.
Lav R. Varshney, Nitish Shirish Keskar, Richard Socher
Some consider large-scale language models that can generate long and coherent pieces of text as dangerous, since they may be used in misinformation campaigns. Here we formulate large-scale language model output detection as a hypothesis testing problem to classify text as genuine or generated. We show that error exponents for particular language models are bounded in terms of their perplexity, a standard measure of language generation performance. Under the assumption that human language is stationary and ergodic, the formulation is extended from considering specific language models to considering maximum likelihood language models, among the class of k-order Markov approximations; error probabilities are characterized. Some discussion of incorporating semantic side information is also given.
Why you may want to read this: Newest paper from Jeffrey Xu Yu (Chinese University of Hong Kong).
Joint Embedding in Named Entity Linking on Sentence Level.
Wei Shi, Siyuan Zhang, Zhiwei Zhang, Hong Cheng, Jeffrey Xu Yu
Named entity linking is to map an ambiguous mention in documents to an entity in a knowledge base. The named entity linking is challenging, given the fact that there are multiple candidate entities for a mention in a document. It is difficult to link a mention when it appears multiple times in a document, since there are conflicts by the contexts around the appearances of the mention. In addition, it is difficult since the given training dataset is small due to the reason that it is done manually to link a mention to its mapping entity. In the literature, there are many reported studies among which the recent embedding methods learn vectors of entities from the training dataset at document level. To address these issues, we focus on how to link entity for mentions at a sentence level, which reduces the noises introduced by different appearances of the same mention in a document at the expense of insufficient information to be used. We propose a new unified embedding method by maximizing the relationships learned from knowledge graphs. We confirm the effectiveness of our method in our experimental studies.
[NLP paper 4/10]
Why you may want to read this: Newest paper from Jianfeng Gao (Microsoft Research, Redmond), Minlie Huang (computer science, Tsinghua University).
ConvLab-2: An Open-Source Toolkit for Building, Evaluating, and Diagnosing Dialogue Systems.
Qi Zhu, Zheng Zhang, Yan Fang, Xiang Li, Ryuichi Takanobu, Jinchao Li, Baolin Peng, Jianfeng Gao, Xiaoyan Zhu, Minlie Huang
We present ConvLab-2, an open-source toolkit that enables researchers to build task-oriented dialogue systems with state-of-the-art models, perform an end-to-end evaluation, and diagnose the weakness of systems. As the successor of ConvLab (Lee et al., 2019b), ConvLab-2 inherits ConvLab's framework but integrates more powerful dialogue models and supports more datasets. Besides, we have developed an analysis tool and an interactive tool to assist researchers in diagnosing dialogue systems. The analysis tool presents rich statistics and summarizes common mistakes from simulated dialogues, which facilitates error analysis and system improvement. The interactive tool provides a user interface that allows developers to diagnose an assembled dialogue system by interacting with the system and modifying the output of each system component.
[NLP paper 5/10]
Why you may want to read this: Newest paper from Zheng Chen (Principle Researcher of Microsoft Research Asia).
Pre-Training for Query Rewriting in A Spoken Language Understanding System.
Zheng Chen, Xing Fan, Yuan Ling, Lambert Mathias, Chenlei Guo
Query rewriting (QR) is an increasingly important technique to reduce customer friction caused by errors in a spoken language understanding pipeline, where the errors originate from various sources such as speech recognition errors, language understanding errors or entity resolution errors. In this work, we first propose a neural-retrieval based approach for query rewriting. Then, inspired by the wide success of pre-trained contextual language embeddings, and also as a way to compensate for insufficient QR training data, we propose a language-modeling (LM) based approach to pre-train query embeddings on historical user conversation data with a voice assistant. In addition, we propose to use the NLU hypotheses generated by the language understanding system to augment the pre-training. Our experiments show pre-training provides rich prior information and help the QR task achieve strong performance. We also show joint pre-training with NLU hypotheses has further benefit. Finally, after pre-training, we find a small set of rewrite pairs is enough to fine-tune the QR model to outperform a strong baseline by full training on all QR training data.
[NLP paper 6/10]
Why you may want to read this: Newest paper from Abhinav Gupta (Associate Professor, Robotics Institute, Carnegie Mellon University).
Exploring Structural Inductive Biases in Emergent Communication.
Agnieszka Słowik, Abhinav Gupta, William L. Hamilton, Mateja Jamnik, Sean B. Holden, Christopher Pal
Human language and thought are characterized by the ability to systematically generate a potentially infinite number of complex structures (e.g., sentences) from a finite set of familiar components (e.g., words). Recent works in emergent communication have discussed the propensity of artificial agents to develop a systematically compositional language through playing co-operative referential games. The degree of structure in the input data was found to affect the compositionality of the emerged communication protocols. Thus, we explore various structural priors in multi-agent communication and propose a novel graph referential game. We compare the effect of structural inductive bias (bag-of-words, sequences and graphs) on the emergence of compositional understanding of the input concepts measured by topographic similarity and generalization to unseen combinations of familiar properties. We empirically show that graph neural networks induce a better compositional language prior and a stronger generalization to out-of-domain data. We further perform ablation studies that show the robustness of the emerged protocol in graph referential games.
[NLP paper 7/10]
Why you may want to read this: Newest paper from Diane Litman (University of Pittsburgh).
Abstractive Summarization for Low Resource Data using Domain Transfer and Data Synthesis.
Ahmed Magooda, Diane Litman
Training abstractive summarization models typically requires large amounts of data, which can be a limitation for many domains. In this paper we explore using domain transfer and data synthesis to improve the performance of recent abstractive summarization methods when applied to small corpora of student reflections. First, we explored whether tuning state of the art model trained on newspaper data could boost performance on student reflection data. Evaluations demonstrated that summaries produced by the tuned model achieved higher ROUGE scores compared to model trained on just student reflection data or just newspaper data. The tuned model also achieved higher scores compared to extractive summarization baselines, and additionally was judged to produce more coherent and readable summaries in human evaluations. Second, we explored whether synthesizing summaries of student data could additionally boost performance. We proposed a template-based model to synthesize new data, which when incorporated into training further increased ROUGE scores. Finally, we showed that combining data synthesis with domain transfer achieved higher ROUGE scores compared to only using one of the two approaches.
Why you may want to read this: Newest paper from Ke Xu (Rutgers University).
Learning to Compare for Better Training and Evaluation of Open Domain Natural Language Generation Models.
Wangchunshu Zhou, Ke Xu
Automated evaluation of open domain natural language generation (NLG) models remains a challenge and widely used metrics such as BLEU and Perplexity can be misleading in some cases. In our paper, we propose to evaluate natural language generation models by learning to compare a pair of generated sentences by fine-tuning BERT, which has been shown to have good natural language understanding ability. We also propose to evaluate the model-level quality of NLG models with sample-level comparison results with skill rating system. While able to be trained in a fully self-supervised fashion, our model can be further fine-tuned with a little amount of human preference annotation to better imitate human judgment. In addition to evaluating trained models, we propose to apply our model as a performance indicator during training for better hyperparameter tuning and early-stopping. We evaluate our approach on both story generation and chit-chat dialogue response generation. Experimental results show that our model correlates better with human preference compared with previous automated evaluation approaches. Training with the proposed metric yields better performance in human evaluation, which further demonstrates the effectiveness of the proposed model.
Why you may want to read this: Newest paper from Patrick Gallinari (Professor Sorbonne University / Criteo AI Lab).
Incorporating Visual Semantics into Sentence Representations within a Grounded Space.
Patrick Bordes, Eloi Zablocki, Laure Soulier, Benjamin Piwowarski, Patrick Gallinari
Language grounding is an active field aiming at enriching textual representations with visual information. Generally, textual and visual elements are embedded in the same representation space, which implicitly assumes a one-to-one correspondence between modalities. This hypothesis does not hold when representing words, and becomes problematic when used to learn sentence representations --- the focus of this paper --- as a visual scene can be described by a wide variety of sentences. To overcome this limitation, we propose to transfer visual information to textual representations by learning an intermediate representation space: the grounded space. We further propose two new complementary objectives ensuring that (1) sentences associated with the same visual content are close in the grounded space and (2) similarities between related elements are preserved across modalities. We show that this model outperforms the previous state-of-the-art on classification and semantic relatedness tasks.
Why you may want to read this: Newest paper from Wang-Chiew Tan (Megagon Labs (was Recruit Institute of Technology)).
Snippext: Semi-supervised Opinion Mining with Augmented Data.
Zhengjie Miao, Yuliang Li, Xiaolan Wang, Wang-Chiew Tan
Online services are interested in solutions to opinion mining, which is the problem of extracting aspects, opinions, and sentiments from text. One method to mine opinions is to leverage the recent success of pre-trained language models which can be fine-tuned to obtain high-quality extractions from reviews. However, fine-tuning language models still requires a non-trivial amount of training data. In this paper, we study the problem of how to significantly reduce the amount of labeled training data required in fine-tuning language models for opinion mining. We describe Snippext, an opinion mining system developed over a language model that is fine-tuned through semi-supervised learning with augmented data. A novelty of Snippext is its clever use of a two-prong approach to achieve state-of-the-art (SOTA) performance with little labeled training data through: (1) data augmentation to automatically generate more labeled training data from existing ones, and (2) a semi-supervised learning technique to leverage the massive amount of unlabeled data in addition to the (limited amount of) labeled data. We show with extensive experiments that Snippext performs comparably and can even exceed previous SOTA results on several opinion mining tasks with only half the training data required. Furthermore, it achieves new SOTA results when all training data are leveraged. By comparison to a baseline pipeline, we found that Snippext extracts significantly more fine-grained opinions which enable new opportunities of downstream applications.
ArXiv Weekly: 10 CV Papers You May Want to Read
[CV paper 1/10]Why you may want to read this: Newest paper from Andrea Vedaldi (University of Oxford), Andrew Zisserman (University of Oxford).
Automatically Discovering and Learning New Visual Categories with Ranking Statistics.
Kai Han, Sylvestre-Alvise Rebuffi, Sebastien Ehrhardt, Andrea Vedaldi, Andrew Zisserman
We tackle the problem of discovering novel classes in an image collection given labelled examples of other classes. This setting is similar to semi-supervised learning, but significantly harder because there are no labelled examples for the new classes. The challenge, then, is to leverage the information contained in the labelled images in order to learn a general-purpose clustering model and use the latter to identify the new classes in the unlabelled data. In this work we address this problem by combining three ideas: (1) we suggest that the common approach of bootstrapping an image representation using the labeled data only introduces an unwanted bias, and that this can be avoided by using self-supervised learning to train the representation from scratch on the union of labelled and unlabelled data; (2) we use rank statistics to transfer the model's knowledge of the labelled classes to the problem of clustering the unlabelled images; and, (3) we train the data representation by optimizing a joint objective function on the labelled and unlabelled subsets of the data, improving both the supervised classification of the labelled data, and the clustering of the unlabelled data. We evaluate our approach on standard classification benchmarks and outperform current methods for novel category discovery by a significant margin.
[CV paper 2/10]
Why you may want to read this: Newest paper from Leonidas Guibas (Professor of Computer Science, Stanford University).
Continuous Geodesic Convolutions for Learning on 3D Shapes.
Zhangsihao Yang, Or Litany, Tolga Birdal, Srinath Sridhar, Leonidas Guibas
The majority of descriptor-based methods for geometric processing of non-rigid shape rely on hand-crafted descriptors. Recently, learning-based techniques have been shown effective, achieving state-of-the-art results in a variety of tasks. Yet, even though these methods can in principle work directly on raw data, most methods still rely on hand-crafted descriptors at the input layer. In this work, we wish to challenge this practice and use a neural network to learn descriptors directly from the raw mesh. To this end, we introduce two modules into our neural architecture. The first is a local reference frame (LRF) used to explicitly make the features invariant to rigid transformations. The second is continuous convolution kernels that provide robustness to sampling. We show the efficacy of our proposed network in learning on raw meshes using two cornerstone tasks: shape matching, and human body parts segmentation. Our results show superior results over baseline methods that use hand-crafted descriptors.
Why you may want to read this: Newest paper from Mubarak Shah (Trustee Chair Professor of Computer Science, University of Central Florida).
Subspace Capsule Network.
Marzieh Edraki, Nazanin Rahnavard, Mubarak Shah
Convolutional neural networks (CNNs) have become a key asset to most of fields in AI. Despite their successful performance, CNNs suffer from a major drawback. They fail to capture the hierarchy of spatial relation among different parts of an entity. As a remedy to this problem, the idea of capsules was proposed by Hinton. In this paper, we propose the SubSpace Capsule Network (SCN) that exploits the idea of capsule networks to model possible variations in the appearance or implicitly defined properties of an entity through a group of capsule subspaces instead of simply grouping neurons to create capsules. A capsule is created by projecting an input feature vector from a lower layer onto the capsule subspace using a learnable transformation. This transformation finds the degree of alignment of the input with the properties modeled by the capsule subspace. We show that SCN is a general capsule network that can successfully be applied to both discriminative and generative models without incurring computational overhead compared to CNN during test time. Effectiveness of SCN is evaluated through a comprehensive set of experiments on supervised image classification, semi-supervised image classification and high-resolution image generation tasks using the generative adversarial network (GAN) framework. SCN significantly improves the performance of the baseline models in all 3 tasks.
Why you may want to read this: Newest paper from Jonathon Shlens (Google Research).
Revisiting Spatial Invariance with Low-Rank Local Connectivity.
Gamaleldin F. Elsayed, Prajit Ramachandran, Jonathon Shlens, Simon Kornblith
Convolutional neural networks are among the most successful architectures in deep learning. This success is at least partially attributable to the efficacy of spatial invariance as an inductive bias. Locally connected layers, which differ from convolutional layers in their lack of spatial invariance, usually perform poorly in practice. However, these observations still leave open the possibility that some degree of relaxation of spatial invariance may yield a better inductive bias than either convolution or local connectivity. To test this hypothesis, we design a method to relax the spatial invariance of a network layer in a controlled manner. In particular, we create a \textit{low-rank} locally connected layer, where the filter bank applied at each position is constructed as a linear combination of basis set of filter banks. By varying the number of filter banks in the basis set, we can control the degree of departure from spatial invariance. In our experiments, we find that relaxing spatial invariance improves classification accuracy over both convolution and locally connected layers across MNIST, CIFAR-10, and CelebA datasets. These results suggest that spatial invariance in convolution layers may be overly restrictive.
[CV paper 5/10]
Why you may want to read this: Newest paper from Kevin W. Bowyer (Schubmehl-Prein Family Professor of Computer Science and Engineering, University of …).
How Does Gender Balance In Training Data Affect Face Recognition Accuracy?.
Vítor Albiero, Kai Zhang, Kevin W. Bowyer
Even though deep learning methods have greatly increased the overall accuracy of face recognition, an old problem still persists: accuracy is higher for men than for women. Previous researchers have speculated that the difference could be due to cosmetics, head pose, or hair covering the face. It is also often speculated that the lower accuracy for women is caused by women being under-represented in the training data. This work aims to investigate if gender imbalance in the training data is actually the cause of lower accuracy for females. Using a state-of-the-art deep CNN, three different loss functions, and two training datasets, we train each on seven subsets with different male/female ratios, totaling forty two train-ings. The trained face matchers are then tested on three different testing datasets. Results show that gender-balancing the dataset has an overall positive effect, with higher accuracy for most of the combinations of loss functions and datasets when a balanced subset is used. However, for the best combination of loss function and dataset, the original training dataset shows better accuracy on 3 out of 4 times. We observe that test accuracy for males is higher when the training data is all male. However, test accuracy for females is not maximized when the training data is all female. Fora number of combinations of loss function and test dataset, accuracy for females is higher when only 75% of the train-ing data is female than when 100% of the training data is female. This suggests that lower accuracy for females is nota simple result of the fraction of female training data. By clustering face features, we show that in general, male faces are closer to other male faces than female faces, and female faces are closer to other female faces than male faces
[CV paper 6/10]
Why you may want to read this: Newest paper from Philip H. S. Torr (Professor, University of Oxford).
Image-to-Image Translation with Text Guidance.
Bowen Li, Xiaojuan Qi, Philip H. S. Torr, Thomas Lukasiewicz
The goal of this paper is to embed controllable factors, i.e., natural language descriptions, into image-to-image translation with generative adversarial networks, which allows text descriptions to determine the visual attributes of synthetic images. We propose four key components: (1) the implementation of part-of-speech tagging to filter out non-semantic words in the given description, (2) the adoption of an affine combination module to effectively fuse different modality text and image features, (3) a novel refined multi-stage architecture to strengthen the differential ability of discriminators and the rectification ability of generators, and (4) a new structure loss to further improve discriminators to better distinguish real and synthetic images. Extensive experiments on the COCO dataset demonstrate that our method has a superior performance on both visual realism and semantic consistency with given descriptions.
[CV paper 7/10]
Why you may want to read this: Newest paper from Xiaopeng Chen (Associate Professor, Beijing Institute of Technology,Visiting Scholar, Carnegie Mellon …).
Ensemble of Deep Convolutional Neural Networks for Automatic Pavement Crack Detection and Measurement.
Zhun Fan, Chong Li, Ying Chen, Paola Di Mascio, Xiaopeng Chen, Guijie Zhu, Giuseppe Loprencipe
Automated pavement crack detection and measurement are important road issues. Agencies have to guarantee the improvement of road safety. Conventional crack detection and measurement algorithms can be extremely time-consuming and low efficiency. Therefore, recently, innovative algorithms have received increased attention from researchers. In this paper, we propose an ensemble of convolutional neural networks (without a pooling layer) based on probability fusion for automated pavement crack detection and measurement. Specifically, an ensemble of convolutional neural networks was employed to identify the structure of small cracks with raw images. Secondly, outputs of the individual convolutional neural network model for the ensemble were averaged to produce the final crack probability value of each pixel, which can obtain a predicted probability map. Finally, the predicted morphological features of the cracks were measured by using the skeleton extraction algorithm. To validate the proposed method, some experiments were performed on two public crack databases (CFD and AigleRN) and the results of the different state-of-the-art methods were compared. The experimental results show that the proposed method outperforms the other methods. For crack measurement, the crack length and width can be measure based on different crack types (complex, common, thin, and intersecting cracks.). The results show that the proposed algorithm can be effectively applied for crack measurement.
[CV paper 8/10]
Why you may want to read this: Newest paper from C.-C. Jay Kuo (Distinguished Professor of ECE and CS, University of Southern California).
PointHop++: A Lightweight Learning Model on Point Sets for 3D Classification.
Min Zhang, Yifan Wang, Pranav Kadam, Shan Liu, C.-C. Jay Kuo
The PointHop method was recently proposed by Zhang et al. for 3D point cloud classification with unsupervised feature extraction. It has an extremely low training complexity while achieving state-of-the-art classification performance. In this work, we improve the PointHop method furthermore in two aspects: 1) reducing its model complexity in terms of the model parameter number and 2) ordering discriminant features automatically based on the cross-entropy criterion. The resulting method is called PointHop++. The first improvement is essential for wearable and mobile computing while the second improvement bridges statistics-based and optimization-based machine learning methodologies. With experiments conducted on the ModelNet40 benchmark dataset, we show that the PointHop++ method performs on par with deep neural network (DNN) solutions and surpasses other unsupervised feature extraction methods.
[CV paper 9/10]
Why you may want to read this: Newest paper from Jerry L. Prince (Professor of Electrical and Computer Engineering, Johns Hopkins University).
Finding novelty with uncertainty.
Jacob C. Reinhold, Yufan He, Shizhong Han, Yunqiang Chen, Dashan Gao, Junghoon Lee, Jerry L. Prince, Aaron Carass
Medical images are often used to detect and characterize pathology and disease; however, automatically identifying and segmenting pathology in medical images is challenging because the appearance of pathology across diseases varies widely. To address this challenge, we propose a Bayesian deep learning method that learns to translate healthy computed tomography images to magnetic resonance images and simultaneously calculates voxel-wise uncertainty. Since high uncertainty occurs in pathological regions of the image, this uncertainty can be used for unsupervised anomaly segmentation. We show encouraging experimental results on an unsupervised anomaly segmentation task by combining two types of uncertainty into a novel quantity we call scibilic uncertainty.
Why you may want to read this: Newest paper from Jerry L. Prince (Professor of Electrical and Computer Engineering, Johns Hopkins University).
Validating uncertainty in medical image translation.
Jacob C. Reinhold, Yufan He, Shizhong Han, Yunqiang Chen, Dashan Gao, Junghoon Lee, Jerry L. Prince, Aaron Carass
Medical images are increasingly used as input to deep neural networks to produce quantitative values that aid researchers and clinicians. However, standard deep neural networks do not provide a reliable measure of uncertainty in those quantitative values. Recent work has shown that using dropout during training and testing can provide estimates of uncertainty. In this work, we investigate using dropout to estimate epistemic and aleatoric uncertainty in a CT-to-MR image translation task. We show that both types of uncertainty are captured, as defined, providing confidence in the output uncertainty estimates.
ArXiv Weekly: 10 ML Papers You May Want to Read
[ML paper 1/10]Why you may want to read this: Newest paper from Michael I. Jordan (Professor of EECS and Professor of Statistics, University of California, Berkeley).
Adaptivity of Stochastic Gradient Methods for Nonconvex Optimization.
Samuel Horváth, Lihua Lei, Peter Richtárik, Michael I. Jordan
Adaptivity is an important yet under-studied property in modern optimization theory. The gap between the state-of-the-art theory and the current practice is striking in that algorithms with desirable theoretical guarantees typically involve drastically different settings of hyperparameters, such as step-size schemes and batch sizes, in different regimes. Despite the appealing theoretical results, such divisive strategies provide little, if any, insight to practitioners to select algorithms that work broadly without tweaking the hyperparameters. In this work, blending the "geometrization" technique introduced by Lei & Jordan 2016 and the \texttt{SARAH} algorithm of Nguyen et al., 2017, we propose the Geometrized \texttt{SARAH} algorithm for non-convex finite-sum and stochastic optimization. Our algorithm is proved to achieve adaptivity to both the magnitude of the target accuracy and the Polyak-\L{}ojasiewicz (PL) constant if present. In addition, it achieves the best-available convergence rate for non-PL objectives simultaneously while outperforming existing algorithms for PL objectives.
Why you may want to read this: Newest paper from Gunnar Rätsch (Professor, ETH Zürich), Bernhard Schölkopf (Director, Max Planck Institute for Intelligent Systems; and Distinguished Amazon Scholar).
Weakly-Supervised Disentanglement Without Compromises.
Francesco Locatello, Ben Poole, Gunnar Rätsch, Bernhard Schölkopf, Olivier Bachem, Michael Tschannen
Intelligent agents should be able to learn useful representations by observing changes in their environment. We model such observations as pairs of non-i.i.d. images sharing at least one of the underlying factors of variation. First, we theoretically show that only knowing how many factors have changed, but not which ones, is sufficient to learn disentangled representations. Second, we provide practical algorithms that learn disentangled representations from pairs of images without requiring annotation of groups, individual factors, or the number of factors that have changed. Third, we perform a large-scale empirical study and show that such pairs of observations are sufficient to reliably learn disentangled representations on several benchmark data sets. Finally, we evaluate our learned representations and find that they are simultaneously useful on a diverse suite of tasks, including generalization under covariate shifts, fairness, and abstract reasoning. Overall, our results demonstrate that weak supervision enables learning of useful disentangled representations in realistic scenarios.
[ML paper 3/10]
Why you may want to read this: Newest paper from Nitish Srivastava (Student, University of Toronto), Ruslan Salakhutdinov (Associate Professor, Machine Learning Department, CMU).
Capsules with Inverted Dot-Product Attention Routing.
Yao-Hung Hubert Tsai, Nitish Srivastava, Hanlin Goh, Ruslan Salakhutdinov
We introduce a new routing algorithm for capsule networks, in which a child capsule is routed to a parent based only on agreement between the parent's state and the child's vote. The new mechanism 1) designs routing via inverted dot-product attention; 2) imposes Layer Normalization as normalization; and 3) replaces sequential iterative routing with concurrent iterative routing. When compared to previously proposed routing algorithms, our method improves performance on benchmark datasets such as CIFAR-10 and CIFAR-100, and it performs at-par with a powerful CNN (ResNet-18) with 4x fewer parameters. On a different task of recognizing digits from overlayed digit images, the proposed capsule model performs favorably against CNNs given the same number of layers and neurons per layer. We believe that our work raises the possibility of applying capsule networks to complex real-world tasks. Our code is publicly available at: https://github.com/apple/ml-capsules-inverted-attention-routing
[ML paper 4/10]
Why you may want to read this: Newest paper from Ulrich Kerzel (IUBH).
Cyclic Boosting -- an explainable supervised machine learning algorithm.
Felix Wick, Ulrich Kerzel, Michael Feindt
Supervised machine learning algorithms have seen spectacular advances and surpassed human level performance in a wide range of specific applications. However, using complex ensemble or deep learning algorithms typically results in black box models, where the path leading to individual predictions cannot be followed in detail. In order to address this issue, we propose the novel "Cyclic Boosting" machine learning algorithm, which allows to efficiently perform accurate regression and classification tasks while at the same time allowing a detailed understanding of how each individual prediction was made.
Why you may want to read this: Newest paper from Zoubin Ghahramani (Professor, University of Cambridge, and Chief Scientist, Uber).
DynamicPPL: Stan-like Speed for Dynamic Probabilistic Models.
Mohamed Tarek, Kai Xu, Martin Trapp, Hong Ge, Zoubin Ghahramani
We present the preliminary high-level design and features of DynamicPPL.jl, a modular library providing a lightning-fast infrastructure for probabilistic programming. Besides a computational performance that is often close to or better than Stan, DynamicPPL provides an intuitive DSL that allows the rapid development of complex dynamic probabilistic programs. Being entirely written in Julia, a high-level dynamic programming language for numerical computing, DynamicPPL inherits a rich set of features available through the Julia ecosystem. Since DynamicPPL is a modular, stand-alone library, any probabilistic programming system written in Julia, such as Turing.jl, can use DynamicPPL to specify models and trace their model parameters. The main features of DynamicPPL are: 1) a meta-programming based DSL for specifying dynamic models using an intuitive tilde-based notation; 2) a tracing data-structure for tracking RVs in dynamic probabilistic models; 3) a rich contextual dispatch system allowing tailored behaviour during model execution; and 4) a user-friendly syntax for probabilistic queries. Finally, we show in a variety of experiments that DynamicPPL, in combination with Turing.jl, achieves computational performance that is often close to or better than Stan.
Why you may want to read this: Newest paper from Richard Socher (Chief Scientist at Salesforce).
Explore, Discover and Learn: Unsupervised Discovery of State-Covering Skills.
Víctor Campos, Alexander Trott, Caiming Xiong, Richard Socher, Xavier Giro-i-Nieto, Jordi Torres
Acquiring abilities in the absence of a task-oriented reward function is at the frontier of reinforcement learning research. This problem has been studied through the lens of empowerment, which draws a connection between option discovery and information theory. Information-theoretic skill discovery methods have garnered much interest from the community, but little research has been conducted in understanding their limitations. Through theoretical analysis and empirical evidence, we show that existing algorithms suffer from a common limitation -- they discover options that provide a poor coverage of the state space. In light of this, we propose 'Explore, Discover and Learn' (EDL), an alternative approach to information-theoretic skill discovery. Crucially, EDL optimizes the same information-theoretic objective derived from the empowerment literature, but addresses the optimization problem using different machinery. We perform an extensive evaluation of skill discovery methods on controlled environments and show that EDL offers significant advantages, such as overcoming the coverage problem, reducing the dependence of learned skills on the initial state, and allowing the user to define a prior over which behaviors should be learned.
[ML paper 7/10]
Why you may want to read this: Newest paper from Masashi Sugiyama (Director, RIKEN Center for Advanced Intelligence Project / Professor, The University of …), Dacheng Tao (The University of Sydney).
Towards Mixture Proportion Estimation without Irreducibility.
Yu Yao, Tongliang Liu, Bo Han, Mingming Gong, Gang Niu, Masashi Sugiyama, Dacheng Tao
\textit{Mixture proportion estimation} (MPE) is a fundamental problem of practical significance, where we are given data from only a \textit{mixture} and one of its two \textit{components} to identify the proportion of each component. All existing MPE methods that are distribution-independent explicitly or implicitly rely on the \textit{irreducible} assumption---the unobserved component is not a mixture containing the observable component. If this is not satisfied, those methods will lead to a critical estimation bias. In this paper, we propose \textit{Regrouping-MPE} that works without irreducible assumption: it builds a new irreducible MPE problem and solves the new problem. It is worthwhile to change the problem: we prove that if the assumption holds, our method will not affect anything; if the assumption does not hold, the bias from problem changing is less than the bias from violation of the irreducible assumption in the original problem. Experiments show that our method outperforms all state-of-the-art MPE methods on various real-world datasets.
Why you may want to read this: Newest paper from Francis Bach (Inria - Ecole Normale Supérieure).
On the Effectiveness of Richardson Extrapolation in Machine Learning.
Francis Bach, SIERRA
Richardson extrapolation is a classical technique from numerical analysis that can improve the approximation error of an estimation method by combining linearly several estimates obtained from different values of one of its hyperparameters, without the need to know in details the inner structure of the original estimation method. The main goal of this paper is to study when Richardson extrapolation can be used within machine learning, beyond the existing applications to step-size adaptations in stochastic gradient descent. We identify two situations where Richardson interpolation can be useful: (1) when the hyperparameter is the number of iterations of an existing iterative optimization algorithm, with applications to averaged gradient descent and Frank-Wolfe algorithms (where we obtain asymptotically rates of O(1/k^2) on polytopes, where k is the number of iterations), and (2) when it is a regularization parameter, with applications to Nesterov smoothing techniques for minimizing non-smooth functions (where we obtain asymptotically rates close to O(1/k^2) for non-smooth functions), and ridge regression. In all these cases, we show that extrapolation techniques come with no significant loss in performance, but with sometimes strong gains, and we provide theoretical justifications based on asymptotic developments for such gains, as well as empirical illustrations on classical problems from machine learning.
[ML paper 9/10]
Why you may want to read this: Newest paper from Michael L. Littman (Brown University).
Learning State Abstractions for Transfer in Continuous Control.
Kavosh Asadi, David Abel, Michael L. Littman
Can simple algorithms with a good representation solve challenging reinforcement learning problems? In this work, we answer this question in the affirmative, where we take "simple learning algorithm" to be tabular Q-Learning, the "good representations" to be a learned state abstraction, and "challenging problems" to be continuous control tasks. Our main contribution is a learning algorithm that abstracts a continuous state-space into a discrete one. We transfer this learned representation to unseen problems to enable effective learning. We provide theory showing that learned abstractions maintain a bounded value loss, and we report experiments showing that the abstractions empower tabular Q-Learning to learn efficiently in unseen tasks.
[ML paper 10/10]
Why you may want to read this: Newest paper from Kyunghyun Cho (New York University, Facebook AI Research).
Consistency of a Recurrent Language Model With Respect to Incomplete Decoding.
Sean Welleck, Ilia Kulikov, Jaedeok Kim, Richard Yuanzhe Pang, Kyunghyun Cho
Despite strong performance on a variety of tasks, neural sequence models trained with maximum likelihood have been shown to exhibit issues such as length bias and degenerate repetition. We study the related issue of receiving infinite-length sequences from a recurrent language model when using common decoding algorithms. To analyze this issue, we first define inconsistency of a decoding algorithm, meaning that the algorithm can yield an infinite-length sequence that has zero probability under the model. We prove that commonly used incomplete decoding algorithms - greedy search, beam search, top-k sampling, and nucleus sampling - are inconsistent, despite the fact that recurrent language models are trained to produce sequences of finite length. Based on these insights, we propose two remedies which address inconsistency: consistent variants of top-k and nucleus sampling, and a self-terminating recurrent language model. Empirical results show that inconsistency occurs in practice, and that the proposed methods prevent inconsistency.
欢迎订阅论文广播的每日更新版:http://www.buzzsprout.com/632479。