POSTERS

Shlomo E. Chazan, Sharon Gannot and Jacob Goldberger

Bar-Ilan University

Multi-microphone, DNN-based, speech enhancement and speaker separation/extraction algorithms have recently gained increasing popularity. The enhancement capabilities of spatial processor can be very high, provided that all its building blocks are accurately estimated. Data-driven estimation approaches can be very attractive since they do not rely on accurate statistical models, which is usually unavailable. However, training a DNN with multi-microphone data is a challenging task, due to inevitable differences between the train and test phases. In this work, we present an estimation procedure for controlling a linearly-constrained minimum variance (LCMV) beamformer for speaker extraction and noise reduction. We propose an attention-based DNN for speaker diarization that is applicable to the task at hand. In the proposed scheme, each microphone signal propagates through a dedicated DNN and an attention mechanism selects the most informative microphone. This approach has the potential of mitigating the mismatch between training and test phases and can therefore lead to an improved speaker extraction performance.

Aviad Elyashar, Jorge Bendahan, Rami Puzis, Maria-Amparo Sanmateu

Ben-Gurion University

Today, people tend to consume news from social media, rather than traditional news. The nature of online news publication has changed, to the point that traditional fact checking and vetting are sometimes incomplete due to the flood of material from content aggregators. Therefore, fake news are widely spreading on the online social media. We propose an approach for measuring fake news in online social media based on estimating the distribution of fake news promoters among the accounts that contributed to the given online discussion.

Motoya Ohnishi; Masahiro Yukawa; Mikael Johansson; Masashi Sugiyama

Keio Univ. / KTH / RIKEN AIP

Motivated by the success of reinforcement learning (RL) for discrete-time tasks such as AlphaGo and Atari games, there has been a recent surge of interest in using RL for continuous-time control of physical systems (cf. many challenging tasks in OpenAI Gym and DeepMind Control Suite). Since discretization of time is susceptible to error, it is methodologically more desirable to handle the system dynamics directly in continuous time. However, very few techniques exist for continuous-time RL and they lack flexibility in value function approximation. In this paper, we propose a novel framework for model-based continuous-time value function approximation in reproducing kernel Hilbert spaces. The resulting framework is so flexible that it can accommodate any kind of kernel-based approach, such as Gaussian processes and kernel adaptive filters, and it allows us to handle uncertainties and nonstationarity without prior knowledge about the environment or what basis functions to employ. We demonstrate the validity of the presented framework through experiments.

Tsuyoshi Okita, Hirotaka Hachiya, Sozo Inoue, Naonori Ueda

Kyushu Institute of Technology & Riken AIP

Cross-Modal Translation of Continuous Signals for Activity Recognition: We propose a method for cross-modal translation of continuous signals, such as accelerometer sensor, video and motion capture, by encoder-decoder model. Then, we propose an unsupervised activity recognizer on top of this.

Lihi Shiloh, Avishay Eyal and Raja Giryes

Tel Aviv University

Developing automatic algorithmic tools for targets' detection and classification in a fiber-optic Distributed Acoustic Sensing (DAS) system is a challenging task. The main hurdle is the need to produce a large-scale dataset of tagged events to facilitate the training of the algorithms. This task requires considerable resources in terms of manpower, computing time and computer memory. In contrast, generating a training dataset via a computer simulation can significantly simplify the development stage and allow tremendous saving in time and costs. This approach, however, requires highly accurate modeling of the optical DAS system, the generation and propagation of the seismic/acoustic waves in the medium and the interaction between the waves to the fiber. The physical parameters and details needed for such modeling are rarely available. In this paper, a novel approach for efficient generation of training data is introduced and demonstrated. It is based on using Generative Adversarial Network (GAN) to transform simulation data to accurately mimic genuine data based on a relatively small experimental database labeled manually. The new approach is verified with experimental data taken from a 5km long DAS sensor yielding 94% classification accuracy between ambient noise and human steps at the vicinity of the buried fiber.

Ryuichiro Hataya, Hideki Nakayama

RIKEN AIP

Deep convolutional neural networks excel in image recognition, but they are also known to be fragile to label corruption. To mitigate this problem, we stochastically propose to switch two loss functions, categorical cross entropy and mean absolute error, using Bernoulli distribution for exploiting their advantages. We employ the bilevel programming approach to simultaneously optimize the base CNN and the hyper-parameter of the distribution. Our proposed method only requires little modification in the optimization process of the original supervised problem but achieves on-par results with other state-of-the-art methods.

Shunsuke Kanda, Yasuo Tabei

RIKEN AIP

Recently, it has become popular that vectorial data are randomly mapped to strings of discrete symbols (i.e., sketch) for fast and space-efficient similarity searches. Such random mapping is called similarity-preserving hashing and approximates a similarity metric by the Hamming distance. Although several efficient similarity-preserving hashing algorithms producing integer sketches have been developed thus far, most of recent similarity search methods for sketches are designed for binary sketches. In this paper, we present a novel similarity search method over integer sketches that employs edge-labeled trees called tries. Our method builds a trie-based index from sketches and efficiently solves the problem by traversing nodes. To handle massive databases of vectorial data, we develop a novel succinct trie representation with a high space-efficiency and also propose practically fast traversal algorithm for the representation. Empirical results using huge databases of vectorial data demonstrate that our similarity search is up to one order of magnitude faster than state-of-the-art methods.

Masahiro Ikeda, Atsushi Miyauchi, Yuuki Takai, Yuuichi Yoshida

RIKEN AIP

Cheeger's inequality states that a tightly connected subset can be extracted from a graph G using an eigenvector of the normalized Laplacian associated with G. More specifically, we can compute a subset with conductance O(¥sqrt{¥phi_G}), where phi_G is the minimum conductance of a set in $G$. It has recently been shown that Cheeger's inequality can be extended to hypergraphs. However, as the normalized Laplacian of a hypergraph is no longer a matrix, we can only approximate to its eigenvectors; this causes a loss in the conductance of the obtained subset. To address this problem, we here consider the heat equation on hypergraphs, which is a differential equation exploiting the normalized Laplacian. We show that the heat equation has a unique solution and that we can extract a subset with conductance ¥sqrt{¥phi_G} from the solution. An analogous result also holds for directed graphs.

Yuto Ogino, Masahiro Yukawa

RIKEN AIP Center / Keio University

Spectral clustering is an empirically successful approach to separating a dataset into some groups with possibly complex shapes based on pairwise affinity. Identifying the number of clusters automatically is still an open issue, although many heuristics have been proposed. In this paper, imposing sparsity on the eigenvectors of graph Laplacian is proposed to attain reasonable approximations of the so-called cluster-indicator-vectors, from which the clusters as well as the cluster number are identified. The proposed algorithm enjoys low computational complexity as it only computes a relevant subset of eigenvectors. It also enjoys better clustering quality than the existing methods, as shown by simulations using nine real datasets.

Yonathan Efroni, Gal Dalal, Bruno Scherrer, Shie Mannor

Technion

Multiple-step lookahead policies have demonstrated high empirical competence in Reinforcement Learning, via the use of Monte Carlo Tree Search or Model Predictive Control. In a recent work \cite{efroni2018beyond}, multiple-step greedy policies and their use in vanilla Policy Iteration algorithms were proposed and analyzed. In this work, we study multiple-step greedy algorithms in more practical setups. We begin by highlighting a counter-intuitive difficulty, arising with soft-policy updates: even in the absence of approximations, and contrary to the 1-step-greedy case, monotonic policy improvement is not guaranteed unless the update stepsize is sufficiently large. Taking particular care about this difficulty, we formulate and analyze online and approximate algorithms that use such a multi-step greedy operator.

Elad Hoffer, Ron Banner, Itay Golan, Daniel Soudry

Technion

Over the past few years batch-normalization has been commonly used in deep networks, allowing faster training and high performance for a wide variety of applications. However, the reasons behind its merits remained unanswered, with several shortcomings that hindered its use for certain tasks. In this work we present a novel view on the purpose and function of normalization methods and weight-decay, as tools to decouple weights' norm from the underlying optimized objective. We also improve the use of weight-normalization and show the connection between practices such as normalization, weight decay and learning-rate adjustments. Finally, we suggest several alternatives to the widely used L2 batch-norm, using normalization in L1 and L∞ spaces that can substantially improve numerical stability in low-precision implementations as well as provide computational and memory benefits. We demonstrate that such methods enable the first batch-norm alternative to work for half-precision implementations.

Dor Bank;Raja Giryes

Tel Aviv University

Dropout is a popular regularization technique in neural networks. Yet, the reason for its success is still not fully understood. This paper provides a new interpretation of Dropout from a frame theory perspective. This leads to a novel regularization technique for neural networks that minimizes the cross-correlation between filters in the network. We demonstrate its applicability in convolutional and fully connected layers in both feed-forward and recurrent networks.

Han Bao, Gang Niu, Masashi Sugiyama

The University of Tokyo / RIKEN AIP

Supervised learning needs a huge amount of labeled data, which can be a big bottleneck under the situation where there is a privacy concern or labeling
cost is high. To overcome this problem, we propose a new weakly-supervised learning setting where only similar (S) data pairs (two examples belong to the same class) and unlabeled (U) data points are needed instead of fully labeled data, which is called SU classification. We show that an unbiased estimator of the classification risk can be obtained only from SU data, and the estimation error of its empirical risk minimizer achieves the optimal parametric convergence rate. Finally, we demonstrate the effectiveness of the proposed method through experiments.

Yuval Atzmon, Gal Chechik

Bar Ilan University

In zero-shot learning (ZSL), a classifier is trained to recognize visual classes without any image samples. Instead, it is given semantic information about the class, like a textual description or a set of attributes. Learning from attributes could benefit from explicitly modeling structure of the attribute space. Unfortunately, learning of general structure from empirical samples is hard with typical dataset sizes.
Here we describe LAGO, a probabilistic model designed to capture natural soft and-or relations across groups of attributes. We show how this model can be learned end-to-end with a deep attribute-detection model. The soft group structure can be learned from data jointly as part of the model, and can also readily incorporate prior knowledge about groups if available. The soft and-or structure succeeds to capture meaningful and predictive structures, improving the accuracy of zero-shot learning on two of three benchmarks.
Finally, LAGO reveals a unified formulation over two ZSL approaches: DAP and ESZSL.

Eitan Richardson, Yair Weiss

The Hebrew University

A longstanding problem in machine learning is to find unsupervised methods that can learn the statistical structure of high dimensional signals. In recent years, GANs have gained much attention as a possible solution to the problem, and in particular have shown the ability to generate remarkably realistic high resolution sampled images. At the same time, many authors have pointed out that GANs may fail to model the full distribution ("mode collapse") and that using the learned models for anything other than generating samples may be very difficult.
In this paper, we examine the utility of GANs in learning statistical models of images by comparing them to perhaps the simplest statistical model, the Gaussian Mixture Model. First, we present a simple method to evaluate generative models based on relative proportions of samples that fall into predetermined bins. Unlike previous automatic methods for evaluating models, our method does not rely on an additional neural network nor does it require approximating intractable computations. Second, we compare the performance of GANs to GMMs trained on the same datasets. While GMMs have previously been shown to be successful in modeling small patches of images, we show how to train them on full sized images despite the high dimensionality. Our results show that GMMs can generate realistic samples (although less sharp than those of GANs) but also capture the full distribution, which GANs fail to do. Furthermore, GMMs allow efficient inference and explicit representation of the underlying statistical structure. Finally, we discuss how GMMs can be used to generate sharp images.

Itay Evron, Edward Moroshko, Koby Crammer

Technion

In extreme classification problems, learning algorithms are required to map instances to labels from an extremely large label set. We build on a recent extreme classification framework with logarithmic time and space, and on a general approach for error correcting output coding (ECOC) with loss-based decoding, and introduce a flexible and efficient approach accompanied by theoretical bounds. Our framework employs output codes induced by graphs, for which we show how to perform efficient loss-based decoding to potentially improve accuracy. In addition, our framework offers a tradeoff between accuracy, model size and prediction time. We show how to find the sweet spot of this tradeoff using only the training data. Our experimental study demonstrates the validity of our assumptions and claims, and shows that our method is competitive with state-of-the-art algorithms.

Chang Liu , Mark Last, Armin Shmilovici

Ben-Gurion University

Understanding a complex story is a unique ability of human beings. A typical story includes many elements, such as protagonist, opponent, desire, turning points, battle, and victory, etc. The capability of identifying as much elements as possible can undoubtedly help to understand the entire story. However, this task is challenging because of its complexity and subjectiveness. In this paper, we extend the two clocks theory, originally validated on scripts of theatre plays, to identifying the turning points in a story of a cartoon movie. The two clock theory monitors the timeline of a story with two clocks: an event clock, which measures the regular time ﬂow of the story; and a weighted clock, which measures the timing of the story events. We conducted an experiment to evaluate our extension of the two clocks theory and achieved satisfying results: 78.6% accuracy for turning points identiﬁcation and 100% for key story elements detection. The initial experiments were performed on the Flintstones Season 1 cartoon series (28 episodes), because the stories of these cartoons are usually simple and unambiguous, which makes them easier for automated analysis than cinema movies or TV series. Based on our encouraging results, we believe that this is just the ﬁrst step towards automated understanding of stories in cinema movies and eventually in amateur videos uploaded to the Internet.

Veltzer Doron

Tel Aviv University

This poster proposes Recursive Neural Networks (RNNs) as phono-logical models. In order to demonstrate their eﬀectiveness I revisit Becker (2009) (and Becker et al. (2011)) summarize its OT account of Turkish’s stem ﬁnal voicing alternations and criticize it on the grounds of implausible learnability, I then show how RNN structure based models would handle the same phenomenon in a simpler and more learnable manner ending in displaying results of an RNN topology used to model the phenomenon tracing its development motivation to such innate facts as the temporal nature of speech and the articulator

Shani Gamrian, Yoav Goldberg

Bar-Ilan University

Deep Reinforcement Learning has managed to achieve state-of-the-art results in learning control policies directly from raw pixels. However, despite its remarkable success, it fails to generalize, a fundamental component required in a stable Artificial Intelligence system. Using the Atari game Breakout, we demonstrate the difficulty of a trained agent in adjusting to simple modifications in the raw image, ones that a human could adapt to trivially. In transfer learning, the goal is to use the knowledge gained from the source task to make the training of the target task faster and better. We show that using various forms of fine-tuning, a common method for transfer learning, is not effective for adapting to such small visual changes. In fact, it is often easier to re-train the agent from scratch than to fine-tune a trained agent. We suggest that in some cases transfer learning can be improved by adding a dedicated component whose goal is to learn to visually map between the known domain and the new one. Concretely, we use Unaligned Generative Adversarial Networks (GANs) to create a mapping function to translate images in the target task to corresponding images in the source task. These mapping functions allow us to transform between various variations of the Breakout game, as well as between different levels of a Nintendo game, Road Fighter. We show that learning this mapping is substantially more efficient than re-training.

Ha Quang Minh

RIKEN-AIP

We will present our recent work on the infinite-dimensional generalization of many widely used distances and divergences on the set of symmetric, positive definite (SPD) matrices. Our focus will be on the sets of positive definite trace class and Hilbert-Schmidt operators, in particular RKHS covariance operators. The theoretical framework will be accompanied by numerical experiments.

Sho Sonoda, Isao Ishikawa, Masahiro Ikeda, Kei Hagihara, Yoshihiro Sawano, Takuo Matsubara, and Noboru Murata

RIKEN AIP

We consider the supervised learning problem with shallow neural networks. According to our unpublished experiments conducted several years prior to this study, we had noticed an interesting similarity between the distribution of hidden parameters after backprobagation (BP) training, and the ridgelet spectrum of the same dataset. Therefore, we conjectured that the distribution is expressed as a version of ridgelet transform, but it was not proven until this study. One difficulty is that both the local minimizers and the ridgelet transforms have an infinite number of varieties, and no relations are known between them. By using the integral representation, we reformulate the BP training as a strong-convex optimization problem and find a global minimizer. Finally, by developing ridgelet analysis on a reproducing kernel Hilbert space (RKHS), we write the minimizer explicitly and succeed to prove the conjecture. The modified ridgelet transform has an explicit expression that can be computed by numerical integration, which suggests that we can obtain the global minimizer of BP, without BP.

Gabi Shalev, Yossi Adi, Joseph Keshet

Bar Ilan University

Deep Neural Networks are powerful models that attained remarkable results on a variety of tasks. These models are shown to be extremely efficient when training and test data are drawn from the same distribution. However, it is not clear how a network will act when it is fed with an out-of-distribution example. In this work, we consider the problem of out-of-distribution detection in neural networks. We propose to use multiple semantic dense representations instead of sparse representation as the target label. Specifically, we propose to use several word representations obtained from different corpora or architectures as target labels. We evaluated the proposed model on computer vision, and speech commands detection tasks and compared it to previous methods. Results suggest that our method compares favorably with previous work. Besides, we present the efficiency of our approach for detecting wrongly classified and adversarial examples.

Hila Gonen, Yoav Goldberg

Bar-Ilan University

Language Modeling for Code-Switching: Evaluation, Integration of Monolingual Data, and Discriminative Training. We focus on the problem of language modeling for code-switched language, in the context of automatic speech recognition (ASR). Language modeling for code-switched language is challenging for (at least) three reasons: (1) lack of available large-scale code-switched data for training; (2) lack of a replicable evaluation setup that is ASR directed yet isolates language modeling performance from the other intricacies of the ASR system; and (3) the reliance on generative modeling. We tackle these three issues: we propose an ASR-motivated evaluation setup which is decoupled from an ASR system and the choice of vocabulary, and provide an evaluation dataset for English-Spanish code-switching. This setup lends itself to a discriminative training approach, which we demonstrate to work better than generative language modeling. Finally, we present an effective training protocol that integrates small amounts of code-switched data with large amounts of monolingual data, for both the generative and discriminative cases.

Chao Li, Mohammad Emtiyaz Khan, Zhun Sun, Qibin~Zhao

RIKEN AIP

Low-rank tensor decomposition is a promising approach for analysis and understanding of real-world data. In this paper, we derive such conditions for a general class of tensor decomposition methods where each latent tensor component can be reshuffled into a low-rank matrix of arbitrary shape.
The reshuffling operation generalizes the traditional unfolding operation, and provides flexibility to recover true latent factors of complex data-structures. We prove that exact recovery can be guaranteed by using a convex program when a type of incoherence measure is upper bounded. The results on image steganography show that our method obtains the state-of-the-art performance.
The theoretical analysis in this paper is expected to be useful to derive similar results for other types of tensor-decomposition methods.

The First Japan-Israel

Machine Learning Workshop

19-20 Nov. 2018

Posters