On the Evaluation of Deep Generative Models

On the Evaluation of Deep Generative Models PDF Author: Sharon Zhou
Publisher:
ISBN:
Category :
Languages : en
Pages :

Get Book Here

Book Description
Evaluation drives and tracks progress in every field. Metrics of evaluation are designed to assess important criteria in an area, and aid us in understanding the quantitative differences between one breakthrough and another. In machine learning, evaluation metrics have historically acted as north stars towards which researchers have optimized and organized their methods and findings. While evaluation metrics have been straightforward to construct and implement in some subfields of machine learning, they have been notoriously difficult to design in generative models. Several reasons emerge to explain this: (1) there are no gold standard outputs to compare against, unlike held-out test sets, (2) because of their diverse training methods and formulations, inherent model properties are difficult to measure consistently, and sampled outputs are often used for evaluation instead, (3) dependence on external (pretrained) models that add another layer of bias and uncertainty, and (4) inconsistent results without a large number of samples. As a result, generative models have suffered from noisy assessments that occupy a changing evaluation landscape, in contrast to the relative stability of their discriminative counterparts. In this manuscript, we examine several important criteria for generative models and introduce evaluation metrics to address each one while discussing the aforementioned issues in generative model evaluation. In particular, we examine the challenge of measuring the perceptual realism of generated outputs and introduce a human-in-the-loop evaluation system that leverages psychophysics theory to ground the method in human perception literature and crowdsourcing techniques to construct an efficient, reliable, and consistent method for comparing different models. In addition to this, we analyze disentanglement, an increasingly important property for assessing learned representations, by measuring an intrinsic property of a generative model's data manifold using persistent homology. The final work in this manuscript takes a step towards assessing a generative model and its different modes with a key application in mind, specifically the stylistic fidelity across different generated modes in a multimodal setting.