Statistics Department Seminar Series: Florentina Bunea, Professor, Department of Statistics and Data Science, Cornell University

"Optimal estimation of topic distributions in topic models with applications to Wasserstein document-distance calculations"

Friday, September 17, 2021

10:00-11:00 AM

340 West Hall Map

The focus of this talk is on the estimation of high-dimensional, discrete, possibly sparse, mixture models in the context topic models. The data consists in p-dimensional multinomial count vectors, corresponding to p words in a given dictionary, across n independent samples, the documents in a corpus. In topic models, the p nexpected word frequency matrix is assumed to be factorized as a p K word-topic matrix A and a Kn topic-document matrix T. Since columns of both matrices represent conditional probability or probability vectors, columns of A are viewed as p-dimensional mixture components that are common to all documents while columns of T, the topic distributions, are viewed as the K-dimensional mixture weights that are document specific and are allowed to be sparse.

The main interest is to provide sharp, finite sample, l1-norm convergence rates for estimators of the possibly sparse mixture weights Twhen Ais either known or unknown. For known A, we suggest MLE estimation of T. Despite the wide-spread applications of these models, and simplicity of the method, the analysis is, surprisingly, still open, owing in part to the fact that T is typically on the boundary of its domain. Our non-standard analysis of the MLE not only establishes its l1 convergence rate, but also reveals a remarkable property: the MLE, with no extra regularization, can be exactly sparse and contain the true zero pattern of T. We further show that the MLE is both minimax optimal and adaptive to the unknown sparsity in a large class of sparse topic distributions. When Ais unknown, we estimate Tby optimizing the likelihood function corresponding to a plug in, generic, estimator A of A. For any estimator A that satisfies carefully detailed conditions for proximity to A, we show that the resulting estimator of T retains the properties established for the MLE. Our theoretical results allow the ambient dimensions K and p to grow with the sample sizes.

Our main application is to the estimation of 1-Wasserstein distances between document generating distributions. We propose, estimate and analyze new 1-Wasserstein distances between alternative probabilistic document representations, at the word and topic level, respectively. We derive finite sample bounds on the estimated proposed 1-Wasserstein distances. For word level document-distances, we provide contrast with existing rates on the 1-Wasserstein distance between standard empirical frequency estimates. The effectiveness of the proposed 1-Wasserstein distances is illustrated by an analysis of an IMDB movie reviews data set.

Florentina Bunea is a Professor in the Department of Statistics and Data Science at Cornell University. Her research is broadly centered on statistical machine learning theory and high-dimensional statistical inference.

https://stat.cornell.edu/people/faculty/florentina-bunea

Building:	West Hall
Website:
Event Type:	Workshop / Seminar
Tags:	seminar
Source:	Happening @ Michigan from Department of Statistics, Department of Statistics Seminar Series

Search: {{$root.lsaSearchQuery.q}}, Page {{$root.page}}

for

Search: {{$root.lsaSearchQuery.q}}, Page {{$root.page}}

for

Statistics Department Seminar Series: Florentina Bunea, Professor, Department of Statistics and Data Science, Cornell University