hiercat: a Hierarchical Probabilistic Latent Categorization for MALACH

What is hiercat?

hiercat is an automatic text classifier which uses the hierarchical structure of class labels to improve classification performance. The model it uses is that of Gaussier, et. al [1]. It was originally developed as part of the AMSC 663/664 project courses at the University of Maryland, College Park [2].

Is it difficult to use?

The short answer is no. Some effort has been spent on making hiercat easy to compile and run. If you run into problems, email me, and I will try to help out.

Is it free?

hiercat is free, both as in speech and as in beer. It is released under the GNU General Public License (GPL).

Abstract:

A hierarchical probabilistic latent categorizer for co-occurrence data in the MALACH project [3] is proposed. Assuming a probabilistic generative model for word-document co-occurrences, we may estimate the conditional probability of a document given a class by iterative maximization of its probability function; this may be done using a tempered Expectation-Maximization (EM) algorithm. Bayes' theorem then gives the posterior class probability for each class, where we expect the largest to be the document category. Such a categorizer, as well as a non-hierarchical categorizer using the k nearest neighbor algorithm (as a basis for comparison), will be developed and applied to the MALACH data set.

Context:

The MALACH (Multilingual Access to Large spoken ArCHives) is a joint collaboration between researchers at UMD, JHU, IBM and the Survivors of the Shoah Visual History Foundation (VHF), whose purpose is to "dramatically improve access to large multilingual collections of recorded speech in oral history archives."

Transcriptions of survivor testimonies' are being produced using automatic speech recognition which must then be categorized for efficient information retrieval. This project seeks to leverage hierarchical properties of these categories (eg, Berlin is in Germany which is a location) to improve categorization.

hiercat

NSF Site Visit

Final presentation:

Semester progress report:

Initial proposal (drafts):



olssonmath.umd.edu