Hiercat Manual v. 01
Quickstart:
Create a subdirectory under the hiercat directory for your data files. The
data_20news_small directory already exists, so I'll use it as an example.
Define the hierarchy
In that directory, you will need to create an xml description of the hierarchy
to use, following the syntax of the example 20news-hierarchy.pruned.xml.
A "class" is a word topic that has training documents associated with it,
while a "topic" is a word topic that may sit above a class in the hierarchy,
but does not have training documents associated with it. The root node will
probably always be a topic and a leaf will always be a class.
Once that is produced, the hierarchy can be parsed using the included
script (in the hiercat directory), parseHierarchy.pl. The script just parses
the xml file as a tree and produces the files:
classes.dat: a list of the classes and their (used internally) id numbers.
topics.dat: a list of the topics and their (used internally) id numbers.
hierarchy.dat: a binary file which describes the hierarchy, to be used
internally by hiercat.
You will need to use the classes.dat file to convert back from the hiercat
output to whatever form you want (ie, the output gives probabilities for
classes in terms of the internally used class number. More on that in a
sec.
Pre-process training and testing documents
Before data is processed by hiercat, it must be pre-processed using the
following included scripts:
step1.pl: Usage: "step1.pl [-c] training_docnames". This maps each
training doc name to its internal doc id number (mapping is placed in
docnums.dat), and each unique word in the training set to its internally
used word id number (mapping stored in wordnums.dat). The script produces
an encoded version of each training document in the working directory, each
having the name "#.sig", where '#' is the training document id number. The
optional switch `-c` tells step1.pl to continue from previous parsing (ie, to
not start the word mapping over from scratch). You would use this if you wanted
to add documents to your training set.
step2.pl: Usage: "step2.pl testing_docnames". This maps each testing
doc name to its internal doc id number (mapping is placed in docnums.test.dat).
The testing docs are each encoded (using the same word-to-wordnum mapping
as with the training docs) and saved in the working directory as "#.test.sig"
which '#' is the testing documnet id number.
The last thing you'll need is the training data file, which should be named
"hiercat.training". An example is in the data_20news_small directory. The
file has the form:
0 13,14,5
1 3,27
...
Meaning that training document 0 (again using the mapping contained in
docnums.dat) is labeled with class 13, 14, and 5 (using the class number
mapping in classes.dat).
Run hiercat
Now you should be ready to run. Hiercat as the following usage:
hiercat [-d dir] [-hvtr] [-t file] [-c "session comment"] [-o file]
The command line options have the following effects:
- -d dirname : Specify the directory which contains all the files hiercat
needs to run (hierarchy.dat, classes.dat, topics.dat, docnums.dat,
docnums.test.dat, wordnums.dat). This will also be the default output
directory. If -d is not used, the default directory is the current working
directory (probably not what you want).
- -h : print a help message
- -v : Increase the verbosity level. You will want to use one -v to see
nearly any output on stdout. Output is always logged to the run output
file regardless.
- -t filename : Specify the binary file which defines the hierarchy. The
default is "hierarchy.dat", in the directory used for data files (specified
with the -d switch). This isn't really useful yet.
- -r : restore trained parameters and begin run with testing. The training
phase is far and away the most time-consuming part of running hiercat. Each
time a training phase is completed, hiercat automatically saves all the
trained parameter values in a file named "params.dump" in the data
directory. If you have already trained (ie, the params.dump file already
exists), you can change the testing documents and begin hiercat with testing
using the previously calculated parameter values. If you complete a
large training run, you will want to make a backup of this params.dump file.
- -c "session comment" : the comment included in quotes will be included
in the run output file. This is a useful way to remember what the purpose
of a particular run was and how it was run.
- -o filename : specify the file to dump categorization results to. The
default is "hiercat.output" in the data directory. Note this file also includes
some basic information about the run (duration, start/end date, some information
about the data run on, and the session comment.
Once a run is complete, the output file (default being "hiercat.output")
will be placed in the data directory. That file will include (in the
"testing-guesses" tag), categorization output of the form:
0:0=0.00985549|1=0.00857243|2=0.0335464
1:0=0.00601823|1=0.00954432|2=0.0193113
2:0=0.00325507|1=0.000626959|2=0.00729339
3:0=0.00694253|1=0.0100118|2=0.026098
4:0=0.00974872|1=0.010095|2=0.0271683
5:0=0.00785829|1=0.0101077|2=0.0151373
6:0=0.00965198|1=0.00563102|2=0.0239918
7:0=0.00994895|1=0.00455565|2=0.0207622
8:0=0.00758394|1=0.016238|2=0.027036
Each line has the format:
training docnum:classnum_0=class_0 probability|classnum_1=class_1 probability|
classnum_2=class_2 probability ...
The classnumber with the highest associated probability is most likely,
and so on. Note they are not sorted by probability (but by class number).