Hiercat Manual v. 01

Quickstart:

Create a subdirectory under the hiercat directory for your data files. The data_20news_small directory already exists, so I'll use it as an example.

Define the hierarchy

In that directory, you will need to create an xml description of the hierarchy to use, following the syntax of the example 20news-hierarchy.pruned.xml. A "class" is a word topic that has training documents associated with it, while a "topic" is a word topic that may sit above a class in the hierarchy, but does not have training documents associated with it. The root node will probably always be a topic and a leaf will always be a class.

Once that is produced, the hierarchy can be parsed using the included script (in the hiercat directory), parseHierarchy.pl. The script just parses the xml file as a tree and produces the files:

  • classes.dat: a list of the classes and their (used internally) id numbers.
  • topics.dat: a list of the topics and their (used internally) id numbers.
  • hierarchy.dat: a binary file which describes the hierarchy, to be used internally by hiercat. You will need to use the classes.dat file to convert back from the hiercat output to whatever form you want (ie, the output gives probabilities for classes in terms of the internally used class number. More on that in a sec.

    Pre-process training and testing documents

    Before data is processed by hiercat, it must be pre-processed using the following included scripts:
  • step1.pl: Usage: "step1.pl [-c] training_docnames". This maps each training doc name to its internal doc id number (mapping is placed in docnums.dat), and each unique word in the training set to its internally used word id number (mapping stored in wordnums.dat). The script produces an encoded version of each training document in the working directory, each having the name "#.sig", where '#' is the training document id number. The optional switch `-c` tells step1.pl to continue from previous parsing (ie, to not start the word mapping over from scratch). You would use this if you wanted to add documents to your training set.
  • step2.pl: Usage: "step2.pl testing_docnames". This maps each testing doc name to its internal doc id number (mapping is placed in docnums.test.dat). The testing docs are each encoded (using the same word-to-wordnum mapping as with the training docs) and saved in the working directory as "#.test.sig" which '#' is the testing documnet id number.

    The last thing you'll need is the training data file, which should be named "hiercat.training". An example is in the data_20news_small directory. The file has the form:

    0 13,14,5
    1 3,27
    ...
    

    Meaning that training document 0 (again using the mapping contained in docnums.dat) is labeled with class 13, 14, and 5 (using the class number mapping in classes.dat).

    Run hiercat

    Now you should be ready to run. Hiercat as the following usage:
    hiercat [-d dir] [-hvtr] [-t file] [-c "session comment"] [-o file]
    The command line options have the following effects:

    Once a run is complete, the output file (default being "hiercat.output") will be placed in the data directory. That file will include (in the "testing-guesses" tag), categorization output of the form:

    0:0=0.00985549|1=0.00857243|2=0.0335464
    1:0=0.00601823|1=0.00954432|2=0.0193113
    2:0=0.00325507|1=0.000626959|2=0.00729339
    3:0=0.00694253|1=0.0100118|2=0.026098
    4:0=0.00974872|1=0.010095|2=0.0271683
    5:0=0.00785829|1=0.0101077|2=0.0151373
    6:0=0.00965198|1=0.00563102|2=0.0239918
    7:0=0.00994895|1=0.00455565|2=0.0207622
    8:0=0.00758394|1=0.016238|2=0.027036
    

    Each line has the format:
    training docnum:classnum_0=class_0 probability|classnum_1=class_1 probability| classnum_2=class_2 probability ...

    The classnumber with the highest associated probability is most likely, and so on. Note they are not sorted by probability (but by class number).