CURSED: training_em.c training_em_spmat.c -T_av(r) being calculated wrong (namely, the denominator summations) ******************************************************************************** Thu Feb 5 11:40:29 EST 2004 -when the signature files get built, the training file needs to be built also (hiercat.training)---to map between #.sig and the correct training entry. Thu Feb 12 22:07:42 EST 2004 -I'm not sure that P_iGa is getting built correctly (need to test the values) +i'm allocating the parameters memory wrong (except for P_a and P_iGa) Wed Feb 18 21:18:37 EST 2004 +P(a) is getting calculated wrong because step1.pl does not calculate the correct value of L!!! (ie, L is wrong when examples can have more than one associated label!! ++correction, step1.pl is returning LARGEST_CLASS (correctly), AND LARGEST_CLASS != L except when labels/segment = 1 Thu Feb 19 11:42:26 EST 2004 +test P(i|alpha) with small dataset ++\sum_{i,alpha} P(i|alpha) = number of distinct classes -why does perl print \n's in step1.pl on enkidu but not malach? Tue Feb 24 17:44:03 EST 2004 +segfaulting somewhere in em.c --assert memory allocation after every row for parameters (params.c) Sat Feb 28 18:51:20 EST 2004 +I'm getting NaN's when trying to calculate the Frobenius norm in em.c ++fixed several small bugs; summing over wrong ranges, typos in variable alias (pointing to wrong memory) Thu Mar 4 17:35:22 EST 2004 +no longer allocating a dump tuple at the beginning of the tuple list; this seems to have fixed a problem with em (which was summing over the false tuple) Sat Mar 6 15:25:00 EST 2004 -data3 test -need to only sum over word topics which are ancestors of a class --this will require an agenda function? ---temporarily being addressed using enumerated zero'ing. eventually a robust routine will sit in enforceHierarchy() Mon Mar 8 22:54:18 EST 2004 +I think I'm calculating P^{(t+1)}(j|\nu) incorrectly ++DOH! tmp var's weren't getting zerod!! YIPPEE!! :) Tue Mar 9 16:50:33 EST 2004 +fixed tmp var zeroing problem in training_em +made file naming saner -free unused memory after training_em (backup of params) -P_dGa is not converging properly in testing_em.c --I know this (I think) because the sum of P(d|a)'s is not #alpha Wed Mar 10 23:44:13 EST 2004 -params->S is dumb wrong; S does not run over the total number of co-occurrences in all testing data, but only the number of co-occurrences in a particular testing document. doh! --this doesn't seem to explain why getTesting_EM isn't working yet... ---could it be stuck in a local maximum? (test this by making a better initial guess, and also after we implement scheduled annealing) Tue Mar 16 00:27:27 EST 2004 +added getP_aGd_bayes to uses bayes rule to calculate P(alpha|d) without an iterative EM algorithm---it works!! :) Tue Mar 16 13:03:38 EST 2004 +function: scheduleAnnealing() -add annealing (powers of beta) into EM routines -add command line functionality -might have to do some clever memory management for large test sets? -enforceHierarchy needs to be generalized for at least a naive bayes run --we need a clever way to map a more complicated hierarchy into enfornceHierarchy Thu Mar 18 13:23:32 EST 2004 +implemented scheduleAnnealing(), although it's not being used yet -beginning to run 20news tests +step1.pl modified to allow continuation of building sig mappings (because there are too many files to map all at once) +a data directory for .sig files ought to be specifiable at runtime +fixed scheduling bug (forgot to prototype!! doh!) +began implementing a version of training_em that calculates all the values of T_avr first and stores them in sparse matrices (should speed up calculations a ton) +migrated from REAL to HIER_REAL to avoid namespace collision with meschach stuff +command line args (datadir, help, verbose) +args.c added -getting sig-Killed in training_em_spmat.c's buildSparseTs(): bad memory alloc? +fixed stupid bug in buildSparseTs: when will I ever stop forgetting to zero out vars before using them!??!? doh!!! +training_em_spmat now produces same output as training_em (after a small fix) Fri Mar 19 19:59:12 EST 2004 -build (poly?)hierarchy data structure (to be used in enforceHierarchy -could we improve memory usage/performance by storing T_avr (in training_em_spmat2.c) as a 2D matrix of size (LARGEST_CLASS x WORD_TOPICS) x L (or something similar)? -should we be training on much smaller sets? Mon Mar 22 00:09:16 EST 2004 +hierarchies can now be defined and parsed in/from xml!! :) (parseHierarchy.pl) +removed now unnecessary script (old step2.pl) to process training file to determine number of classes -change name from LARGEST_CLASS to WORD_CLASSES (fits with WORD_TOPICS) +added parsing of hierarchy.dat files in enforceHierarchy() +hierarchy dat files may be specified at the command line with -t -defines need to be determined at runtime by hiercat (instead of by scripts to runparams.h) -why is /* STEP FORWARD P^{(t+1)}(j|\nu) */ so slow in training_em_spmat? Mon Mar 22 14:56:34 EST 2004 -step1.pl now updates rather than clobber runparams.h --it's not producing the correct number of UNIQUE_WORDS or TRAINING_DOCS +changed wordmangling in step1.pl (now remove all \W) +added parameter saving/loading for after training. We now always dump params to disk when training is complete, and the -r flag will restore those params and begin in the testing phase Tue Mar 23 02:01:09 EST 2004 +DUMB DUMB DUMB!! stupid bug fixed in summation in denominator of T_av(r) (Gausser eq 6) --bug still in training_em and training_em_spmat (see CURSED) Tue Mar 23 14:04:09 EST 2004 -adding categorize function Wed Mar 24 15:04:06 EST 2004 +no longer using runparams.h system!! :) ++these values are now determined in setrunparams.c, by setRunParams() +edited pre-processing perl scripts to remove kludges for runparams system Thu Mar 25 02:05:18 EST 2004 -setting up testing tuple list storage as a list of lists; something is horked. Fri Mar 26 01:27:20 EST 2004 -hmph. it appears that lists (ie, test_lists[1...testing_docs]) are being set up but then getting overwritten...? Fri Mar 26 15:21:51 EST 2004 +changed while to do...while in testing_em.c Sat Mar 27 01:59:34 EST 2004 +searching for memory bug in data_tuple list, in data_tuple.c; using electric fence without much luck ++ see Sat Mar 27 20:19:29 EST 2004 fix -trying to fclose fp while building test lists returns error (something about wrong ioctl for device?) Sat Mar 27 20:19:29 EST 2004 +valgrind, I love you! unsigned long * params->S was getting allocated with training_docs rather than testing_docs elements!! Sun Mar 28 00:50:55 EST 2004 +all memory leaks are gone and all memory is proply free'd at program completion (yay valgrind!) -standardize error reporting system +can now specify file to dump categorization results to +lots of clean up (things are looking pretty spiffy at this point) -need to add license, license info to hello, and place code online Sun Mar 28 20:50:31 EST 2004 -getting error (in matrixio.c from restoreParams()) when trying to load params (is this just on a large run?) +hiercat.output now dumps in data directory (if specified) unless specified explicitly +rewriting dumpParams and restoreParams to not use mescach calls Mon Mar 29 23:36:39 EST 2004 +improving output code (now to an .xml formatted file, including information besides just the categorization guesses) ++this will be very helpful for running lots of experiments --small formatting issues -compress output file? Fri Apr 9 01:15:18 EDT 2004 +first pass at parallelizing the code with MPI Sat Apr 10 02:22:39 EDT 2004 +a sloppy parallelized version is now working! :) +some of the communications are horked (ie, categorization is wrong for np>1) Tue Apr 13 12:41:03 EDT 2004 +sums are correct over parameter matrices Thu Apr 15 12:48:01 EDT 2004 -compiling without -DBINARY_SIGS will now allow sig files to in ascii text Fri Apr 16 03:12:56 EDT 2004 +implementing from scratch sparse matrix routines.... they are either currently broken for just extremely slow.... hmm. Fri Apr 16 19:31:44 EDT 2004 +implemented non-sparse implementation (just storing tons of zeros; who'd have thunk it!). Oh, were memory without end! Tue Apr 20 12:59:45 EDT 2004 +as compile time option, rows of sparse matrices can now be stored as either linked lists or avl trees (see sparse.c) +added getSpmatRowSum for both linked list and avl tree type spmats +timing is indicating the big expense is in updating P_jGv...trying to speed that up (updating P_jGv averaging about 11-12 seconds) ++also building T_v_ra (for efficient access in P_jGv) brought this time down to 3 seconds. Wed Apr 21 02:25:55 EDT 2004 -some nan's are being introduced in testing_em routine, getC -added non-recursive avl tree summing code, but it seems to actually do worse than the recursive version Sat May 1 00:52:40 EDT 2004 +standardized all output to ensure that output bufferes are flushed and messages are also logged Mon May 3 02:18:49 EDT 2004 +fixed logging with mpi Tue May 4 13:47:01 EDT 2004 +norm is now adjusted to compensate for length (FIXME: is it being done properly now?) Thu May 6 18:23:09 EDT 2004 +added accumulators (rather than treesums) in training_em; it seems to be much faster! :)