Class TopicModelTrainer

  • All Implemented Interfaces:
    org.apache.uima.analysis_component.AnalysisComponent

    public class TopicModelTrainer
    extends uk.gov.dstl.baleen.uima.BaleenTask
    A task to create a Topic Model for a set of documents (stored in mongo).

    Uses Latent Dirichlet Allocation following Newman, Asuncion, Smyth and Welling, Distributed Algorithms for Topic Models JMLR (2009), with SparseLDA sampling scheme and data structure from Yao, Mimno and McCallum, Efficient Methods for Topic Model Inference on Streaming Document Collections, KDD (2009).

    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      protected void execute​(uk.gov.dstl.baleen.uima.JobSettings settings)  
      void initialize​(org.apache.uima.UimaContext context)  
      • Methods inherited from class uk.gov.dstl.baleen.uima.BaleenTask

        createMonitor, destroy, doDestroy, doInitialize, getMonitor, process
      • Methods inherited from class org.apache.uima.analysis_component.JCasAnnotator_ImplBase

        getRequiredCasInterface, process
      • Methods inherited from class org.apache.uima.analysis_component.Annotator_ImplBase

        getCasInstancesRequired, hasNext, next
      • Methods inherited from class org.apache.uima.analysis_component.AnalysisComponent_ImplBase

        batchProcessComplete, collectionProcessComplete, getContext, getLogger, getResultSpecification, reconfigure, setResultSpecification
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • TOPIC_FIELD

        public static final java.lang.String TOPIC_FIELD
        Field the topic information is stored
        See Also:
        Constant Field Values
      • KEYWORDS_FIELD

        public static final java.lang.String KEYWORDS_FIELD
        Field the keyword information is stored
        See Also:
        Constant Field Values
      • TOPIC_NUMBER_FIELD

        public static final java.lang.String TOPIC_NUMBER_FIELD
        Field the topic number is stored
        See Also:
        Constant Field Values
      • KEY_STOPWORDS

        public static final java.lang.String KEY_STOPWORDS
        Connection to Stopwords Resource
        See Also:
        Constant Field Values
      • stopwordResource

        protected uk.gov.dstl.baleen.resources.SharedStopwordResource stopwordResource
      • PARAM_STOPLIST

        public static final java.lang.String PARAM_STOPLIST
        The stoplist to use. If the stoplist matches one of the enum's provided in SharedStopwordResource.StopwordList, then that list will be loaded.

        Otherwise, the string is taken to be a file path and that file is used. The format of the file is expected to be one stopword per line.

        See Also:
        Constant Field Values
      • stoplist

        protected java.lang.String stoplist
      • PARAM_DOCUMENT_COLLECTION

        public static final java.lang.String PARAM_DOCUMENT_COLLECTION
        The name of the Mongo collection read from and write to
        See Also:
        Constant Field Values
      • PARAM_CONTENT_FIELD

        public static final java.lang.String PARAM_CONTENT_FIELD
        The name of field in the Mongo document storing the content
        See Also:
        Constant Field Values
      • PARAM_NUMBER_OF_TOPICS

        public static final java.lang.String PARAM_NUMBER_OF_TOPICS
        Number of topics
        See Also:
        Constant Field Values
      • PARAM_NUMBER_OF_ITERATIONS

        public static final java.lang.String PARAM_NUMBER_OF_ITERATIONS
        Number of iterations
        See Also:
        Constant Field Values
      • PARAM_NUMBER_OF_THREADS

        public static final java.lang.String PARAM_NUMBER_OF_THREADS
        Number of threads
        See Also:
        Constant Field Values
      • PARAM_MODEL_FILE

        public static final java.lang.String PARAM_MODEL_FILE
        Output model file path
        See Also:
        Constant Field Values
    • Constructor Detail

      • TopicModelTrainer

        public TopicModelTrainer()
    • Method Detail

      • initialize

        public void initialize​(org.apache.uima.UimaContext context)
                        throws org.apache.uima.resource.ResourceInitializationException
        Specified by:
        initialize in interface org.apache.uima.analysis_component.AnalysisComponent
        Overrides:
        initialize in class uk.gov.dstl.baleen.uima.BaleenTask
        Throws:
        org.apache.uima.resource.ResourceInitializationException
      • execute

        protected void execute​(uk.gov.dstl.baleen.uima.JobSettings settings)
                        throws org.apache.uima.analysis_engine.AnalysisEngineProcessException
        Specified by:
        execute in class uk.gov.dstl.baleen.uima.BaleenTask
        Throws:
        org.apache.uima.analysis_engine.AnalysisEngineProcessException