Package uk.gov.dstl.baleen.jobs.triage
Class TopicModelTrainer
- java.lang.Object
-
- org.apache.uima.analysis_component.AnalysisComponent_ImplBase
-
- org.apache.uima.analysis_component.Annotator_ImplBase
-
- org.apache.uima.analysis_component.JCasAnnotator_ImplBase
-
- org.apache.uima.fit.component.JCasAnnotator_ImplBase
-
- uk.gov.dstl.baleen.uima.BaleenTask
-
- uk.gov.dstl.baleen.jobs.triage.TopicModelTrainer
-
- All Implemented Interfaces:
org.apache.uima.analysis_component.AnalysisComponent
public class TopicModelTrainer extends uk.gov.dstl.baleen.uima.BaleenTask
A task to create a Topic Model for a set of documents (stored in mongo).Uses Latent Dirichlet Allocation following Newman, Asuncion, Smyth and Welling, Distributed Algorithms for Topic Models JMLR (2009), with SparseLDA sampling scheme and data structure from Yao, Mimno and McCallum, Efficient Methods for Topic Model Inference on Streaming Document Collections, KDD (2009).
-
-
Field Summary
Fields Modifier and Type Field Description static java.lang.String
KEY_STOPWORDS
Connection to Stopwords Resourcestatic java.lang.String
KEYWORDS_FIELD
Field the keyword information is storedstatic java.lang.String
PARAM_CONTENT_FIELD
The name of field in the Mongo document storing the contentstatic java.lang.String
PARAM_DOCUMENT_COLLECTION
The name of the Mongo collection read from and write tostatic java.lang.String
PARAM_MODEL_FILE
Output model file pathstatic java.lang.String
PARAM_NUMBER_OF_ITERATIONS
Number of iterationsstatic java.lang.String
PARAM_NUMBER_OF_THREADS
Number of threadsstatic java.lang.String
PARAM_NUMBER_OF_TOPICS
Number of topicsstatic java.lang.String
PARAM_STOPLIST
The stoplist to use.protected java.lang.String
stoplist
protected uk.gov.dstl.baleen.resources.SharedStopwordResource
stopwordResource
static java.lang.String
TOPIC_FIELD
Field the topic information is storedstatic java.lang.String
TOPIC_NUMBER_FIELD
Field the topic number is stored
-
Constructor Summary
Constructors Constructor Description TopicModelTrainer()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected void
execute(uk.gov.dstl.baleen.uima.JobSettings settings)
void
initialize(org.apache.uima.UimaContext context)
-
Methods inherited from class uk.gov.dstl.baleen.uima.BaleenTask
createMonitor, destroy, doDestroy, doInitialize, getMonitor, process
-
Methods inherited from class org.apache.uima.analysis_component.JCasAnnotator_ImplBase
getRequiredCasInterface, process
-
Methods inherited from class org.apache.uima.analysis_component.Annotator_ImplBase
getCasInstancesRequired, hasNext, next
-
-
-
-
Field Detail
-
TOPIC_FIELD
public static final java.lang.String TOPIC_FIELD
Field the topic information is stored- See Also:
- Constant Field Values
-
KEYWORDS_FIELD
public static final java.lang.String KEYWORDS_FIELD
Field the keyword information is stored- See Also:
- Constant Field Values
-
TOPIC_NUMBER_FIELD
public static final java.lang.String TOPIC_NUMBER_FIELD
Field the topic number is stored- See Also:
- Constant Field Values
-
KEY_STOPWORDS
public static final java.lang.String KEY_STOPWORDS
Connection to Stopwords Resource- See Also:
- Constant Field Values
-
stopwordResource
protected uk.gov.dstl.baleen.resources.SharedStopwordResource stopwordResource
-
PARAM_STOPLIST
public static final java.lang.String PARAM_STOPLIST
The stoplist to use. If the stoplist matches one of the enum's provided inSharedStopwordResource.StopwordList
, then that list will be loaded.Otherwise, the string is taken to be a file path and that file is used. The format of the file is expected to be one stopword per line.
- See Also:
- Constant Field Values
-
stoplist
protected java.lang.String stoplist
-
PARAM_DOCUMENT_COLLECTION
public static final java.lang.String PARAM_DOCUMENT_COLLECTION
The name of the Mongo collection read from and write to- See Also:
- Constant Field Values
-
PARAM_CONTENT_FIELD
public static final java.lang.String PARAM_CONTENT_FIELD
The name of field in the Mongo document storing the content- See Also:
- Constant Field Values
-
PARAM_NUMBER_OF_TOPICS
public static final java.lang.String PARAM_NUMBER_OF_TOPICS
Number of topics- See Also:
- Constant Field Values
-
PARAM_NUMBER_OF_ITERATIONS
public static final java.lang.String PARAM_NUMBER_OF_ITERATIONS
Number of iterations- See Also:
- Constant Field Values
-
PARAM_NUMBER_OF_THREADS
public static final java.lang.String PARAM_NUMBER_OF_THREADS
Number of threads- See Also:
- Constant Field Values
-
PARAM_MODEL_FILE
public static final java.lang.String PARAM_MODEL_FILE
Output model file path- See Also:
- Constant Field Values
-
-
Method Detail
-
initialize
public void initialize(org.apache.uima.UimaContext context) throws org.apache.uima.resource.ResourceInitializationException
- Specified by:
initialize
in interfaceorg.apache.uima.analysis_component.AnalysisComponent
- Overrides:
initialize
in classuk.gov.dstl.baleen.uima.BaleenTask
- Throws:
org.apache.uima.resource.ResourceInitializationException
-
execute
protected void execute(uk.gov.dstl.baleen.uima.JobSettings settings) throws org.apache.uima.analysis_engine.AnalysisEngineProcessException
- Specified by:
execute
in classuk.gov.dstl.baleen.uima.BaleenTask
- Throws:
org.apache.uima.analysis_engine.AnalysisEngineProcessException
-
-