models.coherencemodel – Topic coherence pipeline¶Module for calculating topic coherence in python. This is the implementation of the four stage topic coherence pipeline from the paper [1]. The four stage pipeline is basically:
Segmentation -> Probability Estimation -> Confirmation Measure -> Aggregation.
Implementation of this pipeline allows for the user to in essence “make” a coherence measure of his/her choice by choosing a method in each of the pipelines.
| [1] | Michael Roeder, Andreas Both and Alexander Hinneburg. Exploring the space of topic coherence measures. http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf. |
gensim.models.coherencemodel.CoherenceModel(model=None, topics=None, texts=None, corpus=None, dictionary=None, window_size=None, keyed_vectors=None, coherence='c_v', topn=20, processes=-1)¶Bases: gensim.interfaces.TransformationABC
Objects of this class allow for building and maintaining a model for topic coherence.
The main methods are:
get_coherence() method, which returns the topic coherence.Pipeline phases can also be executed individually. Methods for doing this are:
One way of using this feature is through providing a trained topic model. A dictionary has to be explicitly provided if the model does not contain a dictionary already:
cm = CoherenceModel(model=tm, corpus=corpus, coherence='u_mass') # tm is the trained topic model
cm.get_coherence()
Another way of using this feature is through providing tokenized topics such as:
topics = [['human', 'computer', 'system', 'interface'],
['graph', 'minors', 'trees', 'eps']]
cm = CoherenceModel(topics=topics, corpus=corpus, dictionary=dictionary, coherence='u_mass') # note that a dictionary has to be provided.
cm.get_coherence()
Model persistency is achieved via its load/save methods.
| Parameters: |
|
|---|
aggregate_measures(topic_coherences)¶Aggregate the individual topic coherence measures using the pipeline’s aggregation function.
compare_model_topics(model_topics)¶Perform the coherence evaluation for each of the models.
This first precomputes the probabilities once, then evaluates coherence for each model.
Since we have already precomputed the probabilities, this simply involves using the accumulated stats in the CoherenceModel to perform the evaluations, which should be pretty quick.
| Parameters: | model_topics (list) – of lists of top-N words for the model trained with that number of topics. |
|---|---|
| Returns: |
|
| Return type: | list |
compare_models(models)¶estimate_probabilities(segmented_topics=None)¶Accumulate word occurrences and co-occurrences from texts or corpus using the optimal method for the chosen coherence metric. This operation may take quite some time for the sliding window based coherence methods.
for_models(models, dictionary, topn=20, **kwargs)¶Initialize a CoherenceModel with estimated probabilities for all of the given models.
| Parameters: | models (list) – List of models to evalaute coherence of; the only requirement is that each has a get_topics methods. |
|---|
for_topics(topics_as_topn_terms, **kwargs)¶Initialize a CoherenceModel with estimated probabilities for all of the given topics.
| Parameters: | topics_as_topn_terms (list of lists) – Each element in the top-level list should be the list of topics for a model. The topics for the model should be a list of top-N words, one per topic. |
|---|
get_coherence()¶Return coherence value based on pipeline parameters.
get_coherence_per_topic(segmented_topics=None, with_std=False, with_support=False)¶Return list of coherence values for each topic based on pipeline parameters.
load(fname, mmap=None)¶Load a previously saved object from file (also see save).
If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.
If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.
measure¶model¶save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)¶Save the object to file (also see load).
fname_or_handle is either a string specifying the file name to save to, or an open file-like object which can be written to. If the object is a file handle, no special array handling will be performed; all attributes will be saved to the same file.
If separately is None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently.
You can also set separately manually, in which case it must be a list of attribute names to be stored in separate files. The automatic check is not performed in this case.
ignore is a set of attribute names to not serialize (file handles, caches etc). On subsequent load() these attributes will be set to None.
pickle_protocol defaults to 2 so the pickled object can be imported in both Python 2 and 3.
segment_topics()¶top_topics_as_word_lists(model, dictionary, topn=20)¶topics¶topn¶