topic_coherence.text_analysis – Analyzing the texts of a corpus to accumulate statistical information about word occurrences¶This module contains classes for analyzing the texts of a corpus to accumulate statistical information about word occurrences.
gensim.topic_coherence.text_analysis.AccumulatingWorker(input_q, output_q, accumulator, window_size)¶Bases: multiprocessing.process.Process
Accumulate stats from texts fed in from queue.
authkey¶daemon¶Return whether process is a daemon
exitcode¶Return exit code of process or None if it has yet to stop
ident¶Return identifier (PID) of process or None if it has yet to start
is_alive()¶Return whether process is alive
join(timeout=None)¶Wait until child process terminates
name¶pid¶Return identifier (PID) of process or None if it has yet to start
reply_to_master()¶run()¶start()¶Start child process
terminate()¶Terminate process; sends SIGTERM signal or uses TerminateProcess()
gensim.topic_coherence.text_analysis.BaseAnalyzer(relevant_ids)¶Bases: object
Base class for corpus and text analyzers.
analyze_text(text, doc_num=None)¶get_co_occurrences(word_id1, word_id2)¶Return number of docs the words co-occur in, once accumulate has been called.
get_occurrences(word_id)¶Return number of docs the word occurs in, once accumulate has been called.
num_docs¶gensim.topic_coherence.text_analysis.CorpusAccumulator(*args)¶Bases: gensim.topic_coherence.text_analysis.InvertedIndexBased
Gather word occurrence stats from a corpus by iterating over its BoW representation.
accumulate(corpus)¶analyze_text(text, doc_num=None)¶get_co_occurrences(word_id1, word_id2)¶Return number of docs the words co-occur in, once accumulate has been called.
get_occurrences(word_id)¶Return number of docs the word occurs in, once accumulate has been called.
index_to_dict()¶num_docs¶gensim.topic_coherence.text_analysis.InvertedIndexAccumulator(relevant_ids, dictionary)¶Bases: gensim.topic_coherence.text_analysis.WindowedTextsAnalyzer, gensim.topic_coherence.text_analysis.InvertedIndexBased
Build an inverted index from a sequence of corpus texts.
| Parameters: |
|
|---|
accumulate(texts, window_size)¶analyze_text(window, doc_num=None)¶get_co_occurrences(word1, word2)¶Return number of docs the words co-occur in, once accumulate has been called.
get_occurrences(word)¶Return number of docs the word occurs in, once accumulate has been called.
index_to_dict()¶num_docs¶text_is_relevant(text)¶Return True if the text has any relevant words, else False.
gensim.topic_coherence.text_analysis.InvertedIndexBased(*args)¶Bases: gensim.topic_coherence.text_analysis.BaseAnalyzer
Analyzer that builds up an inverted index to accumulate stats.
analyze_text(text, doc_num=None)¶get_co_occurrences(word_id1, word_id2)¶Return number of docs the words co-occur in, once accumulate has been called.
get_occurrences(word_id)¶Return number of docs the word occurs in, once accumulate has been called.
index_to_dict()¶num_docs¶gensim.topic_coherence.text_analysis.ParallelWordOccurrenceAccumulator(processes, *args, **kwargs)¶Bases: gensim.topic_coherence.text_analysis.WindowedTextsAnalyzer
Accumulate word occurrences in parallel.
| Parameters: |
|
|---|
accumulate(texts, window_size)¶analyze_text(text, doc_num=None)¶get_co_occurrences(word1, word2)¶Return number of docs the words co-occur in, once accumulate has been called.
get_occurrences(word)¶Return number of docs the word occurs in, once accumulate has been called.
merge_accumulators(accumulators)¶Merge the list of accumulators into a single WordOccurrenceAccumulator with all occurrence and co-occurrence counts, and a num_docs that reflects the total observed by all the individual accumulators.
num_docs¶queue_all_texts(q, texts, window_size)¶Sequentially place batches of texts on the given queue until texts is consumed. The texts are filtered so that only those with at least one relevant token are queued.
start_workers(window_size)¶Set up an input and output queue and start processes for each worker.
The input queue is used to transmit batches of documents to the workers. The output queue is used by workers to transmit the WordOccurrenceAccumulator instances. Returns: tuple of (list of workers, input queue, output queue).
terminate_workers(input_q, output_q, workers, interrupted=False)¶Wait until all workers have transmitted their WordOccurrenceAccumulator instances, then terminate each. We do not use join here because it has been shown to have some issues in Python 2.7 (and even in later versions). This method also closes both the input and output queue.
If interrupted is False (normal execution), a None value is placed on the input queue for each worker. The workers are looking for this sentinel value and interpret it as a signal to terminate themselves. If interrupted is True, a KeyboardInterrupt occurred. The workers are programmed to recover from this and continue on to transmit their results before terminating. So in this instance, the sentinel values are not queued, but the rest of the execution continues as usual.
text_is_relevant(text)¶Return True if the text has any relevant words, else False.
yield_batches(texts)¶Return a generator over the given texts that yields batches of batch_size texts at a time.
gensim.topic_coherence.text_analysis.PatchedWordOccurrenceAccumulator(*args)¶Bases: gensim.topic_coherence.text_analysis.WordOccurrenceAccumulator
Monkey patched for multiprocessing worker usage, to move some of the logic to the master process.
accumulate(texts, window_size)¶analyze_text(window, doc_num=None)¶get_co_occurrences(word1, word2)¶Return number of docs the words co-occur in, once accumulate has been called.
get_occurrences(word)¶Return number of docs the word occurs in, once accumulate has been called.
merge(other)¶num_docs¶partial_accumulate(texts, window_size)¶Meant to be called several times to accumulate partial results. The final accumulation should be performed with the accumulate method as opposed to this one. This method does not ensure the co-occurrence matrix is in lil format and does not symmetrize it after accumulation.
text_is_relevant(text)¶Return True if the text has any relevant words, else False.
gensim.topic_coherence.text_analysis.UsesDictionary(relevant_ids, dictionary)¶Bases: gensim.topic_coherence.text_analysis.BaseAnalyzer
A BaseAnalyzer that uses a Dictionary, hence can translate tokens to counts. The standard BaseAnalyzer can only deal with token ids since it doesn’t have the token2id mapping.
analyze_text(text, doc_num=None)¶get_co_occurrences(word1, word2)¶Return number of docs the words co-occur in, once accumulate has been called.
get_occurrences(word)¶Return number of docs the word occurs in, once accumulate has been called.
num_docs¶gensim.topic_coherence.text_analysis.WindowedTextsAnalyzer(relevant_ids, dictionary)¶Bases: gensim.topic_coherence.text_analysis.UsesDictionary
Gather some stats about relevant terms of a corpus by iterating over windows of texts.
| Parameters: |
|
|---|
accumulate(texts, window_size)¶analyze_text(text, doc_num=None)¶get_co_occurrences(word1, word2)¶Return number of docs the words co-occur in, once accumulate has been called.
get_occurrences(word)¶Return number of docs the word occurs in, once accumulate has been called.
num_docs¶text_is_relevant(text)¶Return True if the text has any relevant words, else False.
gensim.topic_coherence.text_analysis.WordOccurrenceAccumulator(*args)¶Bases: gensim.topic_coherence.text_analysis.WindowedTextsAnalyzer
Accumulate word occurrences and co-occurrences from a sequence of corpus texts.
accumulate(texts, window_size)¶analyze_text(window, doc_num=None)¶get_co_occurrences(word1, word2)¶Return number of docs the words co-occur in, once accumulate has been called.
get_occurrences(word)¶Return number of docs the word occurs in, once accumulate has been called.
merge(other)¶num_docs¶partial_accumulate(texts, window_size)¶Meant to be called several times to accumulate partial results. The final accumulation should be performed with the accumulate method as opposed to this one. This method does not ensure the co-occurrence matrix is in lil format and does not symmetrize it after accumulation.
text_is_relevant(text)¶Return True if the text has any relevant words, else False.
gensim.topic_coherence.text_analysis.WordVectorsAccumulator(relevant_ids, dictionary, model=None, **model_kwargs)¶Bases: gensim.topic_coherence.text_analysis.UsesDictionary
Accumulate context vectors for words using word vector embeddings.
| Parameters: |
|
|---|
accumulate(texts, window_size)¶analyze_text(text, doc_num=None)¶get_co_occurrences(word1, word2)¶Return number of docs the words co-occur in, once accumulate has been called.
get_occurrences(word)¶Return number of docs the word occurs in, once accumulate has been called.
ids_similarity(ids1, ids2)¶not_in_vocab(words)¶num_docs¶