scripts.segment_wiki – Convert wikipedia dump to json-line format¶Construct a corpus from a Wikipedia (or other MediaWiki-based) database dump (typical filename is <LANG>wiki-<YYYYMMDD>-pages-articles.xml.bz2 or <LANG>wiki-latest-pages-articles.xml.bz2), extract titles, section names, section content and save to json-line format, that contains 3 fields
'title' (str) - title of article,
'section_titles' (list) - list of titles of sections,
'section_texts' (list) - list of content from sections.
English Wikipedia dump available here. Approximate time for processing is 2.5 hours (i7-6700HQ, SSD).
Examples
Convert wiki to json-lines format: python -m gensim.scripts.segment_wiki -f enwiki-latest-pages-articles.xml.bz2 | gzip > enwiki-latest.json.gz
Read json-lines dump
>>> # iterate over the plain text file we just created
>>> for line in smart_open('enwiki-latest.json.gz'):
>>> # decode JSON into a Python object
>>> article = json.loads(line)
>>>
>>> # each article has a "title", "section_titles" and "section_texts" fields
>>> print("Article title: %s" % article['title'])
>>> for section_title, section_text in zip(article['section_titles'], article['section_texts']):
>>> print("Section title: %s" % section_title)
>>> print("Section text: %s" % section_text)
gensim.scripts.segment_wiki.extract_page_xmls(f)¶Extract pages from a MediaWiki database dump.
| Parameters: | f (file) – File descriptor of MediaWiki dump. |
|---|---|
| Yields: | str – XML strings for page tags. |
gensim.scripts.segment_wiki.segment(page_xml)¶Parse the content inside a page tag
| Parameters: | page_xml (str) – Content from page tag. |
|---|---|
| Returns: | Structure contains (title, [(section_heading, section_content)]). |
| Return type: | (str, list of (str, str)) |
gensim.scripts.segment_wiki.segment_all_articles(file_path, min_article_character=200)¶Extract article titles and sections from a MediaWiki bz2 database dump.
| Parameters: |
|
|---|---|
| Yields: | (str, list of (str, str)) – Structure contains (title, [(section_heading, section_content), …]). |
gensim.scripts.segment_wiki.segment_and_write_all_articles(file_path, output_file, min_article_character=200)¶Write article title and sections to output_file, output_file is json-line file with 3 fields:
'title' - title of article,
'section_titles' - list of titles of sections,
'section_texts' - list of content from sections.
| Parameters: |
|
|---|