aglyph.compat.ipyetree — an ElementTree parser for IronPython¶
| Release: | 2.1.1 |
|---|
This module defines an xml.etree.ElementTree.XMLParser that
delegates to the .NET
System.Xml.XmlReader XML
parser to parse an Aglyph XML context document.
IronPython is not able to load CPython’s
xml.parsers.expat module, and so the default parser used by
ElementTree does not exist.
New in version 2.0.0: To address the missing xml.parsers.expat module, this module now defines the CLRXMLParser class, which replaces XmlReaderTreeBuilder and is used by aglyph.context.XMLContext as the default parser when running under IronPython.
Alternatively, IronPython developers may wish to install expat or an
expat-compatible library as a site package. However, this has not
been tested with Aglyph.
-
class
aglyph.compat.ipyetree.CLRXMLParser(target=None, validating=False)[source]¶ Bases:
xml.etree.ElementTree.XMLParserAn
xml.etree.ElementTree.XMLParserthat delegates parsing to the .NET System.Xml.XmlReader parser.If target is omitted, a standard
TreeBuilderinstance is used.If validating is
True, theSystem.Xml.XmlReaderparser will be configured for DTD validation.-
feed(data)¶ Add more XML data to be parsed.
Parameters: data (str) – raw XML read from a stream Note
All data across calls to this method are buffered internally; the parser itself is not actually created until the
close()method is called.
-
close()¶ Parse the XML from the internal buffer to build an element tree.
Returns: the root element of the XML document Return type: xml.etree.ElementTree.ElementTree
-
-
class
aglyph.compat.ipyetree.XmlReaderTreeBuilder(validating=False)[source]¶ Bases:
aglyph.compat.ipyetree.CLRXMLParserBuild an ElementTree using the .NET System.Xml.XmlReader XML parser.
Changed in version 2.0.0: It is no longer necessary for IronPython applications to use this class explicitly.
aglyph.context.XMLContextnow usesCLRXMLParserby default if running under IronPython.Deprecated since version 2.0.0: This class has been renamed to
CLRXMLParser.XmlReaderTreeBuilderwill be removed in release 3.0.0.
A note on IronPython Unicode issues¶
IronPython does not have an encoded-bytes str type; rather, the
str and unicode types are one and the same:
>>> str is unicode
True
Unfortunately, this means that IronPython cannot not properly decode byte streams/sequences to Unicode strings using Python language facilities. Consider the simple example of a UTF-8-encoded XML file test.xml:
<?xml version="1.0" encoding="utf-8"?>
<test>façade</test>
CPython
>>> open("test.xml", "rb").read()
'<?xml version="1.0" encoding="utf-8"?>\n<test>fa\xc3\xa7ade</test>\n'
IronPython
>>> open("test.xml", "rb").read()
u'<?xml version="1.0" encoding="utf-8"?>\n<test>fa\xc3\xa7ade</test>\n'
The byte sequence C3 A7 in UTF-8-encoded byte string represents a single
Unicode code point (U+00E7 LATIN SMALL LETTER C WITH CEDILLA), while the
character sequence C3 A7 in a Unicode string are the Unicode code
points U+00C3 LATIN CAPITAL LETTER A WITH TILDE followed by
U+00A7 SECTION SIGN. Clearly the latter is incorrect.
In many cases, this difference between CPython and IronPython will be transparent. For example:
CPython
>>> "fa\xc3\xa7ade".decode("utf-8")
u'fa\xe7ade'
IronPython
>>> u"fa\xc3\xa7ade".decode("utf-8")
u'fa\xe7ade'
However, IronPython‘s behavior poses a problem for Aglyph XML context
parsing because the xml.etree.ElementTree.ElementTree class uses
open(source, "rb") (as in the first comparison) to access the file contents
when the source argument to xml.etree.ElementTree.ElementTree.parse()
is a string (filename). This would cause the XML parser to return the Unicode
string u"fa\xc3\xa7ade" as the value of the text node under <test>.
If, for example, this was in an Aglyph <str> or <bytes> element (e.g.
<str encoding="iso-8859-1">façade</str>), Aglyph would attempt (correctly)
to encode the Unicode string using ISO-8859-1, which would result in an
incorrect ISO-8859-1 string under IronPython:
>>> u"fa\xc3\xa7ade".encode("iso-8859-1")
u'fa\xc3\xa7ade'
This happens because both '\xc3' and '\xa7' represent valid ISO-8859-1
characters (LATIN SMALL LETTER C WITH CEDILLA and SECTION SIGN, respectively).
One workaround is to use the .NET System.IO.StreamReader class instead of
the Python built-in function open():
>>> from System.IO import StreamReader
>>> from System.Text import Encoding
>>> sr = StreamReader("test.xml", Encoding.UTF8)
>>> sr.ReadToEnd()
u'<?xml version="1.0" encoding="utf-8"?>\n<test>fa\xe7ade</test>\n'
Unfortunately, this requires knowledge of the file encoding prior to reading, which isn’t always possible when parsing XML. (Arguably, it should not need to be known in advance for XML parsing, since the XML declaration should convey this piece of metadata to the XML parser.)
Aglyph’s aglyph.compat.ipyetree.XmlReaderTreeBuilder takes a two-step
approach to work around IronPython‘s Unicode issues when parsing an Aglyph
XML context document:
- Save the document encoding from the XML declaration.
- Use the document encoding to decode data before handing it off to
aglyph.context.XMLContext.
Step #1 is possible because, luckily, the System.Xml.XmlReader class reports XmlNodeType.XmlDeclaration.
Note
If the XML document does not specify an explicit encoding in the XML
declaration, XmlReaderTreeBuilder assumes UTF-8.
Step #2 works because the same “glitch” that causes IronPython‘s Unicode issues can be exploited to work around it:
>>> str is unicode
True
>>> u"fa\xc3\xa7ade".decode("utf-8")
u'fa\xe7ade'
>>> "no non-ascii bytes".decode("utf-8")
'no non-ascii bytes'
Because of this, the text node string u"fa\xc3\xa7ade" can actually be
decoded to u"fa\xe7ade" before being handed off to
aglyph.context.XMLContext, allowing XMLContext to remain ignorant
of IronPython‘s Unicode issues.