1:mod:`xml.dom.minidom` --- Minimal DOM implementation 2===================================================== 3 4.. module:: xml.dom.minidom 5 :synopsis: Minimal Document Object Model (DOM) implementation. 6.. moduleauthor:: Paul Prescod <paul@prescod.net> 7.. sectionauthor:: Paul Prescod <paul@prescod.net> 8.. sectionauthor:: Martin v. Löwis <martin@v.loewis.de> 9 10 11.. versionadded:: 2.0 12 13**Source code:** :source:`Lib/xml/dom/minidom.py` 14 15-------------- 16 17:mod:`xml.dom.minidom` is a minimal implementation of the Document Object 18Model interface, with an API similar to that in other languages. It is intended 19to be simpler than the full DOM and also significantly smaller. Users who are 20not already proficient with the DOM should consider using the 21:mod:`xml.etree.ElementTree` module for their XML processing instead. 22 23 24.. warning:: 25 26 The :mod:`xml.dom.minidom` module is not secure against 27 maliciously constructed data. If you need to parse untrusted or 28 unauthenticated data see :ref:`xml-vulnerabilities`. 29 30 31DOM applications typically start by parsing some XML into a DOM. With 32:mod:`xml.dom.minidom`, this is done through the parse functions:: 33 34 from xml.dom.minidom import parse, parseString 35 36 dom1 = parse('c:\\temp\\mydata.xml') # parse an XML file by name 37 38 datasource = open('c:\\temp\\mydata.xml') 39 dom2 = parse(datasource) # parse an open file 40 41 dom3 = parseString('<myxml>Some data<empty/> some more data</myxml>') 42 43The :func:`parse` function can take either a filename or an open file object. 44 45 46.. function:: parse(filename_or_file[, parser[, bufsize]]) 47 48 Return a :class:`Document` from the given input. *filename_or_file* may be 49 either a file name, or a file-like object. *parser*, if given, must be a SAX2 50 parser object. This function will change the document handler of the parser and 51 activate namespace support; other parser configuration (like setting an entity 52 resolver) must have been done in advance. 53 54If you have XML in a string, you can use the :func:`parseString` function 55instead: 56 57 58.. function:: parseString(string[, parser]) 59 60 Return a :class:`Document` that represents the *string*. This method creates a 61 :class:`~StringIO.StringIO` object for the string and passes that on to :func:`parse`. 62 63Both functions return a :class:`Document` object representing the content of the 64document. 65 66What the :func:`parse` and :func:`parseString` functions do is connect an XML 67parser with a "DOM builder" that can accept parse events from any SAX parser and 68convert them into a DOM tree. The name of the functions are perhaps misleading, 69but are easy to grasp when learning the interfaces. The parsing of the document 70will be completed before these functions return; it's simply that these 71functions do not provide a parser implementation themselves. 72 73You can also create a :class:`Document` by calling a method on a "DOM 74Implementation" object. You can get this object either by calling the 75:func:`getDOMImplementation` function in the :mod:`xml.dom` package or the 76:mod:`xml.dom.minidom` module. Using the implementation from the 77:mod:`xml.dom.minidom` module will always return a :class:`Document` instance 78from the minidom implementation, while the version from :mod:`xml.dom` may 79provide an alternate implementation (this is likely if you have the `PyXML 80package <http://pyxml.sourceforge.net/>`_ installed). Once you have a 81:class:`Document`, you can add child nodes to it to populate the DOM:: 82 83 from xml.dom.minidom import getDOMImplementation 84 85 impl = getDOMImplementation() 86 87 newdoc = impl.createDocument(None, "some_tag", None) 88 top_element = newdoc.documentElement 89 text = newdoc.createTextNode('Some textual content.') 90 top_element.appendChild(text) 91 92Once you have a DOM document object, you can access the parts of your XML 93document through its properties and methods. These properties are defined in 94the DOM specification. The main property of the document object is the 95:attr:`documentElement` property. It gives you the main element in the XML 96document: the one that holds all others. Here is an example program:: 97 98 dom3 = parseString("<myxml>Some data</myxml>") 99 assert dom3.documentElement.tagName == "myxml" 100 101When you are finished with a DOM tree, you may optionally call the 102:meth:`unlink` method to encourage early cleanup of the now-unneeded 103objects. :meth:`unlink` is an :mod:`xml.dom.minidom`\ -specific 104extension to the DOM API that renders the node and its descendants are 105essentially useless. Otherwise, Python's garbage collector will 106eventually take care of the objects in the tree. 107 108.. seealso:: 109 110 `Document Object Model (DOM) Level 1 Specification <https://www.w3.org/TR/REC-DOM-Level-1/>`_ 111 The W3C recommendation for the DOM supported by :mod:`xml.dom.minidom`. 112 113 114.. _minidom-objects: 115 116DOM Objects 117----------- 118 119The definition of the DOM API for Python is given as part of the :mod:`xml.dom` 120module documentation. This section lists the differences between the API and 121:mod:`xml.dom.minidom`. 122 123 124.. method:: Node.unlink() 125 126 Break internal references within the DOM so that it will be garbage collected on 127 versions of Python without cyclic GC. Even when cyclic GC is available, using 128 this can make large amounts of memory available sooner, so calling this on DOM 129 objects as soon as they are no longer needed is good practice. This only needs 130 to be called on the :class:`Document` object, but may be called on child nodes 131 to discard children of that node. 132 133 134.. method:: Node.writexml(writer, indent="", addindent="", newl="") 135 136 Write XML to the writer object. The writer should have a :meth:`write` method 137 which matches that of the file object interface. The *indent* parameter is the 138 indentation of the current node. The *addindent* parameter is the incremental 139 indentation to use for subnodes of the current one. The *newl* parameter 140 specifies the string to use to terminate newlines. 141 142 For the :class:`Document` node, an additional keyword argument *encoding* can 143 be used to specify the encoding field of the XML header. 144 145 .. versionchanged:: 2.1 146 The optional keyword parameters *indent*, *addindent*, and *newl* were added to 147 support pretty output. 148 149 .. versionchanged:: 2.3 150 For the :class:`Document` node, an additional keyword argument 151 *encoding* can be used to specify the encoding field of the XML header. 152 153 154.. method:: Node.toxml([encoding]) 155 156 Return the XML that the DOM represents as a string. 157 158 With no argument, the XML header does not specify an encoding, and the result is 159 Unicode string if the default encoding cannot represent all characters in the 160 document. Encoding this string in an encoding other than UTF-8 is likely 161 incorrect, since UTF-8 is the default encoding of XML. 162 163 With an explicit *encoding* [1]_ argument, the result is a byte string in the 164 specified encoding. It is recommended that this argument is always specified. To 165 avoid :exc:`UnicodeError` exceptions in case of unrepresentable text data, the 166 encoding argument should be specified as "utf-8". 167 168 .. versionchanged:: 2.3 169 the *encoding* argument was introduced; see :meth:`writexml`. 170 171 172.. method:: Node.toprettyxml([indent=""[, newl=""[, encoding=""]]]) 173 174 Return a pretty-printed version of the document. *indent* specifies the 175 indentation string and defaults to a tabulator; *newl* specifies the string 176 emitted at the end of each line and defaults to ``\n``. 177 178 .. versionadded:: 2.1 179 180 .. versionchanged:: 2.3 181 the encoding argument was introduced; see :meth:`writexml`. 182 183The following standard DOM methods have special considerations with 184:mod:`xml.dom.minidom`: 185 186 187.. method:: Node.cloneNode(deep) 188 189 Although this method was present in the version of :mod:`xml.dom.minidom` 190 packaged with Python 2.0, it was seriously broken. This has been corrected for 191 subsequent releases. 192 193 194.. _dom-example: 195 196DOM Example 197----------- 198 199This example program is a fairly realistic example of a simple program. In this 200particular case, we do not take much advantage of the flexibility of the DOM. 201 202.. literalinclude:: ../includes/minidom-example.py 203 204 205.. _minidom-and-dom: 206 207minidom and the DOM standard 208---------------------------- 209 210The :mod:`xml.dom.minidom` module is essentially a DOM 1.0-compatible DOM with 211some DOM 2 features (primarily namespace features). 212 213Usage of the DOM interface in Python is straight-forward. The following mapping 214rules apply: 215 216* Interfaces are accessed through instance objects. Applications should not 217 instantiate the classes themselves; they should use the creator functions 218 available on the :class:`Document` object. Derived interfaces support all 219 operations (and attributes) from the base interfaces, plus any new operations. 220 221* Operations are used as methods. Since the DOM uses only :keyword:`in` 222 parameters, the arguments are passed in normal order (from left to right). 223 There are no optional arguments. ``void`` operations return ``None``. 224 225* IDL attributes map to instance attributes. For compatibility with the OMG IDL 226 language mapping for Python, an attribute ``foo`` can also be accessed through 227 accessor methods :meth:`_get_foo` and :meth:`_set_foo`. ``readonly`` 228 attributes must not be changed; this is not enforced at runtime. 229 230* The types ``short int``, ``unsigned int``, ``unsigned long long``, and 231 ``boolean`` all map to Python integer objects. 232 233* The type ``DOMString`` maps to Python strings. :mod:`xml.dom.minidom` supports 234 either byte or Unicode strings, but will normally produce Unicode strings. 235 Values of type ``DOMString`` may also be ``None`` where allowed to have the IDL 236 ``null`` value by the DOM specification from the W3C. 237 238* ``const`` declarations map to variables in their respective scope (e.g. 239 ``xml.dom.minidom.Node.PROCESSING_INSTRUCTION_NODE``); they must not be changed. 240 241* ``DOMException`` is currently not supported in :mod:`xml.dom.minidom`. 242 Instead, :mod:`xml.dom.minidom` uses standard Python exceptions such as 243 :exc:`TypeError` and :exc:`AttributeError`. 244 245* :class:`NodeList` objects are implemented using Python's built-in list type. 246 Starting with Python 2.2, these objects provide the interface defined in the DOM 247 specification, but with earlier versions of Python they do not support the 248 official API. They are, however, much more "Pythonic" than the interface 249 defined in the W3C recommendations. 250 251The following interfaces have no implementation in :mod:`xml.dom.minidom`: 252 253* :class:`DOMTimeStamp` 254 255* :class:`DocumentType` (added in Python 2.1) 256 257* :class:`DOMImplementation` (added in Python 2.1) 258 259* :class:`CharacterData` 260 261* :class:`CDATASection` 262 263* :class:`Notation` 264 265* :class:`Entity` 266 267* :class:`EntityReference` 268 269* :class:`DocumentFragment` 270 271Most of these reflect information in the XML document that is not of general 272utility to most DOM users. 273 274.. rubric:: Footnotes 275 276.. [1] The encoding string included in XML output should conform to the 277 appropriate standards. For example, "UTF-8" is valid, but "UTF8" is 278 not. See https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-EncodingDecl 279 and https://www.iana.org/assignments/character-sets/character-sets.xhtml. 280