• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..--

doc/22-Nov-2023-1,213775

html5lib/22-Nov-2023-14,16111,788

utils/22-Nov-2023-147117

.gitignoreD22-Nov-20231.8 KiB8371

.gitmodulesD22-Nov-2023109 43

.travis.ymlD22-Nov-2023580 3831

AUTHORS.rstD22-Nov-2023651 4438

CHANGES.rstD22-Nov-20235.3 KiB218133

CONTRIBUTING.rstD22-Nov-20232.4 KiB6143

LICENSED22-Nov-20231.1 KiB2117

MANIFEST.inD22-Nov-2023149 76

README.chromiumD22-Nov-2023291 129

README.rstD22-Nov-20234.2 KiB158103

debug-info.pyD22-Nov-2023779 3826

flake8-run.shD22-Nov-2023393 1511

parse.pyD22-Nov-20238.9 KiB242196

requirements-install.shD22-Nov-2023537 1712

requirements-optional-2.6.txtD22-Nov-2023126 64

requirements-optional-cpython.txtD22-Nov-2023143 64

requirements-optional.txtD22-Nov-2023334 1410

requirements-test.txtD22-Nov-202358 64

requirements.txtD22-Nov-20234 21

setup.pyD22-Nov-20232.2 KiB5953

tox.iniD22-Nov-2023513 3127

README.chromium

1Name: html5lib-python
2Short Name: html5lib
3URL: https://github.com/html5lib/html5lib-python
4Version: 01b1ebb7ce0146b8082b1a7315431aac023eb046
5License: MIT
6
7Description:
8Standards-compliant library for parsing and serializing HTML documents and
9fragments in Python
10
11Local Modifications: None
12

README.rst

1html5lib
2========
3
4.. image:: https://travis-ci.org/html5lib/html5lib-python.png?branch=master
5  :target: https://travis-ci.org/html5lib/html5lib-python
6
7html5lib is a pure-python library for parsing HTML. It is designed to
8conform to the WHATWG HTML specification, as is implemented by all major
9web browsers.
10
11
12Usage
13-----
14
15Simple usage follows this pattern:
16
17.. code-block:: python
18
19  import html5lib
20  with open("mydocument.html", "rb") as f:
21      document = html5lib.parse(f)
22
23or:
24
25.. code-block:: python
26
27  import html5lib
28  document = html5lib.parse("<p>Hello World!")
29
30By default, the ``document`` will be an ``xml.etree`` element instance.
31Whenever possible, html5lib chooses the accelerated ``ElementTree``
32implementation (i.e. ``xml.etree.cElementTree`` on Python 2.x).
33
34Two other tree types are supported: ``xml.dom.minidom`` and
35``lxml.etree``. To use an alternative format, specify the name of
36a treebuilder:
37
38.. code-block:: python
39
40  import html5lib
41  with open("mydocument.html", "rb") as f:
42      lxml_etree_document = html5lib.parse(f, treebuilder="lxml")
43
44When using with ``urllib2`` (Python 2), the charset from HTTP should be
45pass into html5lib as follows:
46
47.. code-block:: python
48
49  from contextlib import closing
50  from urllib2 import urlopen
51  import html5lib
52
53  with closing(urlopen("http://example.com/")) as f:
54      document = html5lib.parse(f, encoding=f.info().getparam("charset"))
55
56When using with ``urllib.request`` (Python 3), the charset from HTTP
57should be pass into html5lib as follows:
58
59.. code-block:: python
60
61  from urllib.request import urlopen
62  import html5lib
63
64  with urlopen("http://example.com/") as f:
65      document = html5lib.parse(f, encoding=f.info().get_content_charset())
66
67To have more control over the parser, create a parser object explicitly.
68For instance, to make the parser raise exceptions on parse errors, use:
69
70.. code-block:: python
71
72  import html5lib
73  with open("mydocument.html", "rb") as f:
74      parser = html5lib.HTMLParser(strict=True)
75      document = parser.parse(f)
76
77When you're instantiating parser objects explicitly, pass a treebuilder
78class as the ``tree`` keyword argument to use an alternative document
79format:
80
81.. code-block:: python
82
83  import html5lib
84  parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom"))
85  minidom_document = parser.parse("<p>Hello World!")
86
87More documentation is available at http://html5lib.readthedocs.org/.
88
89
90Installation
91------------
92
93html5lib works on CPython 2.6+, CPython 3.2+ and PyPy.  To install it,
94use:
95
96.. code-block:: bash
97
98    $ pip install html5lib
99
100
101Optional Dependencies
102---------------------
103
104The following third-party libraries may be used for additional
105functionality:
106
107- ``datrie`` can be used to improve parsing performance (though in
108  almost all cases the improvement is marginal);
109
110- ``lxml`` is supported as a tree format (for both building and
111  walking) under CPython (but *not* PyPy where it is known to cause
112  segfaults);
113
114- ``genshi`` has a treewalker (but not builder); and
115
116- ``charade`` can be used as a fallback when character encoding cannot
117  be determined; ``chardet``, from which it was forked, can also be used
118  on Python 2.
119
120- ``ordereddict`` can be used under Python 2.6
121  (``collections.OrderedDict`` is used instead on later versions) to
122  serialize attributes in alphabetical order.
123
124
125Bugs
126----
127
128Please report any bugs on the `issue tracker
129<https://github.com/html5lib/html5lib-python/issues>`_.
130
131
132Tests
133-----
134
135Unit tests require the ``nose`` library and can be run using the
136``nosetests`` command in the root directory; ``ordereddict`` is
137required under Python 2.6. All should pass.
138
139Test data are contained in a separate `html5lib-tests
140<https://github.com/html5lib/html5lib-tests>`_ repository and included
141as a submodule, thus for git checkouts they must be initialized::
142
143  $ git submodule init
144  $ git submodule update
145
146If you have all compatible Python implementations available on your
147system, you can run tests on all of them using the ``tox`` utility,
148which can be found on PyPI.
149
150
151Questions?
152----------
153
154There's a mailing list available for support on Google Groups,
155`html5lib-discuss <http://groups.google.com/group/html5lib-discuss>`_,
156though you may get a quicker response asking on IRC in `#whatwg on
157irc.freenode.net <http://wiki.whatwg.org/wiki/IRC>`_.
158