1Writing Extensions for Python-Markdown 2====================================== 3 4Overview 5-------- 6 7Python-Markdown includes an API for extension writers to plug their own 8custom functionality and/or syntax into the parser. There are preprocessors 9which allow you to alter the source before it is passed to the parser, 10inline patterns which allow you to add, remove or override the syntax of 11any inline elements, and postprocessors which allow munging of the 12output of the parser before it is returned. If you really want to dive in, 13there are also blockprocessors which are part of the core BlockParser. 14 15As the parser builds an [ElementTree][] object which is later rendered 16as Unicode text, there are also some helpers provided to ease manipulation of 17the tree. Each part of the API is discussed in its respective section below. 18Additionaly, reading the source of some [[Available Extensions]] may be helpful. 19For example, the [[Footnotes]] extension uses most of the features documented 20here. 21 22* [Preprocessors][] 23* [InlinePatterns][] 24* [Treeprocessors][] 25* [Postprocessors][] 26* [BlockParser][] 27* [Working with the ElementTree][] 28* [Integrating your code into Markdown][] 29 * [extendMarkdown][] 30 * [OrderedDict][] 31 * [registerExtension][] 32 * [Config Settings][] 33 * [makeExtension][] 34 35<h3 id="preprocessors">Preprocessors</h3> 36 37Preprocessors munge the source text before it is passed into the Markdown 38core. This is an excellent place to clean up bad syntax, extract things the 39parser may otherwise choke on and perhaps even store it for later retrieval. 40 41Preprocessors should inherit from ``markdown.preprocessors.Preprocessor`` and 42implement a ``run`` method with one argument ``lines``. The ``run`` method of 43each Preprocessor will be passed the entire source text as a list of Unicode 44strings. Each string will contain one line of text. The ``run`` method should 45return either that list, or an altered list of Unicode strings. 46 47A pseudo example: 48 49 class MyPreprocessor(markdown.preprocessors.Preprocessor): 50 def run(self, lines): 51 new_lines = [] 52 for line in lines: 53 m = MYREGEX.match(line) 54 if m: 55 # do stuff 56 else: 57 new_lines.append(line) 58 return new_lines 59 60<h3 id="inlinepatterns">Inline Patterns</h3> 61 62Inline Patterns implement the inline HTML element syntax for Markdown such as 63``*emphasis*`` or ``[links](http://example.com)``. Pattern objects should be 64instances of classes that inherit from ``markdown.inlinepatterns.Pattern`` or 65one of its children. Each pattern object uses a single regular expression and 66must have the following methods: 67 68* **``getCompiledRegExp()``**: 69 70 Returns a compiled regular expression. 71 72* **``handleMatch(m)``**: 73 74 Accepts a match object and returns an ElementTree element of a plain 75 Unicode string. 76 77Note that any regular expression returned by ``getCompiledRegExp`` must capture 78the whole block. Therefore, they should all start with ``r'^(.*?)'`` and end 79with ``r'(.*?)!'``. When using the default ``getCompiledRegExp()`` method 80provided in the ``Pattern`` you can pass in a regular expression without that 81and ``getCompiledRegExp`` will wrap your expression for you. This means that 82the first group of your match will be ``m.group(2)`` as ``m.group(1)`` will 83match everything before the pattern. 84 85For an example, consider this simplified emphasis pattern: 86 87 class EmphasisPattern(markdown.inlinepatterns.Pattern): 88 def handleMatch(self, m): 89 el = markdown.etree.Element('em') 90 el.text = m.group(3) 91 return el 92 93As discussed in [Integrating Your Code Into Markdown][], an instance of this 94class will need to be provided to Markdown. That instance would be created 95like so: 96 97 # an oversimplified regex 98 MYPATTERN = r'\*([^*]+)\*' 99 # pass in pattern and create instance 100 emphasis = EmphasisPattern(MYPATTERN) 101 102Actually it would not be necessary to create that pattern (and not just because 103a more sophisticated emphasis pattern already exists in Markdown). The fact is, 104that example pattern is not very DRY. A pattern for `**strong**` text would 105be almost identical, with the exception that it would create a 'strong' element. 106Therefore, Markdown provides a number of generic pattern classes that can 107provide some common functionality. For example, both emphasis and strong are 108implemented with separate instances of the ``SimpleTagPettern`` listed below. 109Feel free to use or extend any of these Pattern classes. 110 111**Generic Pattern Classes** 112 113* **``SimpleTextPattern(pattern)``**: 114 115 Returns simple text of ``group(2)`` of a ``pattern``. 116 117* **``SimpleTagPattern(pattern, tag)``**: 118 119 Returns an element of type "`tag`" with a text attribute of ``group(3)`` 120 of a ``pattern``. ``tag`` should be a string of a HTML element (i.e.: 'em'). 121 122* **``SubstituteTagPattern(pattern, tag)``**: 123 124 Returns an element of type "`tag`" with no children or text (i.e.: 'br'). 125 126There may be other Pattern classes in the Markdown source that you could extend 127or use as well. Read through the source and see if there is anything you can 128use. You might even get a few ideas for different approaches to your specific 129situation. 130 131<h3 id="treeprocessors">Treeprocessors</h3> 132 133Treeprocessors manipulate an ElemenTree object after it has passed through the 134core BlockParser. This is where additional manipulation of the tree takes 135place. Additionally, the InlineProcessor is a Treeprocessor which steps through 136the tree and runs the InlinePatterns on the text of each Element in the tree. 137 138A Treeprocessor should inherit from ``markdown.treeprocessors.Treeprocessor``, 139over-ride the ``run`` method which takes one argument ``root`` (an Elementree 140object) and returns either that root element or a modified root element. 141 142A pseudo example: 143 144 class MyTreeprocessor(markdown.treeprocessors.Treeprocessor): 145 def run(self, root): 146 #do stuff 147 return my_modified_root 148 149For specifics on manipulating the ElementTree, see 150[Working with the ElementTree][] below. 151 152<h3 id="postprocessors">Postprocessors</h3> 153 154Postprocessors manipulate the document after the ElementTree has been 155serialized into a string. Postprocessors should be used to work with the 156text just before output. 157 158A Postprocessor should inherit from ``markdown.postprocessors.Postprocessor`` 159and over-ride the ``run`` method which takes one argument ``text`` and returns 160a Unicode string. 161 162Postprocessors are run after the ElementTree has been serialized back into 163Unicode text. For example, this may be an appropriate place to add a table of 164contents to a document: 165 166 class TocPostprocessor(markdown.postprocessors.Postprocessor): 167 def run(self, text): 168 return MYMARKERRE.sub(MyToc, text) 169 170<h3 id="blockparser">BlockParser</h3> 171 172Sometimes, pre/tree/postprocessors and Inline Patterns aren't going to do what 173you need. Perhaps you want a new type of block type that needs to be integrated 174into the core parsing. In such a situation, you can add/change/remove 175functionality of the core ``BlockParser``. The BlockParser is composed of a 176number of Blockproccessors. The BlockParser steps through each block of text 177(split by blank lines) and passes each block to the appropriate Blockprocessor. 178That Blockprocessor parses the block and adds it to the ElementTree. The 179[[Definition Lists]] extension would be a good example of an extension that 180adds/modifies Blockprocessors. 181 182A Blockprocessor should inherit from ``markdown.blockprocessors.BlockProcessor`` 183and implement both the ``test`` and ``run`` methods. 184 185The ``test`` method is used by BlockParser to identify the type of block. 186Therefore the ``test`` method must return a boolean value. If the test returns 187``True``, then the BlockParser will call that Blockprocessor's ``run`` method. 188If it returns ``False``, the BlockParser will move on to the next 189BlockProcessor. 190 191The **``test``** method takes two arguments: 192 193* **``parent``**: The parent etree Element of the block. This can be useful as 194 the block may need to be treated differently if it is inside a list, for 195 example. 196 197* **``block``**: A string of the current block of text. The test may be a 198 simple string method (such as ``block.startswith(some_text)``) or a complex 199 regular expression. 200 201The **``run``** method takes two arguments: 202 203* **``parent``**: A pointer to the parent etree Element of the block. The run 204 method will most likely attach additional nodes to this parent. Note that 205 nothing is returned by the method. The Elementree object is altered in place. 206 207* **``blocks``**: A list of all remaining blocks of the document. Your run 208 method must remove (pop) the first block from the list (which it altered in 209 place - not returned) and parse that block. You may find that a block of text 210 legitimately contains multiple block types. Therefore, after processing the 211 first type, your processor can insert the remaining text into the beginning 212 of the ``blocks`` list for future parsing. 213 214Please be aware that a single block can span multiple text blocks. For example, 215The official Markdown syntax rules state that a blank line does not end a 216Code Block. If the next block of text is also indented, then it is part of 217the previous block. Therefore, the BlockParser was specifically designed to 218address these types of situations. If you notice the ``CodeBlockProcessor``, 219in the core, you will note that it checks the last child of the ``parent``. 220If the last child is a code block (``<pre><code>...</code></pre>``), then it 221appends that block to the previous code block rather than creating a new 222code block. 223 224Each BlockProcessor has the following utility methods available: 225 226* **``lastChild(parent)``**: 227 228 Returns the last child of the given etree Element or ``None`` if it had no 229 children. 230 231* **``detab(text)``**: 232 233 Removes one level of indent (four spaces by default) from the front of each 234 line of the given text string. 235 236* **``looseDetab(text, level)``**: 237 238 Removes "level" levels of indent (defaults to 1) from the front of each line 239 of the given text string. However, this methods allows secondary lines to 240 not be indented as does some parts of the Markdown syntax. 241 242Each BlockProcessor also has a pointer to the containing BlockParser instance at 243``self.parser``, which can be used to check or alter the state of the parser. 244The BlockParser tracks it's state in a stack at ``parser.state``. The state 245stack is an instance of the ``State`` class. 246 247**``State``** is a subclass of ``list`` and has the additional methods: 248 249* **``set(state)``**: 250 251 Set a new state to string ``state``. The new state is appended to the end 252 of the stack. 253 254* **``reset()``**: 255 256 Step back one step in the stack. The last state at the end is removed from 257 the stack. 258 259* **``isstate(state)``**: 260 261 Test that the top (current) level of the stack is of the given string 262 ``state``. 263 264Note that to ensure that the state stack doesn't become corrupted, each time a 265state is set for a block, that state *must* be reset when the parser finishes 266parsing that block. 267 268An instance of the **``BlockParser``** is found at ``Markdown.parser``. 269``BlockParser`` has the following methods: 270 271* **``parseDocument(lines)``**: 272 273 Given a list of lines, an ElementTree object is returned. This should be 274 passed an entire document and is the only method the ``Markdown`` class 275 calls directly. 276 277* **``parseChunk(parent, text)``**: 278 279 Parses a chunk of markdown text composed of multiple blocks and attaches 280 those blocks to the ``parent`` Element. The ``parent`` is altered in place 281 and nothing is returned. Extensions would most likely use this method for 282 block parsing. 283 284* **``parseBlocks(parent, blocks)``**: 285 286 Parses a list of blocks of text and attaches those blocks to the ``parent`` 287 Element. The ``parent`` is altered in place and nothing is returned. This 288 method will generally only be used internally to recursively parse nested 289 blocks of text. 290 291While is is not recommended, an extension could subclass or completely replace 292the ``BlockParser``. The new class would have to provide the same public API. 293However, be aware that other extensions may expect the core parser provided 294and will not work with such a drastically different parser. 295 296<h3 id="working_with_et">Working with the ElementTree</h3> 297 298As mentioned, the Markdown parser converts a source document to an 299[ElementTree][] object before serializing that back to Unicode text. 300Markdown has provided some helpers to ease that manipulation within the context 301of the Markdown module. 302 303First, to get access to the ElementTree module import ElementTree from 304``markdown`` rather than importing it directly. This will ensure you are using 305the same version of ElementTree as markdown. The module is named ``etree`` 306within Markdown. 307 308 from markdown import etree 309 310``markdown.etree`` tries to import ElementTree from any known location, first 311as a standard library module (from ``xml.etree`` in Python 2.5), then as a third 312party package (``Elementree``). In each instance, ``cElementTree`` is tried 313first, then ``ElementTree`` if the faster C implementation is not available on 314your system. 315 316Sometimes you may want text inserted into an element to be parsed by 317[InlinePatterns][]. In such a situation, simply insert the text as you normally 318would and the text will be automatically run through the InlinePatterns. 319However, if you do *not* want some text to be parsed by InlinePatterns, 320then insert the text as an ``AtomicString``. 321 322 some_element.text = markdown.AtomicString(some_text) 323 324Here's a basic example which creates an HTML table (note that the contents of 325the second cell (``td2``) will be run through InlinePatterns latter): 326 327 table = etree.Element("table") 328 table.set("cellpadding", "2") # Set cellpadding to 2 329 tr = etree.SubElement(table, "tr") # Add child tr to table 330 td1 = etree.SubElement(tr, "td") # Add child td1 to tr 331 td1.text = markdown.AtomicString("Cell content") # Add plain text content 332 td2 = etree.SubElement(tr, "td") # Add second td to tr 333 td2.text = "*text* with **inline** formatting." # Add markup text 334 table.tail = "Text after table" # Add text after table 335 336You can also manipulate an existing tree. Consider the following example which 337adds a ``class`` attribute to ``<a>`` elements: 338 339 def set_link_class(self, element): 340 for child in element: 341 if child.tag == "a": 342 child.set("class", "myclass") #set the class attribute 343 set_link_class(child) # run recursively on children 344 345For more information about working with ElementTree see the ElementTree 346[Documentation](http://effbot.org/zone/element-index.htm) 347([Python Docs](http://docs.python.org/lib/module-xml.etree.ElementTree.html)). 348 349<h3 id="integrating_into_markdown">Integrating Your Code Into Markdown</h3> 350 351Once you have the various pieces of your extension built, you need to tell 352Markdown about them and ensure that they are run in the proper sequence. 353Markdown accepts a ``Extension`` instance for each extension. Therefore, you 354will need to define a class that extends ``markdown.Extension`` and over-rides 355the ``extendMarkdown`` method. Within this class you will manage configuration 356options for your extension and attach the various processors and patterns to 357the Markdown instance. 358 359It is important to note that the order of the various processors and patterns 360matters. For example, if we replace ``http://...`` links with <a> elements, and 361*then* try to deal with inline html, we will end up with a mess. Therefore, 362the various types of processors and patterns are stored within an instance of 363the Markdown class in [OrderedDict][]s. Your ``Extension`` class will need to 364manipulate those OrderedDicts appropriately. You may insert instances of your 365processors and patterns into the appropriate location in an OrderedDict, remove 366a built-in instance, or replace a built-in instance with your own. 367 368<h4 id="extendmarkdown">extendMarkdown</h4> 369 370The ``extendMarkdown`` method of a ``markdown.Extension`` class accepts two 371arguments: 372 373* **``md``**: 374 375 A pointer to the instance of the Markdown class. You should use this to 376 access the [OrderedDict][]s of processors and patterns. They are found 377 under the following attributes: 378 379 * ``md.preprocessors`` 380 * ``md.inlinePatterns`` 381 * ``md.parser.blockprocessors`` 382 * ``md.treepreprocessors`` 383 * ``md.postprocessors`` 384 385 Some other things you may want to access in the markdown instance are: 386 387 * ``md.htmlStash`` 388 * ``md.output_formats`` 389 * ``md.set_output_format()`` 390 * ``md.registerExtension()`` 391 392* **``md_globals``**: 393 394 Contains all the various global variables within the markdown module. 395 396Of course, with access to those items, theoretically you have the option to 397changing anything through various [monkey_patching][] techniques. However, you 398should be aware that the various undocumented or private parts of markdown 399may change without notice and your monkey_patches may break with a new release. 400Therefore, what you really should be doing is inserting processors and patterns 401into the markdown pipeline. Consider yourself warned. 402 403[monkey_patching]: http://en.wikipedia.org/wiki/Monkey_patch 404 405A simple example: 406 407 class MyExtension(markdown.Extension): 408 def extendMarkdown(self, md, md_globals): 409 # Insert instance of 'mypattern' before 'references' pattern 410 md.inlinePatterns.add('mypattern', MyPattern(md), '<references') 411 412<h4 id="ordereddict">OrderedDict</h4> 413 414An OrderedDict is a dictionary like object that retains the order of it's 415items. The items are ordered in the order in which they were appended to 416the OrderedDict. However, an item can also be inserted into the OrderedDict 417in a specific location in relation to the existing items. 418 419Think of OrderedDict as a combination of a list and a dictionary as it has 420methods common to both. For example, you can get and set items using the 421``od[key] = value`` syntax and the methods ``keys()``, ``values()``, and 422``items()`` work as expected with the keys, values and items returned in the 423proper order. At the same time, you can use ``insert()``, ``append()``, and 424``index()`` as you would with a list. 425 426Generally speaking, within Markdown extensions you will be using the special 427helper method ``add()`` to add additional items to an existing OrderedDict. 428 429The ``add()`` method accepts three arguments: 430 431* **``key``**: A string. The key is used for later reference to the item. 432 433* **``value``**: The object instance stored in this item. 434 435* **``location``**: Optional. The items location in relation to other items. 436 437 Note that the location can consist of a few different values: 438 439 * The special strings ``"_begin"`` and ``"_end"`` insert that item at the 440 beginning or end of the OrderedDict respectively. 441 442 * A less-than sign (``<``) followed by an existing key (i.e.: 443 ``"<somekey"``) inserts that item before the existing key. 444 445 * A greater-than sign (``>``) followed by an existing key (i.e.: 446 ``">somekey"``) inserts that item after the existing key. 447 448Consider the following example: 449 450 >>> import markdown 451 >>> od = markdown.OrderedDict() 452 >>> od['one'] = 1 # The same as: od.add('one', 1, '_begin') 453 >>> od['three'] = 3 # The same as: od.add('three', 3, '>one') 454 >>> od['four'] = 4 # The same as: od.add('four', 4, '_end') 455 >>> od.items() 456 [("one", 1), ("three", 3), ("four", 4)] 457 458Note that when building an OrderedDict in order, the extra features of the 459``add`` method offer no real value and are not necessary. However, when 460manipulating an existing OrderedDict, ``add`` can be very helpful. So let's 461insert another item into the OrderedDict. 462 463 >>> od.add('two', 2, '>one') # Insert after 'one' 464 >>> od.values() 465 [1, 2, 3, 4] 466 467Now let's insert another item. 468 469 >>> od.add('twohalf', 2.5, '<three') # Insert before 'three' 470 >>> od.keys() 471 ["one", "two", "twohalf", "three", "four"] 472 473Note that we also could have set the location of "twohalf" to be 'after two' 474(i.e.: ``'>two'``). However, it's unlikely that you will have control over the 475order in which extensions will be loaded, and this could affect the final 476sorted order of an OrderedDict. For example, suppose an extension adding 477'twohalf' in the above examples was loaded before a separate extension which 478adds 'two'. You may need to take this into consideration when adding your 479extension components to the various markdown OrderedDicts. 480 481Once an OrderedDict is created, the items are available via key: 482 483 MyNode = od['somekey'] 484 485Therefore, to delete an existing item: 486 487 del od['somekey'] 488 489To change the value of an existing item (leaving location unchanged): 490 491 od['somekey'] = MyNewObject() 492 493To change the location of an existing item: 494 495 t.link('somekey', '<otherkey') 496 497<h4 id="registerextension">registerExtension</h4> 498 499Some extensions may need to have their state reset between multiple runs of the 500Markdown class. For example, consider the following use of the [[Footnotes]] 501extension: 502 503 md = markdown.Markdown(extensions=['footnotes']) 504 html1 = md.convert(text_with_footnote) 505 md.reset() 506 html2 = md.convert(text_without_footnote) 507 508Without calling ``reset``, the footnote definitions from the first document will 509be inserted into the second document as they are still stored within the class 510instance. Therefore the ``Extension`` class needs to define a ``reset`` method 511that will reset the state of the extension (i.e.: ``self.footnotes = {}``). 512However, as many extensions do not have a need for ``reset``, ``reset`` is only 513called on extensions that are registered. 514 515To register an extension, call ``md.registerExtension`` from within your 516``extendMarkdown`` method: 517 518 519 def extendMarkdown(self, md, md_globals): 520 md.registerExtension(self) 521 # insert processors and patterns here 522 523Then, each time ``reset`` is called on the Markdown instance, the ``reset`` 524method of each registered extension will be called as well. You should also 525note that ``reset`` will be called on each registered extension after it is 526initialized the first time. Keep that in mind when over-riding the extension's 527``reset`` method. 528 529<h4 id="configsettings">Config Settings</h4> 530 531If an extension uses any parameters that the user may want to change, 532those parameters should be stored in ``self.config`` of your 533``markdown.Extension`` class in the following format: 534 535 self.config = {parameter_1_name : [value1, description1], 536 parameter_2_name : [value2, description2] } 537 538When stored this way the config parameters can be over-ridden from the 539command line or at the time Markdown is initiated: 540 541 markdown.py -x myextension(SOME_PARAM=2) inputfile.txt > output.txt 542 543Note that parameters should always be assumed to be set to string 544values, and should be converted at run time. For example: 545 546 i = int(self.getConfig("SOME_PARAM")) 547 548<h4 id="makeextension">makeExtension</h4> 549 550Each extension should ideally be placed in its own module starting 551with the ``mdx_`` prefix (e.g. ``mdx_footnotes.py``). The module must 552provide a module-level function called ``makeExtension`` that takes 553an optional parameter consisting of a dictionary of configuration over-rides 554and returns an instance of the extension. An example from the footnote 555extension: 556 557 def makeExtension(configs=None) : 558 return FootnoteExtension(configs=configs) 559 560By following the above example, when Markdown is passed the name of your 561extension as a string (i.e.: ``'footnotes'``), it will automatically import 562the module and call the ``makeExtension`` function initiating your extension. 563 564You may have noted that the extensions packaged with Python-Markdown do not 565use the ``mdx_`` prefix in their module names. This is because they are all 566part of the ``markdown.extensions`` package. Markdown will first try to import 567from ``markdown.extensions.extname`` and upon failure, ``mdx_extname``. If both 568fail, Markdown will continue without the extension. 569 570However, Markdown will also accept an already existing instance of an extension. 571For example: 572 573 import markdown 574 import myextension 575 configs = {...} 576 myext = myextension.MyExtension(configs=configs) 577 md = markdown.Markdown(extensions=[myext]) 578 579This is useful if you need to implement a large number of extensions with more 580than one residing in a module. 581 582[Preprocessors]: #preprocessors 583[InlinePatterns]: #inlinepatterns 584[Treeprocessors]: #treeprocessors 585[Postprocessors]: #postprocessors 586[BlockParser]: #blockparser 587[Working with the ElementTree]: #working_with_et 588[Integrating your code into Markdown]: #integrating_into_markdown 589[extendMarkdown]: #extendmarkdown 590[OrderedDict]: #ordereddict 591[registerExtension]: #registerextension 592[Config Settings]: #configsettings 593[makeExtension]: #makeextension 594[ElementTree]: http://effbot.org/zone/element-index.htm 595