1:mod:`tokenize` --- Tokenizer for Python source 2=============================================== 3 4.. module:: tokenize 5 :synopsis: Lexical scanner for Python source code. 6 7.. moduleauthor:: Ka Ping Yee 8.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org> 9 10**Source code:** :source:`Lib/tokenize.py` 11 12-------------- 13 14The :mod:`tokenize` module provides a lexical scanner for Python source code, 15implemented in Python. The scanner in this module returns comments as tokens 16as well, making it useful for implementing "pretty-printers," including 17colorizers for on-screen displays. 18 19To simplify token stream handling, all :ref:`operator <operators>` and 20:ref:`delimiter <delimiters>` tokens and :data:`Ellipsis` are returned using 21the generic :data:`~token.OP` token type. The exact 22type can be determined by checking the ``exact_type`` property on the 23:term:`named tuple` returned from :func:`tokenize.tokenize`. 24 25Tokenizing Input 26---------------- 27 28The primary entry point is a :term:`generator`: 29 30.. function:: tokenize(readline) 31 32 The :func:`.tokenize` generator requires one argument, *readline*, which 33 must be a callable object which provides the same interface as the 34 :meth:`io.IOBase.readline` method of file objects. Each call to the 35 function should return one line of input as bytes. 36 37 The generator produces 5-tuples with these members: the token type; the 38 token string; a 2-tuple ``(srow, scol)`` of ints specifying the row and 39 column where the token begins in the source; a 2-tuple ``(erow, ecol)`` of 40 ints specifying the row and column where the token ends in the source; and 41 the line on which the token was found. The line passed (the last tuple item) 42 is the *logical* line; continuation lines are included. The 5 tuple is 43 returned as a :term:`named tuple` with the field names: 44 ``type string start end line``. 45 46 The returned :term:`named tuple` has an additional property named 47 ``exact_type`` that contains the exact operator type for 48 :data:`~token.OP` tokens. For all other token types ``exact_type`` 49 equals the named tuple ``type`` field. 50 51 .. versionchanged:: 3.1 52 Added support for named tuples. 53 54 .. versionchanged:: 3.3 55 Added support for ``exact_type``. 56 57 :func:`.tokenize` determines the source encoding of the file by looking for a 58 UTF-8 BOM or encoding cookie, according to :pep:`263`. 59 60 61All constants from the :mod:`token` module are also exported from 62:mod:`tokenize`. 63 64Another function is provided to reverse the tokenization process. This is 65useful for creating tools that tokenize a script, modify the token stream, and 66write back the modified script. 67 68 69.. function:: untokenize(iterable) 70 71 Converts tokens back into Python source code. The *iterable* must return 72 sequences with at least two elements, the token type and the token string. 73 Any additional sequence elements are ignored. 74 75 The reconstructed script is returned as a single string. The result is 76 guaranteed to tokenize back to match the input so that the conversion is 77 lossless and round-trips are assured. The guarantee applies only to the 78 token type and token string as the spacing between tokens (column 79 positions) may change. 80 81 It returns bytes, encoded using the :data:`~token.ENCODING` token, which 82 is the first token sequence output by :func:`.tokenize`. 83 84 85:func:`.tokenize` needs to detect the encoding of source files it tokenizes. The 86function it uses to do this is available: 87 88.. function:: detect_encoding(readline) 89 90 The :func:`detect_encoding` function is used to detect the encoding that 91 should be used to decode a Python source file. It requires one argument, 92 readline, in the same way as the :func:`.tokenize` generator. 93 94 It will call readline a maximum of twice, and return the encoding used 95 (as a string) and a list of any lines (not decoded from bytes) it has read 96 in. 97 98 It detects the encoding from the presence of a UTF-8 BOM or an encoding 99 cookie as specified in :pep:`263`. If both a BOM and a cookie are present, 100 but disagree, a :exc:`SyntaxError` will be raised. Note that if the BOM is found, 101 ``'utf-8-sig'`` will be returned as an encoding. 102 103 If no encoding is specified, then the default of ``'utf-8'`` will be 104 returned. 105 106 Use :func:`.open` to open Python source files: it uses 107 :func:`detect_encoding` to detect the file encoding. 108 109 110.. function:: open(filename) 111 112 Open a file in read only mode using the encoding detected by 113 :func:`detect_encoding`. 114 115 .. versionadded:: 3.2 116 117.. exception:: TokenError 118 119 Raised when either a docstring or expression that may be split over several 120 lines is not completed anywhere in the file, for example:: 121 122 """Beginning of 123 docstring 124 125 or:: 126 127 [1, 128 2, 129 3 130 131Note that unclosed single-quoted strings do not cause an error to be 132raised. They are tokenized as :data:`~token.ERRORTOKEN`, followed by the 133tokenization of their contents. 134 135 136.. _tokenize-cli: 137 138Command-Line Usage 139------------------ 140 141.. versionadded:: 3.3 142 143The :mod:`tokenize` module can be executed as a script from the command line. 144It is as simple as: 145 146.. code-block:: sh 147 148 python -m tokenize [-e] [filename.py] 149 150The following options are accepted: 151 152.. program:: tokenize 153 154.. cmdoption:: -h, --help 155 156 show this help message and exit 157 158.. cmdoption:: -e, --exact 159 160 display token names using the exact type 161 162If :file:`filename.py` is specified its contents are tokenized to stdout. 163Otherwise, tokenization is performed on stdin. 164 165Examples 166------------------ 167 168Example of a script rewriter that transforms float literals into Decimal 169objects:: 170 171 from tokenize import tokenize, untokenize, NUMBER, STRING, NAME, OP 172 from io import BytesIO 173 174 def decistmt(s): 175 """Substitute Decimals for floats in a string of statements. 176 177 >>> from decimal import Decimal 178 >>> s = 'print(+21.3e-5*-.1234/81.7)' 179 >>> decistmt(s) 180 "print (+Decimal ('21.3e-5')*-Decimal ('.1234')/Decimal ('81.7'))" 181 182 The format of the exponent is inherited from the platform C library. 183 Known cases are "e-007" (Windows) and "e-07" (not Windows). Since 184 we're only showing 12 digits, and the 13th isn't close to 5, the 185 rest of the output should be platform-independent. 186 187 >>> exec(s) #doctest: +ELLIPSIS 188 -3.21716034272e-0...7 189 190 Output from calculations with Decimal should be identical across all 191 platforms. 192 193 >>> exec(decistmt(s)) 194 -3.217160342717258261933904529E-7 195 """ 196 result = [] 197 g = tokenize(BytesIO(s.encode('utf-8')).readline) # tokenize the string 198 for toknum, tokval, _, _, _ in g: 199 if toknum == NUMBER and '.' in tokval: # replace NUMBER tokens 200 result.extend([ 201 (NAME, 'Decimal'), 202 (OP, '('), 203 (STRING, repr(tokval)), 204 (OP, ')') 205 ]) 206 else: 207 result.append((toknum, tokval)) 208 return untokenize(result).decode('utf-8') 209 210Example of tokenizing from the command line. The script:: 211 212 def say_hello(): 213 print("Hello, World!") 214 215 say_hello() 216 217will be tokenized to the following output where the first column is the range 218of the line/column coordinates where the token is found, the second column is 219the name of the token, and the final column is the value of the token (if any) 220 221.. code-block:: shell-session 222 223 $ python -m tokenize hello.py 224 0,0-0,0: ENCODING 'utf-8' 225 1,0-1,3: NAME 'def' 226 1,4-1,13: NAME 'say_hello' 227 1,13-1,14: OP '(' 228 1,14-1,15: OP ')' 229 1,15-1,16: OP ':' 230 1,16-1,17: NEWLINE '\n' 231 2,0-2,4: INDENT ' ' 232 2,4-2,9: NAME 'print' 233 2,9-2,10: OP '(' 234 2,10-2,25: STRING '"Hello, World!"' 235 2,25-2,26: OP ')' 236 2,26-2,27: NEWLINE '\n' 237 3,0-3,1: NL '\n' 238 4,0-4,0: DEDENT '' 239 4,0-4,9: NAME 'say_hello' 240 4,9-4,10: OP '(' 241 4,10-4,11: OP ')' 242 4,11-4,12: NEWLINE '\n' 243 5,0-5,0: ENDMARKER '' 244 245The exact token type names can be displayed using the :option:`-e` option: 246 247.. code-block:: shell-session 248 249 $ python -m tokenize -e hello.py 250 0,0-0,0: ENCODING 'utf-8' 251 1,0-1,3: NAME 'def' 252 1,4-1,13: NAME 'say_hello' 253 1,13-1,14: LPAR '(' 254 1,14-1,15: RPAR ')' 255 1,15-1,16: COLON ':' 256 1,16-1,17: NEWLINE '\n' 257 2,0-2,4: INDENT ' ' 258 2,4-2,9: NAME 'print' 259 2,9-2,10: LPAR '(' 260 2,10-2,25: STRING '"Hello, World!"' 261 2,25-2,26: RPAR ')' 262 2,26-2,27: NEWLINE '\n' 263 3,0-3,1: NL '\n' 264 4,0-4,0: DEDENT '' 265 4,0-4,9: NAME 'say_hello' 266 4,9-4,10: LPAR '(' 267 4,10-4,11: RPAR ')' 268 4,11-4,12: NEWLINE '\n' 269 5,0-5,0: ENDMARKER '' 270