1:mod:`shlex` --- Simple lexical analysis 2======================================== 3 4.. module:: shlex 5 :synopsis: Simple lexical analysis for Unix shell-like languages. 6 7.. moduleauthor:: Eric S. Raymond <esr@snark.thyrsus.com> 8.. moduleauthor:: Gustavo Niemeyer <niemeyer@conectiva.com> 9.. sectionauthor:: Eric S. Raymond <esr@snark.thyrsus.com> 10.. sectionauthor:: Gustavo Niemeyer <niemeyer@conectiva.com> 11 12**Source code:** :source:`Lib/shlex.py` 13 14-------------- 15 16The :class:`~shlex.shlex` class makes it easy to write lexical analyzers for 17simple syntaxes resembling that of the Unix shell. This will often be useful 18for writing minilanguages, (for example, in run control files for Python 19applications) or for parsing quoted strings. 20 21The :mod:`shlex` module defines the following functions: 22 23 24.. function:: split(s, comments=False, posix=True) 25 26 Split the string *s* using shell-like syntax. If *comments* is :const:`False` 27 (the default), the parsing of comments in the given string will be disabled 28 (setting the :attr:`~shlex.commenters` attribute of the 29 :class:`~shlex.shlex` instance to the empty string). This function operates 30 in POSIX mode by default, but uses non-POSIX mode if the *posix* argument is 31 false. 32 33 .. note:: 34 35 Since the :func:`split` function instantiates a :class:`~shlex.shlex` 36 instance, passing ``None`` for *s* will read the string to split from 37 standard input. 38 39 40.. function:: quote(s) 41 42 Return a shell-escaped version of the string *s*. The returned value is a 43 string that can safely be used as one token in a shell command line, for 44 cases where you cannot use a list. 45 46 This idiom would be unsafe:: 47 48 >>> filename = 'somefile; rm -rf ~' 49 >>> command = 'ls -l {}'.format(filename) 50 >>> print(command) # executed by a shell: boom! 51 ls -l somefile; rm -rf ~ 52 53 :func:`quote` lets you plug the security hole:: 54 55 >>> command = 'ls -l {}'.format(quote(filename)) 56 >>> print(command) 57 ls -l 'somefile; rm -rf ~' 58 >>> remote_command = 'ssh home {}'.format(quote(command)) 59 >>> print(remote_command) 60 ssh home 'ls -l '"'"'somefile; rm -rf ~'"'"'' 61 62 The quoting is compatible with UNIX shells and with :func:`split`: 63 64 >>> remote_command = split(remote_command) 65 >>> remote_command 66 ['ssh', 'home', "ls -l 'somefile; rm -rf ~'"] 67 >>> command = split(remote_command[-1]) 68 >>> command 69 ['ls', '-l', 'somefile; rm -rf ~'] 70 71 .. versionadded:: 3.3 72 73The :mod:`shlex` module defines the following class: 74 75 76.. class:: shlex(instream=None, infile=None, posix=False, punctuation_chars=False) 77 78 A :class:`~shlex.shlex` instance or subclass instance is a lexical analyzer 79 object. The initialization argument, if present, specifies where to read 80 characters from. It must be a file-/stream-like object with 81 :meth:`~io.TextIOBase.read` and :meth:`~io.TextIOBase.readline` methods, or 82 a string. If no argument is given, input will be taken from ``sys.stdin``. 83 The second optional argument is a filename string, which sets the initial 84 value of the :attr:`~shlex.infile` attribute. If the *instream* 85 argument is omitted or equal to ``sys.stdin``, this second argument 86 defaults to "stdin". The *posix* argument defines the operational mode: 87 when *posix* is not true (default), the :class:`~shlex.shlex` instance will 88 operate in compatibility mode. When operating in POSIX mode, 89 :class:`~shlex.shlex` will try to be as close as possible to the POSIX shell 90 parsing rules. The *punctuation_chars* argument provides a way to make the 91 behaviour even closer to how real shells parse. This can take a number of 92 values: the default value, ``False``, preserves the behaviour seen under 93 Python 3.5 and earlier. If set to ``True``, then parsing of the characters 94 ``();<>|&`` is changed: any run of these characters (considered punctuation 95 characters) is returned as a single token. If set to a non-empty string of 96 characters, those characters will be used as the punctuation characters. Any 97 characters in the :attr:`wordchars` attribute that appear in 98 *punctuation_chars* will be removed from :attr:`wordchars`. See 99 :ref:`improved-shell-compatibility` for more information. 100 101 .. versionchanged:: 3.6 102 The *punctuation_chars* parameter was added. 103 104.. seealso:: 105 106 Module :mod:`configparser` 107 Parser for configuration files similar to the Windows :file:`.ini` files. 108 109 110.. _shlex-objects: 111 112shlex Objects 113------------- 114 115A :class:`~shlex.shlex` instance has the following methods: 116 117 118.. method:: shlex.get_token() 119 120 Return a token. If tokens have been stacked using :meth:`push_token`, pop a 121 token off the stack. Otherwise, read one from the input stream. If reading 122 encounters an immediate end-of-file, :attr:`eof` is returned (the empty 123 string (``''``) in non-POSIX mode, and ``None`` in POSIX mode). 124 125 126.. method:: shlex.push_token(str) 127 128 Push the argument onto the token stack. 129 130 131.. method:: shlex.read_token() 132 133 Read a raw token. Ignore the pushback stack, and do not interpret source 134 requests. (This is not ordinarily a useful entry point, and is documented here 135 only for the sake of completeness.) 136 137 138.. method:: shlex.sourcehook(filename) 139 140 When :class:`~shlex.shlex` detects a source request (see :attr:`source` 141 below) this method is given the following token as argument, and expected 142 to return a tuple consisting of a filename and an open file-like object. 143 144 Normally, this method first strips any quotes off the argument. If the result 145 is an absolute pathname, or there was no previous source request in effect, or 146 the previous source was a stream (such as ``sys.stdin``), the result is left 147 alone. Otherwise, if the result is a relative pathname, the directory part of 148 the name of the file immediately before it on the source inclusion stack is 149 prepended (this behavior is like the way the C preprocessor handles ``#include 150 "file.h"``). 151 152 The result of the manipulations is treated as a filename, and returned as the 153 first component of the tuple, with :func:`open` called on it to yield the second 154 component. (Note: this is the reverse of the order of arguments in instance 155 initialization!) 156 157 This hook is exposed so that you can use it to implement directory search paths, 158 addition of file extensions, and other namespace hacks. There is no 159 corresponding 'close' hook, but a shlex instance will call the 160 :meth:`~io.IOBase.close` method of the sourced input stream when it returns 161 EOF. 162 163 For more explicit control of source stacking, use the :meth:`push_source` and 164 :meth:`pop_source` methods. 165 166 167.. method:: shlex.push_source(newstream, newfile=None) 168 169 Push an input source stream onto the input stack. If the filename argument is 170 specified it will later be available for use in error messages. This is the 171 same method used internally by the :meth:`sourcehook` method. 172 173 174.. method:: shlex.pop_source() 175 176 Pop the last-pushed input source from the input stack. This is the same method 177 used internally when the lexer reaches EOF on a stacked input stream. 178 179 180.. method:: shlex.error_leader(infile=None, lineno=None) 181 182 This method generates an error message leader in the format of a Unix C compiler 183 error label; the format is ``'"%s", line %d: '``, where the ``%s`` is replaced 184 with the name of the current source file and the ``%d`` with the current input 185 line number (the optional arguments can be used to override these). 186 187 This convenience is provided to encourage :mod:`shlex` users to generate error 188 messages in the standard, parseable format understood by Emacs and other Unix 189 tools. 190 191Instances of :class:`~shlex.shlex` subclasses have some public instance 192variables which either control lexical analysis or can be used for debugging: 193 194 195.. attribute:: shlex.commenters 196 197 The string of characters that are recognized as comment beginners. All 198 characters from the comment beginner to end of line are ignored. Includes just 199 ``'#'`` by default. 200 201 202.. attribute:: shlex.wordchars 203 204 The string of characters that will accumulate into multi-character tokens. By 205 default, includes all ASCII alphanumerics and underscore. In POSIX mode, the 206 accented characters in the Latin-1 set are also included. If 207 :attr:`punctuation_chars` is not empty, the characters ``~-./*?=``, which can 208 appear in filename specifications and command line parameters, will also be 209 included in this attribute, and any characters which appear in 210 ``punctuation_chars`` will be removed from ``wordchars`` if they are present 211 there. 212 213 214.. attribute:: shlex.whitespace 215 216 Characters that will be considered whitespace and skipped. Whitespace bounds 217 tokens. By default, includes space, tab, linefeed and carriage-return. 218 219 220.. attribute:: shlex.escape 221 222 Characters that will be considered as escape. This will be only used in POSIX 223 mode, and includes just ``'\'`` by default. 224 225 226.. attribute:: shlex.quotes 227 228 Characters that will be considered string quotes. The token accumulates until 229 the same quote is encountered again (thus, different quote types protect each 230 other as in the shell.) By default, includes ASCII single and double quotes. 231 232 233.. attribute:: shlex.escapedquotes 234 235 Characters in :attr:`quotes` that will interpret escape characters defined in 236 :attr:`escape`. This is only used in POSIX mode, and includes just ``'"'`` by 237 default. 238 239 240.. attribute:: shlex.whitespace_split 241 242 If ``True``, tokens will only be split in whitespaces. This is useful, for 243 example, for parsing command lines with :class:`~shlex.shlex`, getting 244 tokens in a similar way to shell arguments. If this attribute is ``True``, 245 :attr:`punctuation_chars` will have no effect, and splitting will happen 246 only on whitespaces. When using :attr:`punctuation_chars`, which is 247 intended to provide parsing closer to that implemented by shells, it is 248 advisable to leave ``whitespace_split`` as ``False`` (the default value). 249 250 251.. attribute:: shlex.infile 252 253 The name of the current input file, as initially set at class instantiation time 254 or stacked by later source requests. It may be useful to examine this when 255 constructing error messages. 256 257 258.. attribute:: shlex.instream 259 260 The input stream from which this :class:`~shlex.shlex` instance is reading 261 characters. 262 263 264.. attribute:: shlex.source 265 266 This attribute is ``None`` by default. If you assign a string to it, that 267 string will be recognized as a lexical-level inclusion request similar to the 268 ``source`` keyword in various shells. That is, the immediately following token 269 will be opened as a filename and input will be taken from that stream until 270 EOF, at which point the :meth:`~io.IOBase.close` method of that stream will be 271 called and the input source will again become the original input stream. Source 272 requests may be stacked any number of levels deep. 273 274 275.. attribute:: shlex.debug 276 277 If this attribute is numeric and ``1`` or more, a :class:`~shlex.shlex` 278 instance will print verbose progress output on its behavior. If you need 279 to use this, you can read the module source code to learn the details. 280 281 282.. attribute:: shlex.lineno 283 284 Source line number (count of newlines seen so far plus one). 285 286 287.. attribute:: shlex.token 288 289 The token buffer. It may be useful to examine this when catching exceptions. 290 291 292.. attribute:: shlex.eof 293 294 Token used to determine end of file. This will be set to the empty string 295 (``''``), in non-POSIX mode, and to ``None`` in POSIX mode. 296 297 298.. attribute:: shlex.punctuation_chars 299 300 Characters that will be considered punctuation. Runs of punctuation 301 characters will be returned as a single token. However, note that no 302 semantic validity checking will be performed: for example, '>>>' could be 303 returned as a token, even though it may not be recognised as such by shells. 304 305 .. versionadded:: 3.6 306 307 308.. _shlex-parsing-rules: 309 310Parsing Rules 311------------- 312 313When operating in non-POSIX mode, :class:`~shlex.shlex` will try to obey to the 314following rules. 315 316* Quote characters are not recognized within words (``Do"Not"Separate`` is 317 parsed as the single word ``Do"Not"Separate``); 318 319* Escape characters are not recognized; 320 321* Enclosing characters in quotes preserve the literal value of all characters 322 within the quotes; 323 324* Closing quotes separate words (``"Do"Separate`` is parsed as ``"Do"`` and 325 ``Separate``); 326 327* If :attr:`~shlex.whitespace_split` is ``False``, any character not 328 declared to be a word character, whitespace, or a quote will be returned as 329 a single-character token. If it is ``True``, :class:`~shlex.shlex` will only 330 split words in whitespaces; 331 332* EOF is signaled with an empty string (``''``); 333 334* It's not possible to parse empty strings, even if quoted. 335 336When operating in POSIX mode, :class:`~shlex.shlex` will try to obey to the 337following parsing rules. 338 339* Quotes are stripped out, and do not separate words (``"Do"Not"Separate"`` is 340 parsed as the single word ``DoNotSeparate``); 341 342* Non-quoted escape characters (e.g. ``'\'``) preserve the literal value of the 343 next character that follows; 344 345* Enclosing characters in quotes which are not part of 346 :attr:`~shlex.escapedquotes` (e.g. ``"'"``) preserve the literal value 347 of all characters within the quotes; 348 349* Enclosing characters in quotes which are part of 350 :attr:`~shlex.escapedquotes` (e.g. ``'"'``) preserves the literal value 351 of all characters within the quotes, with the exception of the characters 352 mentioned in :attr:`~shlex.escape`. The escape characters retain its 353 special meaning only when followed by the quote in use, or the escape 354 character itself. Otherwise the escape character will be considered a 355 normal character. 356 357* EOF is signaled with a :const:`None` value; 358 359* Quoted empty strings (``''``) are allowed. 360 361.. _improved-shell-compatibility: 362 363Improved Compatibility with Shells 364---------------------------------- 365 366.. versionadded:: 3.6 367 368The :class:`shlex` class provides compatibility with the parsing performed by 369common Unix shells like ``bash``, ``dash``, and ``sh``. To take advantage of 370this compatibility, specify the ``punctuation_chars`` argument in the 371constructor. This defaults to ``False``, which preserves pre-3.6 behaviour. 372However, if it is set to ``True``, then parsing of the characters ``();<>|&`` 373is changed: any run of these characters is returned as a single token. While 374this is short of a full parser for shells (which would be out of scope for the 375standard library, given the multiplicity of shells out there), it does allow 376you to perform processing of command lines more easily than you could 377otherwise. To illustrate, you can see the difference in the following snippet: 378 379.. doctest:: 380 :options: +NORMALIZE_WHITESPACE 381 382 >>> import shlex 383 >>> text = "a && b; c && d || e; f >'abc'; (def \"ghi\")" 384 >>> list(shlex.shlex(text)) 385 ['a', '&', '&', 'b', ';', 'c', '&', '&', 'd', '|', '|', 'e', ';', 'f', '>', 386 "'abc'", ';', '(', 'def', '"ghi"', ')'] 387 >>> list(shlex.shlex(text, punctuation_chars=True)) 388 ['a', '&&', 'b', ';', 'c', '&&', 'd', '||', 'e', ';', 'f', '>', "'abc'", 389 ';', '(', 'def', '"ghi"', ')'] 390 391Of course, tokens will be returned which are not valid for shells, and you'll 392need to implement your own error checks on the returned tokens. 393 394Instead of passing ``True`` as the value for the punctuation_chars parameter, 395you can pass a string with specific characters, which will be used to determine 396which characters constitute punctuation. For example:: 397 398 >>> import shlex 399 >>> s = shlex.shlex("a && b || c", punctuation_chars="|") 400 >>> list(s) 401 ['a', '&', '&', 'b', '||', 'c'] 402 403.. note:: When ``punctuation_chars`` is specified, the :attr:`~shlex.wordchars` 404 attribute is augmented with the characters ``~-./*?=``. That is because these 405 characters can appear in file names (including wildcards) and command-line 406 arguments (e.g. ``--color=auto``). Hence:: 407 408 >>> import shlex 409 >>> s = shlex.shlex('~/a && b-c --color=auto || d *.py?', 410 ... punctuation_chars=True) 411 >>> list(s) 412 ['~/a', '&&', 'b-c', '--color=auto', '||', 'd', '*.py?'] 413 414For best effect, ``punctuation_chars`` should be set in conjunction with 415``posix=True``. (Note that ``posix=False`` is the default for 416:class:`~shlex.shlex`.) 417