1:mod:`shlex` --- Simple lexical analysis 2======================================== 3 4.. module:: shlex 5 :synopsis: Simple lexical analysis for Unix shell-like languages. 6 7.. moduleauthor:: Eric S. Raymond <esr@snark.thyrsus.com> 8.. moduleauthor:: Gustavo Niemeyer <niemeyer@conectiva.com> 9.. sectionauthor:: Eric S. Raymond <esr@snark.thyrsus.com> 10.. sectionauthor:: Gustavo Niemeyer <niemeyer@conectiva.com> 11 12**Source code:** :source:`Lib/shlex.py` 13 14-------------- 15 16The :class:`~shlex.shlex` class makes it easy to write lexical analyzers for 17simple syntaxes resembling that of the Unix shell. This will often be useful 18for writing minilanguages, (for example, in run control files for Python 19applications) or for parsing quoted strings. 20 21The :mod:`shlex` module defines the following functions: 22 23 24.. function:: split(s, comments=False, posix=True) 25 26 Split the string *s* using shell-like syntax. If *comments* is :const:`False` 27 (the default), the parsing of comments in the given string will be disabled 28 (setting the :attr:`~shlex.commenters` attribute of the 29 :class:`~shlex.shlex` instance to the empty string). This function operates 30 in POSIX mode by default, but uses non-POSIX mode if the *posix* argument is 31 false. 32 33 .. note:: 34 35 Since the :func:`split` function instantiates a :class:`~shlex.shlex` 36 instance, passing ``None`` for *s* will read the string to split from 37 standard input. 38 39 .. deprecated:: 3.9 40 Passing ``None`` for *s* will raise an exception in future Python 41 versions. 42 43.. function:: join(split_command) 44 45 Concatenate the tokens of the list *split_command* and return a string. 46 This function is the inverse of :func:`split`. 47 48 >>> from shlex import join 49 >>> print(join(['echo', '-n', 'Multiple words'])) 50 echo -n 'Multiple words' 51 52 The returned value is shell-escaped to protect against injection 53 vulnerabilities (see :func:`quote`). 54 55 .. versionadded:: 3.8 56 57 58.. function:: quote(s) 59 60 Return a shell-escaped version of the string *s*. The returned value is a 61 string that can safely be used as one token in a shell command line, for 62 cases where you cannot use a list. 63 64 This idiom would be unsafe: 65 66 >>> filename = 'somefile; rm -rf ~' 67 >>> command = 'ls -l {}'.format(filename) 68 >>> print(command) # executed by a shell: boom! 69 ls -l somefile; rm -rf ~ 70 71 :func:`quote` lets you plug the security hole: 72 73 >>> from shlex import quote 74 >>> command = 'ls -l {}'.format(quote(filename)) 75 >>> print(command) 76 ls -l 'somefile; rm -rf ~' 77 >>> remote_command = 'ssh home {}'.format(quote(command)) 78 >>> print(remote_command) 79 ssh home 'ls -l '"'"'somefile; rm -rf ~'"'"'' 80 81 The quoting is compatible with UNIX shells and with :func:`split`: 82 83 >>> from shlex import split 84 >>> remote_command = split(remote_command) 85 >>> remote_command 86 ['ssh', 'home', "ls -l 'somefile; rm -rf ~'"] 87 >>> command = split(remote_command[-1]) 88 >>> command 89 ['ls', '-l', 'somefile; rm -rf ~'] 90 91 .. versionadded:: 3.3 92 93The :mod:`shlex` module defines the following class: 94 95 96.. class:: shlex(instream=None, infile=None, posix=False, punctuation_chars=False) 97 98 A :class:`~shlex.shlex` instance or subclass instance is a lexical analyzer 99 object. The initialization argument, if present, specifies where to read 100 characters from. It must be a file-/stream-like object with 101 :meth:`~io.TextIOBase.read` and :meth:`~io.TextIOBase.readline` methods, or 102 a string. If no argument is given, input will be taken from ``sys.stdin``. 103 The second optional argument is a filename string, which sets the initial 104 value of the :attr:`~shlex.infile` attribute. If the *instream* 105 argument is omitted or equal to ``sys.stdin``, this second argument 106 defaults to "stdin". The *posix* argument defines the operational mode: 107 when *posix* is not true (default), the :class:`~shlex.shlex` instance will 108 operate in compatibility mode. When operating in POSIX mode, 109 :class:`~shlex.shlex` will try to be as close as possible to the POSIX shell 110 parsing rules. The *punctuation_chars* argument provides a way to make the 111 behaviour even closer to how real shells parse. This can take a number of 112 values: the default value, ``False``, preserves the behaviour seen under 113 Python 3.5 and earlier. If set to ``True``, then parsing of the characters 114 ``();<>|&`` is changed: any run of these characters (considered punctuation 115 characters) is returned as a single token. If set to a non-empty string of 116 characters, those characters will be used as the punctuation characters. Any 117 characters in the :attr:`wordchars` attribute that appear in 118 *punctuation_chars* will be removed from :attr:`wordchars`. See 119 :ref:`improved-shell-compatibility` for more information. *punctuation_chars* 120 can be set only upon :class:`~shlex.shlex` instance creation and can't be 121 modified later. 122 123 .. versionchanged:: 3.6 124 The *punctuation_chars* parameter was added. 125 126.. seealso:: 127 128 Module :mod:`configparser` 129 Parser for configuration files similar to the Windows :file:`.ini` files. 130 131 132.. _shlex-objects: 133 134shlex Objects 135------------- 136 137A :class:`~shlex.shlex` instance has the following methods: 138 139 140.. method:: shlex.get_token() 141 142 Return a token. If tokens have been stacked using :meth:`push_token`, pop a 143 token off the stack. Otherwise, read one from the input stream. If reading 144 encounters an immediate end-of-file, :attr:`eof` is returned (the empty 145 string (``''``) in non-POSIX mode, and ``None`` in POSIX mode). 146 147 148.. method:: shlex.push_token(str) 149 150 Push the argument onto the token stack. 151 152 153.. method:: shlex.read_token() 154 155 Read a raw token. Ignore the pushback stack, and do not interpret source 156 requests. (This is not ordinarily a useful entry point, and is documented here 157 only for the sake of completeness.) 158 159 160.. method:: shlex.sourcehook(filename) 161 162 When :class:`~shlex.shlex` detects a source request (see :attr:`source` 163 below) this method is given the following token as argument, and expected 164 to return a tuple consisting of a filename and an open file-like object. 165 166 Normally, this method first strips any quotes off the argument. If the result 167 is an absolute pathname, or there was no previous source request in effect, or 168 the previous source was a stream (such as ``sys.stdin``), the result is left 169 alone. Otherwise, if the result is a relative pathname, the directory part of 170 the name of the file immediately before it on the source inclusion stack is 171 prepended (this behavior is like the way the C preprocessor handles ``#include 172 "file.h"``). 173 174 The result of the manipulations is treated as a filename, and returned as the 175 first component of the tuple, with :func:`open` called on it to yield the second 176 component. (Note: this is the reverse of the order of arguments in instance 177 initialization!) 178 179 This hook is exposed so that you can use it to implement directory search paths, 180 addition of file extensions, and other namespace hacks. There is no 181 corresponding 'close' hook, but a shlex instance will call the 182 :meth:`~io.IOBase.close` method of the sourced input stream when it returns 183 EOF. 184 185 For more explicit control of source stacking, use the :meth:`push_source` and 186 :meth:`pop_source` methods. 187 188 189.. method:: shlex.push_source(newstream, newfile=None) 190 191 Push an input source stream onto the input stack. If the filename argument is 192 specified it will later be available for use in error messages. This is the 193 same method used internally by the :meth:`sourcehook` method. 194 195 196.. method:: shlex.pop_source() 197 198 Pop the last-pushed input source from the input stack. This is the same method 199 used internally when the lexer reaches EOF on a stacked input stream. 200 201 202.. method:: shlex.error_leader(infile=None, lineno=None) 203 204 This method generates an error message leader in the format of a Unix C compiler 205 error label; the format is ``'"%s", line %d: '``, where the ``%s`` is replaced 206 with the name of the current source file and the ``%d`` with the current input 207 line number (the optional arguments can be used to override these). 208 209 This convenience is provided to encourage :mod:`shlex` users to generate error 210 messages in the standard, parseable format understood by Emacs and other Unix 211 tools. 212 213Instances of :class:`~shlex.shlex` subclasses have some public instance 214variables which either control lexical analysis or can be used for debugging: 215 216 217.. attribute:: shlex.commenters 218 219 The string of characters that are recognized as comment beginners. All 220 characters from the comment beginner to end of line are ignored. Includes just 221 ``'#'`` by default. 222 223 224.. attribute:: shlex.wordchars 225 226 The string of characters that will accumulate into multi-character tokens. By 227 default, includes all ASCII alphanumerics and underscore. In POSIX mode, the 228 accented characters in the Latin-1 set are also included. If 229 :attr:`punctuation_chars` is not empty, the characters ``~-./*?=``, which can 230 appear in filename specifications and command line parameters, will also be 231 included in this attribute, and any characters which appear in 232 ``punctuation_chars`` will be removed from ``wordchars`` if they are present 233 there. If :attr:`whitespace_split` is set to ``True``, this will have no 234 effect. 235 236 237.. attribute:: shlex.whitespace 238 239 Characters that will be considered whitespace and skipped. Whitespace bounds 240 tokens. By default, includes space, tab, linefeed and carriage-return. 241 242 243.. attribute:: shlex.escape 244 245 Characters that will be considered as escape. This will be only used in POSIX 246 mode, and includes just ``'\'`` by default. 247 248 249.. attribute:: shlex.quotes 250 251 Characters that will be considered string quotes. The token accumulates until 252 the same quote is encountered again (thus, different quote types protect each 253 other as in the shell.) By default, includes ASCII single and double quotes. 254 255 256.. attribute:: shlex.escapedquotes 257 258 Characters in :attr:`quotes` that will interpret escape characters defined in 259 :attr:`escape`. This is only used in POSIX mode, and includes just ``'"'`` by 260 default. 261 262 263.. attribute:: shlex.whitespace_split 264 265 If ``True``, tokens will only be split in whitespaces. This is useful, for 266 example, for parsing command lines with :class:`~shlex.shlex`, getting 267 tokens in a similar way to shell arguments. When used in combination with 268 :attr:`punctuation_chars`, tokens will be split on whitespace in addition to 269 those characters. 270 271 .. versionchanged:: 3.8 272 The :attr:`punctuation_chars` attribute was made compatible with the 273 :attr:`whitespace_split` attribute. 274 275 276.. attribute:: shlex.infile 277 278 The name of the current input file, as initially set at class instantiation time 279 or stacked by later source requests. It may be useful to examine this when 280 constructing error messages. 281 282 283.. attribute:: shlex.instream 284 285 The input stream from which this :class:`~shlex.shlex` instance is reading 286 characters. 287 288 289.. attribute:: shlex.source 290 291 This attribute is ``None`` by default. If you assign a string to it, that 292 string will be recognized as a lexical-level inclusion request similar to the 293 ``source`` keyword in various shells. That is, the immediately following token 294 will be opened as a filename and input will be taken from that stream until 295 EOF, at which point the :meth:`~io.IOBase.close` method of that stream will be 296 called and the input source will again become the original input stream. Source 297 requests may be stacked any number of levels deep. 298 299 300.. attribute:: shlex.debug 301 302 If this attribute is numeric and ``1`` or more, a :class:`~shlex.shlex` 303 instance will print verbose progress output on its behavior. If you need 304 to use this, you can read the module source code to learn the details. 305 306 307.. attribute:: shlex.lineno 308 309 Source line number (count of newlines seen so far plus one). 310 311 312.. attribute:: shlex.token 313 314 The token buffer. It may be useful to examine this when catching exceptions. 315 316 317.. attribute:: shlex.eof 318 319 Token used to determine end of file. This will be set to the empty string 320 (``''``), in non-POSIX mode, and to ``None`` in POSIX mode. 321 322 323.. attribute:: shlex.punctuation_chars 324 325 A read-only property. Characters that will be considered punctuation. Runs of 326 punctuation characters will be returned as a single token. However, note that no 327 semantic validity checking will be performed: for example, '>>>' could be 328 returned as a token, even though it may not be recognised as such by shells. 329 330 .. versionadded:: 3.6 331 332 333.. _shlex-parsing-rules: 334 335Parsing Rules 336------------- 337 338When operating in non-POSIX mode, :class:`~shlex.shlex` will try to obey to the 339following rules. 340 341* Quote characters are not recognized within words (``Do"Not"Separate`` is 342 parsed as the single word ``Do"Not"Separate``); 343 344* Escape characters are not recognized; 345 346* Enclosing characters in quotes preserve the literal value of all characters 347 within the quotes; 348 349* Closing quotes separate words (``"Do"Separate`` is parsed as ``"Do"`` and 350 ``Separate``); 351 352* If :attr:`~shlex.whitespace_split` is ``False``, any character not 353 declared to be a word character, whitespace, or a quote will be returned as 354 a single-character token. If it is ``True``, :class:`~shlex.shlex` will only 355 split words in whitespaces; 356 357* EOF is signaled with an empty string (``''``); 358 359* It's not possible to parse empty strings, even if quoted. 360 361When operating in POSIX mode, :class:`~shlex.shlex` will try to obey to the 362following parsing rules. 363 364* Quotes are stripped out, and do not separate words (``"Do"Not"Separate"`` is 365 parsed as the single word ``DoNotSeparate``); 366 367* Non-quoted escape characters (e.g. ``'\'``) preserve the literal value of the 368 next character that follows; 369 370* Enclosing characters in quotes which are not part of 371 :attr:`~shlex.escapedquotes` (e.g. ``"'"``) preserve the literal value 372 of all characters within the quotes; 373 374* Enclosing characters in quotes which are part of 375 :attr:`~shlex.escapedquotes` (e.g. ``'"'``) preserves the literal value 376 of all characters within the quotes, with the exception of the characters 377 mentioned in :attr:`~shlex.escape`. The escape characters retain its 378 special meaning only when followed by the quote in use, or the escape 379 character itself. Otherwise the escape character will be considered a 380 normal character. 381 382* EOF is signaled with a :const:`None` value; 383 384* Quoted empty strings (``''``) are allowed. 385 386.. _improved-shell-compatibility: 387 388Improved Compatibility with Shells 389---------------------------------- 390 391.. versionadded:: 3.6 392 393The :class:`shlex` class provides compatibility with the parsing performed by 394common Unix shells like ``bash``, ``dash``, and ``sh``. To take advantage of 395this compatibility, specify the ``punctuation_chars`` argument in the 396constructor. This defaults to ``False``, which preserves pre-3.6 behaviour. 397However, if it is set to ``True``, then parsing of the characters ``();<>|&`` 398is changed: any run of these characters is returned as a single token. While 399this is short of a full parser for shells (which would be out of scope for the 400standard library, given the multiplicity of shells out there), it does allow 401you to perform processing of command lines more easily than you could 402otherwise. To illustrate, you can see the difference in the following snippet: 403 404.. doctest:: 405 :options: +NORMALIZE_WHITESPACE 406 407 >>> import shlex 408 >>> text = "a && b; c && d || e; f >'abc'; (def \"ghi\")" 409 >>> s = shlex.shlex(text, posix=True) 410 >>> s.whitespace_split = True 411 >>> list(s) 412 ['a', '&&', 'b;', 'c', '&&', 'd', '||', 'e;', 'f', '>abc;', '(def', 'ghi)'] 413 >>> s = shlex.shlex(text, posix=True, punctuation_chars=True) 414 >>> s.whitespace_split = True 415 >>> list(s) 416 ['a', '&&', 'b', ';', 'c', '&&', 'd', '||', 'e', ';', 'f', '>', 'abc', ';', 417 '(', 'def', 'ghi', ')'] 418 419Of course, tokens will be returned which are not valid for shells, and you'll 420need to implement your own error checks on the returned tokens. 421 422Instead of passing ``True`` as the value for the punctuation_chars parameter, 423you can pass a string with specific characters, which will be used to determine 424which characters constitute punctuation. For example:: 425 426 >>> import shlex 427 >>> s = shlex.shlex("a && b || c", punctuation_chars="|") 428 >>> list(s) 429 ['a', '&', '&', 'b', '||', 'c'] 430 431.. note:: When ``punctuation_chars`` is specified, the :attr:`~shlex.wordchars` 432 attribute is augmented with the characters ``~-./*?=``. That is because these 433 characters can appear in file names (including wildcards) and command-line 434 arguments (e.g. ``--color=auto``). Hence:: 435 436 >>> import shlex 437 >>> s = shlex.shlex('~/a && b-c --color=auto || d *.py?', 438 ... punctuation_chars=True) 439 >>> list(s) 440 ['~/a', '&&', 'b-c', '--color=auto', '||', 'd', '*.py?'] 441 442 However, to match the shell as closely as possible, it is recommended to 443 always use ``posix`` and :attr:`~shlex.whitespace_split` when using 444 :attr:`~shlex.punctuation_chars`, which will negate 445 :attr:`~shlex.wordchars` entirely. 446 447For best effect, ``punctuation_chars`` should be set in conjunction with 448``posix=True``. (Note that ``posix=False`` is the default for 449:class:`~shlex.shlex`.) 450