1:mod:`statistics` --- Mathematical statistics functions
2=======================================================
3
4.. module:: statistics
5   :synopsis: mathematical statistics functions
6
7.. moduleauthor:: Steven D'Aprano <steve+python@pearwood.info>
8.. sectionauthor:: Steven D'Aprano <steve+python@pearwood.info>
9
10.. versionadded:: 3.4
11
12**Source code:** :source:`Lib/statistics.py`
13
14.. testsetup:: *
15
16   from statistics import *
17   __name__ = '<doctest>'
18
19--------------
20
21This module provides functions for calculating mathematical statistics of
22numeric (:class:`Real`-valued) data.
23
24.. note::
25
26   Unless explicitly noted otherwise, these functions support :class:`int`,
27   :class:`float`, :class:`decimal.Decimal` and :class:`fractions.Fraction`.
28   Behaviour with other types (whether in the numeric tower or not) is
29   currently unsupported.  Mixed types are also undefined and
30   implementation-dependent.  If your input data consists of mixed types,
31   you may be able to use :func:`map` to ensure a consistent result, e.g.
32   ``map(float, input_data)``.
33
34Averages and measures of central location
35-----------------------------------------
36
37These functions calculate an average or typical value from a population
38or sample.
39
40=======================  =============================================
41:func:`mean`             Arithmetic mean ("average") of data.
42:func:`harmonic_mean`    Harmonic mean of data.
43:func:`median`           Median (middle value) of data.
44:func:`median_low`       Low median of data.
45:func:`median_high`      High median of data.
46:func:`median_grouped`   Median, or 50th percentile, of grouped data.
47:func:`mode`             Mode (most common value) of discrete data.
48=======================  =============================================
49
50Measures of spread
51------------------
52
53These functions calculate a measure of how much the population or sample
54tends to deviate from the typical or average values.
55
56=======================  =============================================
57:func:`pstdev`           Population standard deviation of data.
58:func:`pvariance`        Population variance of data.
59:func:`stdev`            Sample standard deviation of data.
60:func:`variance`         Sample variance of data.
61=======================  =============================================
62
63
64Function details
65----------------
66
67Note: The functions do not require the data given to them to be sorted.
68However, for reading convenience, most of the examples show sorted sequences.
69
70.. function:: mean(data)
71
72   Return the sample arithmetic mean of *data* which can be a sequence or iterator.
73
74   The arithmetic mean is the sum of the data divided by the number of data
75   points.  It is commonly called "the average", although it is only one of many
76   different mathematical averages.  It is a measure of the central location of
77   the data.
78
79   If *data* is empty, :exc:`StatisticsError` will be raised.
80
81   Some examples of use:
82
83   .. doctest::
84
85      >>> mean([1, 2, 3, 4, 4])
86      2.8
87      >>> mean([-1.0, 2.5, 3.25, 5.75])
88      2.625
89
90      >>> from fractions import Fraction as F
91      >>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)])
92      Fraction(13, 21)
93
94      >>> from decimal import Decimal as D
95      >>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")])
96      Decimal('0.5625')
97
98   .. note::
99
100      The mean is strongly affected by outliers and is not a robust estimator
101      for central location: the mean is not necessarily a typical example of the
102      data points.  For more robust, although less efficient, measures of
103      central location, see :func:`median` and :func:`mode`.  (In this case,
104      "efficient" refers to statistical efficiency rather than computational
105      efficiency.)
106
107      The sample mean gives an unbiased estimate of the true population mean,
108      which means that, taken on average over all the possible samples,
109      ``mean(sample)`` converges on the true mean of the entire population.  If
110      *data* represents the entire population rather than a sample, then
111      ``mean(data)`` is equivalent to calculating the true population mean μ.
112
113
114.. function:: harmonic_mean(data)
115
116   Return the harmonic mean of *data*, a sequence or iterator of
117   real-valued numbers.
118
119   The harmonic mean, sometimes called the subcontrary mean, is the
120   reciprocal of the arithmetic :func:`mean` of the reciprocals of the
121   data. For example, the harmonic mean of three values *a*, *b* and *c*
122   will be equivalent to ``3/(1/a + 1/b + 1/c)``.
123
124   The harmonic mean is a type of average, a measure of the central
125   location of the data.  It is often appropriate when averaging quantities
126   which are rates or ratios, for example speeds. For example:
127
128   Suppose an investor purchases an equal value of shares in each of
129   three companies, with P/E (price/earning) ratios of 2.5, 3 and 10.
130   What is the average P/E ratio for the investor's portfolio?
131
132   .. doctest::
133
134      >>> harmonic_mean([2.5, 3, 10])  # For an equal investment portfolio.
135      3.6
136
137   Using the arithmetic mean would give an average of about 5.167, which
138   is too high.
139
140   :exc:`StatisticsError` is raised if *data* is empty, or any element
141   is less than zero.
142
143   .. versionadded:: 3.6
144
145
146.. function:: median(data)
147
148   Return the median (middle value) of numeric data, using the common "mean of
149   middle two" method.  If *data* is empty, :exc:`StatisticsError` is raised.
150   *data* can be a sequence or iterator.
151
152   The median is a robust measure of central location, and is less affected by
153   the presence of outliers in your data.  When the number of data points is
154   odd, the middle data point is returned:
155
156   .. doctest::
157
158      >>> median([1, 3, 5])
159      3
160
161   When the number of data points is even, the median is interpolated by taking
162   the average of the two middle values:
163
164   .. doctest::
165
166      >>> median([1, 3, 5, 7])
167      4.0
168
169   This is suited for when your data is discrete, and you don't mind that the
170   median may not be an actual data point.
171
172   .. seealso:: :func:`median_low`, :func:`median_high`, :func:`median_grouped`
173
174
175.. function:: median_low(data)
176
177   Return the low median of numeric data.  If *data* is empty,
178   :exc:`StatisticsError` is raised.  *data* can be a sequence or iterator.
179
180   The low median is always a member of the data set.  When the number of data
181   points is odd, the middle value is returned.  When it is even, the smaller of
182   the two middle values is returned.
183
184   .. doctest::
185
186      >>> median_low([1, 3, 5])
187      3
188      >>> median_low([1, 3, 5, 7])
189      3
190
191   Use the low median when your data are discrete and you prefer the median to
192   be an actual data point rather than interpolated.
193
194
195.. function:: median_high(data)
196
197   Return the high median of data.  If *data* is empty, :exc:`StatisticsError`
198   is raised.  *data* can be a sequence or iterator.
199
200   The high median is always a member of the data set.  When the number of data
201   points is odd, the middle value is returned.  When it is even, the larger of
202   the two middle values is returned.
203
204   .. doctest::
205
206      >>> median_high([1, 3, 5])
207      3
208      >>> median_high([1, 3, 5, 7])
209      5
210
211   Use the high median when your data are discrete and you prefer the median to
212   be an actual data point rather than interpolated.
213
214
215.. function:: median_grouped(data, interval=1)
216
217   Return the median of grouped continuous data, calculated as the 50th
218   percentile, using interpolation.  If *data* is empty, :exc:`StatisticsError`
219   is raised.  *data* can be a sequence or iterator.
220
221   .. doctest::
222
223      >>> median_grouped([52, 52, 53, 54])
224      52.5
225
226   In the following example, the data are rounded, so that each value represents
227   the midpoint of data classes, e.g. 1 is the midpoint of the class 0.5--1.5, 2
228   is the midpoint of 1.5--2.5, 3 is the midpoint of 2.5--3.5, etc.  With the data
229   given, the middle value falls somewhere in the class 3.5--4.5, and
230   interpolation is used to estimate it:
231
232   .. doctest::
233
234      >>> median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5])
235      3.7
236
237   Optional argument *interval* represents the class interval, and defaults
238   to 1.  Changing the class interval naturally will change the interpolation:
239
240   .. doctest::
241
242      >>> median_grouped([1, 3, 3, 5, 7], interval=1)
243      3.25
244      >>> median_grouped([1, 3, 3, 5, 7], interval=2)
245      3.5
246
247   This function does not check whether the data points are at least
248   *interval* apart.
249
250   .. impl-detail::
251
252      Under some circumstances, :func:`median_grouped` may coerce data points to
253      floats.  This behaviour is likely to change in the future.
254
255   .. seealso::
256
257      * "Statistics for the Behavioral Sciences", Frederick J Gravetter and
258        Larry B Wallnau (8th Edition).
259
260      * Calculating the `median <https://www.ualberta.ca/~opscan/median.html>`_.
261
262      * The `SSMEDIAN
263        <https://help.gnome.org/users/gnumeric/stable/gnumeric.html#gnumeric-function-SSMEDIAN>`_
264        function in the Gnome Gnumeric spreadsheet, including `this discussion
265        <https://mail.gnome.org/archives/gnumeric-list/2011-April/msg00018.html>`_.
266
267
268.. function:: mode(data)
269
270   Return the most common data point from discrete or nominal *data*.  The mode
271   (when it exists) is the most typical value, and is a robust measure of
272   central location.
273
274   If *data* is empty, or if there is not exactly one most common value,
275   :exc:`StatisticsError` is raised.
276
277   ``mode`` assumes discrete data, and returns a single value. This is the
278   standard treatment of the mode as commonly taught in schools:
279
280   .. doctest::
281
282      >>> mode([1, 1, 2, 3, 3, 3, 3, 4])
283      3
284
285   The mode is unique in that it is the only statistic which also applies
286   to nominal (non-numeric) data:
287
288   .. doctest::
289
290      >>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
291      'red'
292
293
294.. function:: pstdev(data, mu=None)
295
296   Return the population standard deviation (the square root of the population
297   variance).  See :func:`pvariance` for arguments and other details.
298
299   .. doctest::
300
301      >>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
302      0.986893273527251
303
304
305.. function:: pvariance(data, mu=None)
306
307   Return the population variance of *data*, a non-empty iterable of real-valued
308   numbers.  Variance, or second moment about the mean, is a measure of the
309   variability (spread or dispersion) of data.  A large variance indicates that
310   the data is spread out; a small variance indicates it is clustered closely
311   around the mean.
312
313   If the optional second argument *mu* is given, it should be the mean of
314   *data*.  If it is missing or ``None`` (the default), the mean is
315   automatically calculated.
316
317   Use this function to calculate the variance from the entire population.  To
318   estimate the variance from a sample, the :func:`variance` function is usually
319   a better choice.
320
321   Raises :exc:`StatisticsError` if *data* is empty.
322
323   Examples:
324
325   .. doctest::
326
327      >>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25]
328      >>> pvariance(data)
329      1.25
330
331   If you have already calculated the mean of your data, you can pass it as the
332   optional second argument *mu* to avoid recalculation:
333
334   .. doctest::
335
336      >>> mu = mean(data)
337      >>> pvariance(data, mu)
338      1.25
339
340   This function does not attempt to verify that you have passed the actual mean
341   as *mu*.  Using arbitrary values for *mu* may lead to invalid or impossible
342   results.
343
344   Decimals and Fractions are supported:
345
346   .. doctest::
347
348      >>> from decimal import Decimal as D
349      >>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
350      Decimal('24.815')
351
352      >>> from fractions import Fraction as F
353      >>> pvariance([F(1, 4), F(5, 4), F(1, 2)])
354      Fraction(13, 72)
355
356   .. note::
357
358      When called with the entire population, this gives the population variance
359      σ².  When called on a sample instead, this is the biased sample variance
360      s², also known as variance with N degrees of freedom.
361
362      If you somehow know the true population mean μ, you may use this function
363      to calculate the variance of a sample, giving the known population mean as
364      the second argument.  Provided the data points are representative
365      (e.g. independent and identically distributed), the result will be an
366      unbiased estimate of the population variance.
367
368
369.. function:: stdev(data, xbar=None)
370
371   Return the sample standard deviation (the square root of the sample
372   variance).  See :func:`variance` for arguments and other details.
373
374   .. doctest::
375
376      >>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
377      1.0810874155219827
378
379
380.. function:: variance(data, xbar=None)
381
382   Return the sample variance of *data*, an iterable of at least two real-valued
383   numbers.  Variance, or second moment about the mean, is a measure of the
384   variability (spread or dispersion) of data.  A large variance indicates that
385   the data is spread out; a small variance indicates it is clustered closely
386   around the mean.
387
388   If the optional second argument *xbar* is given, it should be the mean of
389   *data*.  If it is missing or ``None`` (the default), the mean is
390   automatically calculated.
391
392   Use this function when your data is a sample from a population. To calculate
393   the variance from the entire population, see :func:`pvariance`.
394
395   Raises :exc:`StatisticsError` if *data* has fewer than two values.
396
397   Examples:
398
399   .. doctest::
400
401      >>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]
402      >>> variance(data)
403      1.3720238095238095
404
405   If you have already calculated the mean of your data, you can pass it as the
406   optional second argument *xbar* to avoid recalculation:
407
408   .. doctest::
409
410      >>> m = mean(data)
411      >>> variance(data, m)
412      1.3720238095238095
413
414   This function does not attempt to verify that you have passed the actual mean
415   as *xbar*.  Using arbitrary values for *xbar* can lead to invalid or
416   impossible results.
417
418   Decimal and Fraction values are supported:
419
420   .. doctest::
421
422      >>> from decimal import Decimal as D
423      >>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
424      Decimal('31.01875')
425
426      >>> from fractions import Fraction as F
427      >>> variance([F(1, 6), F(1, 2), F(5, 3)])
428      Fraction(67, 108)
429
430   .. note::
431
432      This is the sample variance s² with Bessel's correction, also known as
433      variance with N-1 degrees of freedom.  Provided that the data points are
434      representative (e.g. independent and identically distributed), the result
435      should be an unbiased estimate of the true population variance.
436
437      If you somehow know the actual population mean μ you should pass it to the
438      :func:`pvariance` function as the *mu* parameter to get the variance of a
439      sample.
440
441Exceptions
442----------
443
444A single exception is defined:
445
446.. exception:: StatisticsError
447
448   Subclass of :exc:`ValueError` for statistics-related exceptions.
449
450..
451   # This modelines must appear within the last ten lines of the file.
452   kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8;
453