Skip to content

Distribution estimator

daspi.statistics.estimation.DistributionEstimator(samples, dist=None, possible_dists=DIST.COMMON, nan_policy='omit')

Bases: BaseEstimator

A class to estimate the distribution of a given 1D numeric sample.

This class provides methods to estimate distribution by fitting a continuous distribution from scipy.stats to the provided samples. It uses the Kolmogorov-Smirnov test to evaluate the fit of the distribution to the data. The distribution with a higher p-value is considered a better fit.

PARAMETER DESCRIPTION
samples

The 1D numeric sample for which the distribution is to be estimated. This should be a Series or array-like object containing numeric values.

TYPE: NumericSample1D

dist

Distributions to which the data may be subject. Only continuous distributions of scipy.stats are allowed. Default is 'norm'

TYPE: str rv_continuous DEFAULT: None

possible_dists

Distributions to which the data may be subject. Only continuous distributions of scipy.stats are allowed, by default DIST.COMMON

TYPE: tuple of strings or rv_continous DEFAULT: COMMON

nan_policy

How to handle NaN values in the samples. - 'propagate': NaN values are preserved in the analysis. - 'raise': Raises an error if NaN values are found. - 'omit': Omits NaN values from the analysis, default is 'omit'.

TYPE: (propagate, 'raise', omit) DEFAULT: 'propagate'

RAISES DESCRIPTION
ValueError

If NaN values are found in the samples and nan_policy is set to 'raise'.

UserWarning

If NaN values are found in the samples and nan_policy is set to 'omit' or 'propagate'. The warning indicates that NaN values will be omitted from the analysis or may lead to unexpected results.

Sources

The theoretical quantiles and percentiles are calculated the same way as statsmodels ProbPlot class does, see: https://www.statsmodels.org/dev/_modules/statsmodels/graphics/gofplots.html

possible_dists = possible_dists instance-attribute

Distributions given during initialization to which the data may be subject.

dist_name property

Get the name of the estimated distribution (read-only).

dist property writable

This is the generic continuous distribution class of the provided or evaluated distribution.

Set the distribution to be used for estimation. If a string is provided, it will be converted to a continuous distribution class using ensure_generic. If None, the distribution will be estimated from the samples.

frozen property

This is the frozen continuous RV object of dist property (read-only).

D property

Get the Kolmogorov-Smirnov test statistic, either D, D+ or D-.

shape_params property

Estimates for any distribution shape parameters (if applicable), followed by those for location and scale. For most random variables, shape statistics will be returned, but there are exceptions (e.g. norm). Can be used to generate values with the help of the dist attribute (read-only).

loc property

Get the loc paramter from shape_params (read-only).

scale property

Get the scale paramter from shape_params (read-only).

p_ks property

Get the two-tailed p-value of kolmogorov-smirnof test for the provided or fitted distribution. A higher p-value indicates a better fit to the data (read-only).

excess property

Get the Fisher kurtosis (excess) of filtered samples. Calculations are corrected for statistical bias (read-only). The curvature of the distribution corresponds to the curvature of a normal distribution when the excess is close to zero. Distributions with negative excess kurtosis are said to be platykurtic, this distribution produces fewer and/or less extreme outliers than the normal distribution (e.g. the uniform distribution has no outliers). Distributions with a positive excess kurtosis are said to be leptokurtic (e.g. the Laplace distribution, which has tails that asymptotically approach zero more slowly than a Gaussian, and therefore produces more outliers than the normal distribution): - excess < 0: less extreme outliers than normal distribution - excess > 0: more extreme outliers than normal distribution

p_excess property

Get the probability that the excess of the population that the sample was drawn from is the same as that of a corresponding normal distribution (read-only).

skew property

Get the skewness of the filtered samples (read-only). Calculations are corrected for statistical bias. For normally distributed data, the skewness should be about zero. For unimodal continuous distributions, a skewness value greater than zero means that there is more weight in the right tail of the distribution: - skew < 0: left-skewed -> long tail left - skew > 0: right-skewed -> long tail right

p_skew property

Get the probability that the skewness of the population that the sample was drawn from is the same as that of a corresponding normal distribution (read-only).

p_ad property

Get the probability that the filtered samples are subject of the normal distribution by performing a Anderson-Darling test (read-only).

theoretical_percentiles property

Get the theoretical percentiles (CDF values) of the sample data (read-only)

theoretical_quantiles property

Get the theoretical quantiles (osm, or order statistic medians) of the filtered samples. This quantiles are calculated the same way as in scipy.stats.probplot() function (read-only).

sample_quantiles property

Get the sample quantiles (sorted filtered samples) (read-only).

sample_percentiles property

Get the empirical percentiles (CDF values) of the sample data (read-only)

predicted property

Get the predicted values of the provided or evaluated distribution by using the order statistic medians (read-only).

log_likelihood property

Get the log-likelihood of the provided or evaluated distribution (read-only).

ss property

Get the sum of squared residuals (SS) using the sorted values and the predicted values (read-only).

aic property

Get the Akaike information criterion (AIC) (read-only).

bic property

Get the Bayesian information criterion (BIC) (read-only).

plotting_positions(nobs, alpha=0.0, beta=None) staticmethod

Generates sequence of plotting positions

PARAMETER DESCRIPTION
nobs

Number of probability points to plot

TYPE: int

alpha

alpha parameter for the plotting position of an expected order statistic

TYPE: float DEFAULT: 0.0

beta

beta parameter for the plotting position of an expected order statistic. If None, then beta is set to alpha.

TYPE: float | None DEFAULT: None

RETURNS DESCRIPTION
Series

The plotting positions

Notes

The plotting positions are given by

\[ i \in [1, nobs]) \]
\[ \frac{(i - \alpha)}{nobs + 1 - \alpha - \beta} \]

Additional information on alpha and beta see: scipy.stats.mstats.plotting_positions

distribution()

Estimate the distribution by selecting the one from the provided distributions that best reflects the filtered data.

RETURNS DESCRIPTION
dist

A generic continous distribution class of best fit

TYPE: scipy.stats rv_continuous

p

The two-tailed p-value for the best fit

TYPE: float

shape_params

Estimates for any shape parameters (if applicable), followed by those for location and scale. For most random variables, shape statistics will be returned, but there are exceptions (e.g. norm). Can be used to generate values with the help of returned dist

TYPE: Tuple[float, ...]

Notes

First, the p-score is calculated by performing a Kolmogorov-Smirnov test to determine how well each distribution fits the samples. Whatever has the highest P-score is considered the most accurate. This is because a higher p-score means the hypothesis is closest to reality.

stable_variance(alpha=0.05, n_sections=3)

Test whether the variance remains stable across the samples.

The sample data is divided into subgroups and the variances of their sections are checked using the Levene test.

PARAMETER DESCRIPTION
alpha

Alpha risk of hypothesis tests. If a p-value is below this limit, the null hypothesis is rejected

TYPE: float DEFAULT: 0.05

n_sections

Amount of sections to divide the filtered samples into, by default 3

TYPE: int DEFAULT: 3

RETURNS DESCRIPTION
stable

True if the p-value > alpha

TYPE: bool

stable_mean(alpha=0.05, n_sections=3)

Test whether the mean remains stable across the samples.

The sample data is divided into subgroups and the mean of their sections are checked using the F test.

PARAMETER DESCRIPTION
alpha

Alpha risk of hypothesis tests. If a p-value is below this limit, the null hypothesis is rejected

TYPE: float DEFAULT: 0.05

n_sections

Amount of sections to divide the filtered samples into, by default 3

TYPE: int DEFAULT: 3

RETURNS DESCRIPTION
stable

True if the p-value > alpha

TYPE: bool

follows_norm_curve(alpha=0.05, excess_test=True, skew_test=True, ad_test=False)

Checks whether the sample data is subject to normal distribution by performing one or more of the following tests (depending on the input): - Skewness test - Bulge test - Anderson-Darling test

PARAMETER DESCRIPTION
alpha

Alpha risk of hypothesis tests. If a p-value is below this limit, the null hypothesis is rejected

TYPE: float DEFAULT: 0.05

skew_test

If true, an skew test will also be carried out, by default True

TYPE: bool DEFAULT: True

ad_test

If true, an excess test will also be carried out, by default True

TYPE: bool DEFAULT: False

ad_test

If true, an Anderson Darling test will also be carried out, by default False

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
remain_h0

True if all p-values of the tests performed are greater than alpha, otherwise False

TYPE: bool

RAISES DESCRIPTION
AssertionError

If all flags are False.