ANOVA Package¶

`daspi.anova` ¶

ANOVA and measurement system analysis package.

This package provides tools for fitting, simplifying, and reporting linear models in the context of Analysis of Variance (ANOVA), designed experiments (DOE), and measurement system analysis (MSA).

convert Utility functions for reversing patsy categorical encoding and serialising DataFrames to HTML.

tables Stateless functions that compute ANOVA tables, effect sizes, variance inflation factors, and term p-values from statsmodels regression result objects.

model The three main model classes:

- `LinearModel` — general OLS model with backward elimination,
  ANOVA reporting, and response optimisation.
- `GageStudyModel` — MSA Type-1 study with full GUM uncertainty
  budget (CAL, RE, BI, LIN, EVR).
- `GageRnRModel` — Gage R&R study decomposing measurement
  variation into repeatability (EV) and reproducibility (AV).

All public names from each submodule are re-exported at the package level.

Module reference¶

Module	Contents
Linear & Gage Models	`LinearModel`, `GageStudyModel`, `GageRnRModel`
Gage Study Model	`GageStudyModel` — MSA Type-1 with GUM uncertainty budget
Gage R&R Model	`GageRnRModel` — crossed Gage R&R variance components

Utility functions¶

`daspi.anova.convert` ¶

Conversion helpers for patsy-encoded model terms and HTML output.

This module provides small, stateless utility functions used internally by the ANOVA model and table modules.

FUNCTION	DESCRIPTION
- `get_term_name` – strips the patsy categorical encoding	(`[T.<value>]` suffixes) from a column name, including interaction terms separated by `:`. The result is the original user-facing factor name.
- `frames_to_html` – serialises one or more DataFrames to an HTML	string with per-table `<caption>` elements; used by `BaseHTMLReprModel` to build the notebook HTML representation.

`get_term_name(name)` ¶

Get the original term name of a patsy encoded categorical column name, including interactions.

PARAMETER	DESCRIPTION
`name`	The encoded column name. TYPE: `str`

RETURNS	DESCRIPTION
`str`	The original term name of the categorical column name.

Notes

Patsy encodes categorical columns by appending '[T.]' to the original term name. Interactions between features are represented by separating the feature names with ':'. This function extracts the original term name from the encoded feature name, taking into account interactions.

Examples:

encoded_name = 'Category[T.Value]:OtherCategory[T.OtherValue]'
term_name = get_term_name(encoded_name)
print(term_name)

'Category:OtherCategory'

`frames_to_html(dfs, captions)` ¶

Converts one or more DataFrames to HTML tables with captions.

PARAMETER	DESCRIPTION
`dfs`	The DataFrame(s) to be converted to HTML. TYPE: `DataFrame or list/tuple of DataFrames`
`captions`	The captions to be used for the HTML tables. The number of captions must match the number of DataFrames. TYPE: `str or list/tuple of str`

RETURNS	DESCRIPTION
`str`	The HTML representation of the DataFrames with captions.

`daspi.anova.tables` ¶

Low-level table-building functions for ANOVA results.

This module contains pure functions that compute and format the numerical summaries produced by LinearModel.

FUNCTION	DESCRIPTION
- `uniques` – order-preserving de-duplication of a sequence; used to	keep factor names in the order they were entered.
- `terms_effect` – calculates the standardised effect size for each	model term (\|coefficient\| / standard error), used in effect plots.
- `variance_inflation_factor` – computes VIF scores for all	predictors; flags multicollinearity.
- `anova_table` – builds a tidy Type I / II / III ANOVA table from a	fitted statsmodels `RegressionResultsWrapper`, including SS, MS, F, and p-values.
- `terms_probability` – extracts per-term p-values from a fitted	model, applying the `get_term_name` conversion so that the index matches original factor names.

Notes

All functions in this module operate on statsmodels result objects or pandas data structures. They do not carry state and are safe to call independently of the model classes.

`uniques(seq)` ¶

Get a list of unique elements from a sequence while preserving the original order.

PARAMETER	DESCRIPTION
`seq`	The input sequence. TYPE: `Iterable`

RETURNS	DESCRIPTION
`List[Any]`	A list of unique elements from the input sequence, preserving the original order.

Notes

This function is based on the 'uniqify' algorithm by Peter Bengtsson. Source: https://www.peterbe.com/plog/uniqifiers-benchmark

Examples:

sequence = [1, 2, 3, 2, 1, 4, 5, 4]
unique_elements = uniques(sequence)
print(unique_elements)

[1, 2, 3, 4, 5]

`terms_effect(model)` ¶

Calculates the impact of each term on the target. The effects are described as absolute number of the parameter coefficients devided by its standard error.

PARAMETER	DESCRIPTION
`model`	Statsmodels regression results of fitted model. TYPE: `RegressionResultsWrapper`

RETURNS	DESCRIPTION
`Series`	A pandas Series containing the effects of each feature on the target variable.

`variance_inflation_factor(model, threshold=5, generalized=True)` ¶

Calculate the variance inflation factor (VIF) and the generalized variance inflation factor (GVIF) for each predictor variable in the fitted model.

This function takes a regression model as input and returns a DataFrame containing the VIF, GVIF (= VIF^(1/2*dof)), threshold for GVIF, collinearity status and calculation kind for each predictor variable in the model. The VIF and GVIF are measures of multicollinearity, which can help identify variables that are highly correlated with each other.

PARAMETER	DESCRIPTION
`model`	The regression model to analyze. TYPE: `RegressionResultsWrapper`
`threshold`	The threshold for deciding whether a predictor is collinear. Common values are 5 and 10. By default 5. TYPE: `int` DEFAULT: `5`
`generalized`	Whether to calculate the generalized VIF or not, by default True. TYPE: `bool` DEFAULT: `True`

RETURNS	DESCRIPTION
`DataFrame`	A DataFrame containing the VIF, GVIF, threshold, collinearity status and performed method for each predictor variable in the model.

Notes

The VIF tells us: The degree to which the standard error of the predictor is increased due to the predictor's correlation with the other predictors in the model. VIF values greater than 10 (or, Tolerance values less than 0.10) corresponding to a multiple correlation of 0.95 indicates a multicollinearity may be a problem (Hair Jr, JF, Anderson, RE, Tatham, RL and Black, WC, 1998). Fox and Weisberg also comment that the straightforward VIF can't be used if there are variables with more than one degree of freedom (e.g. polynomial and other contrasts relating to categorical variables with more than two levels) and recommend using the gvif function (generalized variance inflation factor) in the car package in R in these cases. gvif is the square root of the VIF for individual predictors and thus can be used equivalently. More generally generalized variance-inflation factors consist of the VIF corrected by the number of degrees of freedom (df) of the predictor variable: GVIF = VIF[1/(2df)] and may be compared to thresholds of 10[1/(2df)] to assess collinearity using the stepVIF (source code: https://github.com/cran/car/blob/master/R/vif.R) function in R.

source: https://imaging.mrc-cbu.cam.ac.uk/statswiki/FAQ/Collinearity

See Also: https://www.rdocumentation.org/packages/car/versions/3.1-2/topics/vif https://www.statsmodels.org/dev/generated/statsmodels.stats.outliers_influence.variance_inflation_factor.html

`anova_table(model, typ)` ¶

Perform an analysis of variance (ANOVA) on the fitted model.

PARAMETER DESCRIPTION

model

A fitted regression model of the statsmodels package.

TYPE: RegressionResultsWrapper

typ

The type of ANOVA to perform. Default is 'III', see notes for more informations about the types. - '' : If no or an invalid type is specified, Type-II is used if the model has no significant interactions. Otherwise, Type-III is used for hierarchical models and Type-I is used for non-hierarchical models. - 'I' : Type I sum of squares ANOVA. - 'II' : Type II sum of squares ANOVA. - 'III' : Type III sum of squares ANOVA.

TYPE: Literal['I', 'II', 'III']

RETURNS	DESCRIPTION
`DataFrame`	The ANOVA table as DataFrame containing the following columns: - DF : Degrees of freedom for model terms. - SS : Sum of squares for model terms. - F : F statistic value for significance of adding model terms. - p : P-value for significance of adding model terms. - n2 : Eta-square as effect size (proportion of explained variance). - np2 : Partial eta-square as partial effect size.

Notes

The ANOVA table provides information about the significance of each factor and interaction in the model. The type of ANOVA determines how the sum of squares is partitioned among the factors.

The SAS and also Minitab software uses Type III by default. This type is also the only one who gives us a SS and p-value for the Intercept. A discussion on which one to use can be found here: https://stats.stackexchange.com/a/93031

A nice conclusion about the differences between the types: - Typ-I: We choose the most "important" independent variable and it will receive the maximum amount of variation possible. - Typ-II: We ignore the shared variation: no interaction is assumed. If this is true, the Type II Sums of Squares are statistically more powerful. However if in reality there is an interaction effect, the model will be wrong and there will be a problem in the conclusions of the analysis. - Typ-III: If there is an interaction effect and we are looking for an “equal” split between the independent variables, Type-III should be used.

source: https://towardsdatascience.com/anovas-three-types-of-estimating-sums-of-squares-don-t-make-the-wrong-choice-91107c77a27a

`terms_probability(model)` ¶

Compute the p-values for the terms in a regression model using a ANOVA typ-III table.

PARAMETER	DESCRIPTION
`model`	The regression model to compute the p-values for. TYPE: `RegressionResultsWrapper`

RETURNS	DESCRIPTION
`Series[float]`	A Series containing the p-values for each term in the model. If the ANOVA table could not be calculated, the p-values will be set to NaN.

Notes

ANOVA typ III table is used, because it is the only one who gives us a p-value for the intercept.

ANOVA Package¶

daspi.anova ¶

Module reference¶

Utility functions¶

daspi.anova.convert ¶

get_term_name(name) ¶

frames_to_html(dfs, captions) ¶

daspi.anova.tables ¶

uniques(seq) ¶

terms_effect(model) ¶

variance_inflation_factor(model, threshold=5, generalized=True) ¶

anova_table(model, typ) ¶

terms_probability(model) ¶

`daspi.anova` ¶

`daspi.anova.convert` ¶

`get_term_name(name)` ¶

`frames_to_html(dfs, captions)` ¶

`daspi.anova.tables` ¶

`uniques(seq)` ¶

`terms_effect(model)` ¶

`variance_inflation_factor(model, threshold=5, generalized=True)` ¶

`anova_table(model, typ)` ¶

`terms_probability(model)` ¶