Skip to content

ANOVA Package

daspi.anova

ANOVA and measurement system analysis package.

This package provides tools for fitting, simplifying, and reporting linear models in the context of Analysis of Variance (ANOVA), designed experiments (DOE), and measurement system analysis (MSA).

convert Utility functions for reversing patsy categorical encoding and serialising DataFrames to HTML.

tables Stateless functions that compute ANOVA tables, effect sizes, variance inflation factors, and term p-values from statsmodels regression result objects.

model The three main model classes:

- `LinearModel` — general OLS model with backward elimination,
  ANOVA reporting, and response optimisation.
- `GageStudyModel` — MSA Type-1 study with full GUM uncertainty
  budget (CAL, RE, BI, LIN, EVR).
- `GageRnRModel` — Gage R&R study decomposing measurement
  variation into repeatability (EV) and reproducibility (AV).

All public names from each submodule are re-exported at the package level.

Module reference

Module Contents
Linear & Gage Models LinearModel, GageStudyModel, GageRnRModel
Gage Study Model GageStudyModel — MSA Type-1 with GUM uncertainty budget
Gage R&R Model GageRnRModel — crossed Gage R&R variance components

Utility functions

daspi.anova.convert

Conversion helpers for patsy-encoded model terms and HTML output.

This module provides small, stateless utility functions used internally by the ANOVA model and table modules.

FUNCTION DESCRIPTION
- `get_term_name` – strips the patsy categorical encoding

([T.<value>] suffixes) from a column name, including interaction terms separated by :. The result is the original user-facing factor name.

- `frames_to_html` – serialises one or more DataFrames to an HTML

string with per-table <caption> elements; used by BaseHTMLReprModel to build the notebook HTML representation.

get_term_name(name)

Get the original term name of a patsy encoded categorical column name, including interactions.

PARAMETER DESCRIPTION
name

The encoded column name.

TYPE: str

RETURNS DESCRIPTION
str

The original term name of the categorical column name.

Notes

Patsy encodes categorical columns by appending '[T.]' to the original term name. Interactions between features are represented by separating the feature names with ':'. This function extracts the original term name from the encoded feature name, taking into account interactions.

Examples:

encoded_name = 'Category[T.Value]:OtherCategory[T.OtherValue]'
term_name = get_term_name(encoded_name)
print(term_name)
'Category:OtherCategory'

frames_to_html(dfs, captions)

Converts one or more DataFrames to HTML tables with captions.

PARAMETER DESCRIPTION
dfs

The DataFrame(s) to be converted to HTML.

TYPE: DataFrame or list/tuple of DataFrames

captions

The captions to be used for the HTML tables. The number of captions must match the number of DataFrames.

TYPE: str or list/tuple of str

RETURNS DESCRIPTION
str

The HTML representation of the DataFrames with captions.

daspi.anova.tables

Low-level table-building functions for ANOVA results.

This module contains pure functions that compute and format the numerical summaries produced by LinearModel.

FUNCTION DESCRIPTION
- `uniques` – order-preserving de-duplication of a sequence; used to

keep factor names in the order they were entered.

- `terms_effect` – calculates the standardised effect size for each

model term (|coefficient| / standard error), used in effect plots.

- `variance_inflation_factor` – computes VIF scores for all

predictors; flags multicollinearity.

- `anova_table` – builds a tidy Type I / II / III ANOVA table from a

fitted statsmodels RegressionResultsWrapper, including SS, MS, F, and p-values.

- `terms_probability` – extracts per-term p-values from a fitted

model, applying the get_term_name conversion so that the index matches original factor names.

Notes

All functions in this module operate on statsmodels result objects or pandas data structures. They do not carry state and are safe to call independently of the model classes.

uniques(seq)

Get a list of unique elements from a sequence while preserving the original order.

PARAMETER DESCRIPTION
seq

The input sequence.

TYPE: Iterable

RETURNS DESCRIPTION
List[Any]

A list of unique elements from the input sequence, preserving the original order.

Notes

This function is based on the 'uniqify' algorithm by Peter Bengtsson. Source: https://www.peterbe.com/plog/uniqifiers-benchmark

Examples:

sequence = [1, 2, 3, 2, 1, 4, 5, 4]
unique_elements = uniques(sequence)
print(unique_elements)
[1, 2, 3, 4, 5]

terms_effect(model)

Calculates the impact of each term on the target. The effects are described as absolute number of the parameter coefficients devided by its standard error.

PARAMETER DESCRIPTION
model

Statsmodels regression results of fitted model.

TYPE: RegressionResultsWrapper

RETURNS DESCRIPTION
Series

A pandas Series containing the effects of each feature on the target variable.

variance_inflation_factor(model, threshold=5, generalized=True)

Calculate the variance inflation factor (VIF) and the generalized variance inflation factor (GVIF) for each predictor variable in the fitted model.

This function takes a regression model as input and returns a DataFrame containing the VIF, GVIF (= VIF^(1/2*dof)), threshold for GVIF, collinearity status and calculation kind for each predictor variable in the model. The VIF and GVIF are measures of multicollinearity, which can help identify variables that are highly correlated with each other.

PARAMETER DESCRIPTION
model

The regression model to analyze.

TYPE: RegressionResultsWrapper

threshold

The threshold for deciding whether a predictor is collinear. Common values are 5 and 10. By default 5.

TYPE: int DEFAULT: 5

generalized

Whether to calculate the generalized VIF or not, by default True.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION
DataFrame

A DataFrame containing the VIF, GVIF, threshold, collinearity status and performed method for each predictor variable in the model.

Notes

The VIF tells us: The degree to which the standard error of the predictor is increased due to the predictor's correlation with the other predictors in the model. VIF values greater than 10 (or, Tolerance values less than 0.10) corresponding to a multiple correlation of 0.95 indicates a multicollinearity may be a problem (Hair Jr, JF, Anderson, RE, Tatham, RL and Black, WC, 1998). Fox and Weisberg also comment that the straightforward VIF can't be used if there are variables with more than one degree of freedom (e.g. polynomial and other contrasts relating to categorical variables with more than two levels) and recommend using the gvif function (generalized variance inflation factor) in the car package in R in these cases. gvif is the square root of the VIF for individual predictors and thus can be used equivalently. More generally generalized variance-inflation factors consist of the VIF corrected by the number of degrees of freedom (df) of the predictor variable: GVIF = VIF[1/(2df)] and may be compared to thresholds of 10[1/(2df)] to assess collinearity using the stepVIF (source code: https://github.com/cran/car/blob/master/R/vif.R) function in R.

source: https://imaging.mrc-cbu.cam.ac.uk/statswiki/FAQ/Collinearity

See Also: https://www.rdocumentation.org/packages/car/versions/3.1-2/topics/vif https://www.statsmodels.org/dev/generated/statsmodels.stats.outliers_influence.variance_inflation_factor.html

anova_table(model, typ)

Perform an analysis of variance (ANOVA) on the fitted model.

PARAMETER DESCRIPTION
model

A fitted regression model of the statsmodels package.

TYPE: RegressionResultsWrapper

typ

The type of ANOVA to perform. Default is 'III', see notes for more informations about the types. - '' : If no or an invalid type is specified, Type-II is used if the model has no significant interactions. Otherwise, Type-III is used for hierarchical models and Type-I is used for non-hierarchical models. - 'I' : Type I sum of squares ANOVA. - 'II' : Type II sum of squares ANOVA. - 'III' : Type III sum of squares ANOVA.

TYPE: Literal['I', 'II', 'III']

RETURNS DESCRIPTION
DataFrame

The ANOVA table as DataFrame containing the following columns: - DF : Degrees of freedom for model terms. - SS : Sum of squares for model terms. - F : F statistic value for significance of adding model terms. - p : P-value for significance of adding model terms. - n2 : Eta-square as effect size (proportion of explained variance). - np2 : Partial eta-square as partial effect size.

Notes

The ANOVA table provides information about the significance of each factor and interaction in the model. The type of ANOVA determines how the sum of squares is partitioned among the factors.

The SAS and also Minitab software uses Type III by default. This type is also the only one who gives us a SS and p-value for the Intercept. A discussion on which one to use can be found here: https://stats.stackexchange.com/a/93031

A nice conclusion about the differences between the types: - Typ-I: We choose the most "important" independent variable and it will receive the maximum amount of variation possible. - Typ-II: We ignore the shared variation: no interaction is assumed. If this is true, the Type II Sums of Squares are statistically more powerful. However if in reality there is an interaction effect, the model will be wrong and there will be a problem in the conclusions of the analysis. - Typ-III: If there is an interaction effect and we are looking for an “equal” split between the independent variables, Type-III should be used.

source: https://towardsdatascience.com/anovas-three-types-of-estimating-sums-of-squares-don-t-make-the-wrong-choice-91107c77a27a

terms_probability(model)

Compute the p-values for the terms in a regression model using a ANOVA typ-III table.

PARAMETER DESCRIPTION
model

The regression model to compute the p-values for.

TYPE: RegressionResultsWrapper

RETURNS DESCRIPTION
Series[float]

A Series containing the p-values for each term in the model. If the ANOVA table could not be calculated, the p-values will be set to NaN.

Notes

ANOVA typ III table is used, because it is the only one who gives us a p-value for the intercept.