ANOVA Package¶
daspi.anova
¶
ANOVA and measurement system analysis package.
This package provides tools for fitting, simplifying, and reporting linear models in the context of Analysis of Variance (ANOVA), designed experiments (DOE), and measurement system analysis (MSA).
convert
Utility functions for reversing patsy categorical encoding and
serialising DataFrames to HTML.
tables
Stateless functions that compute ANOVA tables, effect sizes,
variance inflation factors, and term p-values from statsmodels
regression result objects.
model
The three main model classes:
- `LinearModel` — general OLS model with backward elimination,
ANOVA reporting, and response optimisation.
- `GageStudyModel` — MSA Type-1 study with full GUM uncertainty
budget (CAL, RE, BI, LIN, EVR).
- `GageRnRModel` — Gage R&R study decomposing measurement
variation into repeatability (EV) and reproducibility (AV).
All public names from each submodule are re-exported at the package level.
Module reference¶
| Module | Contents |
|---|---|
| Linear & Gage Models | LinearModel, GageStudyModel, GageRnRModel |
| Gage Study Model | GageStudyModel — MSA Type-1 with GUM uncertainty budget |
| Gage R&R Model | GageRnRModel — crossed Gage R&R variance components |
Utility functions¶
daspi.anova.convert
¶
Conversion helpers for patsy-encoded model terms and HTML output.
This module provides small, stateless utility functions used internally by the ANOVA model and table modules.
| FUNCTION | DESCRIPTION |
|---|---|
- `get_term_name` – strips the patsy categorical encoding |
( |
- `frames_to_html` – serialises one or more DataFrames to an HTML |
string with per-table |
get_term_name(name)
¶
Get the original term name of a patsy encoded categorical column name, including interactions.
| PARAMETER | DESCRIPTION |
|---|---|
name
|
The encoded column name.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
The original term name of the categorical column name. |
Notes
Patsy encodes categorical columns by appending '[T.
Examples:
frames_to_html(dfs, captions)
¶
Converts one or more DataFrames to HTML tables with captions.
| PARAMETER | DESCRIPTION |
|---|---|
dfs
|
The DataFrame(s) to be converted to HTML.
TYPE:
|
captions
|
The captions to be used for the HTML tables. The number of captions must match the number of DataFrames.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
The HTML representation of the DataFrames with captions. |
daspi.anova.tables
¶
Low-level table-building functions for ANOVA results.
This module contains pure functions that compute and format the
numerical summaries produced by LinearModel.
| FUNCTION | DESCRIPTION |
|---|---|
- `uniques` – order-preserving de-duplication of a sequence; used to |
keep factor names in the order they were entered. |
- `terms_effect` – calculates the standardised effect size for each |
model term (|coefficient| / standard error), used in effect plots. |
- `variance_inflation_factor` – computes VIF scores for all |
predictors; flags multicollinearity. |
- `anova_table` – builds a tidy Type I / II / III ANOVA table from a |
fitted statsmodels |
- `terms_probability` – extracts per-term p-values from a fitted |
model, applying the |
Notes
All functions in this module operate on statsmodels result objects or pandas data structures. They do not carry state and are safe to call independently of the model classes.
uniques(seq)
¶
Get a list of unique elements from a sequence while preserving the original order.
| PARAMETER | DESCRIPTION |
|---|---|
seq
|
The input sequence.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[Any]
|
A list of unique elements from the input sequence, preserving the original order. |
Notes
This function is based on the 'uniqify' algorithm by Peter Bengtsson. Source: https://www.peterbe.com/plog/uniqifiers-benchmark
Examples:
terms_effect(model)
¶
Calculates the impact of each term on the target. The effects are described as absolute number of the parameter coefficients devided by its standard error.
| PARAMETER | DESCRIPTION |
|---|---|
model
|
Statsmodels regression results of fitted model.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Series
|
A pandas Series containing the effects of each feature on the target variable. |
variance_inflation_factor(model, threshold=5, generalized=True)
¶
Calculate the variance inflation factor (VIF) and the generalized variance inflation factor (GVIF) for each predictor variable in the fitted model.
This function takes a regression model as input and returns a DataFrame containing the VIF, GVIF (= VIF^(1/2*dof)), threshold for GVIF, collinearity status and calculation kind for each predictor variable in the model. The VIF and GVIF are measures of multicollinearity, which can help identify variables that are highly correlated with each other.
| PARAMETER | DESCRIPTION |
|---|---|
model
|
The regression model to analyze.
TYPE:
|
threshold
|
The threshold for deciding whether a predictor is collinear. Common values are 5 and 10. By default 5.
TYPE:
|
generalized
|
Whether to calculate the generalized VIF or not, by default True.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
A DataFrame containing the VIF, GVIF, threshold, collinearity status and performed method for each predictor variable in the model. |
Notes
The VIF tells us: The degree to which the standard error of the predictor is increased due to the predictor's correlation with the other predictors in the model. VIF values greater than 10 (or, Tolerance values less than 0.10) corresponding to a multiple correlation of 0.95 indicates a multicollinearity may be a problem (Hair Jr, JF, Anderson, RE, Tatham, RL and Black, WC, 1998). Fox and Weisberg also comment that the straightforward VIF can't be used if there are variables with more than one degree of freedom (e.g. polynomial and other contrasts relating to categorical variables with more than two levels) and recommend using the gvif function (generalized variance inflation factor) in the car package in R in these cases. gvif is the square root of the VIF for individual predictors and thus can be used equivalently. More generally generalized variance-inflation factors consist of the VIF corrected by the number of degrees of freedom (df) of the predictor variable: GVIF = VIF[1/(2df)] and may be compared to thresholds of 10[1/(2df)] to assess collinearity using the stepVIF (source code: https://github.com/cran/car/blob/master/R/vif.R) function in R.
source: https://imaging.mrc-cbu.cam.ac.uk/statswiki/FAQ/Collinearity
See Also: https://www.rdocumentation.org/packages/car/versions/3.1-2/topics/vif https://www.statsmodels.org/dev/generated/statsmodels.stats.outliers_influence.variance_inflation_factor.html
anova_table(model, typ)
¶
Perform an analysis of variance (ANOVA) on the fitted model.
| PARAMETER | DESCRIPTION |
|---|---|
model
|
A fitted regression model of the
TYPE:
|
typ
|
The type of ANOVA to perform. Default is 'III', see notes for more informations about the types. - '' : If no or an invalid type is specified, Type-II is used if the model has no significant interactions. Otherwise, Type-III is used for hierarchical models and Type-I is used for non-hierarchical models. - 'I' : Type I sum of squares ANOVA. - 'II' : Type II sum of squares ANOVA. - 'III' : Type III sum of squares ANOVA.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
The ANOVA table as DataFrame containing the following columns: - DF : Degrees of freedom for model terms. - SS : Sum of squares for model terms. - F : F statistic value for significance of adding model terms. - p : P-value for significance of adding model terms. - n2 : Eta-square as effect size (proportion of explained variance). - np2 : Partial eta-square as partial effect size. |
Notes
The ANOVA table provides information about the significance of each factor and interaction in the model. The type of ANOVA determines how the sum of squares is partitioned among the factors.
The SAS and also Minitab software uses Type III by default. This type is also the only one who gives us a SS and p-value for the Intercept. A discussion on which one to use can be found here: https://stats.stackexchange.com/a/93031
A nice conclusion about the differences between the types: - Typ-I: We choose the most "important" independent variable and it will receive the maximum amount of variation possible. - Typ-II: We ignore the shared variation: no interaction is assumed. If this is true, the Type II Sums of Squares are statistically more powerful. However if in reality there is an interaction effect, the model will be wrong and there will be a problem in the conclusions of the analysis. - Typ-III: If there is an interaction effect and we are looking for an “equal” split between the independent variables, Type-III should be used.
source: https://towardsdatascience.com/anovas-three-types-of-estimating-sums-of-squares-don-t-make-the-wrong-choice-91107c77a27a
terms_probability(model)
¶
Compute the p-values for the terms in a regression model using a ANOVA typ-III table.
| PARAMETER | DESCRIPTION |
|---|---|
model
|
The regression model to compute the p-values for.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Series[float]
|
A Series containing the p-values for each term in the model. If the ANOVA table could not be calculated, the p-values will be set to NaN. |
Notes
ANOVA typ III table is used, because it is the only one who gives us a p-value for the intercept.