Skip to content

Linear model

daspi.anova.model.LinearModel(source, target, features, covariates=[], alpha=0.05, order=1, skip_intercept_as_least=True, generalized_vif=True, encode_features=True, fit_at_init=True)

Bases: BaseHTMLReprModel

This class is used to create and simplify linear models so that only significant features describe the model.

Balanced models (DOEs or EVOPs) including continuous can be analyzed. With this class, you can create an encoded design matrix with all factor levels, including their interactions. All non-significant factors can then be automatically eliminated. Furthermore, this class allows the examination of main effects, the sum of squares (explained variation), and the ANOVA table in more detail.

PARAMETER DESCRIPTION
source

Pandas DataFrame as tabular data in a long format used for the model.

TYPE: pandas DataFrame

target

Column name of the endogenous variable.

TYPE: str

features

Column names of the exogenous variables that can be actively changed (factor levels for DOE or EVOP). Interactions are also created for these variables if the order is set > 1.

TYPE: List[str]

covariates

Column names for exogenous variables that are logged but cannot be influenced. No interactions are created for these variables.

TYPE: List[str] DEFAULT: []

alpha

Threshold as alpha risk. All features, including continuous and intercept, that have a p-value smaller than alpha are removed during the automatic elimination of the factors. Default is 0.05.

TYPE: float DEFAULT: 0.05

skip_intercept_as_least

If True, the intercept is not removed as a least significant term when using recursive feature elimination. Also if True, the intercept does not appear when calling the least_parameter method, by default True.

TYPE: bool DEFAULT: True

encode_features

If True, all of the provided feature variables are encoded using one-hot encoding by changing the data type to category. Otherwise they are interpreted as continuous variables when possible, by default True.

TYPE: bool DEFAULT: True

fit_at_init

If True, the model is fitted at initialization. Otherwise, the model is fitted after calling the fit method. Default is True.

TYPE: bool DEFAULT: True

Examples:

Do some ANOVA and statistics on a dataset. Run the example below in a Jupyther Notebook to see the results.

import daspi as dsp

df = dsp.load_dataset('painkillers-dissolution')
model = dsp.LinearModel(
    source=df,
    target='dissolution',
    features=['employee', 'stirrer', 'brand', 'catalyst', 'water'],
    covariates=['temperature', 'preparation'],
    order=2)

# Store goodnes of fit values for each elimination step
df_gof = pd.concat(model.recursive_elimination())

# Plot residual and parameter relevance analysis
dsp.ResidualsCharts(model).plot().stripes().label(info=True)
dsp.ParameterRelevanceCharts(model).plot().label(info=True)

# Get HTML output
model
Notes

Always be careful when removing the intercept. If the intercept is missing, Patsy will automatically add all one-hot encoded levels for a categorical variable to compensate for this missing term. This will result in extremely high VIFs. That means that the parameters are not linearly independent.

Terminology
  • feature: The original name as it appears in the provided data.
  • term: The name of the term as it appears in the design matrix. All feature names are converted to "xi" where i is an ascending number. The name of the first feature becomes "x0", the second becomes "x1",... The covariate variables are also converted in the same way, but instead of the letter "x" the letter "e" is used.
  • parameter: The variable names in connection with a coefficient, but in the composition with the original feature names.

target = target instance-attribute

The name of the target variable for the linear model.

features = features instance-attribute

The list of features used in the linear model.

covariates = covariates instance-attribute

The list of covariates variables used in the linear model.

target_map = {target: 'y'} instance-attribute

A dictionary that maps the original feature names to the encoded feature names used in the model.

feature_map = {f: _f for f, _f in (zip(features, f_main_terms))} | {c: _c for c, _c in (zip(covariates, d_main_terms))} instance-attribute

A dictionary that maps the original feature names to the encoded names (term) used in the model.

main_term_map = {v: k for k, v in (self.feature_map.items())} instance-attribute

A dictionary that maps the main term names (no interactions) back to the original feature names used in the model.

skip_intercept_as_least = skip_intercept_as_least instance-attribute

If True, the intercept is not treated as a least significant term

generalized_vif = generalized_vif instance-attribute

If True, the generalized VIF is calculated when possible. Otherwise, only the straightforward VIF (via R2) is calculated.

excluded = set() instance-attribute

A set of feature names that should be excluded from the model.

data = source.rename(columns=(self.feature_map | self.target_map))[list(self.feature_map.values()) + list(self.target_map.values())].copy() instance-attribute

The Pandas DataFrame containing the data for the linear model.

fitted property

Whether the model is fitted (read-only).

model property

Get regression results of fitted model. Calls fit method if no model is fitted yet (read-only).

initial_formula property

Get the initial formula for the ANOVA model (read-only).

The initial formula is constructed from the target variable and the initial terms, which are the terms with an interaction order less than or equal to the specified order.

uncertainty property

Get uncertainty of the model as square root of MS_Residual (read-only).

alpha property writable

Alpha risk as significance threshold for p-value of exegenous factors.

design_info property

Get the DesignInfo instance of current fitted model (read-only).

terms property

Get the encoded names of all variables for the current fitted model (read-only).

parameters property

Get the names of all variables for the current fitted model in the composition using the original feature and covariates names (read-only).

term_map property

Get the names of all internal used and original terms variables for the current fitted model as dict (read-only)

main_parameters property

Get all main parameters of current model excluding intercept (read-only).

formula property

Get the formula used for the linear model, excluding any factors specified in the exclude attribute. The formula is constructed based on the excluded factors. If the intercept is excluded, the formula will include '-1' as the first term. Otherwise, the formula will include the original terms (excluding the excluded factors) separated by '+' (read-only).

effect_threshold property

Calculates the threshold for the effect of adding a term to the model. The threshold is calculated as the inverse survival function (inverse of sf) at the given alpha (read-only).

design_matrix property

Get the design matrix of the current fitted model (read-only).

is_hierarchical()

Check if current fitted model is hierarchical.

effects()

Calculates the impact of each term on the target. The effects are described as absolute number of the parameter coefficients devided by its standard error.

p_values()

Get P-value for significance of adding model terms using anova typ III table for current model.

least_parameter()

Get the parameter name with the least effect or the least p-value coming from a ANOVA typ III table of current fitted model.

RETURNS DESCRIPTION
str

The parameter name with the least p_value if possible or the parameter name with the least effect.

Notes

This method checks if any p-values are missing (NaN). If there are missing p-values, it returns the term name that has the smallest effect on the target variable. Otherwise, it returns the term name with the least p-value for the F-stats coming from current ANOVA table.

fit(**kwds)

Create and fit a ordinary least squares model using current formula. To fit with a user-defined formula, use the formula keyword argument.

PARAMETER DESCRIPTION
**kwds

Pass formula and other keyword arguments to ols function of statsmodels.formula.api.

DEFAULT: {}

r2_pred()

Calculate the predicted R-squared (R2_pred) for the fitted model.

RETURNS DESCRIPTION
float

The predicted R-squared value for the fitted model.

Notes

The predicted R-squared is a measure of how well the model would predict new observations. It is calculated as:

R2_pred = 1 - (Sum of Squared Prediction Errors / Total Sum of Squares)

Where the prediction errors are calculated as the residuals divided by (1 - leverage), where leverage is the diagonal elements of the projection matrix P = X(X'X)^(-1)X'.

References

Calculations are made according to: https://support.minitab.com/de-de/minitab/help-and-how-to/statistical-modeling/doe/how-to/factorial/analyze-factorial-design/methods-and-formulas/goodness-of-fit-statistics/#press

Further information about projection matrix (influence matrix or hat matrix) can be found in: https://en.wikipedia.org/wiki/Projection_matrix

gof_metrics(index=0)

Get different goodness-of-fit metrics (read-only).

PARAMETER DESCRIPTION
index

Value is set as index. When using the method recursive_elimination, the current step is passed as index

TYPE: int | str DEFAULT: 0

RETURNS DESCRIPTION
DataFrame

The goodness-of-fit metrics table as DataFrame containing the following columns: - 'formula' = current formula - 's' = Uncertainty of the model as square root of MS_Residual - 'aic' = Akaike's information criteria - 'r2' = R-squared of the model - 'r2_adj' = adjusted R-squared - 'least_parameter' = the least significant term - 'p_least' = The p-value of least significant term, coming from ANOVA table Type-III. - 'hierarchical' = True if model is hierarchical

summary(anova_typ=None, vif=True, **kwds)

Generate a summary of the fitted model.

PARAMETER DESCRIPTION
anova_typ

If not None, add an ANOVA table of provided type to the summary, by default None.

TYPE: Literal['', 'I', 'II', 'III', None] DEFAULT: None

vif

If True, variance inflation factors (VIF) are added to the anova table. Will only be considered if anova_typ is not None, by default True

TYPE: bool DEFAULT: True

**kwds

Additional keyword arguments to be passed to the summary method of statsmodels.regression.linear_model.RegressionResults class.

DEFAULT: {}

RETURNS DESCRIPTION
Summary

A summary object containing information about the fitted model.

eliminate(parameter)

Removes the given parameter from the model by adding it to t he exclude set. Call fit to refit the model.

PARAMETER DESCRIPTION
parameter

The feature name, the covariates name or the interaction of multiple features to be removed from the model.

TYPE: str

RETURNS DESCRIPTION
Self

The current instance of the model for more method chaining.

Examples:

Prepare a LinearModel instance, fit the model and plot the relevance of the parameters. If you run the following code in a Jupyter Notebook, the plot and the html representation of the model will be displayed.

import pandas as pd
import daspi as dsp

df = dsp.load_dataset('painkillers-dissolution')
lm = dsp.LinearModel(
        source=df,
        target='dissolution',
        features=['employee', 'stirrer', 'brand', 'catalyst', 'water'],
        covariates=['temperature', 'preparation'],
        alpha=0.05,
        order=3,
        encode_categoricals=False
    )
dsp.ParameterRelevanceCharts(lm).plot().stripes().label()
lm
Now remove the least significant term from the model and refit. Repeat the process until the model contains only significant terms.

lm.eliminate('stirrer:brand:catalyst')
dsp.ParameterRelevanceCharts(lm).plot().stripes().label()
lm

To add again a feature to the model, use the include method. For an automatic elimination of insignificant terms, use the 'recursive_elimination' method.

Notes

Always be careful when removing the intercept. If the intercept is missing, Patsy will automatically add all one-hot encoded levels for a categorical variable to compensate for this missing term. This will result in extremely high VIFs.

RAISES DESCRIPTION
AssertionError:

If the given term is not in the model.

include(parameter)

Adds the given feature to the model by removing it from the excluded set. Call fit to refit the model.

PARAMETER DESCRIPTION
parameter

The feature name, the covariates name or the interaction of multiple features to be added to the model.

TYPE: str

RETURNS DESCRIPTION
Self

The current instance of the model for more method chaining.

Examples:

See eliminate method. After removing a term from the model, you can add it again by using the include method.

lm.include('C').fit()
dsp.ParameterRelevanceCharts(lm).plot().stripes().label()
lm

recursive_elimination(rsquared_max=0.99, ensure_hierarchy=True, **kwds)

Perform a recursive parameter elimination on the fitted model.

This function starts with the complete model and recursively eliminates parameters based on their p-values, until only significant parameters remain in the model. The function yields the goodness-of-fit metrics at each step of the elimination process.

PARAMETER DESCRIPTION
rsquared_max

If given, the model must have a lower R^2 value than the given threshold, by default 0.99

TYPE: float in (0, 1) DEFAULT: 0.99

ensure_hierarchy

Adds features at the end to ensure model is hierarchical, by default True

TYPE: bool DEFAULT: True

**kwds

Additional keyword arguments for ols function of statsmodels.formula.api.

DEFAULT: {}

YIELDS DESCRIPTION
DataFrame

The goodness-of-fit metrics at each step of the recursive feature elimination.

Examples:

Prepare a LinearModel instance, fit the model, automatically eliminate insignificant terms and plot the relevance of the parameters and the residuals. In the DataFrame df_gof you can see when which feature was removed and the current goodness-of-fit values can also be viewed. If you run the following code in a Jupyter Notebook, the plots and the html representation of the model will be displayed.

import pandas as pd
import daspi as dsp

df = dsp.load_dataset('painkillers-dissolution')
lm = dsp.LinearModel(
        source=df,
        target='dissolution',
        features=['employee', 'stirrer', 'brand', 'catalyst', 'water'],
        covariates=['temperature', 'preparation'],
        alpha=0.05,
        order=3,
        encode_categoricals=False
    )
df_gof = pd.concat(list(lm.recursive_elimination()))
dsp.ParameterRelevanceCharts(lm).plot().stripes().label(info=True)
dsp.ResidualsCharts(lm).plot().stripes().label(info=True)
lm

anova(typ='', vif=False)

Perform an analysis of variance (ANOVA) on the fitted model.

PARAMETER DESCRIPTION
typ

The type of ANOVA to perform. Default is 'III', see notes for more informations about the types. - '' : If no or an invalid type is specified, Type-II is used if the model has no significant interactions. Otherwise, Type-III is used for hierarchical models and Type-I is used for non-hierarchical models. - 'I' : Type I sum of squares ANOVA. - 'II' : Type II sum of squares ANOVA. - 'III' : Type III sum of squares ANOVA.

TYPE: Literal['', 'I', 'II', 'III'] DEFAULT: ''

vif

If True, variance inflation factors (VIF) are added to the anova table, by default False

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
DataFrame

The ANOVA table as DataFrame containing the following columns: - DF : Degrees of freedom for model terms. - SS : Sum of squares for model terms. - F : F statistic value for significance of adding model terms. - p : P-value for significance of adding model terms. - n2 : Eta-square as effect size (proportion of explained variance). - np2 : Partial eta-square as partial effect size.

Notes

The ANOVA table provides information about the significance of each factor and interaction in the model. The type of ANOVA determines how the sum of squares is partitioned among the factors.

The SAS and also Minitab software uses Type III by default. This type is also the only one who gives us a SS and p-value for the Intercept. So Type-III is also used internaly for evaluating the least significant term. A discussion on which one to use can be found here: https://stats.stackexchange.com/a/93031

A nice conclusion about the differences between the types: - Typ-I: We choose the most "important" independent variable and it will receive the maximum amount of variation possible. - Typ-II: We ignore the shared variation: no interaction is assumed. If this is true, the Type II Sums of Squares are statistically more powerful. However if in reality there is an interaction effect, the model will be wrong and there will be a problem in the conclusions of the analysis. - Typ-III: If there is an interaction effect and we are looking for an “equal” split between the independent variables, Type-III should be used.

source: https://towardsdatascience.com/anovas-three-types-of-estimating-sums-of-squares-don-t-make-the-wrong-choice-91107c77a27a

Examples:

import daspi as dsp
df = dsp.load_dataset('anova3')
lm = dsp.LinearModel(df, 'Cholesterol', ['Sex', 'Risk', 'Drug'])
print(lm.anova(typ='III').round(3))
Typ-III    DF       SS       MS        F      p     n2
source                                                
Intercept   1  390.868  390.868  453.467  0.000  0.864
Sex         1    2.075    2.075    2.407  0.127  0.005
Risk        1   11.332   11.332   13.147  0.001  0.025
Drug        2    0.816    0.408    0.473  0.626  0.002
Residual   55   47.407    0.862      NaN    NaN  0.105

parameter_statistics(alpha=0.05, use_t=True)

Calculate the parameter statistics for the fitted model.

PARAMETER DESCRIPTION
alpha

The significance level for the confidence intervals, by default 0.05.

TYPE: float DEFAULT: 0.05

use_t

If True, use t-distribution for hypothesis testing and confidence intervals. If False, use normal distribution, by default True.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION
DataFrame

A pandas DataFrame containing the parameter statistics for the fitted model. The DataFrame includes columns for the parameter estimates, standard errors, t-values (or z-values), and p-values.

variance_inflation_factor(threshold=5)

Calculate the variance inflation factor (VIF) and the generalized variance inflation factor (GVIF) for each predictor variable in the fitted model.

This function takes a regression model as input and returns a DataFrame containing the VIF, GVIF (= VIF^(1/2*dof)), threshold for GVIF, collinearity status and calculation kind for each predictor variable in the model. The VIF and GVIF are measures of multicollinearity, which can help identify variables that are highly correlated with each other.

PARAMETER DESCRIPTION
threshold

The threshold for deciding whether a predictor is collinear. Common values are 5 and 10. By default 5.

TYPE: int DEFAULT: 5

RETURNS DESCRIPTION
DataFrame

A DataFrame containing the VIF, GVIF, threshold, collinearity status and performed method for each predictor variable in the model.

highest_parameters(features_only=False)

Determines all main and interaction parameters that do not occur in a higher interaction in this constellation.

PARAMETER DESCRIPTION
features_only

If true, intercept and interaction parameters are not returned. By default False.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
List[str]

List of highest parameters.

has_insignificant_term(rsquared_max=0.99)

Check if the fitted model has any insignificant terms.

PARAMETER DESCRIPTION
rsquared_max

The maximum R^2 value that the model can have to be considered significant. If not provided, by default 0.99.

TYPE: float in (0, 1) DEFAULT: 0.99

RETURNS DESCRIPTION
bool

Returns True if the model has any insignificant terms, and False otherwise.

has_significant_interactions()

True if fitted model has significant interactions.

predict(xs)

Predict y with given xs. Ensure that all non interactions are given in xs

PARAMETER DESCRIPTION
xs

The values for which you want to predict. Make sure that all non interaction parameters are given in xs. If multiple values are to be predicted, provide a list of values for each factor level.

TYPE: Dict[str, Any]

RETURNS DESCRIPTION
DataFrame

A DataFrame containing the predicted values for the given values of the predictor variables.

optimize(maximize=True, bounds={})

Optimize the prediction by optimizing the parameters.

PARAMETER DESCRIPTION
maximize

Whether to maximize the prediction or minimize it.

TYPE: bool DEFAULT: True

bounds

Bounds for the parameters to optimize. - You can freeze a paramater by setting it to the desired value. For example, to fix the value of a parameter to 1, you can set bounds = {'param_name': 1}. - To keep an ordinal or metric parameter within a specific range, you can set bounds = {'param_name': (lower, upper)}. For example, to constrain a parameter to be between 0 and 1, you can set bounds = {'param_name': (0, 1)}. - In order to limit the selection of a nominal parameter only to a certain subset of the originally conained values, you can set bounds = {'param_name': (val1, val2, ...)}. For example, to limit the selection of a nominal parameter to the values 'A' and 'B', you can set bounds = {'param_name': ('A', 'B')}.

TYPE: Dict[str, Any] DEFAULT: {}

RETURNS DESCRIPTION
Dict[str, Any]

The optimized parameters.

residual_data()

Get the residual data from the fitted model.

RETURNS DESCRIPTION
DataFrame

The residual data containing the residuals, observation index, and predicted values.

Examples:

import daspi as dsp
df = dsp.load_dataset('partial_factorial')
target = 'Yield'
features = [c for c in df.columns if c != target]
lm = LinearModel(df, target, features)
print(lm.residual_data())
    Observation      Residuals  Prediction
0             1  9.250000e+00       46.75
1             2  2.000000e+00       51.00
2             3 -1.050000e+01       73.50
3             4 -2.500000e-01       65.25
4             5 -1.421085e-14       53.00
5             6  1.025000e+01       44.75
6             7 -2.500000e-01       67.25
7             8 -1.050000e+01       71.50
8             9  3.750000e+00       65.25
9            10 -1.200000e+01       57.00
10           11 -1.500000e+00       79.50
11           12  9.250000e+00       83.75
12           13 -1.000000e+01       59.00
13           14 -3.250000e+00       63.25
14           15  9.250000e+00       85.75
15           16  4.500000e+00       77.50