Linear model
daspi.anova.model.LinearModel(source, target, features, covariates=[], alpha=0.05, order=1, skip_intercept_as_least=True, generalized_vif=True, encode_features=True, fit_at_init=True)
¶
Bases: BaseHTMLReprModel
This class is used to create and simplify linear models so that only significant features describe the model.
Balanced models (DOEs or EVOPs) including continuous can be analyzed. With this class, you can create an encoded design matrix with all factor levels, including their interactions. All non-significant factors can then be automatically eliminated. Furthermore, this class allows the examination of main effects, the sum of squares (explained variation), and the ANOVA table in more detail.
| PARAMETER | DESCRIPTION |
|---|---|
source
|
Pandas DataFrame as tabular data in a long format used for the model.
TYPE:
|
target
|
Column name of the endogenous variable.
TYPE:
|
features
|
Column names of the exogenous variables that can be actively changed (factor levels for DOE or EVOP). Interactions are also created for these variables if the order is set > 1.
TYPE:
|
covariates
|
Column names for exogenous variables that are logged but cannot be influenced. No interactions are created for these variables.
TYPE:
|
alpha
|
Threshold as alpha risk. All features, including continuous and intercept, that have a p-value smaller than alpha are removed during the automatic elimination of the factors. Default is 0.05.
TYPE:
|
skip_intercept_as_least
|
If True, the intercept is not removed as a least significant
term when using recursive feature elimination. Also if True, the
intercept does not appear when calling the
TYPE:
|
encode_features
|
If True, all of the provided feature variables are encoded using one-hot encoding by changing the data type to category. Otherwise they are interpreted as continuous variables when possible, by default True.
TYPE:
|
fit_at_init
|
If True, the model is fitted at initialization. Otherwise, the
model is fitted after calling the
TYPE:
|
Examples:
Do some ANOVA and statistics on a dataset. Run the example below in a Jupyther Notebook to see the results.
import daspi as dsp
df = dsp.load_dataset('painkillers-dissolution')
model = dsp.LinearModel(
source=df,
target='dissolution',
features=['employee', 'stirrer', 'brand', 'catalyst', 'water'],
covariates=['temperature', 'preparation'],
order=2)
# Store goodnes of fit values for each elimination step
df_gof = pd.concat(model.recursive_elimination())
# Plot residual and parameter relevance analysis
dsp.ResidualsCharts(model).plot().stripes().label(info=True)
dsp.ParameterRelevanceCharts(model).plot().label(info=True)
# Get HTML output
model
Notes
Always be careful when removing the intercept. If the intercept is missing, Patsy will automatically add all one-hot encoded levels for a categorical variable to compensate for this missing term. This will result in extremely high VIFs. That means that the parameters are not linearly independent.
Terminology
- feature: The original name as it appears in the provided data.
- term: The name of the term as it appears in the design matrix. All feature names are converted to "xi" where i is an ascending number. The name of the first feature becomes "x0", the second becomes "x1",... The covariate variables are also converted in the same way, but instead of the letter "x" the letter "e" is used.
- parameter: The variable names in connection with a coefficient, but in the composition with the original feature names.
target = target
instance-attribute
¶
The name of the target variable for the linear model.
features = features
instance-attribute
¶
The list of features used in the linear model.
covariates = covariates
instance-attribute
¶
The list of covariates variables used in the linear model.
target_map = {target: 'y'}
instance-attribute
¶
A dictionary that maps the original feature names to the encoded feature names used in the model.
feature_map = {f: _f for f, _f in (zip(features, f_main_terms))} | {c: _c for c, _c in (zip(covariates, d_main_terms))}
instance-attribute
¶
A dictionary that maps the original feature names to the encoded names (term) used in the model.
main_term_map = {v: k for k, v in (self.feature_map.items())}
instance-attribute
¶
A dictionary that maps the main term names (no interactions) back to the original feature names used in the model.
skip_intercept_as_least = skip_intercept_as_least
instance-attribute
¶
If True, the intercept is not treated as a least significant term
generalized_vif = generalized_vif
instance-attribute
¶
If True, the generalized VIF is calculated when possible. Otherwise, only the straightforward VIF (via R2) is calculated.
excluded = set()
instance-attribute
¶
A set of feature names that should be excluded from the model.
data = source.rename(columns=(self.feature_map | self.target_map))[list(self.feature_map.values()) + list(self.target_map.values())].copy()
instance-attribute
¶
The Pandas DataFrame containing the data for the linear model.
fitted
property
¶
Whether the model is fitted (read-only).
model
property
¶
Get regression results of fitted model. Calls fit method
if no model is fitted yet (read-only).
initial_formula
property
¶
Get the initial formula for the ANOVA model (read-only).
The initial formula is constructed from the target variable and the initial terms, which are the terms with an interaction order less than or equal to the specified order.
uncertainty
property
¶
Get uncertainty of the model as square root of MS_Residual (read-only).
alpha
property
writable
¶
Alpha risk as significance threshold for p-value of exegenous factors.
design_info
property
¶
Get the DesignInfo instance of current fitted model (read-only).
terms
property
¶
Get the encoded names of all variables for the current fitted model (read-only).
parameters
property
¶
Get the names of all variables for the current fitted model in the composition using the original feature and covariates names (read-only).
term_map
property
¶
Get the names of all internal used and original terms variables for the current fitted model as dict (read-only)
main_parameters
property
¶
Get all main parameters of current model excluding intercept (read-only).
formula
property
¶
Get the formula used for the linear model, excluding any
factors specified in the exclude attribute. The formula is
constructed based on the excluded factors. If the intercept is
excluded, the formula will include '-1' as the first term.
Otherwise, the formula will include the original terms
(excluding the excluded factors) separated by '+' (read-only).
effect_threshold
property
¶
Calculates the threshold for the effect of adding a term to the model. The threshold is calculated as the inverse survival function (inverse of sf) at the given alpha (read-only).
design_matrix
property
¶
Get the design matrix of the current fitted model (read-only).
is_hierarchical()
¶
Check if current fitted model is hierarchical.
effects()
¶
Calculates the impact of each term on the target. The effects are described as absolute number of the parameter coefficients devided by its standard error.
p_values()
¶
Get P-value for significance of adding model terms using anova typ III table for current model.
least_parameter()
¶
Get the parameter name with the least effect or the least p-value coming from a ANOVA typ III table of current fitted model.
| RETURNS | DESCRIPTION |
|---|---|
str
|
The parameter name with the least p_value if possible or the parameter name with the least effect. |
Notes
This method checks if any p-values are missing (NaN). If there are missing p-values, it returns the term name that has the smallest effect on the target variable. Otherwise, it returns the term name with the least p-value for the F-stats coming from current ANOVA table.
fit(**kwds)
¶
Create and fit a ordinary least squares model using current
formula. To fit with a user-defined formula, use the
formula keyword argument.
| PARAMETER | DESCRIPTION |
|---|---|
**kwds
|
Pass formula and other keyword arguments to
DEFAULT:
|
r2_pred()
¶
Calculate the predicted R-squared (R2_pred) for the fitted model.
| RETURNS | DESCRIPTION |
|---|---|
float
|
The predicted R-squared value for the fitted model. |
Notes
The predicted R-squared is a measure of how well the model would predict new observations. It is calculated as:
R2_pred = 1 - (Sum of Squared Prediction Errors / Total Sum of Squares)
Where the prediction errors are calculated as the residuals divided by (1 - leverage), where leverage is the diagonal elements of the projection matrix P = X(X'X)^(-1)X'.
References
Calculations are made according to: https://support.minitab.com/de-de/minitab/help-and-how-to/statistical-modeling/doe/how-to/factorial/analyze-factorial-design/methods-and-formulas/goodness-of-fit-statistics/#press
Further information about projection matrix (influence matrix or hat matrix) can be found in: https://en.wikipedia.org/wiki/Projection_matrix
gof_metrics(index=0)
¶
Get different goodness-of-fit metrics (read-only).
| PARAMETER | DESCRIPTION |
|---|---|
index
|
Value is set as index. When using the method recursive_elimination, the current step is passed as index
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
The goodness-of-fit metrics table as DataFrame containing the following columns: - 'formula' = current formula - 's' = Uncertainty of the model as square root of MS_Residual - 'aic' = Akaike's information criteria - 'r2' = R-squared of the model - 'r2_adj' = adjusted R-squared - 'least_parameter' = the least significant term - 'p_least' = The p-value of least significant term, coming from ANOVA table Type-III. - 'hierarchical' = True if model is hierarchical |
summary(anova_typ=None, vif=True, **kwds)
¶
Generate a summary of the fitted model.
| PARAMETER | DESCRIPTION |
|---|---|
anova_typ
|
If not None, add an ANOVA table of provided type to the summary, by default None.
TYPE:
|
vif
|
If True, variance inflation factors (VIF) are added to the anova table. Will only be considered if anova_typ is not None, by default True
TYPE:
|
**kwds
|
Additional keyword arguments to be passed to the
DEFAULT:
|
| RETURNS | DESCRIPTION |
|---|---|
Summary
|
A summary object containing information about the fitted model. |
eliminate(parameter)
¶
Removes the given parameter from the model by adding it to t
he exclude set. Call fit to refit the model.
| PARAMETER | DESCRIPTION |
|---|---|
parameter
|
The feature name, the covariates name or the interaction of multiple features to be removed from the model.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Self
|
The current instance of the model for more method chaining. |
Examples:
Prepare a LinearModel instance, fit the model and plot the relevance of the parameters. If you run the following code in a Jupyter Notebook, the plot and the html representation of the model will be displayed.
import pandas as pd
import daspi as dsp
df = dsp.load_dataset('painkillers-dissolution')
lm = dsp.LinearModel(
source=df,
target='dissolution',
features=['employee', 'stirrer', 'brand', 'catalyst', 'water'],
covariates=['temperature', 'preparation'],
alpha=0.05,
order=3,
encode_categoricals=False
)
dsp.ParameterRelevanceCharts(lm).plot().stripes().label()
lm
To add again a feature to the model, use the include method.
For an automatic elimination of insignificant terms, use the
'recursive_elimination' method.
Notes
Always be careful when removing the intercept. If the intercept is missing, Patsy will automatically add all one-hot encoded levels for a categorical variable to compensate for this missing term. This will result in extremely high VIFs.
| RAISES | DESCRIPTION |
|---|---|
AssertionError:
|
If the given term is not in the model. |
include(parameter)
¶
Adds the given feature to the model by removing it from the
excluded set. Call fit to refit the model.
| PARAMETER | DESCRIPTION |
|---|---|
parameter
|
The feature name, the covariates name or the interaction of multiple features to be added to the model.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Self
|
The current instance of the model for more method chaining. |
Examples:
See eliminate method. After removing a term from the model,
you can add it again by using the include method.
recursive_elimination(rsquared_max=0.99, ensure_hierarchy=True, **kwds)
¶
Perform a recursive parameter elimination on the fitted model.
This function starts with the complete model and recursively eliminates parameters based on their p-values, until only significant parameters remain in the model. The function yields the goodness-of-fit metrics at each step of the elimination process.
| PARAMETER | DESCRIPTION |
|---|---|
rsquared_max
|
If given, the model must have a lower R^2 value than the given threshold, by default 0.99
TYPE:
|
ensure_hierarchy
|
Adds features at the end to ensure model is hierarchical, by default True
TYPE:
|
**kwds
|
Additional keyword arguments for
DEFAULT:
|
| YIELDS | DESCRIPTION |
|---|---|
DataFrame
|
The goodness-of-fit metrics at each step of the recursive feature elimination. |
Examples:
Prepare a LinearModel instance, fit the model, automatically eliminate insignificant terms and plot the relevance of the parameters and the residuals. In the DataFrame df_gof you can see when which feature was removed and the current goodness-of-fit values can also be viewed. If you run the following code in a Jupyter Notebook, the plots and the html representation of the model will be displayed.
import pandas as pd
import daspi as dsp
df = dsp.load_dataset('painkillers-dissolution')
lm = dsp.LinearModel(
source=df,
target='dissolution',
features=['employee', 'stirrer', 'brand', 'catalyst', 'water'],
covariates=['temperature', 'preparation'],
alpha=0.05,
order=3,
encode_categoricals=False
)
df_gof = pd.concat(list(lm.recursive_elimination()))
dsp.ParameterRelevanceCharts(lm).plot().stripes().label(info=True)
dsp.ResidualsCharts(lm).plot().stripes().label(info=True)
lm
anova(typ='', vif=False)
¶
Perform an analysis of variance (ANOVA) on the fitted model.
| PARAMETER | DESCRIPTION |
|---|---|
typ
|
The type of ANOVA to perform. Default is 'III', see notes for more informations about the types. - '' : If no or an invalid type is specified, Type-II is used if the model has no significant interactions. Otherwise, Type-III is used for hierarchical models and Type-I is used for non-hierarchical models. - 'I' : Type I sum of squares ANOVA. - 'II' : Type II sum of squares ANOVA. - 'III' : Type III sum of squares ANOVA.
TYPE:
|
vif
|
If True, variance inflation factors (VIF) are added to the anova table, by default False
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
The ANOVA table as DataFrame containing the following columns: - DF : Degrees of freedom for model terms. - SS : Sum of squares for model terms. - F : F statistic value for significance of adding model terms. - p : P-value for significance of adding model terms. - n2 : Eta-square as effect size (proportion of explained variance). - np2 : Partial eta-square as partial effect size. |
Notes
The ANOVA table provides information about the significance of each factor and interaction in the model. The type of ANOVA determines how the sum of squares is partitioned among the factors.
The SAS and also Minitab software uses Type III by default. This type is also the only one who gives us a SS and p-value for the Intercept. So Type-III is also used internaly for evaluating the least significant term. A discussion on which one to use can be found here: https://stats.stackexchange.com/a/93031
A nice conclusion about the differences between the types: - Typ-I: We choose the most "important" independent variable and it will receive the maximum amount of variation possible. - Typ-II: We ignore the shared variation: no interaction is assumed. If this is true, the Type II Sums of Squares are statistically more powerful. However if in reality there is an interaction effect, the model will be wrong and there will be a problem in the conclusions of the analysis. - Typ-III: If there is an interaction effect and we are looking for an “equal” split between the independent variables, Type-III should be used.
source: https://towardsdatascience.com/anovas-three-types-of-estimating-sums-of-squares-don-t-make-the-wrong-choice-91107c77a27a
Examples:
parameter_statistics(alpha=0.05, use_t=True)
¶
Calculate the parameter statistics for the fitted model.
| PARAMETER | DESCRIPTION |
|---|---|
alpha
|
The significance level for the confidence intervals, by default 0.05.
TYPE:
|
use_t
|
If True, use t-distribution for hypothesis testing and confidence intervals. If False, use normal distribution, by default True.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
A pandas DataFrame containing the parameter statistics for the fitted model. The DataFrame includes columns for the parameter estimates, standard errors, t-values (or z-values), and p-values. |
variance_inflation_factor(threshold=5)
¶
Calculate the variance inflation factor (VIF) and the generalized variance inflation factor (GVIF) for each predictor variable in the fitted model.
This function takes a regression model as input and returns a DataFrame containing the VIF, GVIF (= VIF^(1/2*dof)), threshold for GVIF, collinearity status and calculation kind for each predictor variable in the model. The VIF and GVIF are measures of multicollinearity, which can help identify variables that are highly correlated with each other.
| PARAMETER | DESCRIPTION |
|---|---|
threshold
|
The threshold for deciding whether a predictor is collinear. Common values are 5 and 10. By default 5.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
A DataFrame containing the VIF, GVIF, threshold, collinearity status and performed method for each predictor variable in the model. |
highest_parameters(features_only=False)
¶
Determines all main and interaction parameters that do not occur in a higher interaction in this constellation.
| PARAMETER | DESCRIPTION |
|---|---|
features_only
|
If true, intercept and interaction parameters are not returned. By default False.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List[str]
|
List of highest parameters. |
has_insignificant_term(rsquared_max=0.99)
¶
Check if the fitted model has any insignificant terms.
| PARAMETER | DESCRIPTION |
|---|---|
rsquared_max
|
The maximum R^2 value that the model can have to be considered significant. If not provided, by default 0.99.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
Returns True if the model has any insignificant terms, and False otherwise. |
has_significant_interactions()
¶
True if fitted model has significant interactions.
predict(xs)
¶
Predict y with given xs. Ensure that all non interactions are given in xs
| PARAMETER | DESCRIPTION |
|---|---|
xs
|
The values for which you want to predict. Make sure that all non interaction parameters are given in xs. If multiple values are to be predicted, provide a list of values for each factor level.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
A DataFrame containing the predicted values for the given values of the predictor variables. |
optimize(maximize=True, bounds={})
¶
Optimize the prediction by optimizing the parameters.
| PARAMETER | DESCRIPTION |
|---|---|
maximize
|
Whether to maximize the prediction or minimize it.
TYPE:
|
bounds
|
Bounds for the parameters to optimize. - You can freeze a paramater by setting it to the desired value. For example, to fix the value of a parameter to 1, you can set bounds = {'param_name': 1}. - To keep an ordinal or metric parameter within a specific range, you can set bounds = {'param_name': (lower, upper)}. For example, to constrain a parameter to be between 0 and 1, you can set bounds = {'param_name': (0, 1)}. - In order to limit the selection of a nominal parameter only to a certain subset of the originally conained values, you can set bounds = {'param_name': (val1, val2, ...)}. For example, to limit the selection of a nominal parameter to the values 'A' and 'B', you can set bounds = {'param_name': ('A', 'B')}.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Dict[str, Any]
|
The optimized parameters. |
residual_data()
¶
Get the residual data from the fitted model.
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
The residual data containing the residuals, observation index, and predicted values. |
Examples:
import daspi as dsp
df = dsp.load_dataset('partial_factorial')
target = 'Yield'
features = [c for c in df.columns if c != target]
lm = LinearModel(df, target, features)
print(lm.residual_data())
Observation Residuals Prediction
0 1 9.250000e+00 46.75
1 2 2.000000e+00 51.00
2 3 -1.050000e+01 73.50
3 4 -2.500000e-01 65.25
4 5 -1.421085e-14 53.00
5 6 1.025000e+01 44.75
6 7 -2.500000e-01 67.25
7 8 -1.050000e+01 71.50
8 9 3.750000e+00 65.25
9 10 -1.200000e+01 57.00
10 11 -1.500000e+00 79.50
11 12 9.250000e+00 83.75
12 13 -1.000000e+01 59.00
13 14 -3.250000e+00 63.25
14 15 9.250000e+00 85.75
15 16 4.500000e+00 77.50