Regression Redress¶

abstract¶

Computers deal in precision and not accuracy or science. We explain a statistics program that keeps some precision and gains much accuracy. To address bias, a method is to add a constraint on the statistical modeling. Regression Redress controls bias by segregating the leftover correlation.

precision or accuracy¶

You or your machine can calculate numbers with many units beyond the decimal. Here are a few guidelines:

Avoid rounding numbers while making calculations – you lose precision if you do.
When reporting numbers, round to the unit level that you think you could realistically measure and recognize in nature.
Also, think about the unit level that would be useful to make management decisions or conclusions.

Bull’s eye views of precision and accuracy — Figure 1. Relationships between precision and accuracy

Precision is the quality of numbers of many units beyond the decimal. In statistics, the clustering of data about their own average is precision. The indicators of dispersion only are relevant to the sampled population: the coefficient of variation or the variance (or the standard deviation in the unit of measure). Data statistical models are excellent at precision. They do not work on accuracy or truth.

Accuracy is how close a measurement or estimate is to the "true" value. Truth is not available in your machine.

accuracy dispersion¶

Suppose that we mind the consequences or our studies. In this hypothesis, we may welcome some variance of estimates if it helps responsibility in results. We prefer accuracy over precision. Trust only is a consequence.

regression analysis¶

Regression analysis is a set of statistical processes for designing estimators. In statistics, a predictor is a rule for calculating an estimate of an outcome (held in a "dependent variable") based on observed data (held in "explanatory variables").

The most common form of regression analysis is linear regression, in which the researcher designs a linear combination that fits the data.

Regression analysis focuses on the conditional probability distribution of the outcome given the values of the explanatory variables, rather than on the joint probability distribution of all arguments.

What is your purpose?

Precision of machine learning: If the goal is variance reduction in prediction or forecasting, regression can be used to fit a predictive model to an observed data set of values of the dependent and explanatory variables. After developing such a model, if additional values of the explanatory variables are collected without an accompanying response value, the fitted model can be used to make a prediction of the response.
Accuracy of inference: If the goal is to interpret variation in the dependent variable that can be attributed to variation in the explanatory variables, regression can be applied to quantify the strength of the relationship between the response and the explanatory variables. Most times, there is relationship between arguments, which causes imprecision.

Scatter plot of observed data and attenuated curve — Figure 2. Scatter plot of the data of a case study (blue dots). On the x axis there are wind speed forecasts (m/s) with prediction horizons ranging from 1 to 24 hours. On the y axis, there is the power scaled between 0 and 1. The red curve is a power curve model estimated via a nonparametric logit-LOESS regression. (Capelletti, Raimondo, Nicolao, 2022)

Often, linear regression analysis renders attenuated direction: the estimator factors are too small. (The behaviour is called "regression dilution".) An example is represented in Figure 2. Solutions include:

Total least squares is a type of errors-in-variables regression, a least squares data modeling technique, in which observational errors on both dependent and explanatory variables are taken into account.
Regression Redress: ⚠️ we accept variance and we segregate the leftover correlation. Regression Redress tackles "bias" by containing corr(y, residuals).

motivation is bias control¶

In statistics, bias can be represented as the forces that push a measure or a prediction to shift away from the "true" value. Here we take "bias" as "a systematic tendency in which the methods used to gather data and generate statistics present an inaccurate, skewed or biased depiction" (Wikipedia CCSA 4.0).

Bias at gathering 🚧️ We may pretend that measures are "true". This may seem reasonable for laboratory but certainly is wrong in nature. The methods of recording observations induce inaccuracy:

Statistical variability, measurement error or random noise in the dependent variable causes uncertainty in the estimated slope, but not bias: on average, the procedure calculates a correct slope.
However, variability, measurement error or random noise in an explanatory variable causes bias in the estimated slope (as well as imprecision).

Bias at generation 🚧 We have regression attenuation: the skewing of the relationship estimate towards zero.

demarcation¶

"A theory that has withstood rigorous testing should be deemed to have received a high measure of corroboration. and may be retained provisionally as the best available theory until it is finally falsified and/or is superseded by a better theory."

"Science, in Popper’s view, starts with problems rather than with observations—it is, indeed, precisely in the context of grappling with a problem that the scientist makes observations in the first instance: his observations are selectively designed to test the extent to which a given theory functions as a satisfactory solution to a given problem."

Stephen Thornton (2022)

fragility¶

It is practical but seldom justified to "correct" regressed coefficients by applying a multiplying factor to each. A regression model is considered to be conditional on known explanatory values. We may keep this illusion until we admit that the variables we observe are contaminated by (random or social) errors. Statistical justification of redress can be done in a few cases including:

when the source of the errors in observations is known and their variance σ²ᵥ can be calculated;
when repeating measurements of the same unit are available and when the ratio θ = (σ²ᵥ + σ²ᵩ) / σ²ᵩ is known, where σ²ᵩ is the variance of the latent variable xᵩ: x = xᵩ + 𝑣 (and the error or noise 𝑣 is assumed to not depend on the true value xᵩ). In this case, a consistent estimate of slope is equal to the least-squares estimate multiplied by θ.

features¶

We follow the method and the qualities from Treder et al. (2021):

Optimality. Formulating both statistical model training and redress as a single constrained optimization problem allows us to show that the resultant models are statically optimal on the training dataset. In other words, of all potential solutions that control for the target, our solution has the highest accuracy (not precision).
Interpretability. An advantage of unifying prediction and redress in a single statistical model is better interpretability because the entire operation of the model is represented by its estimated weights, and quantities derived from them. Furthermore, these quantities are not affected by the choice of the correlation bound (a hyperparameter in our model).

Method of Regression Redress¶

We model the distribution as response = deterministic plus unpredicted. We try to have as much of the explanatory power as possible reside in the deterministic component. Conversely, if we identify non-randomness in the second – supposedly stochastic – term, then the model underwhelms.

Our idea is that bias may reveal in the fit not capturing an unknown relationship: model leftovers are too much "tied" to the dependent values. We choose two indicators: the residual (the difference between the value on record and the estimated value), and its correlation to the value on record.

To control for leftover correlation from the training data, we do the usual regression but add a constraint that caps the permitted magnitude of correlation between the dependent variable and the residuals. The same loss function as before is minimized. However, the set of feasible solutions is limited to solutions for which |corr(y, residuals)| – the absolute value of the correlation – does not exceed ρ, where ρ≥0 is the correlation bound selected by the user.

Leftover |corr(y, residuals)| ≤ ρ

For assessing conformity to our arbitrary bound, we could choose any statistical association between outcome and residuals. An indicator robust to non-linearity would show the dependence between the rankings of the two variables. It is named after Charles Spearman who graduated at the University of Leipzig, Germany, in 1906. Reference. Application and Python example are published at data.yt.

Interpretation of coefficients¶

We set a hyperparameter ρ to put a maximum on |corr(y, residuals)|. Importantly, the choice of the correlation bound ρ does not change the interpretation. Let θ ∈ ℝ, be the redress factor. Since the regressed weights in our models are just scaled versions of the original regressed weights, bᵢ = θβᵢ, the choice of ρ does not affect the ratio between any pair of weights.

Scaling does not "level" the influence of each argument; it obtains a standard deviation at 1. Scaling does not establish what is "small" or "important". In other words, βs do not account for reality.

Explanatory variables may have non-zero correlations with each other. Actually, in some multivariate analyses, collinearity is encouraged, say, for example, when operating a dependent variable with several similar measures. Nimon et al. (2010) noted that correlated arguments can “complicate result interpretation… a fact that has led many to bemoan the presence of multicollinearity among observed variables” (p. 707). A solution is to examine the relationships between variables:

A structure coefficient can be defined as the correlation between an explanatory variable and the synthetic variable ̂y = Xb (Thompson, 1988, 2006). Each structure coefficient can be calculated by the correlation between estimate Xb and each column of X – where the i-th column is denoted as Xᵢ. Since the correlation factor is invariant to constant shifts and scaling, we have corr(Xᵢ, Xb)==corr(Xᵢ, Xβ), that is, structure coefficients are invariant to the choice of ρ, the boundary on |corr(y, residuals)|.

references¶

Brunner (2023). "Structural Equation Models"

Capelletti, Raimondo, de Nicolao (2022). "Regression dilution effects in wind power prediction from wind speed forecasts"

Frost, Jim (2024). "[Check Your Residual Plots to Ensure Trustworthy Regression Results!]https://statisticsbyjim.com/regression/check-residual-plots-regression-analysis/)"

TJ Murphy (2022). "Statistical Design and Analysis of Experiments with R"

Nimon, K., Henson, R., and Gates, M. (2010). "Revisiting interpretation of canonical correlation analysis: a tutorial and demonstration of canonical commonality analysis". Multivariate Behav. Res. 45, 702–724.

Thompson, B. (1988). "Canonical correlation analysis: An explanation with comments on correct practice". Paper presented at the annual meeting of the American Educational Research Association, New Orleans

Thompson, B. (2006). "Foundations of behavioral statistics: An insight-based approach". Guilford Publications, state of New York

Thornton, Stephen (2022). "Karl Popper"

Treder et al. (2021). "Correlation Constraints for Regression Models: Controlling Bias in Brain Age Prediction"

attribution¶

The present document is authored Eric Maugendre and is available to re-use under the condition of CC BY-SA licence.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search