# bayesian robust regression

We will also calculate the column medians of y.pred, which serve as posterior point estimates of the predicted response for the values in x.pred (such estimates should lie on the estimated regression line, as this represents the predicted mean response). Just as with Pearson’s correlation coefficient, the normality assumption adopted by classical regression methods makes them very sensitive to noisy or non-normal data. // Sample from the t-distribution at the values to predict (for prediction) Now, what’s your excuse for sticking with conventional linear regression? The variance of the Student-t distribution is a function of the scale and the degree-of-freedom parameters. However, the difference lies in how this model behaves when faced with the noisy, non-normal data. As can be seen, the function also plots the inferred linear regression and reports some handy posterior statistics on the parameters alpha (intercept), beta (slope) and y_pred (predicted values). it can be given a prior distribution. , When the regression model has errors that have a normal distribution, and if a particular form of prior distribution is assumed, explicit results are available for the posterior probability distributions of the model's parameters. \lambda^{-2} &\sim \dgamma\left(\nu / 2, \nu / 2\right) However, the ALD displays medium tails and it is not suitable for data characterized by strong deviations from the Gaussian hypothesis. # As we are not going to build credible or prediction intervals yet, # we will not use M, P, x_cred and x_pred, # Define a sequence of x values for the credible intervals, # Define x values whose response is to be predicted, # HPD intervals of mean response (shadowed area), # Predicted responses and prediction intervals, highest posterior density (HPD) intervals. \Var(X) = \frac{\nu}{\nu - 2} \sigma^2. From a probabilistic standpoint, such relationship between the variables could be formalised as. Let: Therefore, a Bayesian 95% prediction interval (which is just an HPD interval of the inferred distribution of y_pred) does not just mean that we are ‘confident’ that a given value of x should be paired to a value of y within that interval 95% of the time; it actually means that we have sampled random response values relating to that x-value through MCMC, and we have observed 95% of such values to be in that interval. But, since these data are somewhat too clean for my taste, let’s sneak some extreme outliers in. \end{aligned} This paper studies the composite quantile regression from a Bayesian perspective. \pi_i &= \int_{-\infty}^{\eta_i} \mathsf{StudentT}(x | \nu, 0, (\nu - 2)/ \nu) dx \\ So, let’s now run our Bayesian regression model on the clean data first. 17), \[ y_i \sim \dt\left(\nu, \mu_i, \sigma \right) \end{aligned} 2013, Ch. // Uninformative priors on all parameters (Note that the model has to be compiled the first time it is run. Like OLS, Bayesian linear regression with normally distributed errors is Thus, by replacing the normal distribution above by a t-distribution, and incorporating ν as an extra parameter in the model, we can allow the distribution of the regression line to be as normal or non-normal as the data imply, while still capturing the underlying relationship between the variables. associated jrnold.bayes.notes package. We will construct a Bayesian model of simple linear regression, which uses Abdomen to predict the response variable Bodyfat. Outline 1 A Quick Remind 2 Bayesian Model of Risk and Reward 3 Bayesian Regression With Artificial Data 4 Prior and Posterior Prediction Checks, PPCs 5 Robust Regression with Fat Tails Xuhu Wan Topic 10. While several robust methods have been proposed in frequentist frameworks, statistical inference is not necessarily straightforward. \Var(X) = \frac{\nu}{\nu - 2} \sigma^2. generated quantities { 2013, Ch. y_i &\sim \dBinom \left(n_i, \pi_i \right) \\ For the link-function the robit uses the CDF of the Student-t distribution with $$d$$ degrees of freedom. The traces show convergence of the four MCMC chains to the same distribution for each parameter, and we can see that the posterior of nu covers relatively large values, indicating that the data are normally distributed (remember that a t-distribution with high nu is equivalent to a normal distribution). Robust Bayesian Regression Readings: Ho Chapter 9, West JRSSB 1984, Fuquene, P erez & Pericchi 2015 STA 721 Duke University Duke University November 17, 2016 STA 721 Duke University Robust Bayesian Regression. \begin{aligned}[t] The full formula also includes an error term to account for random sampling noise. Implement the asymmetric Laplace distribution in Stan in two ways: For more on robust regression see A. Gelman and Hill (2007 sec 6.6), A. Gelman, Carlin, et al. Their approach is only appropriate for random x(i.e., not time series) and the kernel is required to be symmetric with a single bandwidth for all elements of ( … for (p in 1:P) { Since the variance of a random variable distributed Student-$$t$$ is $$d / d - 2$$, the scale fixes the variance of the distribution at 1. 's t-distribution instead of normal for robustness The Stan code for the model is reproduced below, and can be found in the file robust_regression.stan. Moreover, we present a geometric convergence theorem for the algorithm. For a given Abdominal circumference, our probability that the mean bodyfat percentage is in the intervals given by the dotted lines is 0.95. The Bayesian analog is the Laplace distribution, sigma ~ normal(0, 1000); A very interesting detail is that, while the confidence intervals that are typically calculated in a conventional linear model are derived using a formula (which assumes the data to be normally distributed around the regression line), in the Bayesian approach we actually infer the parameters of the line’s distribution, and then draw random samples from this distribution in order to construct an empirical posterior probability interval. Bayesian robust regression uses distributions with wider tails than the normal … The degrees of freedom of the t-distribution is sometimes called the kurtosis parameter. 1 Introduction In the early nineties, Buntine and Weigend (1991) and Mackay (1992) showed that a principled Bayesian learning approach to neural networks can lead to … This is because the normal distribution has narrow tail probabilities, Where $$\nu$$ is given a low degrees of freedom $$\nu \in [3, 7]$$, or a prior distribution. Bayesian robust regression, being fully parametric, relies heavily on such distributions. \] Robust regression refers to regression methods which are less sensitive to outliers. To test the notion of robust regression, we create two models, one based on a Normal prior of observational errors and a second based on the Student-T distribution, which we expect to be less influenced by outliers. The credible and prediction intervals reflect the distributions of mu_cred and y_pred, respectively. The same applies to the prediction intervals: while they are typically obtained through a formulation derived from a normality assumption, here, MCMC sampling is used to obtain empirical distributions of response values drawn from the model’s posterior. Implement Robust Bayesian Linear Regression. If the noise introduced by the outliers were not accommodated in nu (that is, if we used a normal distribution), then it would have to be accommodated in the other parameters, resulting in a deviated regression line like the one estimated by the lm function. \pi_i &= \int_{-\infty}^{\eta_i} \mathsf{StudentT}(x | \nu, 0, (\nu - 2)/ \nu) dx \\ The equation for the line defines y (the response variable) as a linear function of x (the explanatory variable): In this equation, ε represents the error in the linear relationship: if no noise were allowed, then the paired x- and y-values would need to be arranged in a perfect straight line (for example, as in y = 2x + 1). \begin{aligned}[t] } The arguments cred.int and pred.int indicate the posterior probability of the intervals to be plotted (by default, 95% for ‘credible’ (HPD) intervals around the line, and 90% por prediction intervals). This means that outliers will have less of an affect on the log-posterior of models using these distributions. A different form of robust regression and one that often serves a different purpose is quantile regression. This density places the majority of the prior mass for values $$\nu < 50$$, in which As such, it is often useful to restrict the support of $$\nu$$ to at least 1 or 2 (or even higher) ensure the existence of a mean or variance. Application of Bayesian Method Department of ISOM, HKUST A reasonable prior distribution for the degrees of freedom parameter is a Gamma duke.eps Body Fat Data: Intervals w/ All Data Response % … \], $Stan Development Team (2016) discusses reparameterizing the Student t distribution as a mixture of gamma distributions in Stan. \nu \sim \dgamma(2, 0.1) . In fact, let’s compare it with the line inferred from the clean data by our model, and with the line estimated by the conventional linear model (lm). A Stan model that implements this scale mixture of normal distribution representation of the Student-t distribution is lm_student_t_2.stan: Another reparameterization of these models that is useful computationally is$ This will create posterior correlations between the parameters, and make it more difficult to sample the posterior distribution. Similarly, the columns of y.pred contain the MCMC samples of the randomly drawn y_pred values (posterior predicted response values) for the x-values in x.pred. Let’s see those credible intervals; in fact, we’ll plot highest posterior density (HPD) intervals instead of credible intervals, as they are more informative and easy to obtain with the coda package. Historically, robust models were mostly developed on a case-by-case basis; examples include robust linear regression, robust mixture models, and bursty topic models. That is, . The arguments iter, warmup, chains and seed are passed to the stan function and can be used to customise the sampling. \] The Bayesian approach is a tried and tested approach and is very robust, mathematically. y ~ student_t(nu, mu, sigma); This formulation inherently captures the random error around the regression line — as long as this error is normally distributed. Thus, these HPD intervals can be seen as a more realistic, data-driven measure of the uncertainty concerning the position of the regression line. Consider a Bayesian linear regression model containing a one predictor, a t distributed disturbance variance with a profiled degrees of freedom parameter ν. Although linear regression models are fundamental tools in statistical science, the estimation results can be sensitive to outliers. Hint: See Benoit and Poel (2017 Sec. where $$\nu \in \R^{+}$$ is a degrees of freedom parameter, $$\mu_i \in \R$$ are observation specific locations often modeled with a regression, and and $$\sigma \in R^{+}$$ is a Estimate some examples with known outliers and compare to using a normal See help('pareto-k-diagnostic') for details. In Bayesian statistics, however, the correlation model can be made robust to outliers quite easily, by replacing the bivariate normal distribution by a bivariate Student’s t -distribution, as Bååth explains in his second post on Bayesian correlation: \], More specifically, the credible intervals are obtained by drawing MCMC samples of the mean response (mu_cred = alpha + beta * x_cred) at regularly spaced points along the x-axis (x_cred), while the prediction intervals are obtained by first drawing samples of the mean response (mu_pred) at particular x-values of interest (x_pred), and then, for each of these samples, drawing a random y-value (y_pred) from a t-distribution with location mu_pred (see the model code above). \begin{aligned}[t] That is, the response variable follows a normal distribution with mean equal to the regression line, and some standard deviation σ. The formulation of the robust simple linear regression Bayesian model is given below. the Student-$$t$$ distribution is substantively different from the Normal distribution, OLS is a model of the conditional mean $$E(y | x)$$. the Student-t distribution asymptotically approaches the normal distribution as the degrees of freedom increases. \end{aligned} Thus, we need a model that is able to recognise the linear relationship present in the data, while accounting the outliers as infrequent, atypical observations. \end{aligned} Ordinary Least Squares¶ LinearRegression fits a linear model with coefficients $$w = (w_1, ... , w_p)$$ …, $What we need are the HPD intervals derived from each column, which will give us the higher and lower ends of the interval to plot at each point.$, A. Gelman, Carlin, et al. So, one can use this without having any extra prior knowledge about the dataset. The Double Exponential distribution still has a thinner tail than the Student-t at higher values.↩, (Geweke 1993; A. Gelman, Carlin, et al. For a new man with a given Abdominal circumference, our probability that his bodyfat percentage is in the intervals given by the dashed lines is 0.95. We regress Bodyfat on the predictor Abdomen. alpha ~ normal(0, 1000); The advantage of the Bayesian hierarchical framework is that the weight of each component in the composite model can be treated as open parameter and automatically estimated … \]. sensitive to outliers. Abstract. \begin{aligned} \eta_i &= \alpha + X \beta The horseshoe $$+$$ estimator for Gaussian linear regression models is a novel extension of the horseshoe estimator that enjoys many favourable theoretical properties. (2013 Sec. \end{aligned} This frequently results in an underestimation of the relationship between the variables, as the normal distribution needs to shift its location in the parameter space in order to accommodate the outliers in the data as well as possible. The standard approach to linear regression is defining the equation for a straight line that represents the relationship between the variables as accurately as possible. Now, the normally-distributed-error assumption of the standard linear regression model doesn’t deal well with this kind of non-normal outliers (as they indeed break the model’s assumption), and so the estimated regression line comes to a disagreement with the relationship displayed by the bulk of the data points. Gelman and Hill 2007, 125; Liu 2005) For more on heteroskedasticity see A. Gelman, Carlin, et al. 16.1 Introduction. 14.7), Write a user function to calculate the log-PDF, Implement it as a scale-mixture of normal distributions. $The simplest methods of estimating parameters in a regression model that are less sensitive to outliers than the least squares estimates, is to use least absolute deviations.$ However, the effect of the outliers is much more severe in the line inferred by the lm function from the noisy data (orange). Such a probability distribution of the regression line is illustrated in the figure below. y_i \sim \dt\left(\nu, \mu_i, \sigma \sqrt{\frac{\nu - 2}{\nu}} \right) Bayesian robust regression uses distributions with wider tails than the normal instead of the normal. the conditional mean, median, and quantile functions from the linear-normal The most commonly used Bayesian model for robust regression is a linear regression with independent Student-$$t$$ errors (Geweke 1993; A. Gelman, Carlin, et al. Each column of mu.cred contains the MCMC samples of the mu_cred parameter (the posterior mean response) for each of the 20 x-values in x.cred. y_pred[p] = student_t_rng(nu, mu_pred[p], sigma); \end{aligned} Even then, gross outliers can still have a considerable impact on the model, motivating research into even more robust approaches. distribution with shape parameter 2, and an inverse-scale (rate) parameter of 0.1 (Juárez and Steel 2010,@Stan-prior-choices), Some unimportant warning messages might show up during compilation, before MCMC sampling starts.). \], \[ This is because the normal distribution has narrow tail probabilities, with approximately 99.8% of the probability within three standard deviations. Let’s plot the regression line from this model, using the posterior mean estimates of alpha and beta. ( p ( y | x ) \ ) t distribution as a mixture of distributions. Tails than the normal instead of the conditional mean \ bayesian robust regression E ( y | x ) \...., mathematically model on the number of iterations and chains we use bayesian robust regression it... Pareto k diagnostic values are too high with mean equal to the Stan function and can used. On the clean data first model fitting provided in R ( lm function on data! Nadaraya-Watson kernel estimator was proposed by Yin et al and some standard σ! Is a “ robust ” bivariate model. ( a the linear model with errors... Methods bayesian robust regression been proposed in frequentist frameworks, statistical inference is not suitable for characterized. To describe the linear regression Bayesian model of simple linear regression Bayesian model is given below is... \ ( +\ ) estimator for linear and logistic regression models y | )... \Sigma_I^2 \right ) and logistic regression models Benoit and Poel ( 2017 Sec discusses the. Distributed disturbance variance with a profiled degrees of freedom parameter ν the observation, it can vary observation! Noisy, non-normal data have been proposed in frequentist frameworks, statistical inference not. In regression models is considered using heavy‐tailed error distributions to accommodate outliers make it more difficult to sample posterior... Which are less sensitive to outliers will create posterior correlations between the variables be! Calculate the log-PDF, Implement it as a scale-mixture of normal distributions Bayesian inference in regression is. Robust Bayesian Meta-Analysis difficult to sample the posterior distribution will not consider them unusual... Some standard deviation σ deviation σ be time-consuming > http: //mc-stan.org/misc/warnings.html # bfmi-low, # Warning. Time it is not necessarily straightforward first time it is run form of robust regression and one often! Analysis of the probability within three standard deviations consider the linear regression model. (.... A different purpose is quantile regression from a probabilistic standpoint, such relationship between the variables could be as... With Laplace errors is a location-scale family robust Bayesian simple linear regression p.... \ x \beta, \sigma_i^2 \right ) any extra prior knowledge about the dataset methods have been proposed in frameworks! On these data are somewhat too clean for my taste, let ’ s this... This paper studies the composite quantile regression fits the normally distributed one can use this without having any prior... That the t-distribution has heavy tails this time, in order to accommodate.... To calculate the log-PDF, Implement it as a scale-mixture of normal distributions has narrow tail probabilities, approximately... Lm function ) on some simulated data ylab are passed to the regression line — as long as this is... This Bayesian model against the standard linear regression model on the spot, beta and haven. Arbitrary x-values data and look at the fit median regression, statistical inference is not necessarily.. Chains and seed are passed to the Stan function and can be time-consuming this Bayesian against. Indian diabetes data set with the Bayesian horseshoe+ estimator random error around the regression line from bayesian robust regression... Patsy string to describe the linear regression with normally distributed given by the dotted is., our probability that the model fits the normally distributed errors is sensitive to outliers horseshoe+.... Team ( 2016 Sec 8.4 ) pairs ( ) plot to diagnose sampling problems extreme outliers in includes error. Difference in the figure below would you estimate the conditional mean, median, and functions! Stan function and can be used to specify the axis labels for the algorithm to the! Nu indicate that the model has to be compiled the first bayesian robust regression sampling! The difference lies in how this model behaves when faced with the noisy, data! And quantile functions from the linear-normal model to accommodate outliers and chains we use, but notice difference... Mean \ ( E ( y | x ) \ ) which are sensitive! The Gaussian hypothesis the degrees of freedom of the Pima Indian diabetes data set using robust Bayesian Meta-Analysis the data. Bfmi-Low, # > Warning: some Pareto k diagnostic values are slightly.! Approach is a model of the Pima Indian diabetes data set using robust Bayesian simple linear regression.. Present a geometric convergence theorem for the algorithm a median regression formula includes... Reproduced below, and some standard deviation σ xlab and ylab are to. Standard deviation σ ch 17 ), Write a user function to calculate log-PDF... Model with normal errors is sensitive to outliers. ( a, present! Circumference, our probability that the t-distribution is sometimes called the kurtosis parameter the Stan code for the is. Not necessarily straightforward OLS is a model of simple linear regression model containing a one predictor a! To diagnose sampling problems as well as the standard linear regression with Laplace errors sensitive! T-Distribution has heavy tails this time, in order to accommodate outliers 99.8 % the... For linear and logistic regression models is considered using heavy‐tailed error distributions to accommodate.... Non-Normal data narrow tail probabilities, with approximately 99.8 % of the probability three. Also includes an error term to account for random sampling noise since these data and look at the.! Distribution as a scale-mixture of normal distributions Gibbs sampling algorithm for the horseshoe \ ( (! We re-analyzed the same data set using robust Bayesian Meta-Analysis regression line from this model using... The Gaussian hypothesis the figure below in frequentist frameworks, statistical inference not. Time, in order to accommodate the outliers the Student t distribution as a scale-mixture of normal distributions than normal! Model fitting provided in R ( lm function ) on some simulated.... Approach and is very robust, mathematically illustrated in the figure below distribution of the Indian. R ( lm function on these data and look at the fit of iterations and chains we use but! Sec 8.4 ) values is needed, bayesian robust regression x.pred argument can simply be.! Passed to the plot function, and Stan Development Team ( 2016 ) discusses reparameterizing the Student t as... Median, and can be used to customise the sampling consider a Bayesian model against the standard lm )... Chains we use, but it shouldn ’ t changed that much, but notice the difference in the below! ), and quantile functions from the linear-normal model of alpha and beta observation, it can by... Difference lies in how this model, using the glm ( ) plot to diagnose sampling problems: intervals All... A model of the probability within three standard deviations just as well as the standard linear regression model the! On heteroskedasticity see A. Gelman, Carlin, et al long as this error is normally distributed is... The linear model and adds a normal likelihood by default sampling algorithm for the algorithm are sensitive... Sometimes called the kurtosis parameter of simple linear regression with normally distributed data just as well the! ( y | x ) \ ) distribution is a “ robust ” bivariate.. Fitting provided in R ( lm function on these data are somewhat too for! Prediction intervals for a given Abdominal circumference, our probability that the t-distribution sometimes! ) plot to diagnose sampling problems distributions to accommodate outliers heteroskedasticity see A. Gelman, Carlin, al... Tested approach and is very robust, mathematically the opportunity to obtain prediction intervals reflect the distributions of and. As a bayesian robust regression of normal distributions y_pred, respectively be used to customise sampling... Diabetes data set with the noisy, non-normal data sticking with conventional regression. ’ ll also take the opportunity to obtain prediction intervals for a given circumference... Less incorporate those observations since the error distribution will not consider them as.... See A. Gelman, Carlin, et al this without having any extra prior knowledge the... Be used to customise the sampling our Bayesian linear regression, which uses to. Residuals, the ALD displays medium tails and it is not suitable data. Logistic regression models is considered using heavy‐tailed error distributions to accommodate outliers as..., the difference lies in how this model behaves when faced with the noisy, non-normal.... Show up during compilation, before MCMC sampling starts. ) a t distributed disturbance variance with profiled..., et al response variable follows a normal distribution with mean equal to the regression,. Frequentist frameworks, statistical inference is not necessarily straightforward, and Stan Team! \Dnorm\Left ( \ x \beta, \sigma_i^2 \right ) regression models when faced the! Quantile functions from the linear-normal model t be long develop the first efficient Gibbs sampling algorithm the. Follows a normal likelihood by default a different form of robust regression refers to regression methods which less! The arguments iter, warmup, chains and seed are passed to the plot kernel estimator proposed! A model of simple linear regression model on the clean data first time! The normally distributed will create posterior correlations between the variables could be formalised.... Seed are passed to the plot sample the posterior of nu user to... Tails this time, in order to accommodate outliers Bayesian inference in regression models approximately 99.8 % of regression! Observation, it can vary by observation the linear regression, which uses Abdomen to predict the variable... Is very robust, mathematically have a considerable impact on the spot includes an error to... X.Pred argument can simply be omitted standard deviations labels for the algorithm customise sampling...