STAT 331: Applied Linear Models

3 multiple linear regression: case studies.

In the following examples, we shall cover some of the basic tools and techniques of regression analysis in \(\R\) .

3.1 Reading a Regression Output

The dataset satisfaction.csv contains four variables collected from \(n = 46\) patients in given a hospital:

  • Satisfaction : The degree of satisfaction with the quality of care (higher values indicated greater satisfaction).
  • Age : The age of the patient in years.
  • Severity : The severity of the patient’s condition (higher values are more severe).
  • Stress : The patient’s self-reported degree of stress (higher values are more stress).

Let’s start by fitting the multiple linear regression model \[\begin{equation} \mtt{Satisfaction}_i = \beta_0 + \beta_1 \cdot \mtt{Age}_i + \beta_2 \cdot \mtt{Severity}_i + \beta_3 \cdot \mtt{Stress}_i + \eps_i. \tag{3.1} \end{equation}\] Model (3.1) has three covariates ( Age , Severity , Stress ), but the design matrix \(X\) has four columns, \[ X = \begin{bmatrix} 1 & \mtt{Age}_1 & \mtt{Severity}_1 & \mtt{Stress}_1 \\ 1 & \mtt{Age}_2 & \mtt{Severity}_2 & \mtt{Stress}_2 \\ \vdots & & & \vdots \\ 1 & \mtt{Age}_{46} & \mtt{Severity}_{46} & \mtt{Stress}_{46} \end{bmatrix}, \] such that \(p = 4\) . The \(\R\) code for fitting this model is:

3.1.1 Basic Components of a Regression Output

Table 3.1 displays the output of the summary(M) command.

Table 3.1: Regression output for fitting a linear model to patient satisfaction with hospital care as a function of age, severity of condition, and self-reported stress.

The meaning of some of the terms in this output are:

Estimate : These are the \(\hat \beta = (X'X)^{-1}X'y\) . To interpret each \(\beta_j\) , recall that for multiple linear regression we have \(\E[y \mid x] = x' \beta\) . Therefore, if \(x\) and \(\tilde x\) are two covariate values that differ only by one unit in covariate \(j\) , i.e., \[ \tilde x - x = (0, \ldots, 0, \underset{\substack{\uparrow\\\rlap{\text{covariate $j$}}}}{1}, 0, \ldots 0), \] then \(\E[y \mid \tilde x] - \E[y \mid x] = \beta_j\) . In other words, \(\beta_j\) is the effect on the mean of a change of 1 unit of covariate \(j\) if all other covariates are held the same. For example, \(\beta_\tx{Age}\) is the change per extra year of age in expected Satisfaction for two people with the same Severity and same Stress values, regardless of what these values are.

Std. Error : These are the so-called standard errors of \(\hat \beta\) . The standard error \(\se(\hat \beta_j)\) of \(\hat \beta_j\) defined as a sample-based approximation of the estimator’s standard deviation, \(\sd(\hat\beta_j) = \sqrt{\var(\hat\beta_j)}\) . In the regression context, we have \(\var(\hat \beta) = \sigma^2 (X'X)^{-1}\) , such that \(\var(\hat \beta_j)\) is the \(j\) diagonal element of this variance matrix: \[ \var(\hat \beta_j) = \sigma^2 [\ixtx]_{jj}. \] Therefore, a natural choice for the standard error of \(\hat \beta_j\) is \[ \se(\hat \beta_j) = \hat \sigma \sqrt{\ixtx]_{jj}}. \] That is, we simply replace the unknown \(\sigma\) by the square-root of the unbiased estimator.

t value : The values of the test statistics for each null hypothesis \(H_0: \beta_j = 0\) . These are calculated as \[ T_j = \frac{\hat \beta_j}{\tx{se}(\hat \beta_j)}. \]

Pr(>|t|) : The \(p\) -value for each of these null hypotheses. To obtain these, recall that under \(H_0: \beta_j = 0\) , \[ Z = \frac{\hat \beta_j}{\sigma \sqrt{[(X'X)^{-1}]_{jj}}} \sim \N(0, 1), \qquad W = \frac{n-p}{\sigma^2}\cdot \hat \sigma^2 \sim \chi^2_{(n-p)}, \] and since \(\hat \beta_j\) and \(\hat \sigma^2\) are independent, so are \(Z\) and \(W\) . Therefore, \[ \frac{Z}{\sqrt{W/(n-p)}} = \frac{\hat \beta_j}{\hat \sigma \sqrt{[(X'X)^{-1}]_{jj}}} = \frac{\hat \beta_j}{\tx{se}(\hat \beta_j)} \sim t_{(n-p)}. \] Thus, \(\R\) uses the CDF of the \(t\) -distribution to calculate the \(p\) -value \[ p_j = \Pr(\abs{T_j} > \abs{T_{j,\obs}} \mid \beta_j = 0). \]

Residual standard error : The square root of the unbiased estimator, \[ \hat \sigma = \sqrt{\frac{\sum_{i=1}^n (y_i - x_i'\hat\beta)^2}{n-p}} = \sqrt{\frac{e'e}{n-p}}. \]

degrees of freedom : The degrees of freedom for the problem, which is defined as \(n-p\) . Note that for this example we have 3 covariates, but \(p = 4\) since we are fitting four \(\beta\) ’s, including the intercept.

So what are the significant factors in predicting patient satisfaction with the quality of hosptical care? At first glance, the \(p\) -values in Table 3.1 indicate that only the patient’s age (and the intercept) are significant at the 5% level. However, these \(p\) -values only check whether covariates are statistically significant (i.e., there’s evidence that \(\beta_j \neq 0\) ) in the presence of all the others . Indeed, rerunning the analysis without Severity produces:

In other words, Stress is not statistically significant in the presence of both Age and Severity , but if Severity is removed, then Stress adds statistically significant predictive power to the linear model containing only Age and the intercept.

3.1.2 Coefficient of Determination

Researchers are often interested in understanding how well a model explains the observed data. To this end, recall that for response \(y\) , model matrix \(X\) , and residual vector \(e\) , we showed that \[ e'e = y'y - \hat \beta' X'X \hat \beta \implies y'y = \hat \beta' X'X \hat \beta + e'e. \] By subtracting \(n \bar y^2\) from each side of this equality we obtain the following decomposition: \[\begin{align} y'y - n \bar y^2 & = (\hat \beta' X'X \hat \beta - n \bar y^2) + e'e, \nonumber \\ \sumi 1 n (y_i - \bar y)^2 & = \left(\sumi 1 n (x_i'\hat\beta)^2 - n \bar y^2\right) + \sumi 1 n (y_i - x_i'\hat\beta)^2. \tag{3.2} \end{align}\]

The Sum of Squares decomposition (3.2) takes on a special meaning when an intercept term \(\beta_0\) is included in the model.

The terms in (3.2) are named as follows:

  • The term \(\sst = \sumi 1 n (y_i - \bar y)^2\) is the Total Sum of Squares .
  • The term \(\sse = \sumi 1 n (y_i - x_i'\hat\beta)^2\) , in some sense, corresponds to the model prediction error. That is, how far are the actual values \(y_i\) from their predicted values \(\hat y_i = x_i'\hat \beta\) . For this reason, this term is called the Error Sum of Squares or “residual sum of squares”.
  • The term \(\ssr = \sumi 1 n (x_i'\hat \beta - \bar y)^2\) is called the Regression Sum of Squares .

In fact, \(\ssr\) can be thought of as the “explained sum of squares” in light of the following argument.

Given the regression model \[\begin{equation} y_i \mid x_i \ind \N(x_i'\beta, \sigma^2), \tag{3.3} \end{equation}\] we have so far considered the \(x_i\) to be fixed. However, in most applications we would not be able to control the values of the covariates. For instance, in the hospital satisfaction example, the 46 patients (presumably) were randomly sampled, leading to random values of Age , Severity , and Stress . If we consider the \(x_i \iid f_X(x)\) to be randomly sampled from some distribution \(f_X(x)\) , then the \(y_i \iid f_Y(y)\) are also being randomly sampled from a distribution \[ f_Y(y) = \int f_{Y\mid X}(y\mid x) f_X(x) \ud x, \] where the conditional distribution \(f_{Y\mid X}(y\mid x)\) is the PDF of the normal distribution \(y \mid x \sim \N(x'\beta, \sigma^2)\) implied by the regression model (3.3) . In this sense the variance \(\sigma^2_y = \var(y)\) can be defined without conditioning on \(x\) . In fact, the mean and variance in (3.3) are in fact the conditional mean and variance of \(y\) : \[ x_i'\beta = \E[y_i \mid x_i] \and \sigma^2 = \var(y_i \mid x_i). \] Let us now recall the law of double variance: \[ \var(y) = \E[\var(y \mid x)] + \var(\E[y \mid x]). \] If \(y\) is “well predicted” by \(x\) , then \(\E[y \mid x]\) is close to \(y\) and \(\var(\E[y \mid x])/\var(y) \approx 1\) . If \(y\) is “not well predicted” by \(x\) , then \(\E[y \mid x]\) shouldn’t change too much with \(x\) , such that \(\var(\E[y \mid x])/\var(y) \approx 0\) . Based on these considerations, a measure of the quality of a linear model is \[\begin{equation} \frac{\var(\E[y \mid x])}{\var(y)} = 1 - \frac{\E[\var(y \mid x)]}{\var(y)} = 1 - \frac{\sigma^2}{\sigma^2_y}. \tag{3.4} \end{equation}\] This is commonly referred to as the “fraction of variance explained by the model”. A sample-ased measure of this quantity is the so-called coefficient of determination : \[\begin{equation} R^2 = 1 - \frac{\sse}{\sst}. \tag{3.5} \end{equation}\] This value is reported as the Multiple R-squared in Table 3.1 . When the intercept is included in the model, we’ve seen that \(\frac 1 n \sumi 1 n x_i'\hat\beta = \bar y\) , such that if \(e_i = y_i - x_i'\hat \beta\) denotes the \(i\) th residual, then \[ \bar e = \frac 1 n \sumi 1 n (y_i - x_i'\hat \beta) = \bar y - \bar y = 0 \implies \sumi 1 n (y_i - x_i'\hat\beta)^2 = \sumi 1 n (e_i - \bar e)^2, \] such that \[ R^2 = 1 - \frac{\sse/(n-1)}{\sst/(n-1)} = 1 - \frac{\frac 1 {n-1} \sumi 1 n (y_i - x_i'\hat \beta)^2}{\frac 1 {n-1} \sumi 1 n (y_i - \bar y)^2} \] can be written in terms of the sample variances of \(e\) and \(y\) . If, however, the goal is to estimate \(1 - \sigma^2/\sigma^2_y\) , then the numerator should employ the unbiased estimator \(\hat \sigma^2\) . This leads to the adjusted coefficient of determination \[ R^2_\tx{adj} = 1 - \frac{\hat \sigma^2}{\sst/(n-1)}. \] This value is reported as Adjusted R-squared in Table 3.1 .

3.2 Basic Nonlinear Models

The dataset tortoise.csv contains 18 measurements of tortoise hatch yields as a function of size. Specifically, the dataset contains the variables:

  • NumEggs : the number of eggs in the tortoise’s hatch
  • CarapDiam : the diameter of its carapace (cm)

Hatch size as a function of tortoise carapace size. Linear and quadratic regression fits are plotted in red and blue, respectively.

Figure 3.1: Hatch size as a function of tortoise carapace size. Linear and quadratic regression fits are plotted in red and blue, respectively.

The data in Figure 3.1 is poorly described by a straight regression line and has only marginal significance.

A second look at Figure 3.1 reveals that hatch size increases and then decreases as a function of carapace size. It turns out that shell of a tortoise increases throughout its lifetime, whereas its fertility increases up to a certain point and then starts decreasing. This nonlinear trend can be captured using the linear modeling machinery by adding a quadratic term to the regression. That is, we can fit a model of the form \[ \mtt{NumEggs}_i = \beta_0 + \beta_1 \mtt{CarapDiam}_i + \beta_2 \mtt{CarapDiam}_i^2 + \eps_i, \quad \eps_i \iid \N(0, \sigma^2). \] The \(\R\) command for fitting this model is

Note the use of the “as is” function I() . This is because operations such as "+,-,*,:" have a different meaning in formula objects, which is the first argument to lm :

This simple example serves to illustrate the great flexibility of linear models. Indeed, we can use regression methods to fit any model of the form \[ y_i = \sum_{j=1}^p \beta_j f_j(x_i) + \eps_i, \quad \eps_i \iid \N(0, \sigma^2), \] where the \(f_j(x)\) are predetermined functions which can be arbitrarily nonlinear. However, I would not recommend to e.g., include terms of the form \(x, x^2, x^3, \ldots\) to capture an unknown nonlinear effect in the response. Typically \(x\) and \(x^2\) are sufficient and if not there is most likely something else going on that requires further (graphical) investigation.

\(\R\) code to produce Figure 3.1 :

3.2.1 Interpretation of the Intercept

In the non-linear regression example above, we have \(\E[y \mid x] = \beta_0 + \beta_1 x + \beta_2 x^2\) . So how to interpret the parameters? We can use complete-the-square to write \[\begin{align*} \E[y \mid x] & = \beta_2(x + \tfrac 1 2 \beta_1/\beta2)^2 - \tfrac 1 4 \beta_1^2/\beta_2 + \beta_0, \end{align*}\] such that \(\gamma = -\tfrac 1 2 \beta_1\beta_2\) is the shell size at which tortoises produce the maximum expected number of eggs, and \(\beta_2\) is the quadrature of the expected value at the mode. But what about \(\beta_0\) ? We have \(\E[y \mid x = 0] = \beta_0\) , such that \(\beta_0\) is the expected number of eggs for a tortoise of size zero. Thus, \(\beta_0\) in this model does not have a meaningful interpretation. One way to change this is to shift the \(x\) values in the regression. That is, consider the regression model \[ y_i = \gamma_0 + \gamma_1 (x-a) + \gamma_2 (x-a)^2 + \eps_i, \] where \(a\) is a pre-specified constant. Then the intercept of this model \(\gamma_0\) is the expected number of eggs for a tortoise of size \(a\) . The following code show how to fit this regression where \(a\) is the mean of the observed shell sizes.

Thus, the estimated mean number of eggs for a tortoise of average size is 9.98. Interestingly, the shifted and unshifted quadratic regression models have exactly the same predicted values:

3.3 Categorical Predictors

The dataset fishermen_mercury.csv contains data on the mercury concentration found in \(n = 135\) Kuwaiti villagers. Some of the variables in the dataset are:

  • fisherman : Whether or not the subject is a fisherman: 1 = yes, 0 = no.
  • age : Age of the subject (years).
  • height : Height (cm).
  • weight : Weight (kg).
  • fishmlwk : Number of fish meals per week.
  • fishpart : Parts of fish consummed: 0 = none, 1 = muscle tissue only, 2 = muscle tissue and sometimes whole fish, 3 = whole fish.
  • MeHg : Concentration of methyl mercury extracted from subject’s hair sample (mg/g).

Suppose we wish to predict MeHg as a function of fishpart . The variable fishpart is called a categorical predictor (as opposed to a continuous variable) as it only takes on discrete values 0,1,2,3. Moreover, these values do not correspond to integers, but rather to non-numeric categories: none, muscle tissue only, etc. Therefore, while the \(\R\) command

compiles without errors, it assumes that the change predicted MgHg between fishpart=0 and fishpart=1 is exactly the same as between fishpart=1 and fishpart=2 :

A more appropriate treatment of categorical predictors is to include a different effect for each category. This can be fit into the linear regression framework through the use of indicator variables: \[ \mtt{MeHg}_i = \beta_0 + \beta_1\cdot \tx{I}[\mtt{fishpart}_i=1] + \beta_2\cdot \tx{I}[\mtt{fishpart}_i=2] + \beta_3\cdot \tx{I}[\mtt{fishpart}_i=3] + \eps_i. \] Here, \[ \tx{I}[\mtt{fishpart}_i=j] = \begin{cases}1 & \tx{$i$th subject has $\mtt{fishpart}=j$} \\ 0& \tx{otherwise}. \end{cases} \] Note that there is no indicator for fishpart=0 . When this is the case, all three other indicators are equal to zero, such that \[ \E[\tx{MeHg} \mid \mtt{fishpart} = j] = \begin{cases} \beta_0, & j = 0 \\ \beta_0 + \beta_j, & j > 0. \end{cases} \] The corresponding regression model is fit in \(\R\) by first converting fishpart to a factor object:

The design matrix \(X\) for this regression can be obtained through the command

This reveals that observation 131 has fishpart = tissue+whole , whereas observations 130 and 135 have fishpart = none . Alternatively, we can fit the model with individual coefficients for each level of fishpart by forcibly removing the intercept from the model:

3.3.1 Significance of Multiple Predictors

Consider the linear model \[ \mtt{MeHg}_i = \mtt{fishmlwk}_i + \mtt{fishpart}_i + \eps_i, \quad \eps_i \iid \N(0, \sigma^2). \] We have seen that the \(p\) -value of fishmlwk contained in the Pr(>|t|) column of the regression summary indicates the statistical significance of fishmlwk as a predictor when fishpart is included in the model. That is, Pr(>|t|) measures the probability of obtaining more evidence against \(H_0: \beta_{\mtt{fishmlwk}} = 0\) than we found in the present dataset.

Question: How to now test the reverse, i.e., whether fishpart significantly improves predictions when fishmlwk is part of the model?

To answer this, let’s consider a more general problem. Suppose we have the linear regression model \[ y \sim \N(X \beta, \sigma^2 I) \quad \iff \quad y_i \mid x_i \ind \N\big(\textstyle\sum_{j=1}^p x_{ij} \beta_j, \sigma^2\big), \quad i = 1,\ldots, n. \] Now we wish to test whether \(q < p\) of the coefficients, let’s say \(\rv \beta q\) , are all equal to zero: \[ H_0: \beta_j = 0, \ 1 \le j \le q. \]

Setting up the null hypothesis is the first step. The next step is to measure evidence against \(H_0\) . We could use \(\max_{1\le j\le q} |\hat \beta_j|\) , or perhaps \(\sum_{j=1}^q \hat\beta_j^2\) . In some sense, the second measure of evidence is more sensitive than the first, but why weight the \(\hat \beta_j^2\) equally since they have different standard deviations? Indeed, recall that \(\hat \beta \sim \N(\beta, \sigma^2 (X'X)^{-1})\) . Now, let \(\gamma = (\rv \beta q)\) and \(\hat \gamma = (\rv {\hat \beta} q)\) . Then \[ \hat \gamma \sim \N(\gamma, \sigma^2 V), \] where \(V_{q\times q}\) are the top-left entries of \((X'X)^{-1}\) . Recall that \(Z = L^{-1}(\hat \gamma - \gamma) \sim \N(0, I_q)\) , where \(\sigma^2 V = LL'\) . Thus, if \(Z = (\rv Z q)\) , we have \[ \sum_{j=1}^q Z_j^2 = Z'Z = (\hat \gamma - \gamma)'(L^{-1})'L^{-1}(\hat \gamma - \gamma) = \frac 1 {\sigma^2} (\hat \gamma - \gamma)'V^{-1}(\hat \gamma - \gamma). \] So under \(H_0: \gamma = 0\) , we have \[ x_1 = \frac 1 {\sigma^2} \hat \gamma'V^{-1}\hat \gamma = \sum_{j=1}^q Z_j^2 \sim \chi^2_{q}. \] Moreover, this is independent of \[ x_2 = \frac {n-p} {\sigma^2} \hat \sigma^2 \sim \chi^2_{(n-p)}. \] The commonly used measure of evidence against \(H_0: \gamma = 0\) is given by \[\begin{equation} F = \frac{x_1/q}{x_2/(n-p)} = \frac{\hat \gamma' V^{-1} \hat \gamma}{q\hat \sigma^2}. \tag{3.6} \end{equation}\] Thus, \(F\) is a (scaled) ratio of independent chi-squared random variables and its distribution does not depend on the values of \(\alpha = (\rv [q+1] \beta p)\) when \(H_0\) is true. The distribution of \(F\) under \(H_0\) is so important that it gets its own name.

3.3.2 Full and Reduced Models

The so-called “ \(F\) -statistic” defined in (3.6) can be derived from a different perspective. That is, suppose that we partition the design matrix \(X\) as \[ X_{n\times p} = [W_{n\times q} \mid Z_{n\times(p-q)}]. \] Then for \(\gamma = (\rv \beta q)\) and \(\alpha = (\rv [q+1] \beta p)\) , the regression model can be written as \[ M_\tx{full}: y \sim \N(X\beta, \sigma^2 I) \iff y \sim \N(W \gamma + Z \alpha, \sigma^2 I). \] Under the null hypothesis \(H_0: \gamma = 0\) , we are considering the “reduced” model \[ M_\tx{red}: y \sim \N(Z\alpha, \sigma^2 I). \] Notice that \(M_\tx{red}\) is nested in \(M_\tx{full}\) . That is, \(M_\tx{red}\) is within the family of models specified by \(M_\tx{full}\) , under the restriction that the first \(q\) elements of \(\beta\) are equal to zero. It turns out that if \({\sse}\sp{\tx{full}}\) and \({\sse}\sp{\tx{red}}\) are the residual sum-of-squares from the full and reduced models, then the numerator of the \(F\) -statistic in (3.6) can be written as \[ \hat \gamma' V^{-1} \hat \gamma = {\sse}\sp{\tx{red}} - {\sse}\sp{\tx{full}}. \] If we let \(\tx{df}_\tx{full} = n - p\) and \(\tx{df}_\tx{red} = n - (p-q)\) denote the degrees of freedom in each model, then we may write \[\begin{equation} F = \frac{({\sse}\sp{\tx{red}} - {\sse} \sp{\tx{full}})/(\tx{df}_\tx{red}-\tx{df}_\tx{full})}{ {\sse}\sp{\tx{full}}/\tx{df}_{\tx{full}}}. \tag{3.7} \end{equation}\] Indeed, this is the method that \(\R\) uses to to calculate the \(F\) -statistic:

In the special case where the model has \(p\) predictors and an intercept term, \[ y_i \mid x_i \ind \N(\beta_0 + x_i'\beta, \sigma^2), \] the \(F\) -statistic against the null hypothesis \[ H_0: \beta_1 = \cdots = \beta_p = 0 \] of all coefficients being equal to zero except for the intercept is \[ F = \frac{(\sst-\sse)/p}{\sse/(n-p-1)}. \] This quantity and its \(p\) -value are reported in Table 3.1 as F-statistic and p-value .

3.4 Interaction Effects

The dataset real_estate.txt contains the sale price of 521 houses in California and some other characteristics of the house:

  • SalePrice : Sale price ($).
  • SqrFeet : Floor area (square feet).
  • Bedrooms : Number of bedrooms.
  • Bathrooms : Number of bathrooms.
  • Air : Air conditioning: 1 = yes, 0 = no.
  • CarGarage : Size of garage (number of cars).
  • Pool : Is there a pool: 1 = yes, 0 = no.
  • Lotsize : Size of property (square feet).

The model \[ M_1: \mtt{SalePrice}_i = \beta_0 + \beta_1 \mtt{SqrFeet}_i + \beta_2 \cdot \tx{I}[\mtt{Air}_i = 1] + \eps_i, \quad \eps_i \iid \N(0,\sigma^2) \] can be fit in \(\R\) with the command:

Sale price as a function of floor area for 521 houses in California.  Solid lines correspond to predictions with AC as an additive effect, and dotted lines account for interaction between AC and floor area.

Figure 3.2: Sale price as a function of floor area for 521 houses in California. Solid lines correspond to predictions with AC as an additive effect, and dotted lines account for interaction between AC and floor area.

Model \(M_1\) produces parallel regression lines. That is, the increase in the expected sale price for a given house per square foot of floor area is the same for houses with and without air conditioning: \[\begin{equation} \begin{split} & \E[\mtt{SalePrice} \mid \mtt{SqrFeet} = a_2, \mtt{Air} = \color{blue}{\tx{yes}}] - \E[\mtt{SalePrice} \mid \mtt{SqrFeet} = a_1, \mtt{Air} = \color{blue}{\tx{yes}}] \\ = & \E[\mtt{SalePrice} \mid \mtt{SqrFeet} = a_2, \mtt{Air} = \color{red}{\tx{no}}] - \E[\mtt{SalePrice} \mid \mtt{SqrFeet} = a_1, \mtt{Air} = \color{red}{\tx{no}}] \\ = & \beta_1(a_2-a_1). \end{split} \tag{3.8} \end{equation}\] If we wish to model the increase in SalePrice per unit of SqrFeet differently houses with and without air-conditioning, we can do this in \(\R\) with the command

The formula term SqrFeet*Air expands to SqrFeet + Air + SqrFeet:Air , and this corresponds to the model \[ M_2: \mtt{SalePrice}_i = \beta_0 + \beta_1 \mtt{SqrFeet}_i + \beta_2 \cdot \tx{I}[\mtt{Air}_i = 1] + \beta_3 \big(\mtt{SqrFeet}_i\cdot \tx{I}[\mtt{Air}_i = 1]\big) + \eps_i. \] The last term in this model enters the design matrix \(X\) as \[ \mtt{SqrFeet}_i\cdot I[\mtt{Air}_i = 1] = \begin{cases} 0 & \tx{house $i$ has no AC}, \\ \mtt{SqrFeet}_i & \tx{house $i$ has AC}. \end{cases} \] The predictions of model \(M_2\) are the dotted lines in Figure 3.2 . The output of summary(M2) indicates that SqrFeet:Air significantly increases the fit to the data in the presence of only SqrFeet and Air . The term SqrFeet:Air is called an interaction term. In its absence, SqrFeet and Air are said to enter additively into model \(M_1\) in reference to equation (3.8) . That is, Air has a purely additive effect in model \(M_1\) , since it “adds on” the same value of \(\beta_2\) to the predicted sale price for any floor area. In general, we have the following:

We can also have interactions between continuous covariates. For example, suppose that we have:

  • \(y\) : The yield of a crop.
  • \(x\) : The amount of fertilizer.
  • \(w\) : The amount of pesticide.

Then a basic interaction model is of the form \[ M: y_i = \beta_0 + \beta_1 x_i + \beta_2 w_i + \beta_3 x_i w_i + \eps_i. \] Thus for every fixed amount of fertilizer, there is a different linear relation between amount of pesticide and yield: \[ \E[y \mid x, w] = \beta_{0x} + \beta_{1x} w, \] where \(\beta_{0x} = \beta_0 + \beta_1 x\) and \(\beta_{1x} = \beta_2 + \beta_3 x\) . This also provides an interpretation of \(\beta_3\) . If \(\tilde x = x + 1\) , then \[ \beta_3 = \beta_{1\tilde x} - \beta_{1x}. \] In other words, \(\beta_3\) is the difference in the effect per unit of pesticide on yield for a change of one unit of fertilizer.

\(\R\) code to produce Figure 3.2 :

3.5 Heteroskedastic Errors

The basic regression model \[ y_i = x_i'\beta + \eps_i, \quad \eps_i \iid \N(0, \sigma^2) \] is said to have homoskedastic errors. That is, the variance of each error term \(\eps_i\) is constant: \(\var(\eps_i) \equiv \sigma^2\) . A model with heteroskedastic errors is of the form \[ y_i = x_i'\beta + \eps_i, \quad \eps_i \ind \N(0, \sigma_i^2), \] where the error of each subject/experiment \(i = 1,\ldots,n\) is allowed to have a different variance. Two common examples of heteroskedastic errors are presented below.

3.5.1 Replicated Measurements

The hospitals.csv dataset consists of measurements of post-surgery complications in 31 hospitals in New York state. The variables in the dataset are:

  • SurgQual : A quality index for post-surgery complications aggregated over all patients in the hospital (higher means less complications).
  • Difficulty : A measure of the average difficulty of surgeries for that hospital (higher means more difficult cases).
  • N : The number of patients used to calculate these indices.

Aggregated measure of post-surgery complications as a function of difficulty of case load for 31 hospitals in New York state.  The size of each data point is proportional to the number of patients used to compute the aggregate.  Simple and weighted regression lines are indicated by the solid and dotted lines.

Figure 3.3: Aggregated measure of post-surgery complications as a function of difficulty of case load for 31 hospitals in New York state. The size of each data point is proportional to the number of patients used to compute the aggregate. Simple and weighted regression lines are indicated by the solid and dotted lines.

To formalize this, suppose that \(y_{ij}\) is the quality score for patient \(j\) in hospital \(i\) . If each \(y_{ij}\) has common variance about the hospital mean \[ x_i'\beta = \beta_0 + \beta_1 \mtt{Difficulty}_i, \] we could model this using \[ y_{ij} = x_i'\beta + \eps_{ij}, \quad \eps_{ij} \iid \N(0,\sigma^2). \] Thus, each hospital score \(\mtt{SurgQual}_i\) is an average of the scores of individual patients, \[ \mtt{SurgQual}_i = y_i = \frac 1 {N_i} \sum_{j=1}^{N_i} y_{ij}, \] such that \[\begin{equation} y_i = x_i' \beta + \eps_i, \quad \eps_i = \frac 1 {N_i} \sum_{j=1}^{N_i} \eps_{ij} \ind \N(0, \sigma^2/N_i). \tag{3.9} \end{equation}\]

To fit the “repeated measurements” model (3.9) , let \[\begin{align*} \tilde y_i & = N_i^{1/2} y_i, & \tilde x_i & = N_i^{1/2} x_i, & \tilde \eps_i & = N_i^{1/2} \eps_i. \end{align*}\] Then model (3.9) becomes \[ \tilde y_i = \tilde x_i' \beta + \tilde \eps_i, \quad \tilde \eps_i \iid \N(0, \sigma^2). \] In other words, by rescaling the response and predictors by the square-root of the sample size, we can transform the heteroskedastic repeated measurements regression problem into a homoskedastic one. If \(y\) , \(\tilde y\) and \(X\) , \(\tilde X\) denote the original and transformed response vector and design matrix, then \[ \tilde y = D^{1/2} y \and \tilde X = D^{1/2} X, \where D = \left[\begin{smallmatrix} N_1 & & 0\\ & \ddots & \\ 0& & N_n \end{smallmatrix} \right]. \] The regression parameter estimates are then \[\begin{align*} \hat \beta & = (\tilde X' \tilde X)^{-1} \tilde X' \tilde y = (X'DX)^{-1} X'Dy, \\ \hat \sigma^2 & = \frac{\sumi 1 n (\tilde y_i - \tilde x_i'\hat \beta)^2}{n-p} = \frac{(y - X\hat\beta)'D(y - X\hat\beta)}{n-p}. \end{align*}\] This “weighted regression” is achieved in \(\R\) with the weights argument to lm() :

Figure 3.3 indicates that weighting the hospitals proportionally to sample size attenuates the negative relationship between the post-surgery quality index and the case load difficulty index. This is because many of the low-performing hospitals with difficult case loads had few patients, whereas some of the well-performing hospitals with less difficult cases had a very large number of patients.

\(\R\) code to produce Figure 3.3 :

3.5.2 Multiplicative Error Models

The dataset Leinhardt in the \(\R\) package car contains data on infant mortality and per-capita income for 105 world nations circa 1970. The data can be loaded with the following commands:

The variables in the dataset are:

  • income : Per-capita income in U. S. dollars.
  • infant : Infant-mortality rate per 1000 live births.
  • region : Continent, with levels Africa , Americas , Asia , Europe .
  • oil : Oil-exporting country, with levels no , yes .

We begin by fitting an additive (i.e., main effects only) model to infant mortality as a function of the three other covariates.

Perhaps surprisingly, this model indicates a very minor relation between infant and income . To assess the adequacy of this model, the standard “residual plot” of model residuals vs. model predictions is displayed in Figure 3.4 a.

Data analysis on the original scale.  (a) Residuals vs.\ predicted values.  (b) Infant mortality vs. income.

Figure 3.4: Data analysis on the original scale. (a) Residuals vs. predicted values. (b) Infant mortality vs. income.

If the model assumption of iid normal errors adequately describes the data, then the residuals should exhibit no specific pattern about the horizontal axis. However, Figure 3.4 a indicates two glaring patterns.

The first is that the predicted values seem to cluster into 4-5 distinct categories. This is not in itself problematic: since the effect of the continuous covariate income is small, we expect the predictions to largely be grouped according to the discrete covariates region and oil (if there are no continuous covariates in the model, then there are exactly as many distinct predicted values as there are discrete covariate categories).

The second pattern in Figure 3.4 a is that the magnitude of the residuals seems to increase with the predicted values. This is one of the most common pattern of heteroskedasticity in linear modeling.

So far, we have seen additive error models of the form \[ y = \mu(x) + \eps, \qquad \eps \sim [0, \sigma^2]. \] Here, the square brackets indicate the mean and variance of \(\eps\) , i.e., \(\E[\eps] = 0\) and \(\var(\eps) = \sigma^2\) . Under such a model, we have \[ y \sim [\mu(x), \sigma^2]. \] In contrast, a multiplicative error model is of the form \[ y = \mu(x) \cdot \eps, \qquad \eps \sim [1, \sigma^2]. \] When the error is multiplicative, we have \[ y \sim [\mu(x), \mu(x)^2 \sigma^2], \] such that both the conditional mean and conditional variance of \(y\) depend on \(\mu(x)\) . Fortunately, a simple treatment of multiplicative models is to put an additive model on the log of \(y\) , such that \[ \log(y) = \mu(x) + \log(\eps) \iff y = \exp\{\mu(x)\} \cdot \eps. \]

In the linear regression context, the multiplicative model we consider is \[\begin{equation} \log(y_i) = \tilde y_i = x_i'\beta + \tilde \eps_i, \qquad \tilde \eps_i \iid \N(0,\sigma^2). \tag{3.10} \end{equation}\] Under such a model, we can easily obtain the MLE of \(\beta\) , \[ \hat \beta = \ixtx X' \tilde y. \] Similarly we have the unbiased estimate \(\hat \sigma^2\) , confidence intervals for \(\beta\) , hypothesis tests using \(t\) and \(F\) distributions, and so on. All of these directly follow from the usual additive model (3.10) formulated on the log scale. There are, however, two caveats to consider.

Mean Response Estimates . Suppose we wish to estimate \(\mu(x) = \E[y \mid x]\) on the regular scale. Then \[ \mu(x) = \E[y \mid x] = \E[ \exp(x'\beta + \tilde \eps) ] = \exp(x'\beta) \cdot \E[ \eps], \] where \(\eps = \exp(\tilde \eps)\) .

Thus, \(\E[\eps] = \exp(\sigma^2/2)\) , such that \(\E[y \mid x] = \exp(x'\beta + \sigma^2/2)\) . Based on data, the usual estimate of the conditional expectation is \[ \hat \mu(x) = \exp(x'\hat\beta + \hat \sigma^2/2). \]

Prediction Intervals . While \(\E[x'\hat \beta] = x'\beta\) , unfortunately \(\hat \mu(x)\) is not an unbiased estimate for \(\mu(x) = \E[y \mid x]\) . However, suppose that we wish to make a prediction interval for \(y_\star\) , a new observation at covariate value \(x_\star\) . By considering the log-scale model (3.10) , we can construct a \(1-\alpha\) level prediction interval for \(\tilde y_\star = \log(y_\star)\) in the usual way: \[ (L, U) = x_\star' \hat \beta \pm q_\alpha s(x_\star), \] where \(s(x_\star) = \hat \sigma \sqrt{1 + x_\star' \ixtx x_\star}\) , and for \(T \sim t_{(n-p)}\) , we have \(\Pr( \abs T < q_\alpha) = 1-\alpha\) . That is \(L = L(X, y, x_\star)\) and \(U = U(X, y, x_\star)\) are random variables such that \[ \Pr( L < \tilde y_\star < U \mid \theta) = 1-\alpha, \] regardless of the true parameter values \(\theta = (\beta, \sigma)\) . But since \(\exp(\cdot)\) is a monotone increasing function, \[ \Pr( L < \tilde y_\star < U \mid \theta) = \Pr( \exp(L) < y_\star < \exp(U) \mid \theta), \] such that \((\exp(L), \exp(U))\) is a \(1-\alpha\) level prediction interval for \(y_\star\) .

Let’s return to the Leinhardt dataset. A scatterplot of the data is displayed in Figure 3.4 b with different symbols for the points according to region and oil .

Figure 3.4 b shows why a linear relation between infant and income clearly fails: the data are highly skewed to the right on each of these axes. Since both variables are also constrained to be positive, there is a good chance that taking logs can “spread out” the values near zero. Indeed, Figure 3.5 b changes the axes of Figure 3.4 b to the log-scale by adding a log argument to the plot:

Data analysis on the log scale. (a) Residuals vs.\ predicted values.  (b) Infant mortality vs.\ income.

Figure 3.5: Data analysis on the log scale. (a) Residuals vs. predicted values. (b) Infant mortality vs. income.

The relationship between \(\log(\mtt{infant})\) and \(\log(\mtt{income})\) is far closer to linear, so we fit the log-log model in \(\R\) as follows:

The residual plot in Figure 3.5 a does not indicate egregious departures from homoskedasticity, and the analysis of the \(\R\) output confirms a highly significant slope coefficient of log-per-capita income on the log-infant mortality rate.

\(\R\) code to produce Figure 3.5 a:

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Multiple Linear Regression | A Quick Guide (Examples)

Published on February 20, 2020 by Rebecca Bevans . Revised on June 22, 2023.

Regression models are used to describe relationships between variables by fitting a line to the observed data. Regression allows you to estimate how a dependent variable changes as the independent variable(s) change.

Multiple linear regression is used to estimate the relationship between  two or more independent variables and one dependent variable . You can use multiple linear regression when you want to know:

  • How strong the relationship is between two or more independent variables and one dependent variable (e.g. how rainfall, temperature, and amount of fertilizer added affect crop growth).
  • The value of the dependent variable at a certain value of the independent variables (e.g. the expected yield of a crop at certain levels of rainfall, temperature, and fertilizer addition).

Table of contents

Assumptions of multiple linear regression, how to perform a multiple linear regression, interpreting the results, presenting the results, other interesting articles, frequently asked questions about multiple linear regression.

Multiple linear regression makes all of the same assumptions as simple linear regression :

Homogeneity of variance (homoscedasticity) : the size of the error in our prediction doesn’t change significantly across the values of the independent variable.

Independence of observations : the observations in the dataset were collected using statistically valid sampling methods , and there are no hidden relationships among variables.

In multiple linear regression, it is possible that some of the independent variables are actually correlated with one another, so it is important to check these before developing the regression model. If two independent variables are too highly correlated (r2 > ~0.6), then only one of them should be used in the regression model.

Normality : The data follows a normal distribution .

Linearity : the line of best fit through the data points is a straight line, rather than a curve or some sort of grouping factor.

Here's why students love Scribbr's proofreading services

Discover proofreading & editing

Multiple linear regression formula

The formula for a multiple linear regression is:

y = {\beta_0} + {\beta_1{X_1}} + … + {{\beta_n{X_n}} + {\epsilon}

  • … = do the same for however many independent variables you are testing

B_nX_n

To find the best-fit line for each independent variable, multiple linear regression calculates three things:

  • The regression coefficients that lead to the smallest overall model error.
  • The t statistic of the overall model.
  • The associated p value (how likely it is that the t statistic would have occurred by chance if the null hypothesis of no relationship between the independent and dependent variables was true).

It then calculates the t statistic and p value for each regression coefficient in the model.

Multiple linear regression in R

While it is possible to do multiple linear regression by hand, it is much more commonly done via statistical software. We are going to use R for our examples because it is free, powerful, and widely available. Download the sample dataset to try it yourself.

Dataset for multiple linear regression (.csv)

Load the heart.data dataset into your R environment and run the following code:

This code takes the data set heart.data and calculates the effect that the independent variables biking and smoking have on the dependent variable heart disease using the equation for the linear model: lm() .

Learn more by following the full step-by-step guide to linear regression in R .

To view the results of the model, you can use the summary() function:

This function takes the most important parameters from the linear model and puts them into a table that looks like this:

R multiple linear regression summary output

The summary first prints out the formula (‘Call’), then the model residuals (‘Residuals’). If the residuals are roughly centered around zero and with similar spread on either side, as these do ( median 0.03, and min and max around -2 and 2) then the model probably fits the assumption of heteroscedasticity.

Next are the regression coefficients of the model (‘Coefficients’). Row 1 of the coefficients table is labeled (Intercept) – this is the y-intercept of the regression equation. It’s helpful to know the estimated intercept in order to plug it into the regression equation and predict values of the dependent variable:

The most important things to note in this output table are the next two tables – the estimates for the independent variables.

The Estimate column is the estimated effect , also called the regression coefficient or r 2 value. The estimates in the table tell us that for every one percent increase in biking to work there is an associated 0.2 percent decrease in heart disease, and that for every one percent increase in smoking there is an associated .17 percent increase in heart disease.

The Std.error column displays the standard error of the estimate. This number shows how much variation there is around the estimates of the regression coefficient.

The t value column displays the test statistic . Unless otherwise specified, the test statistic used in linear regression is the t value from a two-sided t test . The larger the test statistic, the less likely it is that the results occurred by chance.

The Pr( > | t | ) column shows the p value . This shows how likely the calculated t value would have occurred by chance if the null hypothesis of no effect of the parameter were true.

Because these values are so low ( p < 0.001 in both cases), we can reject the null hypothesis and conclude that both biking to work and smoking both likely influence rates of heart disease.

When reporting your results, include the estimated effect (i.e. the regression coefficient), the standard error of the estimate, and the p value. You should also interpret your numbers to make it clear to your readers what the regression coefficient means.

Visualizing the results in a graph

It can also be helpful to include a graph with your results. Multiple linear regression is somewhat more complicated than simple linear regression, because there are more parameters than will fit on a two-dimensional plot.

However, there are ways to display your results that include the effects of multiple independent variables on the dependent variable, even though only one independent variable can actually be plotted on the x-axis.

Multiple regression in R graph

Here, we have calculated the predicted values of the dependent variable (heart disease) across the full range of observed values for the percentage of people biking to work.

To include the effect of smoking on the independent variable, we calculated these predicted values while holding smoking constant at the minimum, mean , and maximum observed rates of smoking.

Prevent plagiarism. Run a free check.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Chi square test of independence
  • Statistical power
  • Descriptive statistics
  • Degrees of freedom
  • Pearson correlation
  • Null hypothesis

Methodology

  • Double-blind study
  • Case-control study
  • Research ethics
  • Data collection
  • Hypothesis testing
  • Structured interviews

Research bias

  • Hawthorne effect
  • Unconscious bias
  • Recall bias
  • Halo effect
  • Self-serving bias
  • Information bias

A regression model is a statistical model that estimates the relationship between one dependent variable and one or more independent variables using a line (or a plane in the case of two or more independent variables).

A regression model can be used when the dependent variable is quantitative, except in the case of logistic regression, where the dependent variable is binary.

Multiple linear regression is a regression model that estimates the relationship between a quantitative dependent variable and two or more independent variables using a straight line.

Linear regression most often uses mean-square error (MSE) to calculate the error of the model. MSE is calculated by:

  • measuring the distance of the observed y-values from the predicted y-values at each value of x;
  • squaring each of these distances;
  • calculating the mean of each of the squared distances.

Linear regression fits a line to the data by finding the regression coefficient that results in the smallest MSE.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bevans, R. (2023, June 22). Multiple Linear Regression | A Quick Guide (Examples). Scribbr. Retrieved September 27, 2024, from https://www.scribbr.com/statistics/multiple-linear-regression/

Is this article helpful?

Rebecca Bevans

Rebecca Bevans

Other students also liked, simple linear regression | an easy introduction & examples, an introduction to t tests | definitions, formula and examples, types of variables in research & statistics | examples, what is your plagiarism score.

logo

Regression: the Mother of all Models – Retail Case Study Example (Part 9)

Olympics - by Roopam

The Olympics – by Roopam

Welcome back to our retail case study example for marketing analytics. In the previous 8 parts, we have covered some of the key tasks of data science such as:

In this part, we will learn about estimation through the mother of all models – multiple linear regression. A sound understanding of regression analysis and modeling provides a solid foundation for analysts to gain deeper understanding of virtually every other modeling technique like neural networks, logistic regression, etc. But before moving to regression, let’s try to put some fundamental ideas behind statistics in perspective by using the most followed event of the summer Olympics..

100 Meters Sprint

The first Olympic games I followed was in 1988 held in Seoul, South Korea. That was the same Olympics where Ben Johnson broke the then world record for 100 meters sprint by completing the race in 9.79 seconds. Later, Johnson was tested positive for consumption of performance enhancing drugs. He was disqualified from the race, and stripped of his gold medal. For a sporting event that lasts just close to 10 seconds, 100 meters sprint is arguably the most followed event of the summer Olympics. In 2012 Olympics, Usain Bolt created a new record by finishing the race in 9.63 seconds. The following is the list of medal holders for 2012 Olympics (source: Wikipedia)

Rank Lane Name Nationality Reaction Result
7 Usain Bolt Jamaica 0.165 9.63
5 Yohan Blake Jamaica 0.179 9.75
6 Justin Gatlin USA 0.178 9.79

Usain Bolt is widely regarded as the fasted man in the world. However, I must say that…

You Can Beat Usain Bolt in 100 Meters Sprint

Before I explain how, let us go back to the medal holders of 2012 Olympics. For Instance, if we make Usain Bolt run the 100 meters race one thousand times, he will finish each race with a different timing, mostly close to his record time in the Olympics. The same is also true for the other medal holders Yohan Blake, and Justin Gatlin. For argument’s sake, let’s assume the following distributions for race completion time for the three medal holders.  The following distributions are all normal or Gaussian distributions. Normal distribution is a good assumption for most natural phenomena like running speed of humans.

100 Meters Race

Using the above distributions the gold medal will still stay with Usain Bolt as the most likely case. However, there are still cases in which either sprinter can win the gold medal. This, according to me, is the foundation of statistical thinking.

Now coming back to our title for this section, if you compete with Usain Bolt Googolplex number of times then there is still a likely case that you will win at least one race against the fastest man in the world. Yay!

: this is a really large number. Googol is also the inspiration behind the name for Google (search engine) – yes the smart founders of Google misspelled it.
 : this is unfathomably large number. Google’s corporate head quarters in California is called Googleplex.

Regression Analysis – Retail Case Study Example

Now let’s come back to our case study example where you are the Chief Analytics Officer & Business Strategy Head at an online shopping store called DresSMart Inc. set the following two objectives:

Objective 1: Improve the conversion rate of the campaigns i.e. number of customer buying products from the marketing product catalog.

Objective 2:  Improve the profit generated through the converted customers

You have achieved the first objective in the previous few parts of this case study example. The classification models ( Part 5 , Part 6 ,   Part 7  &  Part 8 ) were used to estimate the propensities of customers to respond to campaigns. This leaves you with the second objective to estimate the expected profit generated from each customer if he/she responds to the campaign. This is a classical regression problem. To develop a regression model you will use the data for 4200 customers, out of hundred thousand solicited customers, those have responded to the previous campaigns. All these 4200 customers live in different locations that can be grouped into the following three categories

  • Large Cities
  • Mid-Sized Cities
  • Small Towns

Incidentally, these customers are evenly divided into these three categories with 1400 customers in each group. The first thing you checked is the average value of profit generated from these three categories of cities. As you could see in the figure below average values for profits are different for these categories. Keep these average values in mind, they will come handy when we will develop our regression model.

1 Average Profits

Now  the second question is if these average values for profits are significantly different or not. This question is answered using the location category wise distributions of all the 4200 customers. The above figure shows a representation of these distributions (towards right). For our original data, the following are the location category wise density distribution for all the 4200 customers. Notice, profit is negative for some cases in this distribution because of returned products by customer, and other losses.

Profit Distribution

There are a couple of intuitive insights in the above plots:

  • The large cities have a bigger average value for profits than the others because of higher earning capacity and disposable income for residents of the large metropolitan cities.
  • The large cities also have a wider distribution of profit than other two categories because of greater socio-economic diversity for the large metropolitan cities.

Keeping the above insights in mind, let’s create our simple regression model with these categories as the predictor variables. The following is the results for our regression model:

46 0.4691 98.06 <2e-16
8 0.6635 12.06 <2e-16
22 0.6635 33.16 <2e-16
0.2069
0.2065
2.20E-16

The following is the linear equation for this regression model

Profit = 46+8\times Mid\ Sized\ Cities+22\times Large\ Cities

Notice, that the model just has mid-sized and larger cities as the predictor variables. The information about small towns is absorbed in the intercept part. Also, these predictor variables are dummy variables hence they can have 0 or 1 as the only possible choices for values. For instance, if the location is a small town then mid-sized cities = 0, and large cities=0 hence the profit is:

Profit = 46+8\times 0+22\times 0=46

Recall the above average figures, this is the same average value for small towns. Now, if the location is a mid-sized city then

Profit = 46+8\times 1+22\times 0=54

Again this is the same as the average value for mid-sized cities. Finally, the estimated profit through the resident customer of a large city is:

Profit = 46+8\times 0+22\times 1=68

Now the next question is : how good is this model? For this we will have to scroll up to the regression model results and look at the following three things:

  • P values for individual coefficients: Look at the right most column for the coefficients – the value is really small <2e-16 this means that the model is almost 100% certain that the coefficients will not become zero. This is similar to your chances of beating Usain Bolt i.e. extremely low but not zero.
  • Adjusted R-squared value: for our model which is 0.2065. This means that just the category of location explains about 20% of the variation in profit. This is not bad for a single categorical variable if we will keep adding more significant variables to the above model the value of Adjusted R -squared will keep increasing.
  • F-Statistics:  Again the p-value here is really small i.e. 2.20E-16. This means the model has very low chance of being random similar to your chances of randomly beating Usain Bolt.

Sign-off Note

The following statements summarize the essential ideas behind the Olympic games. The most important thing in the Olympic Games is not to win but to take part. The essential thing is not to have conquered but to have fought well.

So go out, play well, and most importantly enjoy even if the opponent is the fastest man on the planet. See you soon with a new post.

9 thoughts on “ Regression: the Mother of all Models – Retail Case Study Example (Part 9) ”

very intuitive and one of the best effort to explain data science

Great article. Can you please elaborate bit more on how dummy variables will be assigned? I think we need to create two dummy variables one for mid-size city and other for large-size city. The values for small town is removed to avoid dummy variable trap. The removed dummy then becomes the base category against which the other categories are compared. In this case it is included part of intercept. Is this understanding right?

Yes Reva, your understanding is right. The intercept in this case is the average value for small towns. In case there were more than one dummy variables then the intercept will absorb the information for all the baseline values for these dummy variables.

Why is the chance to beat Usain Bolt one in a Googolplex number of times? And how can this be connected to the p value which is nowhere as close to even Googol at 2.20E-16. Could you please explain this concept in more detail Roopam?

R uses this notation (<2.20E-16) to denote very small probabilities - this is because of the computational limitation. Notice the 'less than' sign. This is kind of similar to -infinity in the mathematical terms. It is essentially a tiny probability. But for all practical purposes, it doesn't matter how small it is.

Thank you Roopam 🙂 this is super useful

Roopam Any chance u cld share the code. Thanks Sai

Hello Roopam,

Your blog is one of the best resources online and is helping me a lot. I am preparing for my interviews and your blog is a gold mine of information for case studies/projects etc.

Could you please provide the dataset and the code for this case study.

Thanks, Anand

When I clck on the link for past 2 or past 3 etc I am directed to you main page.

how do i reach to those page. 🙁

Leave a comment Cancel reply

Your email address will not be published. Required fields are marked *

Notify me of follow-up comments by email.

Notify me of new posts by email.

This site uses Akismet to reduce spam. Learn how your comment data is processed .

Logo for University of Washington Libraries

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

13 Multiple Linear Regression

Barbara Illowsky; Margo Bergman; and Susan Dean

Student Learning Outcomes

By the end of this chapter, the student should be able to:

  • Perform and interpret multiple regression analysis
  • State the assumptions of OLS regression and why they are important
  • Discusses the causes and corrections of multicollinearity
  • Explain the purpose and method of including dummy variables
  • Explain the purpose and method of logarithmic transformations
  • Develop predictions of data with multiple regression equations

The Multiple Linear Regression Equation

As previously stated, regression analysis is a statistical technique that can test the hypothesis that a variable is dependent upon one or more other variables. Further, regression analysis can provide an estimate of the magnitude of the impact of a change in one variable on another. This last feature, of course, is all important in predicting future values.

Regression analysis is based upon a functional relationship among variables and further, assumes that the relationship is linear. This linearity assumption is required because, for the most part, the theoretical statistical properties of non-linear estimation are not well worked out yet by the mathematicians and econometricians. This presents us with some difficulties in economic analysis because many of our theoretical models are nonlinear. The marginal cost curve, for example, is decidedly nonlinear as is the total cost function, if we are to believe in the effect of specialization of labor and the Law of Diminishing Marginal Product. There are techniques for overcoming some of these difficulties, exponential and logarithmic transformation of the data for example, but at the outset we must recognize that standard ordinary least squares (OLS) regression analysis will always use a linear function to estimate what might be a nonlinear relationship. When there is only one independent, or explanatory variable, we call the relationship simple linear regression. When there is more than one independent variable, we refer to performing multiple linear regression.

The general multiple linear regression model can be stated by the equation:

y_i=\beta_0+\beta_1X_{1i}+\beta_2X_{2i}+...+\beta_kX_{ki}+\epsilon_i

As with all probability distributions, this model works only if certain assumptions hold. These are that the Y is normally distributed, the errors are also normally distributed with a mean of zero and a constant standard deviation, and that the error terms are independent of the size of X and independent of each other.

Assumptions of the Ordinary Least Squares Regression Model

Each of these assumptions needs a bit more explanation. If one of these assumptions fails to be true, then it will have an effect on the quality of the estimates. Some of the failures of these assumptions can be fixed while others result in estimates that quite simply provide no insight into the questions the model is trying to answer or worse, give biased estimates.

x_i

  • The error term is a random variable with a mean of zero and a constant variance. The meaning of this is that the variances of the independent variables are independent of the value of the variable. Consider the relationship between personal income and the quantity of a good purchased as an example of a case where the variance is dependent upon the value of the independent variable, income. It is plausible that as income increases the variation around the amount purchased will also increase simply because of the flexibility provided with higher levels of income. The assumption is for constant variance with respect to the magnitude of the independent variable called homoscedasticity. If the assumption fails, then it is called heteroscedasticity. Figure 1 shows the case of homoscedasticity where all three distributions have the same variance around the predicted value of Y regardless of the magnitude of X.
  • While the independent variables are all fixed values they are from a probability distribution that is normally distributed. This can be seen in Figure 1 by the shape of the distributions placed on the predicted line at the expected value of the relevant value of Y.
  • The independent variables are independent of Y, but are also assumed to be independent of the other X variables. The model is designed to estimate the effects of independent variables on some dependent variable in accordance with a proposed theory. The case where some or more of the independent variables are correlated is not unusual. There may be no cause and effect relationship among the independent variables, but nevertheless they move together. Take the case of a simple supply curve where quantity supplied is theoretically related to the price of the product and the prices of inputs. There may be multiple inputs that may over time move together from general inflationary pressure. The input prices will therefore violate this assumption of regression analysis. This condition is called multicollinearity, which will be taken up in detail later.
  • The error terms are uncorrelated with each other. This situation arises from an effect on one error term from another error term. While not exclusively a time series problem, it is here that we most often see this case. An X variable in time period one has an effect on the Y variable, but this effect then has an effect in the next time period. This effect gives rise to a relationship among the error terms. This case is called autocorrelation, “self-correlated.” The error terms are now not independent of each other, but rather have their own effect on subsequent error terms.

\^y=b+mx

Multicollinearity

Our discussion earlier indicated that like all statistical models, the OLS regression model has important assumptions attached. Each assumption, if violated, has an effect on the ability of the model to provide useful and meaningful estimates. The Gauss-Markov Theorem has assured us that the OLS estimates are unbiased and minimum variance, but this is true only under the assumptions of the model. Here we will look at the effects on OLS estimates if the independent variables are correlated. The other assumptions and the methods to mitigate the difficulties they pose if they are found to be violated are examined in Econometrics courses. We take up multicollinearity because it is so often prevalent in Economic models and it often leads to frustrating results.

The OLS model assumes that all the independent variables are independent of each other. This assumption is easy to test for a particular sample of data with simple correlation coefficients. Correlation, like much in statistics, is a matter of degree: a little is not good, and a lot is terrible.

x_1

Multicollinearity has a further deleterious impact on the OLS estimates. The correlation between the two independent variables also shows up in the formulas for the estimate of the variance for the coefficients. If the correlation is zero as assumed in the regression model, then the formula collapses to the familiar ratio of the variance of the errors to the variance of the relevant independent variable. If however the two independent variables are correlated, then the variance of the estimate of the coefficient increases. This results in a smaller t-value for the test of hypothesis of the coefficient. In short, multicollinearity results in failing to reject the null hypothesis that the X variable has no impact on Y when in fact X does have a statistically significant impact on Y. Said another way, the large standard errors of the estimated coefficient created by multicollinearity suggest statistical insignificance even when the hypothesized relationship is strong.

Dummy Variables

Thus far the analysis of the OLS regression technique assumed that the independent variables in the models tested were continuous random variables. There are, however, no restrictions in the regression model against independent variables that are binary. This opens the regression model for testing hypotheses concerning categorical variables such as gender, race, region of the country, before a certain data, after a certain date and innumerable others. These categorical variables take on only two values, 1 and 0, success or failure, from the binomial probability distribution. The form of the equation becomes:

\^y = b_0+b_2X_2+b_1X_1

An example of the use of a dummy variable is the work estimating the impact of gender on salaries. There is a full body of literature on this topic and dummy variables are used extensively. For this example the salaries of elementary and secondary school teachers for a particular state is examined. Using a homogeneous job category, school teachers, and for a single state reduces many of the variations that naturally effect salaries such as differential physical risk, cost of living in a particular state, and other working conditions. The estimating equation in its simplest form specifies salary as a function of various teacher characteristic that economic theory would suggest could affect salary. These would include education level as a measure of potential productivity, age and/or experience to capture on-the-job training, again as a measure of productivity. Because the data are for school teachers employed in a public school districts rather than workers in a for-profit company, the school district’s average revenue per average daily student attendance is included as a measure of ability to pay. The results of the regression analysis using data on 24,916 school teachers are presented below.

 

Intercept 4269.9
Gender (male = 1) 632.38 13.39
Total Years of Experience 52.32 1.10
Years of Experience in Current District 29.97 1.52
Education 629.33 13.16
Total Revenue per ADA 90.24 3.76
–  2 .725
24,916

Table 1: Earnings Estimate for Elementary and Secondary School Teachers

The coefficients for all the independent variables are significantly different from zero as indicated by the standard errors. Dividing the standard errors of each coefficient results in a t-value greater than 1.96 which is the required level for 95% significance. The binary variable, our dummy variable of interest in this analysis, is gender where male is given a value of 1 and female given a value of 0. The coefficient is significantly different from zero with a dramatic t-statistic of 47 standard deviations. We thus cannot accept the null hypothesis that the coefficient is equal to zero. Therefore we conclude that there is a premium paid male teachers of $632 after holding constant experience, education and the wealth of the school district in which the teacher is employed. It is important to note that these data are from some time ago and the $632 represents a six percent salary premium at that time. A graph of this example of dummy variables is presented below ( Figure 3 ).

case study using multiple regression analysis

In two dimensions, salary is the dependent variable on the vertical axis and total years of experience was chosen for the continuous independent variable on horizontal axis. Any of the other independent variables could have been chosen to illustrate the effect of the dummy variable. The relationship between total years of experience has a slope of $52.32 per year of experience and the estimated line has an intercept of $4,269 if the gender variable is equal to zero, for female. If the gender variable is equal to 1, for male, the coefficient for the gender variable is added to the intercept and thus the relationship between total years of experience and salary is shifted upward parallel as indicated on the graph. Also marked on the graph are various points for reference. A female school teacher with 10 years of experience receives a salary of $4,792 on the basis of her experience only, but this is still $109 less than a male teacher with zero years of experience.

A more complex interaction between a dummy variable and the dependent variable can also be estimated. It may be that the dummy variable has more than a simple shift effect on the dependent variable, but also interacts with one or more of the other continuous independent variables. While not tested in the example above, it could be hypothesized that the impact of gender on salary was not a one-time shift, but impacted the value of additional years of experience on salary also. That is, female school teacher’s salaries were discounted at the start, and further did not grow at the same rate from the effect  of experience as for male school teachers. This would show up as a different slope for the relationship between total years of experience for males than for females. If this is so then females school teachers would not just start behind their male colleagues (as measured by the shift in the estimated regression line), but would fall further and further behind as time and experienced increased.

The graph below shows how this hypothesis can be tested with the use of dummy variables and an interaction variable ( Figure 4 ).

case study using multiple regression analysis

Interpretation of Regression Coefficients: Elasticity and Logarithmic Transformation

As we have seen, the coefficient of an equation estimated using OLS regression analysis provides an estimate of the slope of a straight line that is assumed be the relationship between the dependent variable and at least one independent variable. From the calculus, the slope of the line is the first derivative and tells us the magnitude of the impact of a one unit change in the X variable upon the value of the Y variable measured in the units of the Y variable. As we saw in the case of dummy variables, this can show up as a parallel shift in the estimated line or even a change in the slope of the line through an interactive variable. Here we wish to explore the concept of elasticity and how we can use a regression analysis to estimate the various elasticities in which economists have an interest.

The concept of elasticity is borrowed from engineering and physics where it is used to measure a material’s responsiveness to a force, typically a physical force such as a stretching/pulling force. It is from here that we get the term an “elastic” band. In economics, the force in question is some market force such as a change in price or income. Elasticity is measured as a percentage change/response in both engineering applications and in economics. The value of measuring in percentage terms is that the units of measurement do not play a role in the value of the measurement and thus allows direct comparison between elasticities. As an example, if the price of gasoline increased say 50 cents from an initial price of $3.00 and generated a decline in monthly consumption for a consumer from 50 gallons to 48 gallons we calculate the elasticity to  be 0.25. The price elasticity is the percentage change in quantity resulting from some percentage change in price. A 16 percent increase in price has generated only a 4 percent decrease in demand: 16% price change → 4% quantity change or .04/.16 = .25. This is called an inelastic demand meaning a small response to the price change. This comes about because there are few if any real substitutes for gasoline; perhaps public transportation, a bicycle or walking. Technically, of course, the percentage change in demand from a price increase is a decline in demand thus price elasticity is a negative number. The common convention, however, is to talk about elasticity as the absolute value of the number. Some goods have many substitutes: pears for apples for plums, for grapes, etc. etc. The elasticity for such goods is larger than one and are called elastic in demand. Here a small percentage change in price will induce a large percentage change in quantity demanded. The consumer will easily shift the demand to the close substitute.

While this discussion has been about price changes, any of the independent variables in a demand equation will have an associated elasticity. Thus, there is an income elasticity that measures the sensitivity of demand to changes in income: not much for the demand for food, but very sensitive for yachts. If the demand equation contains a term for substitute goods, say candy bars in a demand equation for cookies, then the responsiveness of demand for cookies from changes in prices of candy bars can be measured. This is called the cross-price elasticity of demand and to an extent can be thought of as brand loyalty from a marketing view. How responsive is the demand for Coca-Cola to changes in the price of Pepsi?

Now imagine the demand for a product that is very expensive. Again, the measure of elasticity is in percentage terms thus the elasticity can be directly compared to that for gasoline: an elasticity of 0.25 for gasoline conveys the same information as an elasticity of 0.25 for $25,000 car. Both goods are considered by the consumer to have few substitutes and thus have inelastic demand curves, elasticities less than one.

The mathematical formulae for various elasticities are:

\eta_p=\frac{\% \Delta Q}{\% \Delta P}

Where Y is used as the symbol for income.

\eta_p1=\frac{\% \Delta Q_1}{\% \Delta P_2}

Where P2 is the price of the substitute good.

Examining closer the price elasticity we can write the formula as:

\eta_p=\frac{\% \Delta Q}{\% \Delta P}=\frac{dQ}{dP}\left(\frac{P}{Q}\right)=b\left(\frac{P}{Q}\right)

Where b is the estimated coefficient for price in the OLS regression.

\frac{dQ}{dP}

Along a straight-line demand curve the percentage change, thus elasticity, changes continuously as the scale changes, while the slope, the estimated regression coefficient, remains constant. Going back to the demand for gasoline. A change in price from $3.00 to $3.50 was a 16 percent increase in price. If the beginning price were $5.00 then the same 50¢ increase would be only a 10 percent increase generating a different elasticity. Every straight-line demand curve has a range of elasticities starting at the top left, high prices, with large elasticity numbers, elastic demand, and decreasing as one goes down the demand curve, inelastic demand.

In order to provide a meaningful estimate of the elasticity of demand the convention is to estimate the elasticity at the point of means. Remember that all OLS regression lines will go through the point of means. At this point is the greatest weight of the data used to estimate the coefficient. The formula to estimate an elasticity when an OLS demand curve has been estimated becomes:

\eta_p=b\left(\frac{\bar{P}}{\bar{Q}}\right)

The same method can be used to estimate the other elasticities for the demand function by using the appropriate mean values of the other variables; income and price of substitute goods for example.

Logarithmic Transformation of the Data

Ordinary least squares estimates typically assume that the population relationship among the variables is linear thus of the form presented in The Regression Equation . In this form the interpretation of the coefficients is as discussed above; quite simply the coefficient provides an estimate of the impact of a one unit change in X on Y measured in units of Y. It does not matter just where along the line one wishes to make the measurement because it is a straight line with a constant slope thus constant estimated level of impact per unit change. It may be, however, that the analyst wishes to estimate not the simple unit measured impact on the Y variable, but the magnitude of the percentage impact on Y of a one unit change in the X variable. Such a case might be how a unit change in experience, say one year, effects not the absolute amount of a worker’s wage, but the percentage impact on the worker’s wage. Alternatively, it may be that the question asked is the unit measured impact on Y of a specific percentage increase in X. An example may be “by how many dollars will sales increase if the firm spends X percent more on advertising?” The third possibility is the case of elasticity discussed above. Here we are interested in the percentage impact on quantity demanded for a given percentage change in price, or income  or perhaps the price of a substitute good. All three of these cases can be estimated by transforming the data to logarithms before running the regression. The resulting coefficients will then provide a percentage change measurement of the relevant variable.

To summarize, there are four cases:

Case 1: The ordinary least squares case begins with the linear model developed above:

Y=b_0+b_1x

Case 2: The underlying estimated equation is:

log Y=b_0+b_1X

Multiply by 100 to covert to percentages and rearranging terms gives:

100b_1 = \% \frac{\Delta Y}{\Delta X}

Case 3: In this case the question is “what is the unit change in Y resulting from a percentage change in X?” What is the dollar loss in revenues of a five percent increase in price or what is the total dollar cost impact of a five percent increase in labor costs? The estimated equation for this case would be:

Y=b_0+b_1 log (X)

Here the calculus differential of the estimated equation is:

dY=b_1d(logX)

Divide by 100 to get percentage and rearranging terms gives:

\frac{b_1}{100}=\frac{dY}{100\frac{dX}{X}}=\frac{Unit \Delta Y}{\% \Delta X}

Case 4: This is the elasticity case where both the dependent and independent variables are converted to logs before the OLS estimation. This is known as the log-log case or double log case, and provides us with direct estimates of the elasticities of the independent variables. The estimated equation is:

log Y=b_0+b_1 log (X)

Differentiating we have:

d (log Y)=b_1 d(logX)

our definition of elasticity. We conclude that we can directly estimate the elasticity of a variable through double log transformation of the data. The estimated coefficient is the elasticity. It is common to use double log transformation of all variables in the estimation of demand functions to get estimates of all the various elasticities of the demand curve.

Predicting with a Multiple Regression Equation

One important value of an estimated regression equation is its ability to predict the effects on Y of a change in one or  more values of the independent variables. The value of this is obvious. Careful policy cannot be made without estimates of the effects that may result. Indeed, it is the desire for particular results that drive the formation of most policy. Regression models can be, and have been, invaluable aids in forming such policies.

The Gauss-Markov theorem assures us that the point estimate of the impact on the dependent variable derived by putting in the equation the hypothetical values of the independent variables one wishes to simulate will result in an estimate of the dependent variable which is minimum variance and unbiased. That is to say that from this equation comes the best unbiased point estimate of y given the values of x.

\^y=\b_0+\b_1X_{1i}+\b_2X_{2i}+...+\b_kX_{ki}

Remember that point estimates do not carry a particular level of probability, or level of confidence, because points have  no “width” above which there is an area to measure. This was why we developed confidence intervals for the mean and proportion earlier. The same concern arises here also. There are actually two different approaches to the issue of developing estimates of changes in the independent variable, or variables, on the dependent variable. The first approach wishes to measure the expected mean value of y from a specific change in the value of x: this specific value implies the expected value. Here the question is: what is the mean impact on y that would result from multiple hypothetical experiments on y at this specific value of x. Remember that there is a variance around the estimated parameter of x and thus each experiment will result in a bit of a different estimate of the predicted value of y.

The second approach to estimate the effect of a specific value of x on y treats the event as a single experiment: you choose x and multiply it times the coefficient and that provides a single estimate of y. Because this approach acts as if there were a single experiment the variance that exists in the parameter estimate is larger than the variance associated with the expected value approach.

The conclusion is that we have two different ways to predict the effect of values of the independent variable(s) on the dependent variable and thus we have two different intervals. Both are correct answers to the question being asked, but there are two different questions. To avoid confusion, the first case where we are asking for the expected value of the mean of the estimated y, is called a confidence interval as we have named this concept before. The second case, where we are asking for the estimate of the impact on the dependent variable y of a single experiment using a value of x, is called the prediction interval . The test statistics for these two interval measures within which the estimated value of y will fall are:

case study using multiple regression analysis

The mathematical computations of these two test statistics are complex. Various computer regression software packages provide programs within the regression functions to provide answers to inquires of estimated predicted values of y given various values chosen for the x variable(s). It is important to know just which interval is being tested in the computer package because the difference in the size of the standard deviations will change the size of the interval estimated. This is shown in Figure 5.

case study using multiple regression analysis

Figure 5 shows visually the difference the standard deviation makes in the size of the estimated intervals. The confidence interval, measuring the expected value of the dependent variable, is smaller than the prediction interval for the same level of confidence. The expected value method assumes that the experiment is conducted multiple times rather than just once as in the other method. The logic here is similar, although not identical, to that discussed when developing the relationship between the sample size and the confidence interval using the Central Limit Theorem. There, as the number of experiments increased, the distribution narrowed and the confidence interval became tighter around the expected value of the mean.

It is also important to note that the intervals around a point estimate are highly dependent upon the range of data used to estimate the equation regardless of which approach is being used for prediction. Remember that all regression equations go through the point of means, that is, the mean value of y and the mean values of all independent variables in the equation. As the value of x chosen to estimate the associated value of y is further from the point of means the width of the estimated interval around the point estimate increases. Choosing values of x beyond the range of the data used to estimate the equation possess even greater danger of creating estimates with little use; very large intervals, and risk of error. Figure 6 shows this relationship.

case study using multiple regression analysis

 Third exam/final exam example

We found the equation of the best-fit line for the final exam grade as a function of the grade on the third-exam. We can now use the least-squares regression line for prediction. Assume the coefficient for X was determined to be significantly different from zero.

Suppose you want to estimate, or predict, the mean final exam score of statistics students who received 73 on the third exam. The exam scores ( x -values) range from 65 to 75. Since 73 is between the x -values 65 and 75, we feel comfortable to substitute x = 73 into the equation. Then:

\^y=− 173.51 + 4.83(73) = 179.08

We predict that statistics students who earn a grade of 73 on the third exam will earn a grade of 179.08 on the final exam, on average.

  • What would you predict the final exam score to be for a student who scored a 66 on the third exam?

Solution – 1

2. What would you predict the final exam score to be for a student who scored a 90 on the third exam?

Solution – 2

The x values in the data are between 65 and 75. Ninety is outside of the domain of the observed x values in the data (independent variable), so you cannot reliably predict the final exam score for this student. (Even though it is possible to enter 90 into the equation for x and calculate a corresponding y value, the y value that you get will have a confidence interval that may not be meaningful.)

To understand really how unreliable the prediction can be outside of the observed x values observed in the data, make the substitution x = 90 into the equation.

\^y=− 173.51 + 4.83(73) = 261.19

The final-exam score is predicted to be 261.19. The largest the final-exam score can be is 200.

Media Attributions

  • RegAssumGraph
  • TeacherSalGraph
  • IntVarGraph
  • Figure6MultReg

Quantitative Analysis for Business Copyright © by Barbara Illowsky; Margo Bergman; and Susan Dean is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.

Share This Book

To read this content please select one of the options below:

Please note you do not have access to teaching notes, decision-making using regression analysis: a case study on top tier holidays llp.

Publication date: 7 February 2023

Issue publication date: 27 March 2023

Teaching notes

Research methodology.

This study aims to investigate the factors that contribute to the overall tour experience and services provided by Top Tier Holidays. The study is mixed in nature, and the researchers have used analytical tools to analyse the data factually. Multiple regression using MS Excel is used in the study.

Case overview/synopsis

This case is based on the experiences of a real-life travel and tour company located in New Delhi, India. The case helps understand regression analysis to identify independent variables significantly impacting the tour experience. The CEO of the company is focused on improving the overall customer experience. The CEO has identified six principal determinants (variables) applicable to tour companies’ success. These variables are hotel experience, transportation, cab driver, on-tour support, itinerary planning and pricing.

Multiple regression analysis using Microsoft Excel is conducted on the above determinants (the independent variables) and the overall tour experience (the dependent variable). This analysis would help identify the relationship between the independent and dependent variables and find the variables that significantly impact the dependent variable. This case also helps us appreciate the importance of various parameters that affect the overall customer tour experience and the challenges a tour operator company faces in the current competitive business environment.

Complexity academic level

This case is designed for discussion with the undergraduate courses in business management, commerce and tourism management programmes. The case will build up readers’ understanding of linear regression with multiple variables. It shows how multiple linear regression can help companies identify the significant variables affecting business outcomes.

  • Multiple regression
  • Correlation coefficient
  • Tourism industry
  • Customer ratings
  • Regression model optimisation

Acknowledgements

The authors would like to extend their sincere appreciations and gratitude to the CEO and executives of ‘Top Tier Holidays LLP’ for providing relevant information to write this case study. They are truly thankful for their support.

Funding : The authors have not received any funding from any government or non-government sources for writing this case study.

Disclaimer. This case is intended to be used as the basis for class discussion rather than to illustrate either effective or ineffective handling of a management situation. The case was compiled from published sources.

Kumar, N. , Rath, A. , Singh, A.K. and Akoijam, S.L.S. (2023), "Decision-making using regression analysis: a case study on Top Tier Holidays LLP", , Vol. 19 No. 2, pp. 273-289. https://doi.org/10.1108/TCJ-01-2022-0004

Emerald Publishing Limited

Copyright © 2023, Emerald Publishing Limited

You do not currently have access to these teaching notes. Teaching notes are available for teaching faculty at subscribing institutions. Teaching notes accompany case studies with suggested learning objectives, classroom methods and potential assignment questions. They support dynamic classroom discussion to help develop student's analytical skills.

Related articles

All feedback is valuable.

Please share your general feedback

Report an issue or find answers to frequently asked questions

Contact Customer Support

case study using multiple regression analysis

Multivariable Methods

  •   Page:
  •   1  
  • |   2  
  • |   3  
  • |   4  
  • |   5  
  • |   6  
  • |   7  
  • |   8  
  • |   9  
  • |   10  
  • Introduction
  • Learning Objectives
  • Confounding
  • Determining Whether a Variable is a Confounder
  • A Stratified Analysis
  • The Cochran-Mantel-Haenszel Method
  • Data Layout for Cochran-Mantel-Haenszel Estimates
  • Effect Modification
  • Introduction to Correlation and Regression Analysis
  • Correlation Analysis
  • Example - Correlation of Gestational Age and Birth Weight
  • Regression Analysis 
  • Simple Linear Regression 
  • BMI and Total Cholesterol
  • BMI and HDL Cholesterol
  • Comparing Mean HDL Levels With Regression Analysis 
  • The Controversy Over Environmental Tobacco Smoke Exposure

Multiple Linear Regression Analysis

Controlling for confounding with multiple linear regression, relative importance of the independent variables , evaluating effect modification with multiple linear regression , "dummy" variables in regression models , example of the use of dummy variables.

  • Multiple Logistic Regression Analysis
  • Example of Logistic Regression - Association Between Obesity and CVD
  • Example - Risk Factors Associated With Low Infant Birth Weight

On This Page sidebar

Module Topics

All Modules

Multiple linear regression analysis is an extension of simple linear regression analysis, used to assess the association between two or more independent variables and a single continuous dependent variable. The multiple linear regression equation is as follows:

MLR1.png

Multiple regression analysis is also used to assess whether confounding exists. Since multiple linear regression analysis allows us to estimate the association between a given independent variable and the outcome holding all other variables constant, it provides a way of adjusting for (or accounting for) potentially confounding variables that have been included in the model.

Suppose we have a risk factor or an exposure variable, which we denote X 1 (e.g., X 1 =obesity or X 1 =treatment), and an outcome or dependent variable which we denote Y. We can estimate a simple linear regression equation relating the risk factor (the independent variable) to the dependent variable as follows:

MLR2.png

where b 1 is the estimated regression coefficient that quantifies the association between the risk factor and the outcome.

Suppose we now want to assess whether a third variable (e.g., age) is a confounder . We denote the potential confounder X 2 , and then estimate a multiple linear regression equation as follows:

MLR3.png

In the multiple linear regression equation, b 1 is the estimated regression coefficient that quantifies the association between the risk factor X 1 and the outcome, adjusted for X 2 (b 2 is the estimated regression coefficient that quantifies the association between the potential confounder and the outcome). As noted earlier, some investigators assess confounding by assessing how much the regression coefficient associated with the risk factor (i.e., the measure of association) changes after adjusting for the potential confounder. In this case, we compare b 1 from the simple linear regression model to b 1 from the multiple linear regression model. As a rule of thumb, if the regression coefficient from the simple linear regression model changes by more than 10%, then X 2 is said to be a confounder.

Once a variable is identified as a confounder, we can then use multiple linear regression analysis to estimate the association between the risk factor and the outcome adjusting for that confounder. The test of significance of the regression coefficient associated with the risk factor can be used to assess whether the association between the risk factor is statistically significant after accounting for one or more confounding variables. This is also illustrated below.

Example - The Association Between BMI and Systolic Blood Pressure 

Suppose we want to assess the association between BMI and systolic blood pressure using data collected in the seventh examination of the Framingham Offspring Study. A total of n=3,539 participants attended the exam, and their mean systolic blood pressure was 127.3 with a standard deviation of 19.0. The mean BMI in the sample was 28.2 with a standard deviation of 5.3.

A simple linear regression analysis reveals the following:

Independent Variable

Regression Coefficient

T

P-value

Intercept

108.28

62.61

0.0001

BMI

0.67

11.06

0.0001

The simple linear regression model is:

MLR4.png

Suppose we now want to assess whether age (a continuous variable, measured in years), male gender (yes/no), and treatment for hypertension (yes/no) are potential confounders, and if so, appropriately account for these using multiple linear regression analysis. For analytic purposes, treatment for hypertension is coded as 1=yes and 0=no. Gender is coded as 1=male and 0=female. A multiple regression analysis reveals the following:

Independent Variable

Regression Coefficient

T

P-value

Intercept

68.15

26.33

0.0001

BMI

0.58

10.30

0.0001

Age

0.65

20.22

0.0001

Male gender

0.94

1.58

0.1133

Treatment for hypertension

6.44

9.74

0.0001

 The multiple regression model is:

Y-hat.png

Notice that the association between BMI and systolic blood pressure is smaller (0.58 versus 0.67) after adjustment for age, gender and treatment for hypertension. BMI remains statistically significantly associated with systolic blood pressure (p=0.0001), but the magnitude of the association is lower after adjustment. The regression coefficient decreases by 13%.

[Actually, doesn't it decrease by 15.5%. In this case the true "beginning value" was 0.58, and confounding caused it to appear to be 0.67. so the actual % change = 0.09/0.58 = 15.5%.]

Using the informal rule (i.e., a change in the coefficient in either direction by 10% or more), we meet the criteria for confounding. Thus, part of the association between BMI and systolic blood pressure is explained by age, gender and treatment for hypertension.

This also suggests a useful way of identifying confounding. Typically, we try to establish the association between a primary risk factor and a given outcome after adjusting for one or more other risk factors. One useful strategy is to use multiple regression models to examine the association between the primary risk factor and the outcome before and after including possible confounding factors. If the inclusion of a possible confounding variable in the model causes the association between the primary risk factor and the outcome to change by 10% or more, then the additional variable is a confounder.

Assessing only the p-values suggests that these three independent variables are equally statistically significant. The magnitude of the t statistics provides a means to judge relative importance of the independent variables. In this example, age is the most significant independent variable, followed by BMI, treatment for hypertension and then male gender. In fact, male gender does not reach statistical significance (p=0.1133) in the multiple regression model.

Some investigators argue that regardless of whether an important variable such as gender reaches statistical significance it should be retained in the model. Other investigators only retain variables that are statistically significant.

[Not sure what you mean here; do you mean to control for confounding?] /WL

This is yet another example of the complexity involved in multivariable modeling. The multiple regression model produces an estimate of the association between BMI and systolic blood pressure that accounts for differences in systolic blood pressure due to age, gender and treatment for hypertension.

A one unit increase in BMI is associated with a 0.58 unit increase in systolic blood pressure holding age, gender and treatment for hypertension constant. Each additional year of age is associated with a 0.65 unit increase in systolic blood pressure, holding BMI, gender and treatment for hypertension constant.

Men have higher systolic blood pressures, by approximately 0.94 units, holding BMI, age and treatment for hypertension constant and persons on treatment for hypertension have higher systolic blood pressures, by approximately 6.44 units, holding BMI, age and gender constant. The multiple regression equation can be used to estimate systolic blood pressures as a function of a participant's BMI, age, gender and treatment for hypertension status. For example, we can estimate the blood pressure of a 50 year old male, with a BMI of 25 who is not on treatment for hypertension as follows:

SysBP1.png

We can estimate the blood pressure of a 50 year old female, with a BMI of 25 who is on treatment for hypertension as follows:

SysBP2.png

On page 4 of this module we considered data from a clinical trial designed to evaluate the efficacy of a new drug to increase HDL cholesterol. One hundred patients enrolled in the study and were randomized to receive either the new drug or a placebo. The investigators were at first disappointed to find very little difference in the mean HDL cholesterol levels of treated and untreated subjects.

 

Sample Size

Mean HDL

Standard Deviation of HDL

New Drug

50

40.16

4.46

Placebo

50

39.21

3.91

However, when they analyzed the data separately in men and women, they found evidence of an effect in men, but not in women. We noted that when the magnitude of association differs at different levels of another variable (in this case gender), it suggests that effect modification is present.

 

New Drug

40

38.88

3.97

Placebo

41

39.24

4.21

 

 

 

 

 

 

 

New Drug

10

45.25

1.89

Placebo

9

39.06

2.22

Multiple regression analysis can be used to assess effect modification. This is done by estimating a multiple regression equation relating the outcome of interest (Y) to independent variables representing the treatment assignment, sex and the product of the two (called the treatment by sex interaction variable ). For the analysis, we let T = the treatment assignment (1=new drug and 0=placebo), M = male gender (1=yes, 0=no) and TM, i.e., T * M or T x M, the product of treatment and male gender. In this case, the multiple regression analysis revealed the following: 

Independent Variable

Regression Coefficient

T

P-value

Intercept

39.24

65.89

0.0001

T (Treatment)

-0.36

-0.43

0.6711

M (Male Gender)

-0.18

-0.13

0.8991

TM (Treatment x Male Gender)

6.55

3.37

0.0011

The multiple regression model is:

EM1.png

The details of the test are not shown here, but note in the table above that in this model, the regression coefficient associated with the interaction term, b 3 , is statistically significant (i.e., H 0 : b 3 = 0 versus H 1 : b 3 ≠ 0). The fact that this is statistically significant indicates that the association between treatment and outcome differs by sex.

The model shown above can be used to estimate the mean HDL levels for men and women who are assigned to the new medication and to the placebo. In order to use the model to generate these estimates, we must recall the coding scheme (i.e., T = 1 indicates new drug, T=0 indicates placebo, M=1 indicates male sex and M=0 indicates female sex).

The expected or predicted HDL for men (M=1) assigned to the new drug (T=1) can be estimated as follows:

EM2.png

The expected HDL for men (M=1) assigned to the placebo (T=0) is:

EM4.png

Similarly, the expected HDL for women (M=0) assigned to the new drug (T=1) is:

EM3.png

The expected HDL for women (M=0)assigned to the placebo (T=0) is:

em5.png

Notice that the expected HDL levels for men and women on the new drug and on placebo are identical to the means shown the table summarizing the stratified analysis. Because there is effect modification, separate simple linear regression models are estimated to assess the treatment effect in men and women:

Regression Coefficient

T

P-value

Intercept

39.08

57.09

0.0001

T (Treatment)

6.19

6.56

0.0001

 

 

 

 

Regression Coefficient

T

P-value

Intercept

39.24

61.36

0.0001

T (Treatment)

-0.36

-0.40

0.6927

 The regression models are:

In men, the regression coefficient associated with treatment (b 1 =6.19) is statistically significant (details not shown), but in women, the regression coefficient associated with treatment (b 1 = -0.36) is not statistically significant (details not shown).

Multiple linear regression analysis is a widely applied technique. In this section we showed here how it can be used to assess and account for confounding and to assess effect modification. The techniques we described can be extended to adjust for several confounders simultaneously and to investigate more complex effect modification (e.g., three-way statistical interactions).

There is an important distinction between confounding and effect modification. Confounding is a distortion of an estimated association caused by an unequal distribution of another risk factor. When there is confounding, we would like to account for it (or adjust for it) in order to estimate the association without distortion. In contrast, effect modification is a biological phenomenon in which the magnitude of association is differs at different levels of another factor, e.g., a drug that has an effect on men, but not in women. In the example, present above it would be in inappropriate to pool the results in men and women. Instead, the goal should be to describe effect modification and report the different effects separately.

There are many other applications of multiple regression analysis. A popular application is to assess the relationships between several predictor variables simultaneously, and a single, continuous outcome. For example, it may be of interest to determine which predictors, in a relatively large set of candidate predictors, are most important or most strongly associated with an outcome. It is always important in statistical analysis, particularly in the multivariable arena, that statistical modeling is guided by biologically plausible associations.

Independent variables in regression models can be continuous or dichotomous. Regression models can also accommodate categorical independent variables. For example, it might be of interest to assess whether there is a difference in total cholesterol by race/ethnicity. The module on Hypothesis Testing presented analysis of variance as one way of testing for differences in means of a continuous outcome among several comparison groups. Regression analysis can also be used. However, the investigator must create indicator variables to represent the different comparison groups (e.g., different racial/ethnic groups). The set of indicator variables ( also called dummy variables ) are considered in the multiple regression model simultaneously as a set independent variables. For example, suppose that participants indicate which of the following best represents their race/ethnicity: White, Black or African American, American Indian or Alaskan Native, Asian, Native Hawaiian or Pacific Islander or Other Race. This categorical variable has six response options. To consider race/ethnicity as a predictor in a regression model, we create five indicator variables (one less than the total number of response options) to represent the six different groups. To create the set of indicators, or set of dummy variables, we first decide on a reference group or category. In this example, the reference group is the racial group that we will compare the other groups against. Indicator variable are created for the remaining groups and coded 1 for participants who are in that group (e.g., are of the specific race/ethnicity of interest) and all others are coded 0. In the multiple regression model, the regression coefficients associated with each of the dummy variables (representing in this example each race/ethnicity group) are interpreted as the expected difference in the mean of the outcome variable for that race/ethnicity as compared to the reference group, holding all other predictors constant. The example below uses an investigation of risk factors for low birth weight to illustrates this technique as well as the interpretation of the regression coefficients in the model.

An observational study is conducted to investigate risk factors associated with infant birth weight. The study involves 832 pregnant women. Each woman provides demographic and clinical data and is followed through the outcome of pregnancy. At the time of delivery, the infant s birth weight is measured, in grams, as is their gestational age, in weeks. Birth weights vary widely and range from 404 to 5400 grams. The mean birth weight is 3367.83 grams with a standard deviation of 537.21 grams. Investigators wish to determine whether there are differences in birth weight by infant gender, gestational age, mother's age and mother's race. In the study sample, 421/832 (50.6%) of the infants are male and the mean gestational age at birth is 39.49 weeks with a standard deviation of 1.81 weeks (range 22-43 weeks). The mean mother's age is 30.83 years with a standard deviation of 5.76 years (range 17-45 years). Approximately 49% of the mothers are white; 41% are Hispanic; 5% are black; and 5% identify themselves as other race. A multiple regression analysis is performed relating infant gender (coded 1=male, 0=female), gestational age in weeks, mother's age in years and 3 dummy or indicator variables reflecting mother's race. The results are summarized in the table below.

Independent Variable

Regression Coefficient

T

P-value

Intercept

-3850.92

-11.56

0.0001

Male infant

174.79

6.06

0.0001

Gestational age, weeks

179.89

22.35

0.0001

Mother's age, years

1.38

0.47

0.6361

Black race

-138.46

-1.93

0.0535

Hispanic race

-13.07

-0.37

0.7103

Other race

-68.67

-1.05

0.2918

 Many of the predictor variables are statistically significantly associated with birth weight. Male infants are approximately 175 grams heavier than female infants, adjusting for gestational age, mother's age and mother's race/ethnicity. Gestational age is highly significant (p=0.0001), with each additional gestational week associated with an increase of 179.89 grams in birth weight, holding infant gender, mother's age and mother's race/ethnicity constant. Mother's age does not reach statistical significance (p=0.6361). Mother's race is modeled as a set of three dummy or indicator variables. In this analysis, white race is the reference group. Infants born to black mothers have lower birth weight by approximately 140 grams (as compared to infants born to white mothers), adjusting for gestational age, infant gender and mothers age. This difference is marginally significant (p=0.0535). There are no statistically significant differences in birth weight in infants born to Hispanic versus white mothers or to women who identify themselves as other race as compared to white.

return to top | previous page | next page

  • Case Studies

Multiple Regression Case Study

Our services.

The following is a sample Multiple Regression Case Study . There are several key elements to a successful regression analysis. The first one is choosing the right functional model. The second one consists of assessing the fulfilment of the regression assumptions.

These two elements go hand to hand and they depend from each other. This is, after choosing a functional model, the assumptions need to be verified and if they are not met, then very possibly we would need to review the structure functional model, or whether a different link-function needs to be used, a ridge-regression needs to be used, etc.

The possibilities are endless, and an expert eye is required. One thing is clear: having a regression model that does not meet the assumptions is as useful as not having any model at all.

A Multiple Linear Model for Life Expectancy

A multiple linear regression model is constructed in order to predict Life Expectancy . Five possible predictors were considered and using stepwise regression the final model consisted of only two predictors: Human Development Index and Index of Democratization .

1. Introduction: A theoretical Approach, Argument and Hypotheses

The objective of this paper is to obtain a multiple linear regression model for Life Expectancy , based on the predictors found in the cs2003 comprehensive.sav SPSS dataset. For the purpose of the analysis, the following predictors will be used in a multiple regression model for predicting life expectancy: Human Development Index , Unemployment (% total labor force) , Democratization , Hospital Beds per 1000 people , Health Expenditure (% GDP) and Urban Population (% of total) .

All of these variables are expected to reasonably affect the average life expectancy of a country and for this reason they are going to be included in the model, or at least, they will be attempted to be included. Then, by a process of model building, the best model containing the above mentioned variables will be constructed, by using the following principles: parsimony, maximum explained variance, and smallest standard error. For the purpose of testing the validity of our model, four cases will be held for testing purposes. The holdout countries will be Marshall Islands, Palau, Micronesia and Samoa

2. Descriptions of Data, Indicators and Slippage

For the purpose of the analysis the SPSS file the cs2003 comprehensive.sav will be used. This file contains 235 variables and 212 cases, corresponding to the countries in the world. The variables included the dataset are many demographic and macroeconomic variables that put together can give a very good idea of the metrics of any given country.

3. Analysis of Findings

The purpose of this section is to fully describe the results of a regression analysis performed in order to address the research question stated in the previous sections. First, the possible linear correlation between the dependent variable (DV) Life Expectancy and the predictors is assessed.

case study using multiple regression analysis

As it can be observed above, all five predictors have a significantly and positive degree of linear association with the DV.

Now graphically:

case study using multiple regression analysis

There is a clear degree of linear association between the DV and the potential predictors, which confirms the results obtained in the correlation matrix.

Now that we know that the predictors have a significant linear association with the response variable, a multiple linear regression analysis is performed:

case study using multiple regression analysis

It is observed that the model is significant overall, F(6, 29) = 21.821, p < .001 . The model seems to have a good predictive value, since considering that 78.1% of the variation in Life Expectancy is explained by this model. There are no problems with multicollinearity, since all the VIF’s are lower than 5. But we also observe that not all predictors are individually significant. In order to drop the redundant predictors a stepwise regression will be performed.

case study using multiple regression analysis

Observe that only 2 variables enter to the final model: Human Development Index and Index of Democratization . Such model explains 78% of the variation in Life Expectancy . The model is:

Life Expectancy = 33.928 + 51.842*Human Development Index -0.128*Index of Democratization

The following residual plots are obtained:

case study using multiple regression analysis

The histogram of residuals doesn’t seem to show any strong violation from normality.

case study using multiple regression analysis

The plot of residuals versus predicted values above doesn’t show any pattern suggesting any kind of problem with heteroskedasticity. The regression assumptions seem to be met.

4. Conclusions and Policy Implications

First of all, it is important to point out that the dataset exhibited a whole lot of missing values, which is something that could be worrisome for the validity of the conclusions of this analysis. In fact, out of 212 cases, only 30 turned out to be valid to perform the regression analysis. It wad found that only 2 variables entered to the final model: Human Development Index and Index of Democratization. Such model explains 78% of the variation in Life Expectancy. The model is:

Hence, and increase of 0.1 in the Human Development Index brings and average increase of 0.51842 years in life expectancy, whereas an increase of 1 point in the index of democratization decreases an average of 0.128 years of life expectancy. Overall, the model found seems to be reliable, with a higher percentage of explained variation (78%) and apparently the regression assumptions are met. One possible flaw is that number of valid cases for the regression analysis was quite (low), which could eventually affect the validity of the results.

Gravetter, F. & Wallnau L. (2005). Essentials of Statistics for the Behavioral Sciences. Wadsworth.

Mertler, Craig A. & Vannatta Rachel A. (2002). Advanced and Multivariate Statistical Methods. Los Angeles: Pyrczak Publishing.

Kutner, M et al. (2004). Applied Linear Regression Models. New York, McGraw-Hill Irwin.

Checking the validity of the model using the holdout data:

The dataset contains a lot of missing values, so the original countries considered for being holdout countries don’t have the required variables to perform the estimate of life expectancy. Hence, we choose 3 countries that contain all valid cases, required to use the regression model obtained:

79.1463

0.933

27.4

78.7894

0.3569

0.356914

0.45%

68.8402

0.659

16.8

65.9415

2.8987

2.898722

4.21%

62.4597

0.594

23.8

61.6757

0.7840

0.783952

1.26%

MAPE =

1.97%

The mean average percent error is 1.97%, which indicates that the model is valid.

Please e-mail or call us at 1-818-850-7850 for a FREE initial phone consultation and we will be glad to talk to you and see if our expertise can be of help.

Data Analysis

We can do serious data crunching and get meaningful conclusions.

Well Documented

We provide well documented reports, with the exact depth requested by the customer.

Get Results

We build models, we test, we reach conclusions, we get results.

Customizable

We adopt to our customers need. We can customize and automate.

Our Features

Customer support.

We respond quickly to questions from our customers.

Statistical Software

We can handle complex analyses with most of the statistical software packages available

We offer customized reports.

Free initial Consultation

We will be glad to talk to you about your needs and how we can help

Competitive Prices

We specialize on efficient and affordable solutions for small and medium size business

Call us now!

Did we mention we have a free initial consultation?

and more...

  • Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Statistics By Jim

Making statistics intuitive

Regression Tutorial with Analysis Examples

By Jim Frost 85 Comments

Fitted line plot that fits the curved relationship between BMI and body fat percentage.

If you’re learning regression analysis, you might want to bookmark this tutorial!

When to Use Regression and the Signs of a High-Quality Analysis

Before we get to the regression tutorials, I’ll cover several overarching issues.

Why use regression at all? What are common problems that trip up analysts? And, how do you differentiate a high-quality regression analysis from a less rigorous study? Read these posts to find out:

  • When Should I Use Regression Analysis? : Learn what regression can do for you and when you should use it.
  • Five Regression Tips for a Better Analysis : These tips help ensure that you perform a top-quality regression analysis.

Tutorial: Choosing the Right Type of Regression Analysis

There are many different types of regression analysis. Choosing the right procedure depends on your data and the nature of the relationships, as these posts explain.

  • Choosing the Correct Type of Regression Analysis : Reviews different regression methods by focusing on data types.
  • How to Choose Between Linear and Nonlinear Regression : Determining which one to use by assessing the statistical output.
  • The Difference between Linear and Nonlinear Models : Both kinds of models can fit curves, so what’s the difference?

Tutorial: Specifying the Regression Model

Linear model with a cubic term.

Model specification is an iterative process. The interpretation and assumption confirmation sections of this tutorial explain how to assess your model and how to change the model based on the statistical output and graphs.

  • Model Specification: Choosing the Correct Regression Model : I review standard statistical approaches, difficulties you may face, and offer some real-world advice.
  • Using Data Mining to Select Your Regression Model Can Create Problems : This approach to choosing a model can produce misleading results. Learn how to detect and avoid this problem.
  • Guide to Stepwise Regression and Best Subsets Regression : Two common tools for identifying candidate variables during the investigative stages of model building.
  • Overfitting Regression Models : Overly complicated models can produce misleading R-squared values, regression coefficients, and p-values. Learn how to detect and avoid this problem.
  • Curve Fitting Using Linear and Nonlinear Regression : When your data don’t follow a straight line, the model must fit the curvature. This post covers various methods for fitting curves.
  • Understanding Interaction Effects : When the effect of one variable depends on the value of another variable, you need to include an interaction effect in your model otherwise the results will be misleading.
  • When Do You Need to Standardize the Variables? : In specific situations, standardizing the independent variables can uncover statistically significant results.
  • Confounding Variables and Omitted Variable Bias : The variables that you leave out of the model can bias the variables that you include.
  • Proxy Variables: The Good Twin of Confounding Variables : Find ways to incorporate valuable information in your models and avoid confounders.

Tutorial: Interpreting Regression Results

After choosing the type of regression and specifying the model, you need to interpret the results. The next set of posts explain how to interpret the results for various regression analysis statistics:

  • Coefficients and p-values
  • Constant (Y-intercept)
  • Comparing regression slopes and constants with hypothesis tests
  • How high does R-squared need be?
  • Interpreting a model with a low R-squared
  • Adjusted R-squared and Predicted R-squared
  • Standard error of the regression (S) vs. R-squared
  • Five Reasons Your R-squared can be Too High : A high R-squared can occasionally signify a problem with your model.
  • F-test of overall significance
  • Identifying the Most Important Independent Variables : After settling on a model, analysts frequently ask, “Which variable is most important?”

Tutorial: Using Regression to Make Predictions

Analysts often use regression analysis to make predictions. In this section of the regression tutorial, learn how to make predictions and assess their precision.

  • Making Predictions with Regression Analysis : This guide uses BMI to predict body fat percentage.
  • Predicted R-squared : This statistic evaluates how well a model predicts the dependent variable for new observations.
  • Understand Prediction Precision to Avoid Costly Mistakes : Research shows that presentation affects the number of interpretation mistakes. Covers prediction intervals.
  • Prediction intervals versus other intervals : Prediction intervals indicate the precision of the predictions. I compare prediction intervals to different types of intervals.

Tutorial: Checking Regression Assumptions and Fixing Problems

case study using multiple regression analysis

  • The Seven Classical Assumptions of OLS Linear Regression
  • Residual plots : Shows what the graphs should look like and why they might not!
  • Heteroscedasticity : The residuals should have a constant scatter (homoscedasticity). Shows how to detect this problem and various methods of fixing it.
  • Multicollinearity : Highly correlated independent variables can be problematic, but not always! Explains how to identify this problem and several ways of resolving it.

Examples of Different Types of Regression Analyses

The last part of the regression tutorial contains regression analysis examples. Some of the examples are included in previous tutorial sections. Most of these regression examples include the datasets so you can try it yourself! Also, try using Excel to perform regression analysis with a step-by-step example!

  • Linear regression with a double-log transformation : Models the relationship between mammal mass and metabolic rate using a fitted line plot.
  • Understanding Historians’ Rankings of U.S. Presidents using Regression Models : Models rankings of U.S. Presidents to various predictors.
  • Modeling the relationship between BMI and Body Fat Percentage with linear regression .
  • Curve fitting with linear and nonlinear regression .

If you’re learning regression and like the approach I use in my blog, check out my eBook!

Cover for my ebook, Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models.

Share this:

case study using multiple regression analysis

Reader Interactions

' src=

August 11, 2024 at 9:26 am

Hi Jim I am looking for some scientific papers to do the validation among the goodness of fit parameters in terms of atmospheric conditions using polynomial model. Even the benchmark would be helpful as well to determine the range of parameter’s threshold values.

' src=

April 10, 2024 at 5:29 pm

Hello Jim, Finding your site super-helpful. Wondering if you can provide some examples to simply illustrate multiple regression output (from spss). I would like to illustrate the overall effects of the independent variables on the dependent without creating a histogram. Ideally something that shows the strength, direction and significance in a box plot, line graph, bubble chart or other smart graphic. So appreciate your guidance.

' src=

January 2, 2023 at 10:02 am

Hi Jim, I just bought all 3 of your books via your Website Store that took me to Amazon – $66.72 plus tax. I don’t see how to get the PDF versions though without an additional cost to buy them additionally. Can you help me with how to get access to the PDF’s? Also, I am reviewing these to see if I want to add them to the courses I teach in Data Analytics. Do you have academic pricing available for students? Both hardcopy and e-copy?

' src=

January 3, 2023 at 12:17 am

Hi Anthony,

Look for an email from me.

Edited to add: I just sent an email to the address you used for the comment, but it bounced back saying it was “Blocked.” Please provide a method for me to contact you. You can use the contact form on my website and provide a different email address. Thanks!

' src=

July 18, 2022 at 7:15 am

Dear Dr, how are you? Thank you very much for your contribution. I have one question, which might not be related to this post in My cross tabulation between two categorical variables, the cell of one of the variables have just 8 observations for the ” no” , and 33 observations for the “yes” of the of the second variable. Can I continue with is for the descriptive statistics or should collapse the categories to increase the sample size? Do I use the new variable with a fewer categories in my regression analysis? your help is much appreciated

' src=

May 5, 2021 at 1:46 pm

Thanks for teaching us about Stats intuitively.

Is your book Regression Analysis available in PDF format? I’m a student learning Stats and would like it only in PDF format (no Kindle)

May 5, 2021 at 1:49 pm

Yes, if you buy it through My Website Store , you’ll get it in PDF format.

' src=

February 22, 2021 at 4:08 pm

Thnak you for your valuable advice. Change is in the way I am putting data into the software. When I put average data, glm output shows ingestion has no significant effect on mortality. When I input data with replications, glm out shows significant effect of ingestion on mortality. My standard deviations are large but data shows homogscedacity and normal distribution.

Your comments wil really be helpful in this regard.

February 22, 2021 at 4:11 pm

If you have replications, I’d enter that data to maintain the separate data points and NOT use the average. That provides the model with more information!

February 22, 2021 at 6:39 am

I have a question about Generaližed linear model. I am getting different outputs of same response variable when I apply glm using 1) data with replications 2) using average data. Mortality is My response variable and no. Of particles ingested is My predictor variable, other two predictors are categorical.

Looking forward to your advice.

February 22, 2021 at 3:44 pm

I’m not sure what you’re changing in your analysis to get the different outputs?

Replications are good because they help the model estimate pure error.

Average data is okay too but just be aware of incorporating that into the interpretations.

' src=

September 7, 2020 at 1:02 pm

I know that we can use linear or nonlinear models to fit a line to a dataset with curvature. My question is that when we have too many independent variable, how could we understand if there is a curvature?

Do you think we should start with simple linear regression, then model polynomial, Reciprocal, log, and nonlinear regression and compare the result for all of them to find which model works the best?

Thanks a lot for your very good and easy to understand book.

' src=

August 6, 2020 at 8:49 pm

I wonder if you have any recommended articles as to how to interpret the actual p-values and confidence interval for multiple regressions? I am struggling to find examples/templates of reporting these results.

I truly appreciate your help.

August 6, 2020 at 10:29 pm

I’ve written a post about interpreting p-value for regression coefficients , which I think would be helpful.

For confidence intervals of regression coefficients, think about sample means and CIs for means as a starting point. You can use the mean as the sample estimate of the population mean. However, because we’re working with a sample, we know there is a margin of error around that estimate. The CI captures that margin of error. If you have a CI for a sample mean, you know that the true population parameter is likely to be in that range.

In the regression context, the coefficient is also a mean. It’s a mean effect or the mean change in the dependent variable given a one-unit change in the independent variable. However, because we’re working with a sample, we know there is a margin of error around that mean effect. Consequently, with a CI for a regression coefficient, we know that the true mean effect of that coefficient is likely to fall within that coefficient CI.

I hope that helps!

' src=

June 28, 2020 at 2:57 am

Hi Jim first of all, thanks for all your great work. I’m setting linear regression analysis, in which the standard coefficient is considered, but the problem is my dependent variable that is Energy usage intensity so it means the lower value is the better than a higher value. correct me if I’m wrong, I think SPSS evaluates high value as the best and lower one as the worst so in my case, it could lead to effect reversely on the result (standard coefficient beta). is it right? and what is your suggestion?

' src=

June 23, 2020 at 3:51 am

hello Jim i want help with an econometric model or equation that can be used if there is one independent variable (dam) and dependent variable(5 livelihood outcomes). here i am confused if i can use binary regression model considering the 5 outcomes as a indicators of the dependent variable which is livelihood outcomes or i have to consider the 5 livelihood outcomes as 5 dependent variables and use multivariate regression.please reply a.a.p thank you so much

June 28, 2020 at 12:31 am

It really depends on the nature of the variables. I don’t know what you’re assessing, but here are two possibilities.

You label the independent variable, which I’m assuming is continuous but I don’t know for sure, and the 5 indicator/binary outcomes. This is appropriate if you think the IV affects, or at least predicts, those five indicators. Use this aproach if the goal of your analysis is to use the IV to predict the probability of those binary outcomes. Use binary logistic regression. You’ll need to run five different models. In each model, one of the binary outcomes/indicators is your DV and you’d use the same IV for each model. This type of model allows you to use the value of the IV to predict the probability of the binary outcome.

However, if you instead want to use the binary indicators to predict the continuous variable, you’d need to use multiple regression. The continuous variable is your DV and the five indicators are your IVs. This type of model allows you to use the values of the five indicators to predict the mean value of the continuous variable.

Which approach you take is a mix of theory and want your study needs to learn.

' src=

June 3, 2020 at 5:17 pm

Is this normal that the signs in “Regression equation in uncoded units” are sometimes different from the signs in the “Coded coefficients table”? In my regression results, for some terms, while the sign of a coefficient is negative in “Coded coefficients table”, it is positive in the regression equation. I am a little confused here. I thought the signs should be the same.

Thanks, Behnaz

June 3, 2020 at 8:07 pm

There is nothing unusual about the coded and uncoded coefficients having different signs. Suppose a coded coefficient has a negative sign but the uncoded coefficient has a positive sign. Your software using one of several processes that translates the raw data (uncoded) into coded values that help the model estimate process. Sometimes that conversion process causes data values to switch signs.

' src=

October 17, 2019 at 5:31 am

Hello Jim, I am lookign to do a Rsquared line for a multiple regression series. i’m not so confident that the 3rd,4th,5th number in the correlations will help make a better line. i’m basically looking at data to predict stock prices (getting a better R2) so for example Enterprise Value/Sales to growth rate has a high R2 of like .48 but we know for sure that Free cash flow/revenue to percent down from 52 week high is like .299

i have no clue how to get this to work in a 3d chart or to make a formula and find the new r2. any help would be great.

i dont have excel..not a programmer..just have some google sheets experience.

October 17, 2019 at 3:42 pm

Hi Jonathan,

I’m not 100% sure what you mean by an R-squared line? Or, by the 3rd, 4th, 5th, number in the correlations? Are you fitting several models where each one has just one independent variable?

It sounds to me like you’ll to learn more about multiple regression. Fortunately, I’ve written an ebook about it that will take you from a novice to being able to perform multiple regression effectively. Learn about my intuitive guide to regression ebook .

It also sounds like you’ll need to obtain some statistical software! I’m not sure what statistics if any you can perform in Google Sheets.

' src=

July 18, 2019 at 9:44 am

Forgive me if these questions have obvious answers but I could not find the answers yet. Still reading and learning. Is 3 the minimum number of samples needed to calculate regression? Why? I’m guessing the equations used require at least 3 sets of X,Y data to calculate a regression but I do not see a good explanation of why. I’m am not wondering about how many sets make the strongest fit. And with only two sets we would get a straight line and no chance of curvature……

I am working on a stability analysis report. For some lots we only have two time points. Zero and Three months. The software will not calculate the regression. Obviously, it needs three time points…..but why? For example: standard error cannot be calculated with only two results and therefore the rest of the equations will not work…or maybe it is related to degrees of freedom? (in the meantime what I will do is run through the equations by hand. the problem is I’m so heavily relying on the software. in other words being lazy. at least I’m questioning though. i’ve been told not to worry about it and just submit the report with “regression requires three data points”.)

' src=

July 12, 2019 at 5:34 pm

Hello, Pls how can I construct a model on carbon pricing? Thanks in anticipation of a timely response.

July 15, 2019 at 11:12 am

The first step is to do a lot of research to see what others have done. That’ll get you started in the right direction. It’ll help you identify the data you’ll need to collect, variables to include in the model, and the type and form of model that is likely to fit your data. I’ve also written a post about choosing the correct regression model that you should read. That post describes the model fitting process and how to determine which model is the best.

Best of luck with your analysis!

' src=

June 25, 2019 at 11:47 am

Hi Jim: I recently read your book Regression Analysis and found it very helpful. It covered a lot of material but I continue to have some questions about basic workflow when conducting regression analysis in social science research. For someone who wants to create an explanatory multiple regression model(s) as part of an observational study in anthropology, what are the basic chronological steps one should follow to analyze the data (eg: choose model type based on type of data collected; create scatterplots between Y and X’s; calculate correlation coefficients; specify model . . .)? I am looking for the basic steps to follow in the order that they should be completed. Once a researcher has identified a research question and collected and stored data in a dataset, what should the step-by-step work flow look like for a regression / model building analysis? Having a basic chronology of steps will help me better organize (and use) the material in your book. Thanks!

June 25, 2019 at 10:03 pm

First, thanks so much for buying me ebook. I’m so happy to hear that it was helpful. You ask a great question. And, it my next book I tackle the actual process of performing statistical studies that use the scientific method. For now, I can point you towards a blog post that covers this topic: Five Steps for Conducting Studies with Statistical Analyses

And, because you’re talking about an observation study, I recommend my post about observational studies . It talks about how they’re helpful, what to watch out for, and some tips. Also be sure to read about confounding variables in regression analysis, which starts on page 158 in the book.

Additionally, starting on p. 150 in the ebook, I talk about how to determine which variables to include in the model.

Throughout all of those posts and the ebook, you’ll notice a common theme. That you need to do a lot of advance research to figure out what you need to measure and how to measure it. Also important to ensure that you don’t accidently not measure a variable and have omitted variable bias affect your results. That’s where all the literature research will be helpful.

Now, in terms of analyzing the data, it’s hard to come up with one general approach. Hopefully, the literature review will tell you what has worked and hasn’t worked for similar studies. For example, maybe you’ll see that similar studies use OLS but need to use a log transformation. It also strongly depends on the nature of your data. The type of dependent variable(s) plays a huge role in what type of model you should use. See page 315 for more about that. It’s really a mix of what type of data you have (particularly the DVs) and what has worked/not worked for similar studies.

Sometimes, even armed with all that advanced knowledge, you’ll go to fit the model with what seems to be the best choice, and it just doesn’t fit your data. Then, you need to go back to the drawing board and try something else. It’s definitely an iterative process. But, looking at what similar studies have done and understanding your data can give you a better chance of starting out with the right type of model. And, then use the tips starting on page 150 to see about the actual process of specifying the model, which is also an iterative process. You might well start out with the correct type of model, but have to go through several iterations to settle on the best form of it.

  • See what other studies have done
  • Understand your own data.
  • Use information from step 1 and 2 to settle on a good type of model to start with and what variables to include in it.
  • Try to obtain a good fit using that type of model. This step is an iterative process of fitting models, assessing the fit and significance, and possibly making adjustments.
  • If you can obtain a good fit in step 4, you’re done after settling on the best form.
  • If you cannot obtain a good fit in step 4, do more research to find another type of model you can try and go back to step 3.

Best of luck with your study!

' src=

June 5, 2019 at 11:50 am

Thank you! Much appreciated!!

June 5, 2019 at 12:24 pm

You’re very welcome, Svend! Because you’re study uses regression, you might consider buying my ebook about regression . I cover a lot more in it than I do on the blog.

June 4, 2019 at 6:17 am

Hi Jim! Did you notice my question from 28. May…?? Svend

June 4, 2019 at 11:06 am

Hi Svend, Sorry about the delay in replying. Sometimes life gets busy! I will reply to your previous comment right now.

' src=

June 2, 2019 at 5:12 am

Thank you so much for such timely responses! They helped clarify a lot of things for me 🙂

May 28, 2019 at 3:53 am

Thank you for a very informative blog! I have a question regarding “overfitting” of a multivariable regression analysis that I have performed; 368 patients (ACL-reconstructed + concomitant cartilage lesions) with 5-year FU after ACL-reconstruction. The dependent variable was continuous (PROM). I have included 14 independent variables (sex/age/time from surgery etc, all of which formerely shown to be clinically important for the outcome) including two different types of surgery for the concomitant cartilage injury. No surgery to the concomitant lesions was used as reference (n=203), debridement (n=70), and Microfracture (n=95). My main objective was to investigate the effect on PROMs of those 2 treatments. My initial understanding was that it was OK to include that many independent variables as long as there were 368 patients included/PROMs at FU. But I have had comments that as long as the number of patients for some of the independent variables, ex. (debridement and microfracture) is lower than the model as a whole, the number of independent variables should be based on the variable with least observations…? I guess my question is: does the lowest number of observations for an independent variable dictate the size of the model/how many predictors you can use..? -And also the power..? Thanks!

June 4, 2019 at 11:23 am

I’m not sure if you’ve read my post about overfitting . If you haven’t, you should read it. It’ll answer some of your questions.

For your specific case, in general, yes, I think you have enough observations. In my blog post, I’m talking mainly about continuous variables. However, if I’m understanding correctly, you’re referring to a categorical variable for reference/debridement? If so, the rules are a bit different but I still think you’re good.

Regression and ANOVA are really the same analysis. So, you can thinking of your analysis as an ANOVA where you’re comparing groups in your data. And, it’s true that groups with smaller numbers will produce less precise estimates than groups with larger numbers. And, you generally require more observations for categorical variables than you do for continuous variables. However, it appears like your smallest group has an n=70 and that’s a very good sample size. In ANOVA, having more than 15-20 observations per group is usually good from a assumptions point of view (might not be produce sufficient statistical power depending on the effect size). So, you’re way over that. If some of your groups had very few observation, you might have need to worry about the estimates for that variable–but that’s not the case.

And, given your number of observations (368) and number of model terms requiring estimates overall (14), I don’t see any obvious reason to worry about overfitting on that basis either. Just be sure that you’re counting interaction terms and polynomials in the number model terms. Additionally, a categorical variable can use more degrees of freedom than a single continuous variable.

In short, I don’t see any reason for concern about overfitting given what you have written. Power depends on the effect size, which I don’t know. However, based on the number of observations/terms in model, I again don’t see an obvious problem.

I hope this helps! Best of luck with your analysis!

May 26, 2019 at 5:25 am

Also, another query. I want to run a multiple regression but my demographics and one of my IVs weren’t significant in the initial correlation I ran. What variables should I put in my regression test now? Should I skip all those that weren’t significant? Or just the demographics? I have read that if you have literature backing up the relationship, you can run a regression analysis regardless of how it appeared in your preliminary analysis. How true it that? What would be the best approach in this case? would mean a lot if you help me out on this one

May 27, 2019 at 10:25 pm

Hi again Aisha,

Two different answers for you. One, be wary of the correlation results. The problem is, again, the potential for confounding variables. Correlation doesn’t factor in other variables. Confounding variables can mess up the correlation results just like it can bias a regression model as I explained in my other comment. You have reason to believe that some of your demographic variables won’t be significant until you add your main IVs. So, you should try that to see what happens. Read the post about confounding variables and keep that in mind as you work through this!

And, yes, if you have strong theory or evidence from other studies for including IVs in the model, it’s ok to include them in your model even if it’s not significant. Just explain that in the write up.

For more about that, and model building in general, read my post about specifying the correct model !

May 25, 2019 at 8:13 am

Hi! I can’t believe I didn’t find this blog earlier, would have saved me a lot of trouble for my research 😀 Anyway, I have a question. Is it possible for your demographic variables to become significant predictors in the final model of a Hierarchical regression? I cant seem to understand why it is the case with mine when they came out to be non significant in the first model (even in the correlation test when tested earlier) but became significant when I put them with the rest of my (main) IVs. Are there practical reasons for that or is it poor statistical skills? :-/

May 27, 2019 at 10:19 pm

Thanks for writing with a fantastic question. It really touches on a number of different issues.

Statistics is a funny field. There’s the field of statistics, but then many scientists/researchers in different fields use statistics within their own fields. And, I’ve observed in different fields that there are different terminology and practices for statistical procedures. Often I’ll hear a term for a statistical procedure and at first I won’t know what it is. But, then the person will describe it to me and I’ll know it by another name.

At one point, hierarchical regression was like this for me. I’ve never used it myself, but it appears to be common in social sciences research. The idea is you add variables to model in several groups, such as the demographic variables in one group, and then some other variables in the next group. There’s usually a logic behind the grouping. The idea is to see how much the model improves with the addition of each group.

I have some issues with this practice, and I think your case illustrates them. The idea behind this method is that each model in the process isn’t as good as the subsequent model, but it’s still a valid comparison. Unfortunately, if you look at a model knowing that you’re leaving out significant predictors, there’s a chance that the model with fewer IVs is biased. This problem occurs more frequently with observational studies, which I believe are more common in the social sciences. It’s the problem of confounding variables. And, what you describe is consistent with there being confounding variables that are not in the model with demographic variables until you add the main IVs. For more details, read my post about how confounding variables that are not in the model can bias your results .

Chances are that some of your main IVs are correlated with one more demographic variables and the DV. That condition will bias coefficients in your demographic IV model because that model excludes the confounding variables.

So, that’s the likely practical reason for what you’re observing. Not poor statistical skills! And, I’m not a fan of hierarchical regression for that reason. Perhaps there’s value to it that I’m not understanding. I’ve never used it in practice. But there doesn’t seem to be much to gain by assessing that first (in your case) demographic IV model when it appears to be excluding confounding variables and is, consequently, biased!

However, I know that methodology is common in some fields, so it’s probably best to roll with it! 🙂 But, that’s what I think is happening.

' src=

May 19, 2019 at 6:30 am

Hello Jim I need your help please. I have this eq: Can you perform a multiple regression with two independent variables but one of them constant ? for example I have this data

Angle (Theta) Length ratio (%) Force (kn) 0 1 52.1 0.174444444 1 52.9 0.261666667 1 53.3 0.348888889 1 55.5 0.436111111 1 58.1

May 20, 2019 at 2:42 pm

Hi Ibrahim,

Thanks for writing with the good question!

The heart of regression analysis is determining how changes in an independent variable correlates with changes in the dependent variable. However, if an independent variable does not change (i.e., it is constant), there is no way for the analysis to determine how changes in it correlate to changes in the DV. It’s just not possible. So, to answer your question, you can’t perform regression with a constant variable.

I hope this helps!

' src=

February 27, 2019 at 6:13 pm

Thank you very much for this awesome site!

' src=

February 27, 2019 at 11:01 am

Hello sir, i need to know about regression and anova could you help me please.

February 27, 2019 at 11:46 am

You’re in the right spot! Read through my blog posts and you’ll learn about these topics. Additionally, within a couple of weeks, I’ll be releasing an ebook that’s all about learning regression!

' src=

February 20, 2019 at 12:09 pm

Very nice tutorial. I’m reading them all! Are there any articles explaining how the regression model gets trained? Something about gradient descent?

' src=

February 11, 2019 at 11:55 am

Thanks alot for your precious time sir

February 11, 2019 at 11:58 am

You’re very welcome! 🙂

February 10, 2019 at 5:05 am

Hey sir,hope you will be fine.It is really wonderful platform to learn regression. Sir i have some problem as I’m using cross sectional data and dependent variable is continuous.Its basically MICS data and I’m using OLS but the problem is that there are some missing observation in some variables.So the sample size is not equal across all the variables.So its make sense in OLS?

February 11, 2019 at 11:40 am

In the normal course of events, yes, when an observation has a missing value in one of the variables, OLS will exclude the entire observation when it fits the model. If observations with missing values are a small portion of your dataset, it’s probably not a problem. You do have to be aware of whether certain types of respondents are more likely to have missing values because that can skew your results. You want the missing values to occur randomly through the observations rather than systematically occurring more frequently in particular types of observations. But, again, if the vast majority of your observations don’t have missing values, OLS can still be a good choice.

Assuming that OLS make sense for your data, one difficulty with missing values is that there really is no alternative analysis that you can use to handle them. If OLS is appropriate for your data, you’re pretty much stuck with it even if you have problematic missing values. However, there are methods of estimating the missing values so you can use those observations. This process is particularly helpful if the missing values don’t occur randomly (as I describe above). I don’t know which software you are using, but SPSS has a particularly good method for imputing missing values. If you think missing values are a problem for your dataset, you should investigate ways to estimate those missing values, and then use OLS.

' src=

January 20, 2019 at 10:33 am

Hi Jim, I was quite excited to see you post this, but then there was no following article, only related subjects.

Binary logistic regression

By Jim Frost

Binary logistic regression models the relationship between a set of predictors and a binary response variable. A binary response has only two possible values, such as win and lose. Use a binary regression model to understand how changes in the predictor values are associated with changes in the probability of an event occurring.

Is the lesson on binary logistic regression to follow, or what am I missing?

Thank you for your time.

Antonio Padua

January 20, 2019 at 1:20 pm

Hi Antonio,

That’s a glossary term. On my blog, glossary terms have a special link. If you hover the pointer over the link, you’ll see a tooltip that displays the glossary term. Or, if you click the link, you go to the glossary term itself. You can also find all the glossary terms by clicking Glossary in the menu across the top of the screen. It seems like you probably clicked the link to get to the glossary term for binary logistic regression.

I’ve had several requests for articles about this topic. So, I’m putting it on my to-do list! Although, it probably won’t be for a number of months. In the mean time, you can read my post where I show an example of binary logistic regression .

Thanks for writing!

' src=

November 2, 2018 at 1:24 pm

Thanks so much, your blog is really helpful! I was wondering whether you have some suggestions on published articles that use OLS (nothing fancy, just very plain OLS) and that could be used in class for learning interpreting regression outputs. I’d love to use “real” work and make students see that what they learn is relevant in academia. I mostly find work that is too complicated for someone just starting to learn regression techniques, so any advice would be appreciated!

Thanks, Hanna

' src=

October 25, 2018 at 7:52 pm

Hi Jim. Did you write on Instrumental variable and 2 SLS method? I am interested in them. Thanks so all excellent things you did on this site.

October 25, 2018 at 10:29 pm

I haven’t yet, but those might be good topics for the future!

' src=

October 23, 2018 at 2:33 pm

Jim. Thank you so much. Especially for such a prompt response! The slopes are coming from IT segment stock valuations over 150 years. The slopes are derived from valuation troughs and peaks. So it is a graph like you’d see for the S&P. Sorry I was not clear on this.

October 23, 2018 at 12:14 pm

Jim, could you recommend a model based on the following:

1. I can see a strong visual correlation between the left side trough and peak and the right side. When the left has a steep vector so does the left, for example.

2. This does not need to be the case, the left could provide a much steeper slope compared to right or a much more narrow slope.

3. The parallels intrigue me and I would like to measure if the left slope can be explained by the right to any degree.

4. I am measuring the rise and fall of industry valuations over time. (it is the rise and fall in these valuations over time that create these ~ parallel slopes.

5. My data set since 1886 only provides 6 events, but they are consistent as described.

6. I attempted correlate rising slope against declining.

October 23, 2018 at 2:04 pm

I’m having time figuring out what you’re describing. I’m not sure what slopes you’re referring and I don’t know what you mean by the left versus right slopes?

If you only have 6 data points, you’ll only be able to fit an extremely simple model. You’ll usually need at least 10 data points (absolute minimum but probably more) to even include one independent variable.

If you have two slopes for something and you want to see if one slope explains the other, you could try using linear regression. Use one slope as an independent variable and another as a dependent variable. Slopes would be a continuous variable and so that might work. The underlying data for each slope would have to be independent from data used for other slopes. And, you’ll have to worry about time order effects such as autocorrelation.

' src=

October 2, 2018 at 1:37 am

Thank you Jim.

October 2, 2018 at 1:31 am

Hi Jim, I have a doubt regarding which regression analysis is to be conducted. The data set consists of categorical independent variables (ordinal) and one dependent variable which is of continuous type. Moreover, most of the data pertaining to an independent variable is concentrated towards first category (70%). My objective is to capture the factors influencing the dependent variable and its significance. In that case should I consider the ind. variables to be continuous or as categorical? Thanks in advance.

October 2, 2018 at 2:26 am

I think I already answered your question on this. Although, it looks like you’re now saying that you have an ordinal independent variable rather than a categorical variable. Ordinal data can be difficult. I’d still try using linear regression to fit the data.

You have two options that you can try.

1) You can include the ordinal data as continuous data. Doing this assumes that going from 1 to 2 is the same scale change as going from 2 to 3 and so on. Just like with actual continuous data. Although, you can add polynomials and transformations to improve the fit.

2) However, that doesn’t always work. Sometimes ordinal data don’t behave like continuous data. For example, the 2nd place finisher in a race doesn’t necessarily take twice as long as the 1st place finisher. And the difference between 3rd and 2nd isn’t the same as between 1st and 2nd. Etc. In that case, you can include it as a categorical variable. Using this approach, you estimate the mean differences between the different ordinal levels and you don’t have to assume they’ll be the same.

There’s an important caveat about including them as categorical variables. When you include categorical variables, you’re actually using indicator variables. A 5 point Likert scale (ordinal) actually includes 4 indicator variables. If you have many Likert variables, you’re actually including 4 variables for each one. That can be problematic. If you add enough of these variables, it can lead to overfitting . Depending on your software, you might not even see these indicator variables because they code and include them behind the scenes. It’s something to be aware of. If you have many such variables, it’s preferable to include them as continuous variables if possible.

You’ll have to think about whether your data seems more like continuous or categorical data. And, try both methods if you’re not sure. Check the residuals to make sure the model provides a good fit.

Ordinal data can be tricky because they’re not really continuous data nor categorical data–a bit of both! So, you’ll have to experiment and assess how well the different approaches work.

Good luck with your analysis!

October 1, 2018 at 2:32 am

Hello Jim, I have a set of data consisting of dependent variable which is of continuous type and independent variables which are of categorical type. The interesting thing which I found is that majority (more than 70%)of the independent variables belong to the category 1. The category values range from scale 1 to 5. I would like to know the appropriate sampling technique to be used. Is it appropriate to use linear regression or should I use other alternatives? Or any preprocessing of data is required? Please help me with the above.

Thanks in advance Raju.

October 1, 2018 at 9:40 pm

I’d try linear regression first. You can include that categorical variable as the independent variable with no problem. As always, be sure to check the residual plots. You can also use one-way ANOVA, which would be the more usual choice for this type of analysis. But, linear regression and ANOVA are really the same analysis “under the hood.” So, you can go either way.

' src=

September 23, 2018 at 4:28 am

Hello Jim I’d like to Know what your suggestions are with regards to choice of regression for predicting: dependent variable is count data but it does not follow a poisson distribution independent variables include categorical and continuous data I’d appreciate your thoughts on it …. thanks!

September 24, 2018 at 11:08 pm

Hi Sarkhani,

Having count data that don’t follow the Poisson happens fairly often. The top alternatives that I’m aware of are negative binomial regression and zero inflated models. I talk about those options a bit in my post about choosing the correct type of regression analysis . The count data section is near the end. I hope this information points you in the right direction!

' src=

August 29, 2018 at 9:38 am

Hi jim i’m really happy to find your blog

' src=

August 11, 2018 at 1:42 pm

Independent variables range from 0 to 1 and corresponding dependent variables range from 1 to 5 . If we apply regression analysis to above and predict the value of y for any value of x that also ranges from 0 to 1, whether the value of y will always lie in the range 1 to 5?

August 11, 2018 at 4:18 pm

In my experience, the predicted values will fall outside the range of the actual dependent variable. Assuming that you are referring to actual limits at 1 and 5, the regression analysis does not “understand” that those are hard limits. The extent that the predicted values fall outside these limits depends on the amount of error in the model.

' src=

August 8, 2018 at 4:18 am

Very Good Explanation about regression ….Thank you sir for such a wonderful post….

' src=

March 29, 2018 at 11:43 am

Hi Jim, I would like to see you writing something about Cross Validation (Training and test).

' src=

February 20, 2018 at 8:30 am

thank you Jim this is helpful

February 21, 2018 at 4:08 pm

You’re very welcome, Lisa! I’m glad you found it to be helpful!

' src=

January 21, 2018 at 10:39 am

Hello Jim I’d like to Know what your suggestions are with regards to choice of regression for predicting: the likelihood of participants falling into One of two categories (low Fear group codes 1 and high Fear 2 … when looking at scores from several variables ( e.g. external Other locus of control, external social locus of control , internal locus of control and social phobia and sleep quality ) It was suggested that I break the question up to smaller components … I’d appreciate your thoughts on it …. thanks!

January 22, 2018 at 2:30 pm

Because you have a binary response (dependent variable), you’ll need to use binary logistic regression. I don’t know what types of predictors you have. If they’re continuous, you can just use them in the model and see how it works.

If they’re ordinal data, such as a Likert scale, you can still try using them as predictors in the model. However, ordinal data are less likely to satisfy all the assumptions. Check the residual plots. If including the ordinal data in the model doesn’t work, you can recode them as indicator variables (1s and 0s only based on whether an observation meets a criteria or not. For example, if you have a scale of -2, -1. 0, 1, 2 you could recode it so observations with a positive score get a 1 while all other scores get a 0.

Those are some ideas to try. Of course, what works best for your case depends on the subject area and types of data that you have.

' src=

January 21, 2018 at 5:04 am

I am using Step-wise regression to select significant variables in the model for prediction.how to interpret BIC in variable selection?

regards, Zishan

January 22, 2018 at 5:36 pm

Hi, when comparing candidate models, you look for models with a lower BIC. A lower BIC indicates that a model is more likely to be the true model. BIC identifies the model that is more likely to have generated the observed data.

' src=

January 18, 2018 at 2:44 pm

yes.the language of the topic is very easy , i would appreciate you sir ,if you let me know that ,If rank

correlation is r =0.8,sum of “D”square=33.how we will calculate /find no. observations (n).

January 18, 2018 at 3:00 pm

I’m not sure what you mean by “D” square, but I believe you’ll need more information for that.

' src=

January 6, 2018 at 11:08 pm

Hi, Jim! I’m really happy to find your blog. It’s really helping, especially that you use basic English so non-native speaker can understand it better than reading most textbooks. Thanks!

January 7, 2018 at 12:49 am

Hi Dina, you’re welcome! And, thanks so much for your kind words–you made my day!

' src=

December 21, 2017 at 12:30 am

Can you write on Logistic regression please!

December 21, 2017 at 12:45 am

Hi! You bet! I plan to write about it in the near future!

' src=

December 16, 2017 at 2:33 am

great work by great man,, it is easily accessible source to access the scholars,, sir i am going to analyse data plz send me guidlines for selection of best simple linear/ multiple linear regression model, thanks

December 17, 2017 at 12:21 am

Hi, thank you so much for your kind words. I really appreciate it! I’ve written a blog post that I think is exactly what you need. It’ll help you choose the best regression model .

' src=

December 8, 2017 at 8:47 am

such a splendid compilation, Thanks Jim

December 8, 2017 at 11:09 am

' src=

December 3, 2017 at 10:00 pm

would you also throw some ideas on Instrumental variable and 2 SLS method please?

December 3, 2017 at 10:40 pm

Those are great ideas! I’ll write about them in future posts.

Comments and Questions Cancel reply

Multiple Regression: Methodology and Applications

  • Highlights in Science Engineering and Technology 49:542-548
  • CC BY-NC 4.0
  • This person is not on ResearchGate, or hasn't claimed this research yet.

Abstract and Figures

Summary of the design variables

Discover the world's research

  • 25+ million members
  • 160+ million publication pages
  • 2.3+ billion citations

Vesna Antoska Knights

  • Angy García Salazar
  • Diandra Gómez Lorenti
  • Carmen Magallanes Galarza
  • Ángel Maldonado Castro

Nasser F BinDhim

  • Abdulhameed Abdullah Alhabeeb
  • Sri Wahyuni Nasution

Ermi Girsang

  • Widya Panduwinata
  • Environ Dev Sustain

A.J. Cetina-Quiñones

  • P. López de Paz

Maikel Leyva-Vázquez

  • Lorenzo Cevallos Torres

Vasilios Tsoukalas

  • INDIAN J MED RES
  • Shawaz Iqbal

Sandhya Gupta

  • J GEOCHEM EXPLOR

Ahmad Reza Mokhtari

  • ENERG BUILDINGS

Valerio Roberto Maria Lo Verso

  • Elazar J. Pedhazur

David W Hosmer

  • MATH PROC CAMBRIDGE
  • Mr M. S. Bartlett
  • Recruit researchers
  • Join for free
  • Login Email Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google Welcome back! Please log in. Email · Hint Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google No account? Sign up

Information

  • Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

  • Active Journals
  • Find a Journal
  • Journal Proposal
  • Proceedings Series
  • For Authors
  • For Reviewers
  • For Editors
  • For Librarians
  • For Publishers
  • For Societies
  • For Conference Organizers
  • Open Access Policy
  • Institutional Open Access Program
  • Special Issues Guidelines
  • Editorial Process
  • Research and Publication Ethics
  • Article Processing Charges
  • Testimonials
  • Preprints.org
  • SciProfiles
  • Encyclopedia

mathematics-logo

Article Menu

case study using multiple regression analysis

  • Subscribe SciFeed
  • Recommended Articles
  • Author Biographies
  • Google Scholar
  • on Google Scholar
  • Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

Mitigating multicollinearity in regression: a study on improved ridge estimators.

case study using multiple regression analysis

Share and Cite

Akhtar, N.; Alharthi, M.F.; Khan, M.S. Mitigating Multicollinearity in Regression: A Study on Improved Ridge Estimators. Mathematics 2024 , 12 , 3027. https://doi.org/10.3390/math12193027

Akhtar N, Alharthi MF, Khan MS. Mitigating Multicollinearity in Regression: A Study on Improved Ridge Estimators. Mathematics . 2024; 12(19):3027. https://doi.org/10.3390/math12193027

Akhtar, Nadeem, Muteb Faraj Alharthi, and Muhammad Shakir Khan. 2024. "Mitigating Multicollinearity in Regression: A Study on Improved Ridge Estimators" Mathematics 12, no. 19: 3027. https://doi.org/10.3390/math12193027

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

  • Open access
  • Published: 27 September 2024

Implementing multiple imputations for addressing missing data in multireader multicase design studies

  • Zhemin Pan 1   na1 ,
  • Yingyi Qin 2   na1 ,
  • Wangyang Bai 1 ,
  • Qian He 2 ,
  • Xiaoping Yin 3 &
  • Jia He 1 , 2  

BMC Medical Research Methodology volume  24 , Article number:  217 ( 2024 ) Cite this article

16 Accesses

Metrics details

In computer-aided diagnosis (CAD) studies utilizing multireader multicase (MRMC) designs, missing data might occur when there are instances of misinterpretation or oversight by the reader or problems with measurement techniques. Improper handling of these missing data can lead to bias. However, little research has been conducted on addressing the missing data issue within the MRMC framework.

We introduced a novel approach that integrates multiple imputation with MRMC analysis (MI-MRMC). An elaborate simulation study was conducted to compare the efficacy of our proposed approach with that of the traditional complete case analysis strategy within the MRMC design. Furthermore, we applied these approaches to a real MRMC design CAD study on aneurysm detection via head and neck CT angiograms to further validate their practicality.

Compared with traditional complete case analysis, the simulation study demonstrated the MI-MRMC approach provides an almost unbiased estimate of diagnostic capability, alongside satisfactory performance in terms of statistical power and the type I error rate within the MRMC framework, even in small sample scenarios. In the real CAD study, the proposed MI-MRMC method further demonstrated strong performance in terms of both point estimates and confidence intervals compared with traditional complete case analysis.

Within MRMC design settings, the adoption of an MI-MRMC approach in the face of missing data can facilitate the attainment of unbiased and robust estimates of diagnostic capability.

Peer Review reports

Introduction

The accuracy of imaging diagnostic modalities is shaped by not only the technical specifications of the diagnostic equipment or the algorithms but also the skill set, education, and sensory and cognitive capacities of the interpreting clinicians/readers (e.g., radiologists) [ 1 , 2 , 3 ]. The multireader multicase (MRMC) design, which involves various readers to assess each case, enables the quantification of the impact that reader variability has on the accuracy of imaging diagnostic modalities. As a result, MRMC design studies can enhance the generalizability of study findings and strengthen the overall robustness of the research [ 4 ]. MRMC design is currently needed for the clinical evaluation of computer-aided diagnostic (CAD) devices and imaging diagnostic modalities by regulatory agencies, including the Food and Drug Administration in the United States [ 5 ] and the National Medical Products Administration in China [ 6 , 7 ].

For the analysis of MRMC design data, a lack of independence in reader performance is a critical consideration [ 8 ]. Traditional statistical methods may not be suitable for this complexity. The Dorfman–Berbaum–Metz (DBM) [ 9 ] method and the Obuchowski–Rockette (OR) [ 8 ] method are commonly used approaches to address the intricate correlations present in MRMC studies [ 10 ]. In DBM analysis, to address the lack of independence in readers’ performance, jackknife pseudovalues are computed for each test-reader combination, and a mixed-effects analysis of variance (ANOVA) is subsequently performed on these pseudovalues to carry out significance testing. For OR analysis, the correlations were addressed by adjusting the F statistic to account for the underlying correlation structures.

As with any study, the challenge of missing data is ubiquitous. Missing data can occur in MRMC design studies when there are instances of misreading or omissions by the reader, substandard specimen collection, issues with measurement techniques, errors during the data collection process, or when results exceed threshold values [ 4 , 11 , 12 , 13 ]. Despite this commonality, the majority of MRMC design clinical trials fail to disclose whether they grappled with missing data issues [ 10 ]. Consequently, it remains unclear whether the analytical outcomes were derived from complete or incomplete datasets or if a suitable method for handling missing data was employed. This stands in contrast to the Checklist for Artificial Intelligence in Medical Imaging [ 14 ] and the Standards for Reporting of Diagnostic Accuracy Studies [ 15 ] guidelines, which both explicitly mandate the transparent reporting of missing data and the strategies employed to address them. Within the framework of causal inference, the ambiguity surrounding the status of missing data can introduce uncertainties about the conditions under which results are inferred and may even potentially result in biased estimates [ 16 , 17 ].

Currently, there is limited research on methods specifically designed for handling missing data in MRMC studies. This might explain why missing data are rarely reported in such studies. For those that do address missing data issues, the complete case analysis method is arguably the most commonly used approach. This method involves discarding any case that contains missing data, including all evaluations of that case by all readers [ 18 , 19 ]. This approach typically requires that the type of missing data be missing completely at random (MCAR); otherwise, the results obtained might be biased. Furthermore, the complete case method can lead to further loss of information due to the reduction in sample size, which might also affect the accuracy of the trial results and decrease the statistical power [ 17 ]. Additionally, from the perspective of causal inference, accuracy estimates derived from complete case analyses represent only the subset of the population with complete records, failing to accurately reflect the estimator of the entire target population [ 20 ]. Hence, missing data handling approaches, especially for MRMC designs, are urgently needed.

In 1976, Donald Rubin [ 21 ] introduced the concept of multiple imputation (MI), which involves imputing each missing value multiple times according to a selected imputation model, analyzing the imputed datasets individually, and combining the results on the basis of Rubin’s rules. Thus, MI is able to reflect the uncertainty associated with the data imputation process by increasing the variability of the imputed data. This approach has gained widespread adoption for managing missing data in various research contexts, including drug clinical trials [ 22 ] and observational studies [ 23 ], and addressing verification bias in diagnostic studies [ 24 , 25 ]. However, the implementation of MI within the MRMC design framework remains relatively unexplored. The successful implementation of MI hinges on the congruence between the imputation model and the analysis model, necessitating that the imputation method captures all variables and characteristics pertinent to the analysis model. This requirement ensures unbiased parameter estimates and correctly calculated standard errors [ 26 , 27 ]. The complexity inherent in MRMC designs, however, poses significant challenges to this congruence.

In light of these gaps, in this study, we aim to establish a missing data handling approach that integrates MI theory with MRMC analysis to maximize the use of available data and minimize biases resulting from the exclusion of cases with missing information. We intend to validate the feasibility and suitability of the proposed approach through both simulation studies and a real CAD study. Thus, providing a reliable solution for managing missing data within the MRMC framework enhances the reliability of diagnostic trial outcomes in real-world clinical settings.

The structure of this paper is as follows: First, the approach to address the issue of missing data within MRMC design studies is presented. The specifics of the simulation study are then detailed, including both the setup and the findings obtained. Subsequently, the proposed approach is implemented in a real MRMC design study that includes instances of missing data. The paper concludes with a discussion of the implications of the work and offers practical recommendations.

Basic settings and notations

In this study, a two-test receiver operating characteristic (ROC) paradigm MRMC study design is assumed. Each reader is tasked with interpreting all cases and assigning a confidence-of-disease score that reflects their assessment of the presence of disease. The true disease status of the patients was verified by experienced, independent readers who served as the gold standard. Instances of missing data may occur during the evaluation phase, leading to the absence of interpretation results. We hypothesize that these missing data were under the MCAR / missing-at-random (MAR) mechanisms. The term ‘test’ will be used to refer to the imaging system, modality, or image processing throughout this article.

In terms of notation, \(\:{X}_{ijk}\) represents the confidence-of-disease score assigned to the \(\:k\) -th case by reader \(\:\:j\) on the basis of the \(\:i\) -th test. The observed data consists of \(\:{X}_{ijk}\) , with \(\:\:i=1,\dots\:,I\) , \(\:j=1,\dots\:,J\) , and \(\:k=1,\dots\:,K\) , where \(\:I\) is the number of diagnostic tests evaluated, which is two for better illustration, \(\:\:J\) denotes the number of readers, and \(\:K\) is the total number of cases examined.

Traditional Approach - Complete case (CC) analysis

Under the complete case (CC) analysis framework, any instance where a single reading record or assessment is missing results in the exclusion of all interpretation results associated with that case. This exclusion applies across all readers and modalities, ensuring that the dataset—referred to as the complete case dataset—comprises only cases with fully observed data.

In this study, DBM [ 9 , 28 , 29 ] analysis was subsequently conducted on the complete case dataset. This analysis method transforms correlated figures of merit (FOM), specifically the area under the ROC curve (AUC), into independent test-reader-case-level jackknife pseudovalues, thereby addressing the complex correlation structure inherent in MRMC data.

The formula for calculating the jackknife pseudovalue is as follows:

where \(\:{Y}_{ijk}\) represents the jackknife pseudovalue of the AUC for the \(\:i\) -th test, \(\:j\) -th reader, and \(\:\:k\) -th case. \(\:{\widehat{\theta\:}}_{ij}\) is the AUC estimate derived from all the cases for the \(\:i\) -th test and the \(\:j\) -th reader. \(\:{\widehat{\theta\:}}_{ij\left(k\right)}\) corresponds to the AUC estimate computed excluding the  \(\:k\) -th case. The jackknife pseudovalue of the \(\:k\) -th patient can be viewed as the weighted difference in accuracy. When the FOM is the Wilcoxon AUC, the pseudovalues across the case index are identical to the respective FOM estimates.

Using \(\:{Y}_{ijk}\) as the response, the DBM method for testing the effect of the imaging diagnostic tests can be specified via three-factor ANOVA, with the test effect treated as a fixed factor and the reader and case effects treated as random factors to account for the variability among different readers, cases and interactions.

where \(\:{\tau\:}_{i}\) represents the fixed effect attributable to the \(\:i\) -th imaging test. \(\:{R}_{j}\) and \(\:{C}_{k}\) are the random effects associated with the \(\:\:j\) -th reader and \(\:k\) -th case, respectively. Interaction terms, represented by multiple symbols in parentheses, are considered random effects. The error term \(\:{\epsilon\:}_{ijk}\) captures the residual variability not explained by the model. The DBM approach assumes that the random effects, including the interaction terms, are mutually independent and follow normal distributions with means of zero.

The DBM F statistic for testing the test effect is based on the conventional mixed model and is later optimized by Hillis to ensure that the type I error rates are within acceptable bounds [ 30 ].

Consequently, for the complete case dataset, the estimated effect size (the difference in the FOM across tests) and corresponding statistics are as follows, where the subscript CC denotes metrics calculated for the complete case dataset:

Mean-square quantities calculated based on pseudovalues [ 29 ]:

Proposed MI-MRMC Approach

Step 1. imputation.

The multiple imputation by chained equations (MICE) algorithm was implemented to construct the imputation of missing data [ 31 ]. The MICE algorithm addresses missing data by generating multiple imputations that reflect the posterior predictive distribution, \(\:P\left({X}_{miss}\right|{X}_{obs})\) . This process involves constructing a sequence of prediction models, with the imputation of each variable being conditional on the observed and previously imputed values of the other variables. By iteratively producing multiple imputed datasets— M datasets in our implementation—the MICE approach encapsulates the uncertainty inherent in the imputation process.

In the construction of the abovementioned predictive models for the MICE algorithm within MRMC studies, the typical scarcity of auxiliary variables poses a methodological challenge. To circumvent this limitation, an imputation model is proposed that leverages the intrinsic correlations among different readers’ interpretations. Since these readers assess identical case sets, their interrelated evaluations provide a solid basis for the imputation model. In addition, given that the interpretation ratings by each reader are typically treated as continuous variables, the predictive mean matching method was incorporated to enhance the imputation process [ 32 ]. Moreover, to accommodate potential variations that may arise when readers evaluate cases across different tests and disease statuses, the model is further calibrated using a subset of the data stratified by modality and disease status.

For diseased cases under test 1, let variable \(\:{X}_{j}\:\) represent the interpretation results of reader \(\:\:j\) ( \(\:\:j=\text{1,2},\dots\:,J\) ). The observed dataset comprising the results from all readers is denoted as \(\:{x}_{\left(0\right)}\) , where \(\:{x}_{\left(0\right)}=\{{X}_{1\left(0\right)},\dots\:,{X}_{j\left(0\right)},\dots\:,\:{X}_{J\left(0\right)}\}\) . \(\:{x}_{j\left(1\right)}\) represents the missing part of \(\:{X}_{j}\) . The imputation of missing data proceeds through the following process:

Create the initial imputation for the missing data: \(\:{x}_{1\left(1\right)}^{\left(0\right)},{x}_{2\left(1\right)}^{\left(0\right)},\dots\:,{x}_{J\left(1\right)}^{\left(0\right)}\) .

In the current iteration (t + 1), the imputed values from the previous iteration (t), denoted as \(\:{x}_{1\left(1\right)}^{\left(t\right)},\dots\:,{x}_{J\left(1\right)}^{\left(t\right)}\) , are updated for each variable. This update is achieved by applying the specific predictive formula provided below:

Step 2. Analysis of the individual imputed datasets

For the analysis of the \(\:M\) imputed datasets, the DBM method was also utilized for comparison purposes. Thus, for the \(\:m\) -th imputed dataset ( \(\:m=\text{1,2},\dots\:,M\) ), the estimated effect size and corresponding statistics are as follows, and the calculation of Mean-square quantities is similar as Eq. (7):

Step 3. Pooling results

After the analysis of the individual imputed datasets, the features obtained from each imputed dataset are combined on the idea of Rubin’s rule [ 31 ].

Step 3a. Pooling the effect size

The point estimate of \(\:\theta\:\) , derived from multiple imputation, is calculated as the mean of the point estimates \(\:{\widehat{\theta\:}}_{m}\) obtained from each of the \(\:M\) imputed datasets ( \(\:m=\text{1,2},\dots\:,M\) ).

Step 3b. Pooling variance

The total variance of the parameter estimate \(\:\theta\:\) is composed of two components: the between-imputation variance ( \(\:{V}_{B}\) ) and the within-imputation variance ( \(\:{V}_{W}\) ), where the between-imputation variance captures the variability among the different imputed dataset estimates, and the within-imputation variance is determined by each individual imputed dataset itself.

Within imputation variance:

Between imputation variance:

Total variance:

The pooled standard error:

Step 3c. Significance testing

Wald statistics for MI-MRMC

The Wald statistic is constructed by dividing the estimated effect size by its pooled standard error, resulting in a ratio that follows a t-distribution under the null hypothesis.

Degree of freedom for MI-MRMC.

It is proposed that the degrees of freedom for statistical inference should adequately represent the uncertainty from both the MRMC process and the MI process. To achieve this, the average degrees of freedom from the \(\:M\) imputed datasets were chosen as a proxy for the degrees of freedom attributable to the MRMC phase. This average is then integrated with the degrees of freedom as prescribed by the multiple imputation procedure, in accordance with the principles outlined in Rubin’s rules [ 33 ]. This composite degree of freedom is then used to conduct statistical tests, ensuring that our final inferences are sensitive to the complexities and uncertainties inherent in both the MRMC and MI processes.

Confidence interval for MI-MRMC.

The confidence interval can then be obtained via the equation below:

Simulation study

Original complete dataset generation.

The generation of original complete datasets was based on the Roe and Metz model [ 34 ], which is based on a binormal distribution framework. In this simulation, it was assumed that all Monte Carlo simulation readers evaluated all cases across two imaging modalities and assigned a confidence-of-disease score for each interpretation.

Let \(\:{X}_{ijkt}\) represent the confidence-of-disease score of the Roe and Metz model for test \(\:i\) ( \(\:i=1,\dots\:,I\) ), reader \(\:j\) ( \(\:j=\text{1,2},\dots\:,J\) ), case \(\:k\) ( \(\:k=\text{1,2},\dots\:,K\) ), and truth \(\:t\) ( \(\:t=0\) means a nondiseased case image, \(\:\:t=1\) means a disease case image),

where \(\:{\mu\:}_{t}\) is 0 for nondiseased cases, \(\:{\tau\:}_{it}\:\) is the fixed effect of each modality, and the remaining items are random effects that are mutually independent and normally distributed with zero means. The random effect of test×reader×case was combined into the error item, considering that these two effects are inseparable without repeated readings.

To simplify, it is assumed that the variance of the random effect was identical in nondiseased cases and diseased cases.

Thus, for nondiseased cases,

For diseased patients,

In the context of hypothesis testing, under the null hypothesis, it is proposed that both \(\:{\tau\:}_{A0}\) and \(\:{\tau\:}_{B0}\) are equal to zero and that \(\:{\tau\:}_{A1}\:\) is equal to \(\:{\tau\:}_{B1}\) . Conversely, under the alternative hypothesis, while \(\:{\tau\:}_{A0}\) and \(\:{\tau\:}_{B0}\) remain zero, \(\:{\tau\:}_{A1}\) and \(\:{\tau\:}_{B1}\) are not equal, indicating a difference in the effects on diagnostic ability.

Within reader correlation \(\:{\rho\:}_{WR}\) and between reader correlation \(\:{\rho\:}_{BR}\) were also identified to indicate different correlation structure settings [ 35 ].

Introducing missingness

The simulation study employed two missing data mechanisms, MCAR and MAR, to evaluate their impact on the analytical results.

Under the MAR mechanism, it is posited that the probability of a missing observation is related to observable variables, specifically the reader and the test. The missingness indicator \(\:{R}_{ijk}\) , which denotes whether the interpretation by reader \(\:\:j\) for case \(\:\:k\) under test \(\:i\) is observed ( \(\:{R}_{ijk}\) =0) or missing ( \(\:{R}_{ijk}\) =1), is modeled via logistic regression:

The parameters \(\:{\gamma\:}_{1}\) and \(\:{\gamma\:}_{2}\) represent the effects of the reader and test, respectively, on the log-odds of the observation being missing. Specifically, \(\:{\gamma\:}_{1}\) is set to -0.1 and \(\:{\gamma\:}_{2}\:\) is set to 0.15, and the parameter \(\:{\gamma\:}_{0}\) is varied to achieve different missing rates.

Conversely, by setting \(\:{\gamma\:}_{1}={\gamma\:}_{2}=0\) , the MCAR mechanism is simulated, where the missingness is independent of the observed data. In this case, the missingness indicator \(\:{R}_{ijk}\) is solely determined by the intercept \(\:{\gamma\:}_{0}\) :

The simulation scenarios are detailed in Table  1 and Supplementary Tables 1 and are primarily based on the settings established by Roe and Metz [ 34 ]. A total of 1728 scenarios were considered, and under each scenario, 1000 simulations were conducted to mitigate sampling bias.

Evaluation of analysis approaches

In this simulation study, datasets incorporating instances of missing data were analyzed via the MI-MRMC approach, as well as via the CC approach, to obtain estimates of the parameter of interest—namely, the difference in the ROC AUC between two diagnostic tests. For the purpose of comparison, the DBM analysis was also conducted on the original complete datasets, which were void of any missing data. This approach will be referred to as ‘original’ hereafter.

The following metrics were calculated to compare the performance of the proposed approach in terms of statistical performance, point estimation accuracy and confidence interval coverage: (1) type I error rate (under null hypothesis settings); (2) power (under alternative hypothesis setting); (3) root mean squared error (RMSE); (4) bias; (5) 95% confidence interval coverage rate; and (6) confidence interval width.

All simulation computations were executed via R (version 4.1.2) [ 36 ].

Real example

Data Source

The proposed analysis approach was applied to a real MRMC design CAD study, which is an ROC paradigm MRMC design. The study was conducted at the Affiliated Hospital of HeBei University, the China-Japan Union Hospital of Jilin University, and Peking University People's Hospital, with ethics approval obtained from the ethics committees of these hospitals.

Study Design

This study evaluated the efficacy of aneurysm detection with and without the assistance of a deep learning model in the context of head and neck CT angiograms. A total of 280 subjects were included, 135 of whom had at least one aneurysm. Ten qualified radiologists interpreted all the images and rated each image, where 0 indicated a definitive absence of an aneurysm and 10 indicated a definitive presence.

Out of the 5,600 interpretations (280 subjects × 10 radiologists × 2 tests), there were 17 instances of missing data. Twelve of these were due to radiologists not evaluating some images, whereas 5 were attributable to failure in generating reconstructed images. The overall missing data rate was 0.30%. The MI-MRMC and the CC approaches were applied to handle missing data. To establish a benchmark for evaluating our analysis, An “original complete dataset” was also created, in which the previously missing interpretations were subsequently re-interpreted by the original radiologists. A DBM analysis was then conducted on this dataset.

Figure 1  displays the mean Type I error rates under the null hypothesis setting for the MI-MRMC, CC, and the original approach by factor level for each of the simulation study factors across various scenarios, differentiated by sample size. The MI-MRMC approach exhibits a relatively lower Type I error rate compared with the results from the original complete datasets. In contrast, the CC approach demonstrates context-dependent performance. Despite these slight variations, both the CC and the MI-MRMC approaches yield generally comparable results for MAR and MCAR conditions. Specifically, the Type I error rates of both approaches closely align with those observed in the original complete datasets and approximate the nominal significance level of 0.05.

figure 1

Mean type I error performance under different scenarios differentiated by sample size for the original, CC and MI-MRMC approaches. A ) Under the MCAR mechanism, B ) under the MAR mechanism. CC: complete case analysis, MI-MRMC: multiple imputation under the MRMC framework, Original: DBM analysis on the original complete dataset

The statistical power under the alternative hypothesis (Fig. 2 ) reveals that the MI-MRMC approach maintains strong performance in terms of power. Notably, for this approach, any reduction in power is slight, even as the rate of missing data increases. In contrast, the CC approach results in a significant decrease in power, which is exacerbated by increases in both the missing data rate and the total number of readers. Furthermore, for both approaches, a decrease in the AUC is associated with a reduction in statistical power. Performance comparisons across different settings of variance components show that the outcomes are broadly similar, indicating that the statistical power of these approaches is relatively consistent regardless of variance component configurations.

figure 2

Mean power performance under different scenarios differentiated by sample size for the original, CC and MI-MRMC approaches. A ) Under the MCAR mechanism, B ) under the MAR mechanism. CC: complete case analysis, MI-MRMC: multiple imputation under the MRMC framework, Original: DBM analysis on the original complete dataset

The mean RMSE values are detailed in Fig. 3  and Supplementary Figure S1 . For all the considered scenarios, whether under MAR or MCAR conditions, the RMSE associated with the CC approach is greater than that associated with the MI-MRMC approach. Moreover, the RMSE for the CC approach increases significantly as the sample size diminishes and the rates of missing data increase.

figure 3

Mean RMSE performance under different scenarios for the original, CC analysis and MI-MRMC approaches. CC: complete case analysis, MI-MRMC: multiple imputation under the MRMC framework, Original: DBM analysis on the original complete dataset, NA: not applicable

Similarly, in line with the RMSE findings, in comparison with the MI-MRMC approach, the bias is greater when the CC approach is employed, particularly in conditions of limited sample size, elevated missing rates, and lower AUC values (Fig. 4 , Supplementary Figure S2 ).

figure 4

Mean bias performance under different scenarios for the original, CC analysis and MI-MRMC approaches. CC: complete case analysis, MI-MRMC: multiple imputation under the MRMC framework, Original: DBM analysis on the original complete dataset, NA: not applicable

The 95% confidence interval coverage rate, as shown in Fig. 5  and Supplementary Figure S3 , is consistent across all the scenarios for these three approaches, closely approximating the ideal 95%.

figure 5

Mean 95% confidence interval coverage rate performance under different scenarios for the original, CC analysis and MI-MRMC approaches CC: complete case analysis, MI-MRMC: multiple imputation under the MRMC framework, Original: DBM analysis on the original complete dataset, NA: not applicable

Regarding the width of the confidence interval, for all approaches, scenarios with smaller sample sizes, lower AUC settings, and higher missing rates are associated with wider confidence intervals. Compared with the performance of original complete datasets, the MI-MRMC approach shows a modest increase in confidence interval width. The CC approach, however, results in even wider confidence intervals, particularly under higher missing rates, as illustrated in Supplementary Figure S4 .

The detailed simulation results can be found in Supplementary Table S2 .

Table  2 summarizes the results of the CAD study. All methods indicate a significant difference in the AUC when comparing scenarios with and without the use of the deep learning model for aneurysm detection by head and neck CT angiograms. Notably, the proposed MI-MRMC approach produces results that align more closely with those from the original complete datasets—more so than the CC approach—regarding point estimates, confidence intervals, and P values.

In this study, an MI-MRMC approach was developed to handle missing data issues within MRMC design CAD studies. To assess the feasibility and suitability of this approach, we conducted a simulation study comparing its performance against that of the CC approach across 1728 scenarios. Additionally, we implemented a real-world CAD study to evaluate the performance of the MI-MRMC approach under actual clinical conditions.

Our findings reveal that with respect to point estimation, the CC approach demonstrates marginally inferior performance compared with that of the MI-MRMC and the original approach, resulting in slightly elevated bias and RMSE. However, the CC approach yields substantially wider confidence intervals, which consequently leads to markedly reduced statistical power in comparison to both the MI-MRMC and the original approach. This disparity in power becomes more significant with an increase in the rate of missing data and the size of the reader sample. And these findings underscoring the potential for inherent bias and highlighting its inadequacy in effectively managing missing data within MRMC settings. Our results align with observations from other diagnostic test trials beyond MRMC studies, where the limitations of complete case analysis have been similarly noted [ 37 , 38 , 39 , 40 ]. Notably, Newman has labeled the complete case analysis as ‘suboptimal, potentially unethical, and totally unnecessary’, noting that even minimal missing data can reduce study power and bias results, making findings applicable only to those with complete data [ 41 ]. Despite these identified limitations and the critique from the broader research community, it is important to acknowledge that the CC approach remains the most commonly employed approach in CAD studies. This prevalent use, juxtaposed with the method’s recognized deficiencies, emphasizes the necessity for a paradigm shift toward more robust and reliable methods in the handling of missing data in MRMC design.

In contrast, the MI-MRMC approach consistently demonstrates strong statistical power while maintaining the type I error rate close to the nominal 5% level. This is complemented by superior performance metrics, including low RMSE, minimal bias, and accurate 95% confidence interval coverage. These favorable outcomes persist across various conditions, encompassing different missing data mechanisms, diverse sample sizes of cases and readers, a range of missing rates, and various variance structures. Regarding confidence interval width, our findings indicate that MI-MRMC tends to produce slightly wider confidence intervals compared with the original complete dataset. This observation aligns with previous literature on MI, which suggests that the wider MI confidence intervals reflect a realistic addition of uncertainty introduced by missing data and the subsequent imputation process [ 42 ]. We observed that MI-MRMC demonstrates relatively wider confidence intervals, particularly in scenarios with low correlation structures (LL, LH) or limited reader sample sizes. This results in a comparatively lower type I error rate under these conditions. However, it’s important to note that despite the relatively lower type I error rate, MI-MRMC still maintains strong statistical power compared with the traditional CC approach.

When deriving the degrees of freedom for the MI-MRMC, we adopted the methodology proposed by Barnard and Rubin [ 33 ] over the framework suggested by Rubin in 1987 [ 31 ]. This decision is informed by the unique characteristics of MRMC studies, which typically feature a modest proportion of missing data and where individual observations, such as the confidence-of-disease score, exert limited influence on the endpoint, specifically the AUC. This results in minimal between-imputation variance. In addition, Rubin’s 1987 method in this context may inflates the degrees of freedom compared with those derived from the original complete data, potentially skewing significance testing toward optimism. Conversely, the approach of Barnard and Rubin [ 33 ], which accounts for the degrees of freedom from both the observed datasets and the imputation phase, offers a more accurate estimation. It enables the integration of the degrees of freedom inherent to the MRMC phase, optimized through Hillis’s contributions [ 28 ], ensuring a balanced and precise evaluation of statistical significance.

In our exploratory studies, the joint model algorithm was also evaluated, yielding results comparable to those obtained with the MICE algorithm. Given that the joint model algorithm requires a stringent assumption of multivariate normal distribution [ 43 ], the MICE algorithm was selected. Regarding the optimal number of imputations, a preliminary simulation study was conducted using ten imputations. The results indicated marginal gains in precision beyond five imputations, consistent with the recommendations of Little and Rubin [ 17 ]. Consequently, five imputations were deemed sufficient for this investigation. Future studies may explore the impact of varying the number of imputations, taking into account real-world application situations and computational constraints.

Multiple imputation, which originated in the 1970s [ 21 ], addresses the uncertainty associated with missing data by generating multiple imputed datasets. Since its inception, MI has gained widespread acceptance across various fields, including survey research [ 44 ], clinical trials [ 22 , 45 ], and observational studies [ 23 ]. Specifically, in the realm of diagnostic testing, MI has been explored as a solution for mitigating verification bias caused by missing gold standard data [ 24 , 25 ], as well as for handling missing data in index tests in the non-MRMC design context [ 37 , 46 , 47 ]. Through comprehensive simulations and practical diagnostic trials, MI has proven to be highly effective in these areas [ 48 ], establishing itself as a key technique for addressing challenges associated with missing data. Consistent with these prior findings, and by integrating MI within the MRMC framework, our approach further underscores the robustness of the MI theory. This shows compelling statistical performance, even when dealing with missing data within the complex correlated data structures characteristic of MRMC designs, contributing to the expanding evidence of MI’s significant potential to enhance research methodologies in scenarios plagued by missing data.

Furthermore, the estimate of MI-MRMC corresponds to the randomized/enrolled population, aligning with the ICH E9(R1) framework [ 49 ] and principles of causal inference [ 50 ]. In contrast, the CC approach violates the randomization principle and may introduce selection bias due to the deletion of cases with missing data. Thus, MI-MRMC could be an actionable sensitive analysis approach when missing data occur in real clinical settings.

It is important to acknowledge the limitations of this research. First, in our real-case study, the original complete dataset relied on ad hoc re-interpretation, which may introduce biases such as inter-reader variance. However, finding a balance between representing the actual missing data scenario and maintaining dataset integrity has proven challenging. Second, our simulation study, while addressing the 1728 scenario, may not fully replicate real-world conditions. For example, there may be situations where variances across different tests and truth statuses could vary [ 51 ]. Therefore, future research should consider applying our approach to more sophisticated scenarios to further evaluate its efficacy. Finally, our investigation focused solely on data MCAR and MAR mechanisms, given that missing not at random occurrences are infrequent in CAD studies. To increase robustness, future studies could incorporate other sensitivity analysis methods, such as the tipping point approach, alongside our proposed MI-MRMC framework [ 13 ].

In conclusion, this study is the first to address the critical yet often overlooked issue of missing data in MRMC designs. The proposed MI-MRMC approach addresses this issue through multiple imputation, thereby producing estimates that are representative of the randomized/enrolled population. By comparing traditional CC approach with the MI-MRMC approach and employing both simulation studies and real-world applications, the substantial benefits of MI-MRMC are highlighted, particularly in enhancing accuracy and statistical power while maintain good control of the type I error rate in the presence of missing data. Consequently, this method offers an effective solution for managing the challenges associated with missing data in MRMC designs and can serve as a sensitive analysis approach for real clinical environments, thereby to some extent paving the way for more robust and reliable research outcomes in future endeavors.

Availability of data and materials

The datasets used/or analyzed during the current study are available from the corresponding author on reasonable request.

Gallas BD, Chan HP, D’Orsi CJ, Dodd LE, Giger ML, Gur D, Krupinski EA, Metz CE, Myers KJ, Obuchowski NA, et al. Evaluating imaging and computer-aided detection and diagnosis devices at the FDA. Acad Radiol. 2012;19(4):463–77.

Article   PubMed   PubMed Central   Google Scholar  

Wagner RF, Metz CE, Campbell G. Assessment of medical imaging systems and computer aids: a tutorial review. Acad Radiol. 2007;14(6):723–48.

Article   PubMed   Google Scholar  

Beam CA, Layde PM, Sullivan DC. Variability in the interpretation of screening mammograms by US radiologists. Findings from a national sample. Arch Intern Med. 1996;156(2):209–13.

Article   CAS   PubMed   Google Scholar  

Yu T, Li Q, Gray G, Yue LQ. Statistical innovations in diagnostic device evaluation. J Biopharm Stat. 2016;26(6):1067–77.

Clinical Performance Assessment. Considerations for Computer-Assisted Detection Devices Applied to Radiology Images and Radiology Device Data in Premarket Notification (510(k)) Submissions: Guidance for Industry and Food and Drug Administration Staff [ https://www.fda.gov/media/77642/download ].

Guiding Principles for Technical Review of Breast X-ray System Registration. [ https://www.cmde.org.cn//flfg/zdyz/zdyzwbk/20210701103258337.html ].

Key Points for Review of Medical Device Software Assisted by Deep Learning. [ https://www.cmde.org.cn//xwdt/zxyw/20190628151300923.html ].

Obuchowski NA. Multireader, multimodality receiver operating characteristic curve studies: hypothesis testing and sample size estimation using an analysis of variance approach with dependent observations. Acad Radiol. 1995;2(Suppl 1):S22–29. discussion S57-64, S70-21 pas.

PubMed   Google Scholar  

Dorfman DD, Berbaum KS, Metz CE. Receiver operating characteristic rating analysis. Generalization to the population of readers and patients with the jackknife method. Invest Radiol. 1992;27(9):723–31.

Wang L, Wang H, Xia C, Wang Y, Tang Q, Li J, Zhou XH. Toward standardized premarket evaluation of computer aided diagnosis/detection products: insights from FDA-approved products. Expert Rev Med Devices. 2020;17(9):899–918.

Obuchowski NA, Bullen J. Multireader Diagnostic Accuracy Imaging studies: fundamentals of Design and Analysis. Radiology. 2022;303(1):26–34.

Campbell G, Pennello G, Yue L. Missing data in the regulation of medical devices. J Biopharm Stat. 2011;21(2):180–95.

Campbell G, Yue LQ. Statistical innovations in the medical device world sparked by the FDA. J Biopharm Stat. 2016;26(1):3–16.

Mongan J, Moy L, Kahn CE Jr. Checklist for Artificial Intelligence in Medical Imaging (CLAIM): a guide for authors and reviewers. Radiol Artif Intell. 2020;2(2):e200029.

Cohen JF, Korevaar DA, Altman DG, Bruns DE, Gatsonis CA, Hooft L, Irwig L, Levine D, Reitsma JB, de Vet HC, et al. STARD 2015 guidelines for reporting diagnostic accuracy studies: explanation and elaboration. BMJ Open. 2016;6(11):e012799.

Stahlmann K, Reitsma JB, Zapf A. Missing values and inconclusive results in diagnostic studies - a scoping review of methods. Stat Methods Med Res. 2023;32(9):1842–55.

Little RJA, Rubin DB. Statistical Analysis with Missing Data, 3rd Edition. John Wiley & Sons; 2020.

Schuetz GM, Schlattmann P, Dewey M. Use of 3x2 tables with an intention to diagnose approach to assess clinical performance of diagnostic tests: meta-analytical evaluation of coronary CT angiography studies. BMJ. 2012;345:e6717.

Shinkins B, Thompson M, Mallett S, Perera R. Diagnostic accuracy studies: how to report and analyse inconclusive test results. BMJ. 2013;346:f2778.

Mitroiu M, Oude Rengerink K, Teerenstra S, Petavy F, Roes KCB. A narrative review of estimands in drug development and regulatory evaluation: old wine in new barrels? Trials. 2020;21(1):671.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Rubin DB. Inference and missing data. Biometrika. 1976;63(3):581–92.

Article   Google Scholar  

Jakobsen JC, Gluud C, Wetterslev J, Winkel P. When and how should multiple imputation be used for handling missing data in randomised clinical trials - a practical guide with flowcharts. BMC Med Res Methodol. 2017;17(1):162.

Pedersen AB, Mikkelsen EM, Cronin-Fenton D, Kristensen NR, Pham TM, Pedersen L, Petersen I. Missing data and multiple imputation in clinical epidemiological research. Clin Epidemiol. 2017;9:157–66.

Harel O, Zhou XH. Multiple imputation for correcting verification bias. Stat Med. 2006;25(22):3769–86.

Harel O, Zhou XH. Multiple imputation for the comparison of two screening tests in two-phase Alzheimer studies. Stat Med. 2007;26(11):2370–88.

Meng XL. Multiple-imputation inferences with uncongenial sources of input. Stat Sci. 1994;9(4):538–73.

Bartlett JW, Seaman SR, White IR, Carpenter JR, Alzheimer’s Disease Neuroimaging I. Multiple imputation of covariates by fully conditional specification: accommodating the substantive model. Stat Methods Med Res. 2015;24(4):462–87.

Hillis SL, Berbaum KS, Metz CE. Recent developments in the Dorfman-Berbaum-Metz procedure for multireader ROC study analysis. Acad Radiol. 2008;15(5):647–61.

Chakraborty DP. Observer performance methods for diagnostic imaging: foundations, modeling, and applications with r-based examples. 1st edition. Boca Raton: CRC Press; 2017.

Hillis SL. A comparison of denominator degrees of freedom methods for multiple observer ROC analysis. Stat Med. 2007;26(3):596–619.

Rubin DB, Wiley I. Multiple imputation for nonresponse in surveys. New York: Wiley; 1987.

Book   Google Scholar  

Landerman LR, Land KC, Pieper CF. An empirical evaluation of the predictive mean matching method for imputing missing values. Sociol Methods Res. 1997;26(1):3–33.

Barnard J, Rubin DB. Miscellanea. Small-sample degrees of freedom with multiple imputation. Biometrika. 1999;86(4):948–55.

Roe CA, Metz CE. Dorfman-Berbaum-Metz method for statistical analysis of multireader, multimodality receiver operating characteristic data: validation with computer simulation. Acad Radiol. 1997;4(4):298–303.

Hillis SL. Relationship between Roe and Metz simulation model for multireader diagnostic data and Obuchowski-Rockette model parameters. Stat Med. 2018;37(13):2067–93. https://doi.org/10.1002/sim.7616 .

R. A Language and Environment for Statistical Computing [ https://www.R-project.org/ ].

Gad AM, Ali AA, Mohamed RH. A multiple imputation approach to evaluate the accuracy of diagnostic tests in presence of missing values. Commun Math Biol Neurosci. 2022;21:1–19.

Kohn MA, Carpenter CR, Newman TB. Understanding the direction of bias in studies of diagnostic test accuracy. Acad Emerg Med. 2013;20(11):1194–206.

Whiting PF, Rutjes AW, Westwood ME, Mallett S, Group Q-S. A systematic review classifies sources of bias and variation in diagnostic test accuracy studies. J Clin Epidemiol. 2013;66(10):1093–104.

Van der Heijden GJ, Donders ART, Stijnen T, Moons KG. Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: a clinical example. J Clin Epidemiol. 2006;59(10):1102–9.

Newman DA. Missing data: five practical guidelines. Organizational Res Methods. 2014;17(4):372–411.

Buuren Sv. Flexible imputation of missing data. Boca Raton, FL: CRC; 2012.

Hickey GL, Philipson P, Jorgensen A, Kolamunnage-Dona R. Joint modelling of time-to-event and multivariate longitudinal outcomes: recent developments and issues. BMC Med Res Methodol. 2016;16(1):117.

He Y, Zaslavsky AM, Landrum MB, Harrington DP, Catalano P. Multiple imputation in a large-scale complex survey: a practical guide. Stat Methods Med Res. 2010;19(6):653–70.

Barnes SA, Lindborg SR, Seaman JW Jr. Multiple imputation techniques in small sample clinical trials. Stat Med. 2006;25(2):233–45.

Long Q, Zhang X, Hsu CH. Nonparametric multiple imputation for receiver operating characteristics analysis when some biomarker values are missing at random. Stat Med. 2011;30(26):3149–61.

Cheng W, Tang N. Smoothed empirical likelihood inference for ROC curve in the presence of missing biomarker values. Biom J. 2020;62(4):1038–59.

Karakaya J, Karabulut E, Yucel RM. Sensitivity to imputation models and assumptions in receiver operating characteristic analysis with incomplete data. J Stat Comput Simul. 2015;85(17):3498–511.

FDA. E9(R1) Statistical Principles for Clinical Trials: Addendum: Estimands and Sensitivity Analysis in Clinical Trials. https://www.fda.gov/regulatory-information/search-fda-guidance-documents/e9r1-statistical-principles-clinicaltrials-addendum-estimands-and-sensitivity-analysis-clinical . Accessed 5 Sep 2024.

Westreich D, Edwards JK, Cole SR, Platt RW, Mumford SL, Schisterman EF. Imputation approaches for potential outcomes in causal inference. Int J Epidemiol. 2015;44(5):1731–7.

Hillis SL. Simulation of unequal-variance binormal multireader ROC decision data: an extension of the Roe and Metz simulation model. Acad Radiol. 2012;19(12):1518–28.

Download references

Acknowledgements

We would like to express our sincere gratitude to Prof. Stephen L. Hillis, Prof. Dev P. Chakraborty, and Mr. Ning Li for their invaluable support and guidance throughout the development of this paper. We extend our gratitude to Shanghai United Imaging Intelligence Co., Ltd., for sponsoring the real example study and sharing the data. We also acknowledge the valuable support from the investigators of the real example study: Xiaoping Yin and Jianing Wang from the Affiliated Hospital of HeBei University; Lin Liu and Zhanhao Mo from the China-Japan Union Hospital of Jilin University; and Nan Hong and Lei Chen from Peking University People’s Hospital.

This study was conducted under grants from the Shanghai municipal health commission Special Research Project in Emerging Interdisciplinary Fields (2022JC011) and Shanghai Science and Technology Development Funds (22QA1411400).

Author information

Zhemin Pan and Yingyi Qin contributed equally to this work.

Authors and Affiliations

Tongji University School of Medicine, 1239 Siping Road, Yangpu District, Shanghai, 200092, China

Zhemin Pan, Wangyang Bai & Jia He

Department of Military Health Statistics, Naval Medical University, 800 Xiangyin Road, Yangpu District, Shanghai, 200433, China

Yingyi Qin, Qian He & Jia He

Department of Radiology, the Affiliated Hospital of Hebei University, 212 Eastern Yuhua Road, Baoding City, Hebei Province, 071000, China

Xiaoping Yin

You can also search for this author in PubMed   Google Scholar

Contributions

ZM Pan and YY Qin contributed equally to this work. ZM Pan and YY Qin designed the simulation and wrote the main manuscript text. WY Bai prepared figures and tables. Q He conducted the analysis of the real example. XP Ying provided substantial contributions during the revisions. J He provided critical input to the manuscript. All authors reviewed the manuscript and approved the final version of this paper.

Corresponding author

Correspondence to Jia He .

Ethics declarations

Ethics approval and consent to participate.

The case study has ethics approvals from the Ethics Committee of Peking University People`s Hospital(2022Nov9th), the Ethics Committee of China-Japan Union Hospital of Jilin University(2022Nov25th), and the Ethics Committee of the Affiliated Hospital of HeBei University(2022Dec26th). Participating patients provided informed consent and research methods followed national and international guidelines.

Competing interests

The authors declare no competing interests.

Consent for publication

Not applicable.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

12874_2024_2321_moesm1_esm.tiff.

Supplementary Material 1: Supplementary Figure S1. Mean RMSE under different scenarios differentiated by sample size for the original, CC and MI-MRMC approaches. A) Under the MCAR mechanism, B) under the MAR mechanism. CC: complete case analysis, MI-MRMC: multiple imputation under the MRMC framework, Original: DBM analysis on the original complete dataset.

12874_2024_2321_MOESM2_ESM.tiff

Supplementary Material 2: Supplementary Figure S2. Mean bias under different scenarios differentiated by sample size for the original, CC and MI-MRMC approaches. A) Under the MCAR mechanism, B) under the MAR mechanism. CC: complete case analysis, MI-MRMC: multiple imputation under the MRMC framework, Original: DBM analysis on the original complete dataset

12874_2024_2321_MOESM3_ESM.tiff

Supplementary Material 3: Supplementary Figure S3. Mean 95% confidence interval coverage rate under different scenarios differentiated by sample size for the original, CC and MI-MRMC approaches. A) Under the MCAR mechanism, B) under the MAR mechanism. CC: complete case analysis, MI-MRMC: multiple imputation under the MRMC framework, Original: DBM analysis on the original complete dataset

12874_2024_2321_MOESM4_ESM.tiff

Supplementary Material 4:Supplementary Figure S4. Mean confidence interval width under different scenarios differentiated by sample size for the original, CC and MI-MRMC approaches. A) Under the MCAR mechanism, B) under the MAR mechanism. CC: complete case analysis, MI-MRMC: multiple imputation under the MRMC framework, Original: DBM analysis on the original complete dataset

Supplementary Material 5: Supplementary Table S1. Simulation settings.

Supplementary material 6: supplementary table s2. detailed simulation results., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ .

Reprints and permissions

About this article

Cite this article.

Pan, Z., Qin, Y., Bai, W. et al. Implementing multiple imputations for addressing missing data in multireader multicase design studies. BMC Med Res Methodol 24 , 217 (2024). https://doi.org/10.1186/s12874-024-02321-3

Download citation

Received : 06 March 2024

Accepted : 27 August 2024

Published : 27 September 2024

DOI : https://doi.org/10.1186/s12874-024-02321-3

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Missing data
  • Multiple imputation
  • Multireader multicase
  • Computer-aided diagnosis

BMC Medical Research Methodology

ISSN: 1471-2288

case study using multiple regression analysis

IMAGES

  1. Solved Read the case study "Multiple Regression Analysis A

    case study using multiple regression analysis

  2. Multiple Regression Case Study

    case study using multiple regression analysis

  3. PPT

    case study using multiple regression analysis

  4. Multiple Regression Case Study

    case study using multiple regression analysis

  5. Multiple Regression Analysis

    case study using multiple regression analysis

  6. (PDF) Modeling Human Vulnerability and Risk of Epidemics Hazards Using

    case study using multiple regression analysis

VIDEO

  1. Multiple Regression in SPSS

  2. Example of Multiple Regression

  3. Multiple Regression in R (Part 1)

  4. Using Multiple Regression to Make Predictions

  5. Multiple Regression Model : A Step-by-Step Case Study| part 2

  6. Multiple Regression and Hypothesis Testing

COMMENTS

  1. PDF Unit 7: Multiple Linear Regression Lecture 3: Case Study

    In these situations a transformation applied to the response variable may be useful. In order to decide which transformation to use, we should examine the distribution of the response variable. min = 0. Q1 = 12000 mean = 44098 median = 30000 Q3 = 55000 max = 450000.

  2. Analysis of Economics Data Chapter 13: Multiple Regression Case Studies

    13.1 Case Study 1: School Performance Index Multiple Regression Multiple Regression Regress API on other regressors with default se™s I Edparent coe¢ cient little change from 79.53 to 73.94 I all six regressors jointly statistically signi-cant F = 771.4 I subset of -ve regressors other than Edparent statistically signi-cant F = 14.80 ...

  3. 3 Multiple Linear Regression: Case Studies

    3.1 Reading a Regression Output. The dataset satisfaction.csv contains four variables collected from \(n = 46\) patients in given a hospital:. Satisfaction: The degree of satisfaction with the quality of care (higher values indicated greater satisfaction).; Age: The age of the patient in years.; Severity: The severity of the patient's condition (higher values are more severe).

  4. Multiple Linear Regression

    Multiple linear regression formula. The formula for a multiple linear regression is: = the predicted value of the dependent variable. = the y-intercept (value of y when all other parameters are set to 0) = the regression coefficient () of the first independent variable () (a.k.a. the effect that increasing the value of the independent variable ...

  5. Linear Regression Modeling with A case study

    This tutorial introduces regression analyses specifically using R language. After this tutorial you will learn : how regression is used in Statistics against how it is used in Machine Learning . It will also introduce you to using Tidymodels for Regression Analysis. understand the concept of simple and multiple linear regression

  6. Multiple Linear Regression. A complete study

    Here, Y is the output variable, and X terms are the corresponding input variables. Notice that this equation is just an extension of Simple Linear Regression, and each predictor has a corresponding slope coefficient (β).The first β term (βo) is the intercept constant and is the value of Y in absence of all predictors (i.e when all X terms are 0). It may or may or may not hold any ...

  7. Regression Analysis

    Regression Analysis - Retail Case Study Example. ... Multiple R-squared: 0.2069: Adjusted R-squared: 0.2065: F-statistic (P Value) 2.20E-16 . The following is the linear equation for this regression model. Notice, that the model just has mid-sized and larger cities as the predictor variables. The information about small towns is absorbed in ...

  8. Multiple Linear Regression for Manufacturing Analysis

    Fig-1. Multiple Linear Regression Formula. The linear regression formula's slope can also be interpreted as the linear relationship strength between the independent variable and its dependent variable.Based on that definition, we can comfortably say that the higher the slope value of the independent variable, the more significant this variable influences its dependent variable.

  9. Multiple Linear Regression

    The Multiple Linear Regression Equation. As previously stated, regression analysis is a statistical technique that can test the hypothesis that a variable is dependent upon one or more other variables. Further, regression analysis can provide an estimate of the magnitude of the impact of a change in one variable on another.

  10. Multiple Linear Regression in R: Tutorial With Examples

    This is the use of linear regression with multiple variables, and the equation is: Y = b0 + b1X1 + b2X2 + b3X3 + … + bnXn + e. Y and b0 are the same as in the simple linear regression model. b1X1 represents the regression coefficient (b1) on the first independent variable (X1). The same analysis applies to all the remaining regression ...

  11. PDF Multiple Regression Analysis

    5A.4.2 The Statistical Goal in a Regression Analysis The statistical goal of multiple regression analysis is to produce a model in the form of a linear equa-tion that identifies the best weighted linear combination of independent variables in the study to optimally predict the criterion variable.

  12. Multiple Regression Analysis Example with Conceptual Framework

    The brief research using multiple regression analysis is a broad study or analysis of the reasons or underlying factors that significantly relate to the number of hours devoted by high school students in using the Internet. The regression analysis is broad because it only focuses on the total number of hours devoted by high school students to ...

  13. Multiple linear regression: Theory and applications

    Photo by Ferdinand Stöhr on Unsplash. Multiple linear regression is one of the most fundamental statistical models due to its simplicity and interpretability of results. For prediction purposes, linear models can sometimes outperform fancier nonlinear models, especially in situations with small numbers of training cases, low signal-to-noise ratio, or sparse data (Hastie et al., 2009).

  14. Decision-making using regression analysis: a case study on Top Tier

    The study is mixed in nature, and the researchers have used analytical tools to analyse the data factually. Multiple regression using MS Excel is used in the study.,This case is based on the experiences of a real-life travel and tour company located in New Delhi, India.

  15. Multiple Linear Regression Analysis

    Multiple regression analysis can be used to assess effect modification. This is done by estimating a multiple regression equation relating the outcome of interest (Y) to independent variables representing the treatment assignment, sex and the product of the two (called the treatment by sex interaction variable).For the analysis, we let T = the treatment assignment (1=new drug and 0=placebo), M ...

  16. 8.8: Case study

    In this case, the first model of interest using the two SAT percentiles, fygpai = β0 +βsatvsatvi +βsatmsatmi +εi, (8.8.1) (8.8.1) fygpa i = β 0 + β satv satv i + β satm satm i + ε i, looks like it might be worth interrogating further so we can jump straight into considering the 6+ steps involved in hypothesis testing for the two slope ...

  17. 18.1: Multiple linear regression

    Multiple response variables falls into a category of statistics called multivariate statistics. Like multi-way ANOVA, multiple regression is the extension of simple linear regression from one independent predictor variable to include two or more predictors. The benefit of this extension is obvious — our models gain realism.

  18. Multiple Regression Case Study

    Multiple Regression Case Study. The following is a sample Multiple Regression Case Study. There are several key elements to a successful regression analysis. The first one is choosing the right functional model. The second one consists of assessing the fulfilment of the regression assumptions. These two elements go hand to hand and they depend ...

  19. Regression Tutorial with Analysis Examples

    My tutorial helps you go through the regression content in a systematic and logical order. This tutorial covers many facets of regression analysis including selecting the correct type of regression analysis, specifying the best model, interpreting the results, assessing the fit of the model, generating predictions, and checking the assumptions.

  20. Multiple Regression: Methodology and Applications

    Abstract. Multiple regression is one of the most significant forms of regression and has a wide range. of applications. The study of the implementation of multiple regression analysis in different ...

  21. Multiple Regression Analysis for Real Estate Valuation

    This case study presents an introduction to the basics of real estate appraisal and multiple regression analysis; in particular, as used in real estate valuation for mass property tax assessment. While real estate researchers, appraisers and some tax assessors have used multiple regression analysis for many years, its use by a large

  22. PDF RE RESSIO A ALYSIS AS A AU ITT OL: A CASE STU Y

    INTRODUCTION. This report describes an auditing situation in which the Kansas City Regional Office staff used regression analysis to confirm its questioning of an agency position. It is hoped that this report can be a useful reference to aid further use of this analytical technique. Regression analysis is a statistical technique used to measure ...

  23. 4 Examples of Using Linear Regression in Real Life

    Linear Regression Real Life Example #3. Agricultural scientists often use linear regression to measure the effect of fertilizer and water on crop yields. For example, scientists might use different amounts of fertilizer and water on different fields and see how it affects crop yield. They might fit a multiple linear regression model using ...

  24. Mitigating Multicollinearity in Regression: A Study on Improved ...

    Multicollinearity, a critical issue in regression analysis that can severely compromise the stability and accuracy of parameter estimates, arises when two or more variables exhibit correlation with each other. This paper solves this problem by introducing six new, improved two-parameter ridge estimators (ITPRE): NATPR1, NATPR2, NATPR3, NATPR4, NATPR5, and NATPR6. These ITPRE are designed to ...

  25. Linear Regression in R: A Case Study

    In this case, 0.78 indicates a strong correlation, especially when considering that the closer the correlation coefficient is to 1, the stronger the correlation. Step 4: Find the equation of the least-squares regression equation and write out the equation. Add the regression line to the scatterplot you generated above.

  26. Implementing multiple imputations for addressing missing data in

    Background In computer-aided diagnosis (CAD) studies utilizing multireader multicase (MRMC) designs, missing data might occur when there are instances of misinterpretation or oversight by the reader or problems with measurement techniques. Improper handling of these missing data can lead to bias. However, little research has been conducted on addressing the missing data issue within the MRMC ...