MAPE =
1.97%
The mean average percent error is 1.97%, which indicates that the model is valid.
Please e-mail or call us at 1-818-850-7850 for a FREE initial phone consultation and we will be glad to talk to you and see if our expertise can be of help.
We can do serious data crunching and get meaningful conclusions.
We provide well documented reports, with the exact depth requested by the customer.
We build models, we test, we reach conclusions, we get results.
We adopt to our customers need. We can customize and automate.
Customer support.
We respond quickly to questions from our customers.
We can handle complex analyses with most of the statistical software packages available
We offer customized reports.
We will be glad to talk to you about your needs and how we can help
We specialize on efficient and affordable solutions for small and medium size business
Did we mention we have a free initial consultation?
Statistics By Jim
Making statistics intuitive
By Jim Frost 85 Comments
If you’re learning regression analysis, you might want to bookmark this tutorial!
Before we get to the regression tutorials, I’ll cover several overarching issues.
Why use regression at all? What are common problems that trip up analysts? And, how do you differentiate a high-quality regression analysis from a less rigorous study? Read these posts to find out:
There are many different types of regression analysis. Choosing the right procedure depends on your data and the nature of the relationships, as these posts explain.
Model specification is an iterative process. The interpretation and assumption confirmation sections of this tutorial explain how to assess your model and how to change the model based on the statistical output and graphs.
After choosing the type of regression and specifying the model, you need to interpret the results. The next set of posts explain how to interpret the results for various regression analysis statistics:
Analysts often use regression analysis to make predictions. In this section of the regression tutorial, learn how to make predictions and assess their precision.
The last part of the regression tutorial contains regression analysis examples. Some of the examples are included in previous tutorial sections. Most of these regression examples include the datasets so you can try it yourself! Also, try using Excel to perform regression analysis with a step-by-step example!
If you’re learning regression and like the approach I use in my blog, check out my eBook!
August 11, 2024 at 9:26 am
Hi Jim I am looking for some scientific papers to do the validation among the goodness of fit parameters in terms of atmospheric conditions using polynomial model. Even the benchmark would be helpful as well to determine the range of parameter’s threshold values.
April 10, 2024 at 5:29 pm
Hello Jim, Finding your site super-helpful. Wondering if you can provide some examples to simply illustrate multiple regression output (from spss). I would like to illustrate the overall effects of the independent variables on the dependent without creating a histogram. Ideally something that shows the strength, direction and significance in a box plot, line graph, bubble chart or other smart graphic. So appreciate your guidance.
January 2, 2023 at 10:02 am
Hi Jim, I just bought all 3 of your books via your Website Store that took me to Amazon – $66.72 plus tax. I don’t see how to get the PDF versions though without an additional cost to buy them additionally. Can you help me with how to get access to the PDF’s? Also, I am reviewing these to see if I want to add them to the courses I teach in Data Analytics. Do you have academic pricing available for students? Both hardcopy and e-copy?
January 3, 2023 at 12:17 am
Hi Anthony,
Look for an email from me.
Edited to add: I just sent an email to the address you used for the comment, but it bounced back saying it was “Blocked.” Please provide a method for me to contact you. You can use the contact form on my website and provide a different email address. Thanks!
July 18, 2022 at 7:15 am
Dear Dr, how are you? Thank you very much for your contribution. I have one question, which might not be related to this post in My cross tabulation between two categorical variables, the cell of one of the variables have just 8 observations for the ” no” , and 33 observations for the “yes” of the of the second variable. Can I continue with is for the descriptive statistics or should collapse the categories to increase the sample size? Do I use the new variable with a fewer categories in my regression analysis? your help is much appreciated
May 5, 2021 at 1:46 pm
Thanks for teaching us about Stats intuitively.
Is your book Regression Analysis available in PDF format? I’m a student learning Stats and would like it only in PDF format (no Kindle)
May 5, 2021 at 1:49 pm
Yes, if you buy it through My Website Store , you’ll get it in PDF format.
February 22, 2021 at 4:08 pm
Thnak you for your valuable advice. Change is in the way I am putting data into the software. When I put average data, glm output shows ingestion has no significant effect on mortality. When I input data with replications, glm out shows significant effect of ingestion on mortality. My standard deviations are large but data shows homogscedacity and normal distribution.
Your comments wil really be helpful in this regard.
February 22, 2021 at 4:11 pm
If you have replications, I’d enter that data to maintain the separate data points and NOT use the average. That provides the model with more information!
February 22, 2021 at 6:39 am
I have a question about Generaližed linear model. I am getting different outputs of same response variable when I apply glm using 1) data with replications 2) using average data. Mortality is My response variable and no. Of particles ingested is My predictor variable, other two predictors are categorical.
Looking forward to your advice.
February 22, 2021 at 3:44 pm
I’m not sure what you’re changing in your analysis to get the different outputs?
Replications are good because they help the model estimate pure error.
Average data is okay too but just be aware of incorporating that into the interpretations.
September 7, 2020 at 1:02 pm
I know that we can use linear or nonlinear models to fit a line to a dataset with curvature. My question is that when we have too many independent variable, how could we understand if there is a curvature?
Do you think we should start with simple linear regression, then model polynomial, Reciprocal, log, and nonlinear regression and compare the result for all of them to find which model works the best?
Thanks a lot for your very good and easy to understand book.
August 6, 2020 at 8:49 pm
I wonder if you have any recommended articles as to how to interpret the actual p-values and confidence interval for multiple regressions? I am struggling to find examples/templates of reporting these results.
I truly appreciate your help.
August 6, 2020 at 10:29 pm
I’ve written a post about interpreting p-value for regression coefficients , which I think would be helpful.
For confidence intervals of regression coefficients, think about sample means and CIs for means as a starting point. You can use the mean as the sample estimate of the population mean. However, because we’re working with a sample, we know there is a margin of error around that estimate. The CI captures that margin of error. If you have a CI for a sample mean, you know that the true population parameter is likely to be in that range.
In the regression context, the coefficient is also a mean. It’s a mean effect or the mean change in the dependent variable given a one-unit change in the independent variable. However, because we’re working with a sample, we know there is a margin of error around that mean effect. Consequently, with a CI for a regression coefficient, we know that the true mean effect of that coefficient is likely to fall within that coefficient CI.
I hope that helps!
June 28, 2020 at 2:57 am
Hi Jim first of all, thanks for all your great work. I’m setting linear regression analysis, in which the standard coefficient is considered, but the problem is my dependent variable that is Energy usage intensity so it means the lower value is the better than a higher value. correct me if I’m wrong, I think SPSS evaluates high value as the best and lower one as the worst so in my case, it could lead to effect reversely on the result (standard coefficient beta). is it right? and what is your suggestion?
June 23, 2020 at 3:51 am
hello Jim i want help with an econometric model or equation that can be used if there is one independent variable (dam) and dependent variable(5 livelihood outcomes). here i am confused if i can use binary regression model considering the 5 outcomes as a indicators of the dependent variable which is livelihood outcomes or i have to consider the 5 livelihood outcomes as 5 dependent variables and use multivariate regression.please reply a.a.p thank you so much
June 28, 2020 at 12:31 am
It really depends on the nature of the variables. I don’t know what you’re assessing, but here are two possibilities.
You label the independent variable, which I’m assuming is continuous but I don’t know for sure, and the 5 indicator/binary outcomes. This is appropriate if you think the IV affects, or at least predicts, those five indicators. Use this aproach if the goal of your analysis is to use the IV to predict the probability of those binary outcomes. Use binary logistic regression. You’ll need to run five different models. In each model, one of the binary outcomes/indicators is your DV and you’d use the same IV for each model. This type of model allows you to use the value of the IV to predict the probability of the binary outcome.
However, if you instead want to use the binary indicators to predict the continuous variable, you’d need to use multiple regression. The continuous variable is your DV and the five indicators are your IVs. This type of model allows you to use the values of the five indicators to predict the mean value of the continuous variable.
Which approach you take is a mix of theory and want your study needs to learn.
June 3, 2020 at 5:17 pm
Is this normal that the signs in “Regression equation in uncoded units” are sometimes different from the signs in the “Coded coefficients table”? In my regression results, for some terms, while the sign of a coefficient is negative in “Coded coefficients table”, it is positive in the regression equation. I am a little confused here. I thought the signs should be the same.
Thanks, Behnaz
June 3, 2020 at 8:07 pm
There is nothing unusual about the coded and uncoded coefficients having different signs. Suppose a coded coefficient has a negative sign but the uncoded coefficient has a positive sign. Your software using one of several processes that translates the raw data (uncoded) into coded values that help the model estimate process. Sometimes that conversion process causes data values to switch signs.
October 17, 2019 at 5:31 am
Hello Jim, I am lookign to do a Rsquared line for a multiple regression series. i’m not so confident that the 3rd,4th,5th number in the correlations will help make a better line. i’m basically looking at data to predict stock prices (getting a better R2) so for example Enterprise Value/Sales to growth rate has a high R2 of like .48 but we know for sure that Free cash flow/revenue to percent down from 52 week high is like .299
i have no clue how to get this to work in a 3d chart or to make a formula and find the new r2. any help would be great.
i dont have excel..not a programmer..just have some google sheets experience.
October 17, 2019 at 3:42 pm
Hi Jonathan,
I’m not 100% sure what you mean by an R-squared line? Or, by the 3rd, 4th, 5th, number in the correlations? Are you fitting several models where each one has just one independent variable?
It sounds to me like you’ll to learn more about multiple regression. Fortunately, I’ve written an ebook about it that will take you from a novice to being able to perform multiple regression effectively. Learn about my intuitive guide to regression ebook .
It also sounds like you’ll need to obtain some statistical software! I’m not sure what statistics if any you can perform in Google Sheets.
July 18, 2019 at 9:44 am
Forgive me if these questions have obvious answers but I could not find the answers yet. Still reading and learning. Is 3 the minimum number of samples needed to calculate regression? Why? I’m guessing the equations used require at least 3 sets of X,Y data to calculate a regression but I do not see a good explanation of why. I’m am not wondering about how many sets make the strongest fit. And with only two sets we would get a straight line and no chance of curvature……
I am working on a stability analysis report. For some lots we only have two time points. Zero and Three months. The software will not calculate the regression. Obviously, it needs three time points…..but why? For example: standard error cannot be calculated with only two results and therefore the rest of the equations will not work…or maybe it is related to degrees of freedom? (in the meantime what I will do is run through the equations by hand. the problem is I’m so heavily relying on the software. in other words being lazy. at least I’m questioning though. i’ve been told not to worry about it and just submit the report with “regression requires three data points”.)
July 12, 2019 at 5:34 pm
Hello, Pls how can I construct a model on carbon pricing? Thanks in anticipation of a timely response.
July 15, 2019 at 11:12 am
The first step is to do a lot of research to see what others have done. That’ll get you started in the right direction. It’ll help you identify the data you’ll need to collect, variables to include in the model, and the type and form of model that is likely to fit your data. I’ve also written a post about choosing the correct regression model that you should read. That post describes the model fitting process and how to determine which model is the best.
Best of luck with your analysis!
June 25, 2019 at 11:47 am
Hi Jim: I recently read your book Regression Analysis and found it very helpful. It covered a lot of material but I continue to have some questions about basic workflow when conducting regression analysis in social science research. For someone who wants to create an explanatory multiple regression model(s) as part of an observational study in anthropology, what are the basic chronological steps one should follow to analyze the data (eg: choose model type based on type of data collected; create scatterplots between Y and X’s; calculate correlation coefficients; specify model . . .)? I am looking for the basic steps to follow in the order that they should be completed. Once a researcher has identified a research question and collected and stored data in a dataset, what should the step-by-step work flow look like for a regression / model building analysis? Having a basic chronology of steps will help me better organize (and use) the material in your book. Thanks!
June 25, 2019 at 10:03 pm
First, thanks so much for buying me ebook. I’m so happy to hear that it was helpful. You ask a great question. And, it my next book I tackle the actual process of performing statistical studies that use the scientific method. For now, I can point you towards a blog post that covers this topic: Five Steps for Conducting Studies with Statistical Analyses
And, because you’re talking about an observation study, I recommend my post about observational studies . It talks about how they’re helpful, what to watch out for, and some tips. Also be sure to read about confounding variables in regression analysis, which starts on page 158 in the book.
Additionally, starting on p. 150 in the ebook, I talk about how to determine which variables to include in the model.
Throughout all of those posts and the ebook, you’ll notice a common theme. That you need to do a lot of advance research to figure out what you need to measure and how to measure it. Also important to ensure that you don’t accidently not measure a variable and have omitted variable bias affect your results. That’s where all the literature research will be helpful.
Now, in terms of analyzing the data, it’s hard to come up with one general approach. Hopefully, the literature review will tell you what has worked and hasn’t worked for similar studies. For example, maybe you’ll see that similar studies use OLS but need to use a log transformation. It also strongly depends on the nature of your data. The type of dependent variable(s) plays a huge role in what type of model you should use. See page 315 for more about that. It’s really a mix of what type of data you have (particularly the DVs) and what has worked/not worked for similar studies.
Sometimes, even armed with all that advanced knowledge, you’ll go to fit the model with what seems to be the best choice, and it just doesn’t fit your data. Then, you need to go back to the drawing board and try something else. It’s definitely an iterative process. But, looking at what similar studies have done and understanding your data can give you a better chance of starting out with the right type of model. And, then use the tips starting on page 150 to see about the actual process of specifying the model, which is also an iterative process. You might well start out with the correct type of model, but have to go through several iterations to settle on the best form of it.
Best of luck with your study!
June 5, 2019 at 11:50 am
Thank you! Much appreciated!!
June 5, 2019 at 12:24 pm
You’re very welcome, Svend! Because you’re study uses regression, you might consider buying my ebook about regression . I cover a lot more in it than I do on the blog.
June 4, 2019 at 6:17 am
Hi Jim! Did you notice my question from 28. May…?? Svend
June 4, 2019 at 11:06 am
Hi Svend, Sorry about the delay in replying. Sometimes life gets busy! I will reply to your previous comment right now.
June 2, 2019 at 5:12 am
Thank you so much for such timely responses! They helped clarify a lot of things for me 🙂
May 28, 2019 at 3:53 am
Thank you for a very informative blog! I have a question regarding “overfitting” of a multivariable regression analysis that I have performed; 368 patients (ACL-reconstructed + concomitant cartilage lesions) with 5-year FU after ACL-reconstruction. The dependent variable was continuous (PROM). I have included 14 independent variables (sex/age/time from surgery etc, all of which formerely shown to be clinically important for the outcome) including two different types of surgery for the concomitant cartilage injury. No surgery to the concomitant lesions was used as reference (n=203), debridement (n=70), and Microfracture (n=95). My main objective was to investigate the effect on PROMs of those 2 treatments. My initial understanding was that it was OK to include that many independent variables as long as there were 368 patients included/PROMs at FU. But I have had comments that as long as the number of patients for some of the independent variables, ex. (debridement and microfracture) is lower than the model as a whole, the number of independent variables should be based on the variable with least observations…? I guess my question is: does the lowest number of observations for an independent variable dictate the size of the model/how many predictors you can use..? -And also the power..? Thanks!
June 4, 2019 at 11:23 am
I’m not sure if you’ve read my post about overfitting . If you haven’t, you should read it. It’ll answer some of your questions.
For your specific case, in general, yes, I think you have enough observations. In my blog post, I’m talking mainly about continuous variables. However, if I’m understanding correctly, you’re referring to a categorical variable for reference/debridement? If so, the rules are a bit different but I still think you’re good.
Regression and ANOVA are really the same analysis. So, you can thinking of your analysis as an ANOVA where you’re comparing groups in your data. And, it’s true that groups with smaller numbers will produce less precise estimates than groups with larger numbers. And, you generally require more observations for categorical variables than you do for continuous variables. However, it appears like your smallest group has an n=70 and that’s a very good sample size. In ANOVA, having more than 15-20 observations per group is usually good from a assumptions point of view (might not be produce sufficient statistical power depending on the effect size). So, you’re way over that. If some of your groups had very few observation, you might have need to worry about the estimates for that variable–but that’s not the case.
And, given your number of observations (368) and number of model terms requiring estimates overall (14), I don’t see any obvious reason to worry about overfitting on that basis either. Just be sure that you’re counting interaction terms and polynomials in the number model terms. Additionally, a categorical variable can use more degrees of freedom than a single continuous variable.
In short, I don’t see any reason for concern about overfitting given what you have written. Power depends on the effect size, which I don’t know. However, based on the number of observations/terms in model, I again don’t see an obvious problem.
I hope this helps! Best of luck with your analysis!
May 26, 2019 at 5:25 am
Also, another query. I want to run a multiple regression but my demographics and one of my IVs weren’t significant in the initial correlation I ran. What variables should I put in my regression test now? Should I skip all those that weren’t significant? Or just the demographics? I have read that if you have literature backing up the relationship, you can run a regression analysis regardless of how it appeared in your preliminary analysis. How true it that? What would be the best approach in this case? would mean a lot if you help me out on this one
May 27, 2019 at 10:25 pm
Hi again Aisha,
Two different answers for you. One, be wary of the correlation results. The problem is, again, the potential for confounding variables. Correlation doesn’t factor in other variables. Confounding variables can mess up the correlation results just like it can bias a regression model as I explained in my other comment. You have reason to believe that some of your demographic variables won’t be significant until you add your main IVs. So, you should try that to see what happens. Read the post about confounding variables and keep that in mind as you work through this!
And, yes, if you have strong theory or evidence from other studies for including IVs in the model, it’s ok to include them in your model even if it’s not significant. Just explain that in the write up.
For more about that, and model building in general, read my post about specifying the correct model !
May 25, 2019 at 8:13 am
Hi! I can’t believe I didn’t find this blog earlier, would have saved me a lot of trouble for my research 😀 Anyway, I have a question. Is it possible for your demographic variables to become significant predictors in the final model of a Hierarchical regression? I cant seem to understand why it is the case with mine when they came out to be non significant in the first model (even in the correlation test when tested earlier) but became significant when I put them with the rest of my (main) IVs. Are there practical reasons for that or is it poor statistical skills? :-/
May 27, 2019 at 10:19 pm
Thanks for writing with a fantastic question. It really touches on a number of different issues.
Statistics is a funny field. There’s the field of statistics, but then many scientists/researchers in different fields use statistics within their own fields. And, I’ve observed in different fields that there are different terminology and practices for statistical procedures. Often I’ll hear a term for a statistical procedure and at first I won’t know what it is. But, then the person will describe it to me and I’ll know it by another name.
At one point, hierarchical regression was like this for me. I’ve never used it myself, but it appears to be common in social sciences research. The idea is you add variables to model in several groups, such as the demographic variables in one group, and then some other variables in the next group. There’s usually a logic behind the grouping. The idea is to see how much the model improves with the addition of each group.
I have some issues with this practice, and I think your case illustrates them. The idea behind this method is that each model in the process isn’t as good as the subsequent model, but it’s still a valid comparison. Unfortunately, if you look at a model knowing that you’re leaving out significant predictors, there’s a chance that the model with fewer IVs is biased. This problem occurs more frequently with observational studies, which I believe are more common in the social sciences. It’s the problem of confounding variables. And, what you describe is consistent with there being confounding variables that are not in the model with demographic variables until you add the main IVs. For more details, read my post about how confounding variables that are not in the model can bias your results .
Chances are that some of your main IVs are correlated with one more demographic variables and the DV. That condition will bias coefficients in your demographic IV model because that model excludes the confounding variables.
So, that’s the likely practical reason for what you’re observing. Not poor statistical skills! And, I’m not a fan of hierarchical regression for that reason. Perhaps there’s value to it that I’m not understanding. I’ve never used it in practice. But there doesn’t seem to be much to gain by assessing that first (in your case) demographic IV model when it appears to be excluding confounding variables and is, consequently, biased!
However, I know that methodology is common in some fields, so it’s probably best to roll with it! 🙂 But, that’s what I think is happening.
May 19, 2019 at 6:30 am
Hello Jim I need your help please. I have this eq: Can you perform a multiple regression with two independent variables but one of them constant ? for example I have this data
Angle (Theta) Length ratio (%) Force (kn) 0 1 52.1 0.174444444 1 52.9 0.261666667 1 53.3 0.348888889 1 55.5 0.436111111 1 58.1
May 20, 2019 at 2:42 pm
Hi Ibrahim,
Thanks for writing with the good question!
The heart of regression analysis is determining how changes in an independent variable correlates with changes in the dependent variable. However, if an independent variable does not change (i.e., it is constant), there is no way for the analysis to determine how changes in it correlate to changes in the DV. It’s just not possible. So, to answer your question, you can’t perform regression with a constant variable.
I hope this helps!
February 27, 2019 at 6:13 pm
Thank you very much for this awesome site!
February 27, 2019 at 11:01 am
Hello sir, i need to know about regression and anova could you help me please.
February 27, 2019 at 11:46 am
You’re in the right spot! Read through my blog posts and you’ll learn about these topics. Additionally, within a couple of weeks, I’ll be releasing an ebook that’s all about learning regression!
February 20, 2019 at 12:09 pm
Very nice tutorial. I’m reading them all! Are there any articles explaining how the regression model gets trained? Something about gradient descent?
February 11, 2019 at 11:55 am
Thanks alot for your precious time sir
February 11, 2019 at 11:58 am
You’re very welcome! 🙂
February 10, 2019 at 5:05 am
Hey sir,hope you will be fine.It is really wonderful platform to learn regression. Sir i have some problem as I’m using cross sectional data and dependent variable is continuous.Its basically MICS data and I’m using OLS but the problem is that there are some missing observation in some variables.So the sample size is not equal across all the variables.So its make sense in OLS?
February 11, 2019 at 11:40 am
In the normal course of events, yes, when an observation has a missing value in one of the variables, OLS will exclude the entire observation when it fits the model. If observations with missing values are a small portion of your dataset, it’s probably not a problem. You do have to be aware of whether certain types of respondents are more likely to have missing values because that can skew your results. You want the missing values to occur randomly through the observations rather than systematically occurring more frequently in particular types of observations. But, again, if the vast majority of your observations don’t have missing values, OLS can still be a good choice.
Assuming that OLS make sense for your data, one difficulty with missing values is that there really is no alternative analysis that you can use to handle them. If OLS is appropriate for your data, you’re pretty much stuck with it even if you have problematic missing values. However, there are methods of estimating the missing values so you can use those observations. This process is particularly helpful if the missing values don’t occur randomly (as I describe above). I don’t know which software you are using, but SPSS has a particularly good method for imputing missing values. If you think missing values are a problem for your dataset, you should investigate ways to estimate those missing values, and then use OLS.
January 20, 2019 at 10:33 am
Hi Jim, I was quite excited to see you post this, but then there was no following article, only related subjects.
Binary logistic regression
By Jim Frost
Binary logistic regression models the relationship between a set of predictors and a binary response variable. A binary response has only two possible values, such as win and lose. Use a binary regression model to understand how changes in the predictor values are associated with changes in the probability of an event occurring.
Is the lesson on binary logistic regression to follow, or what am I missing?
Thank you for your time.
Antonio Padua
January 20, 2019 at 1:20 pm
Hi Antonio,
That’s a glossary term. On my blog, glossary terms have a special link. If you hover the pointer over the link, you’ll see a tooltip that displays the glossary term. Or, if you click the link, you go to the glossary term itself. You can also find all the glossary terms by clicking Glossary in the menu across the top of the screen. It seems like you probably clicked the link to get to the glossary term for binary logistic regression.
I’ve had several requests for articles about this topic. So, I’m putting it on my to-do list! Although, it probably won’t be for a number of months. In the mean time, you can read my post where I show an example of binary logistic regression .
Thanks for writing!
November 2, 2018 at 1:24 pm
Thanks so much, your blog is really helpful! I was wondering whether you have some suggestions on published articles that use OLS (nothing fancy, just very plain OLS) and that could be used in class for learning interpreting regression outputs. I’d love to use “real” work and make students see that what they learn is relevant in academia. I mostly find work that is too complicated for someone just starting to learn regression techniques, so any advice would be appreciated!
Thanks, Hanna
October 25, 2018 at 7:52 pm
Hi Jim. Did you write on Instrumental variable and 2 SLS method? I am interested in them. Thanks so all excellent things you did on this site.
October 25, 2018 at 10:29 pm
I haven’t yet, but those might be good topics for the future!
October 23, 2018 at 2:33 pm
Jim. Thank you so much. Especially for such a prompt response! The slopes are coming from IT segment stock valuations over 150 years. The slopes are derived from valuation troughs and peaks. So it is a graph like you’d see for the S&P. Sorry I was not clear on this.
October 23, 2018 at 12:14 pm
Jim, could you recommend a model based on the following:
1. I can see a strong visual correlation between the left side trough and peak and the right side. When the left has a steep vector so does the left, for example.
2. This does not need to be the case, the left could provide a much steeper slope compared to right or a much more narrow slope.
3. The parallels intrigue me and I would like to measure if the left slope can be explained by the right to any degree.
4. I am measuring the rise and fall of industry valuations over time. (it is the rise and fall in these valuations over time that create these ~ parallel slopes.
5. My data set since 1886 only provides 6 events, but they are consistent as described.
6. I attempted correlate rising slope against declining.
October 23, 2018 at 2:04 pm
I’m having time figuring out what you’re describing. I’m not sure what slopes you’re referring and I don’t know what you mean by the left versus right slopes?
If you only have 6 data points, you’ll only be able to fit an extremely simple model. You’ll usually need at least 10 data points (absolute minimum but probably more) to even include one independent variable.
If you have two slopes for something and you want to see if one slope explains the other, you could try using linear regression. Use one slope as an independent variable and another as a dependent variable. Slopes would be a continuous variable and so that might work. The underlying data for each slope would have to be independent from data used for other slopes. And, you’ll have to worry about time order effects such as autocorrelation.
October 2, 2018 at 1:37 am
Thank you Jim.
October 2, 2018 at 1:31 am
Hi Jim, I have a doubt regarding which regression analysis is to be conducted. The data set consists of categorical independent variables (ordinal) and one dependent variable which is of continuous type. Moreover, most of the data pertaining to an independent variable is concentrated towards first category (70%). My objective is to capture the factors influencing the dependent variable and its significance. In that case should I consider the ind. variables to be continuous or as categorical? Thanks in advance.
October 2, 2018 at 2:26 am
I think I already answered your question on this. Although, it looks like you’re now saying that you have an ordinal independent variable rather than a categorical variable. Ordinal data can be difficult. I’d still try using linear regression to fit the data.
You have two options that you can try.
1) You can include the ordinal data as continuous data. Doing this assumes that going from 1 to 2 is the same scale change as going from 2 to 3 and so on. Just like with actual continuous data. Although, you can add polynomials and transformations to improve the fit.
2) However, that doesn’t always work. Sometimes ordinal data don’t behave like continuous data. For example, the 2nd place finisher in a race doesn’t necessarily take twice as long as the 1st place finisher. And the difference between 3rd and 2nd isn’t the same as between 1st and 2nd. Etc. In that case, you can include it as a categorical variable. Using this approach, you estimate the mean differences between the different ordinal levels and you don’t have to assume they’ll be the same.
There’s an important caveat about including them as categorical variables. When you include categorical variables, you’re actually using indicator variables. A 5 point Likert scale (ordinal) actually includes 4 indicator variables. If you have many Likert variables, you’re actually including 4 variables for each one. That can be problematic. If you add enough of these variables, it can lead to overfitting . Depending on your software, you might not even see these indicator variables because they code and include them behind the scenes. It’s something to be aware of. If you have many such variables, it’s preferable to include them as continuous variables if possible.
You’ll have to think about whether your data seems more like continuous or categorical data. And, try both methods if you’re not sure. Check the residuals to make sure the model provides a good fit.
Ordinal data can be tricky because they’re not really continuous data nor categorical data–a bit of both! So, you’ll have to experiment and assess how well the different approaches work.
Good luck with your analysis!
October 1, 2018 at 2:32 am
Hello Jim, I have a set of data consisting of dependent variable which is of continuous type and independent variables which are of categorical type. The interesting thing which I found is that majority (more than 70%)of the independent variables belong to the category 1. The category values range from scale 1 to 5. I would like to know the appropriate sampling technique to be used. Is it appropriate to use linear regression or should I use other alternatives? Or any preprocessing of data is required? Please help me with the above.
Thanks in advance Raju.
October 1, 2018 at 9:40 pm
I’d try linear regression first. You can include that categorical variable as the independent variable with no problem. As always, be sure to check the residual plots. You can also use one-way ANOVA, which would be the more usual choice for this type of analysis. But, linear regression and ANOVA are really the same analysis “under the hood.” So, you can go either way.
September 23, 2018 at 4:28 am
Hello Jim I’d like to Know what your suggestions are with regards to choice of regression for predicting: dependent variable is count data but it does not follow a poisson distribution independent variables include categorical and continuous data I’d appreciate your thoughts on it …. thanks!
September 24, 2018 at 11:08 pm
Hi Sarkhani,
Having count data that don’t follow the Poisson happens fairly often. The top alternatives that I’m aware of are negative binomial regression and zero inflated models. I talk about those options a bit in my post about choosing the correct type of regression analysis . The count data section is near the end. I hope this information points you in the right direction!
August 29, 2018 at 9:38 am
Hi jim i’m really happy to find your blog
August 11, 2018 at 1:42 pm
Independent variables range from 0 to 1 and corresponding dependent variables range from 1 to 5 . If we apply regression analysis to above and predict the value of y for any value of x that also ranges from 0 to 1, whether the value of y will always lie in the range 1 to 5?
August 11, 2018 at 4:18 pm
In my experience, the predicted values will fall outside the range of the actual dependent variable. Assuming that you are referring to actual limits at 1 and 5, the regression analysis does not “understand” that those are hard limits. The extent that the predicted values fall outside these limits depends on the amount of error in the model.
August 8, 2018 at 4:18 am
Very Good Explanation about regression ….Thank you sir for such a wonderful post….
March 29, 2018 at 11:43 am
Hi Jim, I would like to see you writing something about Cross Validation (Training and test).
February 20, 2018 at 8:30 am
thank you Jim this is helpful
February 21, 2018 at 4:08 pm
You’re very welcome, Lisa! I’m glad you found it to be helpful!
January 21, 2018 at 10:39 am
Hello Jim I’d like to Know what your suggestions are with regards to choice of regression for predicting: the likelihood of participants falling into One of two categories (low Fear group codes 1 and high Fear 2 … when looking at scores from several variables ( e.g. external Other locus of control, external social locus of control , internal locus of control and social phobia and sleep quality ) It was suggested that I break the question up to smaller components … I’d appreciate your thoughts on it …. thanks!
January 22, 2018 at 2:30 pm
Because you have a binary response (dependent variable), you’ll need to use binary logistic regression. I don’t know what types of predictors you have. If they’re continuous, you can just use them in the model and see how it works.
If they’re ordinal data, such as a Likert scale, you can still try using them as predictors in the model. However, ordinal data are less likely to satisfy all the assumptions. Check the residual plots. If including the ordinal data in the model doesn’t work, you can recode them as indicator variables (1s and 0s only based on whether an observation meets a criteria or not. For example, if you have a scale of -2, -1. 0, 1, 2 you could recode it so observations with a positive score get a 1 while all other scores get a 0.
Those are some ideas to try. Of course, what works best for your case depends on the subject area and types of data that you have.
January 21, 2018 at 5:04 am
I am using Step-wise regression to select significant variables in the model for prediction.how to interpret BIC in variable selection?
regards, Zishan
January 22, 2018 at 5:36 pm
Hi, when comparing candidate models, you look for models with a lower BIC. A lower BIC indicates that a model is more likely to be the true model. BIC identifies the model that is more likely to have generated the observed data.
January 18, 2018 at 2:44 pm
yes.the language of the topic is very easy , i would appreciate you sir ,if you let me know that ,If rank
correlation is r =0.8,sum of “D”square=33.how we will calculate /find no. observations (n).
January 18, 2018 at 3:00 pm
I’m not sure what you mean by “D” square, but I believe you’ll need more information for that.
January 6, 2018 at 11:08 pm
Hi, Jim! I’m really happy to find your blog. It’s really helping, especially that you use basic English so non-native speaker can understand it better than reading most textbooks. Thanks!
January 7, 2018 at 12:49 am
Hi Dina, you’re welcome! And, thanks so much for your kind words–you made my day!
December 21, 2017 at 12:30 am
Can you write on Logistic regression please!
December 21, 2017 at 12:45 am
Hi! You bet! I plan to write about it in the near future!
December 16, 2017 at 2:33 am
great work by great man,, it is easily accessible source to access the scholars,, sir i am going to analyse data plz send me guidlines for selection of best simple linear/ multiple linear regression model, thanks
December 17, 2017 at 12:21 am
Hi, thank you so much for your kind words. I really appreciate it! I’ve written a blog post that I think is exactly what you need. It’ll help you choose the best regression model .
December 8, 2017 at 8:47 am
such a splendid compilation, Thanks Jim
December 8, 2017 at 11:09 am
December 3, 2017 at 10:00 pm
would you also throw some ideas on Instrumental variable and 2 SLS method please?
December 3, 2017 at 10:40 pm
Those are great ideas! I’ll write about them in future posts.
Discover the world's research
You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.
All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .
Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.
Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.
Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.
Original Submission Date Received: .
Find support for a specific problem in the support section of our website.
Please let us know what you think of our products and services.
Visit our dedicated information section to learn more about MDPI.
Mitigating multicollinearity in regression: a study on improved ridge estimators.
Akhtar, N.; Alharthi, M.F.; Khan, M.S. Mitigating Multicollinearity in Regression: A Study on Improved Ridge Estimators. Mathematics 2024 , 12 , 3027. https://doi.org/10.3390/math12193027
Akhtar N, Alharthi MF, Khan MS. Mitigating Multicollinearity in Regression: A Study on Improved Ridge Estimators. Mathematics . 2024; 12(19):3027. https://doi.org/10.3390/math12193027
Akhtar, Nadeem, Muteb Faraj Alharthi, and Muhammad Shakir Khan. 2024. "Mitigating Multicollinearity in Regression: A Study on Improved Ridge Estimators" Mathematics 12, no. 19: 3027. https://doi.org/10.3390/math12193027
Article access statistics, further information, mdpi initiatives, follow mdpi.
Subscribe to receive issue release notifications and newsletters from MDPI journals
BMC Medical Research Methodology volume 24 , Article number: 217 ( 2024 ) Cite this article
16 Accesses
Metrics details
In computer-aided diagnosis (CAD) studies utilizing multireader multicase (MRMC) designs, missing data might occur when there are instances of misinterpretation or oversight by the reader or problems with measurement techniques. Improper handling of these missing data can lead to bias. However, little research has been conducted on addressing the missing data issue within the MRMC framework.
We introduced a novel approach that integrates multiple imputation with MRMC analysis (MI-MRMC). An elaborate simulation study was conducted to compare the efficacy of our proposed approach with that of the traditional complete case analysis strategy within the MRMC design. Furthermore, we applied these approaches to a real MRMC design CAD study on aneurysm detection via head and neck CT angiograms to further validate their practicality.
Compared with traditional complete case analysis, the simulation study demonstrated the MI-MRMC approach provides an almost unbiased estimate of diagnostic capability, alongside satisfactory performance in terms of statistical power and the type I error rate within the MRMC framework, even in small sample scenarios. In the real CAD study, the proposed MI-MRMC method further demonstrated strong performance in terms of both point estimates and confidence intervals compared with traditional complete case analysis.
Within MRMC design settings, the adoption of an MI-MRMC approach in the face of missing data can facilitate the attainment of unbiased and robust estimates of diagnostic capability.
Peer Review reports
The accuracy of imaging diagnostic modalities is shaped by not only the technical specifications of the diagnostic equipment or the algorithms but also the skill set, education, and sensory and cognitive capacities of the interpreting clinicians/readers (e.g., radiologists) [ 1 , 2 , 3 ]. The multireader multicase (MRMC) design, which involves various readers to assess each case, enables the quantification of the impact that reader variability has on the accuracy of imaging diagnostic modalities. As a result, MRMC design studies can enhance the generalizability of study findings and strengthen the overall robustness of the research [ 4 ]. MRMC design is currently needed for the clinical evaluation of computer-aided diagnostic (CAD) devices and imaging diagnostic modalities by regulatory agencies, including the Food and Drug Administration in the United States [ 5 ] and the National Medical Products Administration in China [ 6 , 7 ].
For the analysis of MRMC design data, a lack of independence in reader performance is a critical consideration [ 8 ]. Traditional statistical methods may not be suitable for this complexity. The Dorfman–Berbaum–Metz (DBM) [ 9 ] method and the Obuchowski–Rockette (OR) [ 8 ] method are commonly used approaches to address the intricate correlations present in MRMC studies [ 10 ]. In DBM analysis, to address the lack of independence in readers’ performance, jackknife pseudovalues are computed for each test-reader combination, and a mixed-effects analysis of variance (ANOVA) is subsequently performed on these pseudovalues to carry out significance testing. For OR analysis, the correlations were addressed by adjusting the F statistic to account for the underlying correlation structures.
As with any study, the challenge of missing data is ubiquitous. Missing data can occur in MRMC design studies when there are instances of misreading or omissions by the reader, substandard specimen collection, issues with measurement techniques, errors during the data collection process, or when results exceed threshold values [ 4 , 11 , 12 , 13 ]. Despite this commonality, the majority of MRMC design clinical trials fail to disclose whether they grappled with missing data issues [ 10 ]. Consequently, it remains unclear whether the analytical outcomes were derived from complete or incomplete datasets or if a suitable method for handling missing data was employed. This stands in contrast to the Checklist for Artificial Intelligence in Medical Imaging [ 14 ] and the Standards for Reporting of Diagnostic Accuracy Studies [ 15 ] guidelines, which both explicitly mandate the transparent reporting of missing data and the strategies employed to address them. Within the framework of causal inference, the ambiguity surrounding the status of missing data can introduce uncertainties about the conditions under which results are inferred and may even potentially result in biased estimates [ 16 , 17 ].
Currently, there is limited research on methods specifically designed for handling missing data in MRMC studies. This might explain why missing data are rarely reported in such studies. For those that do address missing data issues, the complete case analysis method is arguably the most commonly used approach. This method involves discarding any case that contains missing data, including all evaluations of that case by all readers [ 18 , 19 ]. This approach typically requires that the type of missing data be missing completely at random (MCAR); otherwise, the results obtained might be biased. Furthermore, the complete case method can lead to further loss of information due to the reduction in sample size, which might also affect the accuracy of the trial results and decrease the statistical power [ 17 ]. Additionally, from the perspective of causal inference, accuracy estimates derived from complete case analyses represent only the subset of the population with complete records, failing to accurately reflect the estimator of the entire target population [ 20 ]. Hence, missing data handling approaches, especially for MRMC designs, are urgently needed.
In 1976, Donald Rubin [ 21 ] introduced the concept of multiple imputation (MI), which involves imputing each missing value multiple times according to a selected imputation model, analyzing the imputed datasets individually, and combining the results on the basis of Rubin’s rules. Thus, MI is able to reflect the uncertainty associated with the data imputation process by increasing the variability of the imputed data. This approach has gained widespread adoption for managing missing data in various research contexts, including drug clinical trials [ 22 ] and observational studies [ 23 ], and addressing verification bias in diagnostic studies [ 24 , 25 ]. However, the implementation of MI within the MRMC design framework remains relatively unexplored. The successful implementation of MI hinges on the congruence between the imputation model and the analysis model, necessitating that the imputation method captures all variables and characteristics pertinent to the analysis model. This requirement ensures unbiased parameter estimates and correctly calculated standard errors [ 26 , 27 ]. The complexity inherent in MRMC designs, however, poses significant challenges to this congruence.
In light of these gaps, in this study, we aim to establish a missing data handling approach that integrates MI theory with MRMC analysis to maximize the use of available data and minimize biases resulting from the exclusion of cases with missing information. We intend to validate the feasibility and suitability of the proposed approach through both simulation studies and a real CAD study. Thus, providing a reliable solution for managing missing data within the MRMC framework enhances the reliability of diagnostic trial outcomes in real-world clinical settings.
The structure of this paper is as follows: First, the approach to address the issue of missing data within MRMC design studies is presented. The specifics of the simulation study are then detailed, including both the setup and the findings obtained. Subsequently, the proposed approach is implemented in a real MRMC design study that includes instances of missing data. The paper concludes with a discussion of the implications of the work and offers practical recommendations.
In this study, a two-test receiver operating characteristic (ROC) paradigm MRMC study design is assumed. Each reader is tasked with interpreting all cases and assigning a confidence-of-disease score that reflects their assessment of the presence of disease. The true disease status of the patients was verified by experienced, independent readers who served as the gold standard. Instances of missing data may occur during the evaluation phase, leading to the absence of interpretation results. We hypothesize that these missing data were under the MCAR / missing-at-random (MAR) mechanisms. The term ‘test’ will be used to refer to the imaging system, modality, or image processing throughout this article.
In terms of notation, \(\:{X}_{ijk}\) represents the confidence-of-disease score assigned to the \(\:k\) -th case by reader \(\:\:j\) on the basis of the \(\:i\) -th test. The observed data consists of \(\:{X}_{ijk}\) , with \(\:\:i=1,\dots\:,I\) , \(\:j=1,\dots\:,J\) , and \(\:k=1,\dots\:,K\) , where \(\:I\) is the number of diagnostic tests evaluated, which is two for better illustration, \(\:\:J\) denotes the number of readers, and \(\:K\) is the total number of cases examined.
Under the complete case (CC) analysis framework, any instance where a single reading record or assessment is missing results in the exclusion of all interpretation results associated with that case. This exclusion applies across all readers and modalities, ensuring that the dataset—referred to as the complete case dataset—comprises only cases with fully observed data.
In this study, DBM [ 9 , 28 , 29 ] analysis was subsequently conducted on the complete case dataset. This analysis method transforms correlated figures of merit (FOM), specifically the area under the ROC curve (AUC), into independent test-reader-case-level jackknife pseudovalues, thereby addressing the complex correlation structure inherent in MRMC data.
The formula for calculating the jackknife pseudovalue is as follows:
where \(\:{Y}_{ijk}\) represents the jackknife pseudovalue of the AUC for the \(\:i\) -th test, \(\:j\) -th reader, and \(\:\:k\) -th case. \(\:{\widehat{\theta\:}}_{ij}\) is the AUC estimate derived from all the cases for the \(\:i\) -th test and the \(\:j\) -th reader. \(\:{\widehat{\theta\:}}_{ij\left(k\right)}\) corresponds to the AUC estimate computed excluding the \(\:k\) -th case. The jackknife pseudovalue of the \(\:k\) -th patient can be viewed as the weighted difference in accuracy. When the FOM is the Wilcoxon AUC, the pseudovalues across the case index are identical to the respective FOM estimates.
Using \(\:{Y}_{ijk}\) as the response, the DBM method for testing the effect of the imaging diagnostic tests can be specified via three-factor ANOVA, with the test effect treated as a fixed factor and the reader and case effects treated as random factors to account for the variability among different readers, cases and interactions.
where \(\:{\tau\:}_{i}\) represents the fixed effect attributable to the \(\:i\) -th imaging test. \(\:{R}_{j}\) and \(\:{C}_{k}\) are the random effects associated with the \(\:\:j\) -th reader and \(\:k\) -th case, respectively. Interaction terms, represented by multiple symbols in parentheses, are considered random effects. The error term \(\:{\epsilon\:}_{ijk}\) captures the residual variability not explained by the model. The DBM approach assumes that the random effects, including the interaction terms, are mutually independent and follow normal distributions with means of zero.
The DBM F statistic for testing the test effect is based on the conventional mixed model and is later optimized by Hillis to ensure that the type I error rates are within acceptable bounds [ 30 ].
Consequently, for the complete case dataset, the estimated effect size (the difference in the FOM across tests) and corresponding statistics are as follows, where the subscript CC denotes metrics calculated for the complete case dataset:
Mean-square quantities calculated based on pseudovalues [ 29 ]:
Step 1. imputation.
The multiple imputation by chained equations (MICE) algorithm was implemented to construct the imputation of missing data [ 31 ]. The MICE algorithm addresses missing data by generating multiple imputations that reflect the posterior predictive distribution, \(\:P\left({X}_{miss}\right|{X}_{obs})\) . This process involves constructing a sequence of prediction models, with the imputation of each variable being conditional on the observed and previously imputed values of the other variables. By iteratively producing multiple imputed datasets— M datasets in our implementation—the MICE approach encapsulates the uncertainty inherent in the imputation process.
In the construction of the abovementioned predictive models for the MICE algorithm within MRMC studies, the typical scarcity of auxiliary variables poses a methodological challenge. To circumvent this limitation, an imputation model is proposed that leverages the intrinsic correlations among different readers’ interpretations. Since these readers assess identical case sets, their interrelated evaluations provide a solid basis for the imputation model. In addition, given that the interpretation ratings by each reader are typically treated as continuous variables, the predictive mean matching method was incorporated to enhance the imputation process [ 32 ]. Moreover, to accommodate potential variations that may arise when readers evaluate cases across different tests and disease statuses, the model is further calibrated using a subset of the data stratified by modality and disease status.
For diseased cases under test 1, let variable \(\:{X}_{j}\:\) represent the interpretation results of reader \(\:\:j\) ( \(\:\:j=\text{1,2},\dots\:,J\) ). The observed dataset comprising the results from all readers is denoted as \(\:{x}_{\left(0\right)}\) , where \(\:{x}_{\left(0\right)}=\{{X}_{1\left(0\right)},\dots\:,{X}_{j\left(0\right)},\dots\:,\:{X}_{J\left(0\right)}\}\) . \(\:{x}_{j\left(1\right)}\) represents the missing part of \(\:{X}_{j}\) . The imputation of missing data proceeds through the following process:
Create the initial imputation for the missing data: \(\:{x}_{1\left(1\right)}^{\left(0\right)},{x}_{2\left(1\right)}^{\left(0\right)},\dots\:,{x}_{J\left(1\right)}^{\left(0\right)}\) .
In the current iteration (t + 1), the imputed values from the previous iteration (t), denoted as \(\:{x}_{1\left(1\right)}^{\left(t\right)},\dots\:,{x}_{J\left(1\right)}^{\left(t\right)}\) , are updated for each variable. This update is achieved by applying the specific predictive formula provided below:
For the analysis of the \(\:M\) imputed datasets, the DBM method was also utilized for comparison purposes. Thus, for the \(\:m\) -th imputed dataset ( \(\:m=\text{1,2},\dots\:,M\) ), the estimated effect size and corresponding statistics are as follows, and the calculation of Mean-square quantities is similar as Eq. (7):
After the analysis of the individual imputed datasets, the features obtained from each imputed dataset are combined on the idea of Rubin’s rule [ 31 ].
The point estimate of \(\:\theta\:\) , derived from multiple imputation, is calculated as the mean of the point estimates \(\:{\widehat{\theta\:}}_{m}\) obtained from each of the \(\:M\) imputed datasets ( \(\:m=\text{1,2},\dots\:,M\) ).
The total variance of the parameter estimate \(\:\theta\:\) is composed of two components: the between-imputation variance ( \(\:{V}_{B}\) ) and the within-imputation variance ( \(\:{V}_{W}\) ), where the between-imputation variance captures the variability among the different imputed dataset estimates, and the within-imputation variance is determined by each individual imputed dataset itself.
Within imputation variance:
Between imputation variance:
Total variance:
The pooled standard error:
Wald statistics for MI-MRMC
The Wald statistic is constructed by dividing the estimated effect size by its pooled standard error, resulting in a ratio that follows a t-distribution under the null hypothesis.
Degree of freedom for MI-MRMC.
It is proposed that the degrees of freedom for statistical inference should adequately represent the uncertainty from both the MRMC process and the MI process. To achieve this, the average degrees of freedom from the \(\:M\) imputed datasets were chosen as a proxy for the degrees of freedom attributable to the MRMC phase. This average is then integrated with the degrees of freedom as prescribed by the multiple imputation procedure, in accordance with the principles outlined in Rubin’s rules [ 33 ]. This composite degree of freedom is then used to conduct statistical tests, ensuring that our final inferences are sensitive to the complexities and uncertainties inherent in both the MRMC and MI processes.
Confidence interval for MI-MRMC.
The confidence interval can then be obtained via the equation below:
Original complete dataset generation.
The generation of original complete datasets was based on the Roe and Metz model [ 34 ], which is based on a binormal distribution framework. In this simulation, it was assumed that all Monte Carlo simulation readers evaluated all cases across two imaging modalities and assigned a confidence-of-disease score for each interpretation.
Let \(\:{X}_{ijkt}\) represent the confidence-of-disease score of the Roe and Metz model for test \(\:i\) ( \(\:i=1,\dots\:,I\) ), reader \(\:j\) ( \(\:j=\text{1,2},\dots\:,J\) ), case \(\:k\) ( \(\:k=\text{1,2},\dots\:,K\) ), and truth \(\:t\) ( \(\:t=0\) means a nondiseased case image, \(\:\:t=1\) means a disease case image),
where \(\:{\mu\:}_{t}\) is 0 for nondiseased cases, \(\:{\tau\:}_{it}\:\) is the fixed effect of each modality, and the remaining items are random effects that are mutually independent and normally distributed with zero means. The random effect of test×reader×case was combined into the error item, considering that these two effects are inseparable without repeated readings.
To simplify, it is assumed that the variance of the random effect was identical in nondiseased cases and diseased cases.
Thus, for nondiseased cases,
For diseased patients,
In the context of hypothesis testing, under the null hypothesis, it is proposed that both \(\:{\tau\:}_{A0}\) and \(\:{\tau\:}_{B0}\) are equal to zero and that \(\:{\tau\:}_{A1}\:\) is equal to \(\:{\tau\:}_{B1}\) . Conversely, under the alternative hypothesis, while \(\:{\tau\:}_{A0}\) and \(\:{\tau\:}_{B0}\) remain zero, \(\:{\tau\:}_{A1}\) and \(\:{\tau\:}_{B1}\) are not equal, indicating a difference in the effects on diagnostic ability.
Within reader correlation \(\:{\rho\:}_{WR}\) and between reader correlation \(\:{\rho\:}_{BR}\) were also identified to indicate different correlation structure settings [ 35 ].
Introducing missingness
The simulation study employed two missing data mechanisms, MCAR and MAR, to evaluate their impact on the analytical results.
Under the MAR mechanism, it is posited that the probability of a missing observation is related to observable variables, specifically the reader and the test. The missingness indicator \(\:{R}_{ijk}\) , which denotes whether the interpretation by reader \(\:\:j\) for case \(\:\:k\) under test \(\:i\) is observed ( \(\:{R}_{ijk}\) =0) or missing ( \(\:{R}_{ijk}\) =1), is modeled via logistic regression:
The parameters \(\:{\gamma\:}_{1}\) and \(\:{\gamma\:}_{2}\) represent the effects of the reader and test, respectively, on the log-odds of the observation being missing. Specifically, \(\:{\gamma\:}_{1}\) is set to -0.1 and \(\:{\gamma\:}_{2}\:\) is set to 0.15, and the parameter \(\:{\gamma\:}_{0}\) is varied to achieve different missing rates.
Conversely, by setting \(\:{\gamma\:}_{1}={\gamma\:}_{2}=0\) , the MCAR mechanism is simulated, where the missingness is independent of the observed data. In this case, the missingness indicator \(\:{R}_{ijk}\) is solely determined by the intercept \(\:{\gamma\:}_{0}\) :
The simulation scenarios are detailed in Table 1 and Supplementary Tables 1 and are primarily based on the settings established by Roe and Metz [ 34 ]. A total of 1728 scenarios were considered, and under each scenario, 1000 simulations were conducted to mitigate sampling bias.
Evaluation of analysis approaches
In this simulation study, datasets incorporating instances of missing data were analyzed via the MI-MRMC approach, as well as via the CC approach, to obtain estimates of the parameter of interest—namely, the difference in the ROC AUC between two diagnostic tests. For the purpose of comparison, the DBM analysis was also conducted on the original complete datasets, which were void of any missing data. This approach will be referred to as ‘original’ hereafter.
The following metrics were calculated to compare the performance of the proposed approach in terms of statistical performance, point estimation accuracy and confidence interval coverage: (1) type I error rate (under null hypothesis settings); (2) power (under alternative hypothesis setting); (3) root mean squared error (RMSE); (4) bias; (5) 95% confidence interval coverage rate; and (6) confidence interval width.
All simulation computations were executed via R (version 4.1.2) [ 36 ].
Data Source
The proposed analysis approach was applied to a real MRMC design CAD study, which is an ROC paradigm MRMC design. The study was conducted at the Affiliated Hospital of HeBei University, the China-Japan Union Hospital of Jilin University, and Peking University People's Hospital, with ethics approval obtained from the ethics committees of these hospitals.
Study Design
This study evaluated the efficacy of aneurysm detection with and without the assistance of a deep learning model in the context of head and neck CT angiograms. A total of 280 subjects were included, 135 of whom had at least one aneurysm. Ten qualified radiologists interpreted all the images and rated each image, where 0 indicated a definitive absence of an aneurysm and 10 indicated a definitive presence.
Out of the 5,600 interpretations (280 subjects × 10 radiologists × 2 tests), there were 17 instances of missing data. Twelve of these were due to radiologists not evaluating some images, whereas 5 were attributable to failure in generating reconstructed images. The overall missing data rate was 0.30%. The MI-MRMC and the CC approaches were applied to handle missing data. To establish a benchmark for evaluating our analysis, An “original complete dataset” was also created, in which the previously missing interpretations were subsequently re-interpreted by the original radiologists. A DBM analysis was then conducted on this dataset.
Figure 1 displays the mean Type I error rates under the null hypothesis setting for the MI-MRMC, CC, and the original approach by factor level for each of the simulation study factors across various scenarios, differentiated by sample size. The MI-MRMC approach exhibits a relatively lower Type I error rate compared with the results from the original complete datasets. In contrast, the CC approach demonstrates context-dependent performance. Despite these slight variations, both the CC and the MI-MRMC approaches yield generally comparable results for MAR and MCAR conditions. Specifically, the Type I error rates of both approaches closely align with those observed in the original complete datasets and approximate the nominal significance level of 0.05.
Mean type I error performance under different scenarios differentiated by sample size for the original, CC and MI-MRMC approaches. A ) Under the MCAR mechanism, B ) under the MAR mechanism. CC: complete case analysis, MI-MRMC: multiple imputation under the MRMC framework, Original: DBM analysis on the original complete dataset
The statistical power under the alternative hypothesis (Fig. 2 ) reveals that the MI-MRMC approach maintains strong performance in terms of power. Notably, for this approach, any reduction in power is slight, even as the rate of missing data increases. In contrast, the CC approach results in a significant decrease in power, which is exacerbated by increases in both the missing data rate and the total number of readers. Furthermore, for both approaches, a decrease in the AUC is associated with a reduction in statistical power. Performance comparisons across different settings of variance components show that the outcomes are broadly similar, indicating that the statistical power of these approaches is relatively consistent regardless of variance component configurations.
Mean power performance under different scenarios differentiated by sample size for the original, CC and MI-MRMC approaches. A ) Under the MCAR mechanism, B ) under the MAR mechanism. CC: complete case analysis, MI-MRMC: multiple imputation under the MRMC framework, Original: DBM analysis on the original complete dataset
The mean RMSE values are detailed in Fig. 3 and Supplementary Figure S1 . For all the considered scenarios, whether under MAR or MCAR conditions, the RMSE associated with the CC approach is greater than that associated with the MI-MRMC approach. Moreover, the RMSE for the CC approach increases significantly as the sample size diminishes and the rates of missing data increase.
Mean RMSE performance under different scenarios for the original, CC analysis and MI-MRMC approaches. CC: complete case analysis, MI-MRMC: multiple imputation under the MRMC framework, Original: DBM analysis on the original complete dataset, NA: not applicable
Similarly, in line with the RMSE findings, in comparison with the MI-MRMC approach, the bias is greater when the CC approach is employed, particularly in conditions of limited sample size, elevated missing rates, and lower AUC values (Fig. 4 , Supplementary Figure S2 ).
Mean bias performance under different scenarios for the original, CC analysis and MI-MRMC approaches. CC: complete case analysis, MI-MRMC: multiple imputation under the MRMC framework, Original: DBM analysis on the original complete dataset, NA: not applicable
The 95% confidence interval coverage rate, as shown in Fig. 5 and Supplementary Figure S3 , is consistent across all the scenarios for these three approaches, closely approximating the ideal 95%.
Mean 95% confidence interval coverage rate performance under different scenarios for the original, CC analysis and MI-MRMC approaches CC: complete case analysis, MI-MRMC: multiple imputation under the MRMC framework, Original: DBM analysis on the original complete dataset, NA: not applicable
Regarding the width of the confidence interval, for all approaches, scenarios with smaller sample sizes, lower AUC settings, and higher missing rates are associated with wider confidence intervals. Compared with the performance of original complete datasets, the MI-MRMC approach shows a modest increase in confidence interval width. The CC approach, however, results in even wider confidence intervals, particularly under higher missing rates, as illustrated in Supplementary Figure S4 .
The detailed simulation results can be found in Supplementary Table S2 .
Table 2 summarizes the results of the CAD study. All methods indicate a significant difference in the AUC when comparing scenarios with and without the use of the deep learning model for aneurysm detection by head and neck CT angiograms. Notably, the proposed MI-MRMC approach produces results that align more closely with those from the original complete datasets—more so than the CC approach—regarding point estimates, confidence intervals, and P values.
In this study, an MI-MRMC approach was developed to handle missing data issues within MRMC design CAD studies. To assess the feasibility and suitability of this approach, we conducted a simulation study comparing its performance against that of the CC approach across 1728 scenarios. Additionally, we implemented a real-world CAD study to evaluate the performance of the MI-MRMC approach under actual clinical conditions.
Our findings reveal that with respect to point estimation, the CC approach demonstrates marginally inferior performance compared with that of the MI-MRMC and the original approach, resulting in slightly elevated bias and RMSE. However, the CC approach yields substantially wider confidence intervals, which consequently leads to markedly reduced statistical power in comparison to both the MI-MRMC and the original approach. This disparity in power becomes more significant with an increase in the rate of missing data and the size of the reader sample. And these findings underscoring the potential for inherent bias and highlighting its inadequacy in effectively managing missing data within MRMC settings. Our results align with observations from other diagnostic test trials beyond MRMC studies, where the limitations of complete case analysis have been similarly noted [ 37 , 38 , 39 , 40 ]. Notably, Newman has labeled the complete case analysis as ‘suboptimal, potentially unethical, and totally unnecessary’, noting that even minimal missing data can reduce study power and bias results, making findings applicable only to those with complete data [ 41 ]. Despite these identified limitations and the critique from the broader research community, it is important to acknowledge that the CC approach remains the most commonly employed approach in CAD studies. This prevalent use, juxtaposed with the method’s recognized deficiencies, emphasizes the necessity for a paradigm shift toward more robust and reliable methods in the handling of missing data in MRMC design.
In contrast, the MI-MRMC approach consistently demonstrates strong statistical power while maintaining the type I error rate close to the nominal 5% level. This is complemented by superior performance metrics, including low RMSE, minimal bias, and accurate 95% confidence interval coverage. These favorable outcomes persist across various conditions, encompassing different missing data mechanisms, diverse sample sizes of cases and readers, a range of missing rates, and various variance structures. Regarding confidence interval width, our findings indicate that MI-MRMC tends to produce slightly wider confidence intervals compared with the original complete dataset. This observation aligns with previous literature on MI, which suggests that the wider MI confidence intervals reflect a realistic addition of uncertainty introduced by missing data and the subsequent imputation process [ 42 ]. We observed that MI-MRMC demonstrates relatively wider confidence intervals, particularly in scenarios with low correlation structures (LL, LH) or limited reader sample sizes. This results in a comparatively lower type I error rate under these conditions. However, it’s important to note that despite the relatively lower type I error rate, MI-MRMC still maintains strong statistical power compared with the traditional CC approach.
When deriving the degrees of freedom for the MI-MRMC, we adopted the methodology proposed by Barnard and Rubin [ 33 ] over the framework suggested by Rubin in 1987 [ 31 ]. This decision is informed by the unique characteristics of MRMC studies, which typically feature a modest proportion of missing data and where individual observations, such as the confidence-of-disease score, exert limited influence on the endpoint, specifically the AUC. This results in minimal between-imputation variance. In addition, Rubin’s 1987 method in this context may inflates the degrees of freedom compared with those derived from the original complete data, potentially skewing significance testing toward optimism. Conversely, the approach of Barnard and Rubin [ 33 ], which accounts for the degrees of freedom from both the observed datasets and the imputation phase, offers a more accurate estimation. It enables the integration of the degrees of freedom inherent to the MRMC phase, optimized through Hillis’s contributions [ 28 ], ensuring a balanced and precise evaluation of statistical significance.
In our exploratory studies, the joint model algorithm was also evaluated, yielding results comparable to those obtained with the MICE algorithm. Given that the joint model algorithm requires a stringent assumption of multivariate normal distribution [ 43 ], the MICE algorithm was selected. Regarding the optimal number of imputations, a preliminary simulation study was conducted using ten imputations. The results indicated marginal gains in precision beyond five imputations, consistent with the recommendations of Little and Rubin [ 17 ]. Consequently, five imputations were deemed sufficient for this investigation. Future studies may explore the impact of varying the number of imputations, taking into account real-world application situations and computational constraints.
Multiple imputation, which originated in the 1970s [ 21 ], addresses the uncertainty associated with missing data by generating multiple imputed datasets. Since its inception, MI has gained widespread acceptance across various fields, including survey research [ 44 ], clinical trials [ 22 , 45 ], and observational studies [ 23 ]. Specifically, in the realm of diagnostic testing, MI has been explored as a solution for mitigating verification bias caused by missing gold standard data [ 24 , 25 ], as well as for handling missing data in index tests in the non-MRMC design context [ 37 , 46 , 47 ]. Through comprehensive simulations and practical diagnostic trials, MI has proven to be highly effective in these areas [ 48 ], establishing itself as a key technique for addressing challenges associated with missing data. Consistent with these prior findings, and by integrating MI within the MRMC framework, our approach further underscores the robustness of the MI theory. This shows compelling statistical performance, even when dealing with missing data within the complex correlated data structures characteristic of MRMC designs, contributing to the expanding evidence of MI’s significant potential to enhance research methodologies in scenarios plagued by missing data.
Furthermore, the estimate of MI-MRMC corresponds to the randomized/enrolled population, aligning with the ICH E9(R1) framework [ 49 ] and principles of causal inference [ 50 ]. In contrast, the CC approach violates the randomization principle and may introduce selection bias due to the deletion of cases with missing data. Thus, MI-MRMC could be an actionable sensitive analysis approach when missing data occur in real clinical settings.
It is important to acknowledge the limitations of this research. First, in our real-case study, the original complete dataset relied on ad hoc re-interpretation, which may introduce biases such as inter-reader variance. However, finding a balance between representing the actual missing data scenario and maintaining dataset integrity has proven challenging. Second, our simulation study, while addressing the 1728 scenario, may not fully replicate real-world conditions. For example, there may be situations where variances across different tests and truth statuses could vary [ 51 ]. Therefore, future research should consider applying our approach to more sophisticated scenarios to further evaluate its efficacy. Finally, our investigation focused solely on data MCAR and MAR mechanisms, given that missing not at random occurrences are infrequent in CAD studies. To increase robustness, future studies could incorporate other sensitivity analysis methods, such as the tipping point approach, alongside our proposed MI-MRMC framework [ 13 ].
In conclusion, this study is the first to address the critical yet often overlooked issue of missing data in MRMC designs. The proposed MI-MRMC approach addresses this issue through multiple imputation, thereby producing estimates that are representative of the randomized/enrolled population. By comparing traditional CC approach with the MI-MRMC approach and employing both simulation studies and real-world applications, the substantial benefits of MI-MRMC are highlighted, particularly in enhancing accuracy and statistical power while maintain good control of the type I error rate in the presence of missing data. Consequently, this method offers an effective solution for managing the challenges associated with missing data in MRMC designs and can serve as a sensitive analysis approach for real clinical environments, thereby to some extent paving the way for more robust and reliable research outcomes in future endeavors.
The datasets used/or analyzed during the current study are available from the corresponding author on reasonable request.
Gallas BD, Chan HP, D’Orsi CJ, Dodd LE, Giger ML, Gur D, Krupinski EA, Metz CE, Myers KJ, Obuchowski NA, et al. Evaluating imaging and computer-aided detection and diagnosis devices at the FDA. Acad Radiol. 2012;19(4):463–77.
Article PubMed PubMed Central Google Scholar
Wagner RF, Metz CE, Campbell G. Assessment of medical imaging systems and computer aids: a tutorial review. Acad Radiol. 2007;14(6):723–48.
Article PubMed Google Scholar
Beam CA, Layde PM, Sullivan DC. Variability in the interpretation of screening mammograms by US radiologists. Findings from a national sample. Arch Intern Med. 1996;156(2):209–13.
Article CAS PubMed Google Scholar
Yu T, Li Q, Gray G, Yue LQ. Statistical innovations in diagnostic device evaluation. J Biopharm Stat. 2016;26(6):1067–77.
Clinical Performance Assessment. Considerations for Computer-Assisted Detection Devices Applied to Radiology Images and Radiology Device Data in Premarket Notification (510(k)) Submissions: Guidance for Industry and Food and Drug Administration Staff [ https://www.fda.gov/media/77642/download ].
Guiding Principles for Technical Review of Breast X-ray System Registration. [ https://www.cmde.org.cn//flfg/zdyz/zdyzwbk/20210701103258337.html ].
Key Points for Review of Medical Device Software Assisted by Deep Learning. [ https://www.cmde.org.cn//xwdt/zxyw/20190628151300923.html ].
Obuchowski NA. Multireader, multimodality receiver operating characteristic curve studies: hypothesis testing and sample size estimation using an analysis of variance approach with dependent observations. Acad Radiol. 1995;2(Suppl 1):S22–29. discussion S57-64, S70-21 pas.
PubMed Google Scholar
Dorfman DD, Berbaum KS, Metz CE. Receiver operating characteristic rating analysis. Generalization to the population of readers and patients with the jackknife method. Invest Radiol. 1992;27(9):723–31.
Wang L, Wang H, Xia C, Wang Y, Tang Q, Li J, Zhou XH. Toward standardized premarket evaluation of computer aided diagnosis/detection products: insights from FDA-approved products. Expert Rev Med Devices. 2020;17(9):899–918.
Obuchowski NA, Bullen J. Multireader Diagnostic Accuracy Imaging studies: fundamentals of Design and Analysis. Radiology. 2022;303(1):26–34.
Campbell G, Pennello G, Yue L. Missing data in the regulation of medical devices. J Biopharm Stat. 2011;21(2):180–95.
Campbell G, Yue LQ. Statistical innovations in the medical device world sparked by the FDA. J Biopharm Stat. 2016;26(1):3–16.
Mongan J, Moy L, Kahn CE Jr. Checklist for Artificial Intelligence in Medical Imaging (CLAIM): a guide for authors and reviewers. Radiol Artif Intell. 2020;2(2):e200029.
Cohen JF, Korevaar DA, Altman DG, Bruns DE, Gatsonis CA, Hooft L, Irwig L, Levine D, Reitsma JB, de Vet HC, et al. STARD 2015 guidelines for reporting diagnostic accuracy studies: explanation and elaboration. BMJ Open. 2016;6(11):e012799.
Stahlmann K, Reitsma JB, Zapf A. Missing values and inconclusive results in diagnostic studies - a scoping review of methods. Stat Methods Med Res. 2023;32(9):1842–55.
Little RJA, Rubin DB. Statistical Analysis with Missing Data, 3rd Edition. John Wiley & Sons; 2020.
Schuetz GM, Schlattmann P, Dewey M. Use of 3x2 tables with an intention to diagnose approach to assess clinical performance of diagnostic tests: meta-analytical evaluation of coronary CT angiography studies. BMJ. 2012;345:e6717.
Shinkins B, Thompson M, Mallett S, Perera R. Diagnostic accuracy studies: how to report and analyse inconclusive test results. BMJ. 2013;346:f2778.
Mitroiu M, Oude Rengerink K, Teerenstra S, Petavy F, Roes KCB. A narrative review of estimands in drug development and regulatory evaluation: old wine in new barrels? Trials. 2020;21(1):671.
Article CAS PubMed PubMed Central Google Scholar
Rubin DB. Inference and missing data. Biometrika. 1976;63(3):581–92.
Article Google Scholar
Jakobsen JC, Gluud C, Wetterslev J, Winkel P. When and how should multiple imputation be used for handling missing data in randomised clinical trials - a practical guide with flowcharts. BMC Med Res Methodol. 2017;17(1):162.
Pedersen AB, Mikkelsen EM, Cronin-Fenton D, Kristensen NR, Pham TM, Pedersen L, Petersen I. Missing data and multiple imputation in clinical epidemiological research. Clin Epidemiol. 2017;9:157–66.
Harel O, Zhou XH. Multiple imputation for correcting verification bias. Stat Med. 2006;25(22):3769–86.
Harel O, Zhou XH. Multiple imputation for the comparison of two screening tests in two-phase Alzheimer studies. Stat Med. 2007;26(11):2370–88.
Meng XL. Multiple-imputation inferences with uncongenial sources of input. Stat Sci. 1994;9(4):538–73.
Bartlett JW, Seaman SR, White IR, Carpenter JR, Alzheimer’s Disease Neuroimaging I. Multiple imputation of covariates by fully conditional specification: accommodating the substantive model. Stat Methods Med Res. 2015;24(4):462–87.
Hillis SL, Berbaum KS, Metz CE. Recent developments in the Dorfman-Berbaum-Metz procedure for multireader ROC study analysis. Acad Radiol. 2008;15(5):647–61.
Chakraborty DP. Observer performance methods for diagnostic imaging: foundations, modeling, and applications with r-based examples. 1st edition. Boca Raton: CRC Press; 2017.
Hillis SL. A comparison of denominator degrees of freedom methods for multiple observer ROC analysis. Stat Med. 2007;26(3):596–619.
Rubin DB, Wiley I. Multiple imputation for nonresponse in surveys. New York: Wiley; 1987.
Book Google Scholar
Landerman LR, Land KC, Pieper CF. An empirical evaluation of the predictive mean matching method for imputing missing values. Sociol Methods Res. 1997;26(1):3–33.
Barnard J, Rubin DB. Miscellanea. Small-sample degrees of freedom with multiple imputation. Biometrika. 1999;86(4):948–55.
Roe CA, Metz CE. Dorfman-Berbaum-Metz method for statistical analysis of multireader, multimodality receiver operating characteristic data: validation with computer simulation. Acad Radiol. 1997;4(4):298–303.
Hillis SL. Relationship between Roe and Metz simulation model for multireader diagnostic data and Obuchowski-Rockette model parameters. Stat Med. 2018;37(13):2067–93. https://doi.org/10.1002/sim.7616 .
R. A Language and Environment for Statistical Computing [ https://www.R-project.org/ ].
Gad AM, Ali AA, Mohamed RH. A multiple imputation approach to evaluate the accuracy of diagnostic tests in presence of missing values. Commun Math Biol Neurosci. 2022;21:1–19.
Kohn MA, Carpenter CR, Newman TB. Understanding the direction of bias in studies of diagnostic test accuracy. Acad Emerg Med. 2013;20(11):1194–206.
Whiting PF, Rutjes AW, Westwood ME, Mallett S, Group Q-S. A systematic review classifies sources of bias and variation in diagnostic test accuracy studies. J Clin Epidemiol. 2013;66(10):1093–104.
Van der Heijden GJ, Donders ART, Stijnen T, Moons KG. Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: a clinical example. J Clin Epidemiol. 2006;59(10):1102–9.
Newman DA. Missing data: five practical guidelines. Organizational Res Methods. 2014;17(4):372–411.
Buuren Sv. Flexible imputation of missing data. Boca Raton, FL: CRC; 2012.
Hickey GL, Philipson P, Jorgensen A, Kolamunnage-Dona R. Joint modelling of time-to-event and multivariate longitudinal outcomes: recent developments and issues. BMC Med Res Methodol. 2016;16(1):117.
He Y, Zaslavsky AM, Landrum MB, Harrington DP, Catalano P. Multiple imputation in a large-scale complex survey: a practical guide. Stat Methods Med Res. 2010;19(6):653–70.
Barnes SA, Lindborg SR, Seaman JW Jr. Multiple imputation techniques in small sample clinical trials. Stat Med. 2006;25(2):233–45.
Long Q, Zhang X, Hsu CH. Nonparametric multiple imputation for receiver operating characteristics analysis when some biomarker values are missing at random. Stat Med. 2011;30(26):3149–61.
Cheng W, Tang N. Smoothed empirical likelihood inference for ROC curve in the presence of missing biomarker values. Biom J. 2020;62(4):1038–59.
Karakaya J, Karabulut E, Yucel RM. Sensitivity to imputation models and assumptions in receiver operating characteristic analysis with incomplete data. J Stat Comput Simul. 2015;85(17):3498–511.
FDA. E9(R1) Statistical Principles for Clinical Trials: Addendum: Estimands and Sensitivity Analysis in Clinical Trials. https://www.fda.gov/regulatory-information/search-fda-guidance-documents/e9r1-statistical-principles-clinicaltrials-addendum-estimands-and-sensitivity-analysis-clinical . Accessed 5 Sep 2024.
Westreich D, Edwards JK, Cole SR, Platt RW, Mumford SL, Schisterman EF. Imputation approaches for potential outcomes in causal inference. Int J Epidemiol. 2015;44(5):1731–7.
Hillis SL. Simulation of unequal-variance binormal multireader ROC decision data: an extension of the Roe and Metz simulation model. Acad Radiol. 2012;19(12):1518–28.
Download references
We would like to express our sincere gratitude to Prof. Stephen L. Hillis, Prof. Dev P. Chakraborty, and Mr. Ning Li for their invaluable support and guidance throughout the development of this paper. We extend our gratitude to Shanghai United Imaging Intelligence Co., Ltd., for sponsoring the real example study and sharing the data. We also acknowledge the valuable support from the investigators of the real example study: Xiaoping Yin and Jianing Wang from the Affiliated Hospital of HeBei University; Lin Liu and Zhanhao Mo from the China-Japan Union Hospital of Jilin University; and Nan Hong and Lei Chen from Peking University People’s Hospital.
This study was conducted under grants from the Shanghai municipal health commission Special Research Project in Emerging Interdisciplinary Fields (2022JC011) and Shanghai Science and Technology Development Funds (22QA1411400).
Zhemin Pan and Yingyi Qin contributed equally to this work.
Tongji University School of Medicine, 1239 Siping Road, Yangpu District, Shanghai, 200092, China
Zhemin Pan, Wangyang Bai & Jia He
Department of Military Health Statistics, Naval Medical University, 800 Xiangyin Road, Yangpu District, Shanghai, 200433, China
Yingyi Qin, Qian He & Jia He
Department of Radiology, the Affiliated Hospital of Hebei University, 212 Eastern Yuhua Road, Baoding City, Hebei Province, 071000, China
Xiaoping Yin
You can also search for this author in PubMed Google Scholar
ZM Pan and YY Qin contributed equally to this work. ZM Pan and YY Qin designed the simulation and wrote the main manuscript text. WY Bai prepared figures and tables. Q He conducted the analysis of the real example. XP Ying provided substantial contributions during the revisions. J He provided critical input to the manuscript. All authors reviewed the manuscript and approved the final version of this paper.
Correspondence to Jia He .
Ethics approval and consent to participate.
The case study has ethics approvals from the Ethics Committee of Peking University People`s Hospital(2022Nov9th), the Ethics Committee of China-Japan Union Hospital of Jilin University(2022Nov25th), and the Ethics Committee of the Affiliated Hospital of HeBei University(2022Dec26th). Participating patients provided informed consent and research methods followed national and international guidelines.
The authors declare no competing interests.
Not applicable.
Publisher’s note.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
12874_2024_2321_moesm1_esm.tiff.
Supplementary Material 1: Supplementary Figure S1. Mean RMSE under different scenarios differentiated by sample size for the original, CC and MI-MRMC approaches. A) Under the MCAR mechanism, B) under the MAR mechanism. CC: complete case analysis, MI-MRMC: multiple imputation under the MRMC framework, Original: DBM analysis on the original complete dataset.
Supplementary Material 2: Supplementary Figure S2. Mean bias under different scenarios differentiated by sample size for the original, CC and MI-MRMC approaches. A) Under the MCAR mechanism, B) under the MAR mechanism. CC: complete case analysis, MI-MRMC: multiple imputation under the MRMC framework, Original: DBM analysis on the original complete dataset
Supplementary Material 3: Supplementary Figure S3. Mean 95% confidence interval coverage rate under different scenarios differentiated by sample size for the original, CC and MI-MRMC approaches. A) Under the MCAR mechanism, B) under the MAR mechanism. CC: complete case analysis, MI-MRMC: multiple imputation under the MRMC framework, Original: DBM analysis on the original complete dataset
Supplementary Material 4:Supplementary Figure S4. Mean confidence interval width under different scenarios differentiated by sample size for the original, CC and MI-MRMC approaches. A) Under the MCAR mechanism, B) under the MAR mechanism. CC: complete case analysis, MI-MRMC: multiple imputation under the MRMC framework, Original: DBM analysis on the original complete dataset
Supplementary material 6: supplementary table s2. detailed simulation results., rights and permissions.
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ .
Reprints and permissions
Cite this article.
Pan, Z., Qin, Y., Bai, W. et al. Implementing multiple imputations for addressing missing data in multireader multicase design studies. BMC Med Res Methodol 24 , 217 (2024). https://doi.org/10.1186/s12874-024-02321-3
Download citation
Received : 06 March 2024
Accepted : 27 August 2024
Published : 27 September 2024
DOI : https://doi.org/10.1186/s12874-024-02321-3
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
ISSN: 1471-2288
IMAGES
VIDEO
COMMENTS
In these situations a transformation applied to the response variable may be useful. In order to decide which transformation to use, we should examine the distribution of the response variable. min = 0. Q1 = 12000 mean = 44098 median = 30000 Q3 = 55000 max = 450000.
13.1 Case Study 1: School Performance Index Multiple Regression Multiple Regression Regress API on other regressors with default se™s I Edparent coe¢ cient little change from 79.53 to 73.94 I all six regressors jointly statistically signi-cant F = 771.4 I subset of -ve regressors other than Edparent statistically signi-cant F = 14.80 ...
3.1 Reading a Regression Output. The dataset satisfaction.csv contains four variables collected from \(n = 46\) patients in given a hospital:. Satisfaction: The degree of satisfaction with the quality of care (higher values indicated greater satisfaction).; Age: The age of the patient in years.; Severity: The severity of the patient's condition (higher values are more severe).
Multiple linear regression formula. The formula for a multiple linear regression is: = the predicted value of the dependent variable. = the y-intercept (value of y when all other parameters are set to 0) = the regression coefficient () of the first independent variable () (a.k.a. the effect that increasing the value of the independent variable ...
This tutorial introduces regression analyses specifically using R language. After this tutorial you will learn : how regression is used in Statistics against how it is used in Machine Learning . It will also introduce you to using Tidymodels for Regression Analysis. understand the concept of simple and multiple linear regression
Here, Y is the output variable, and X terms are the corresponding input variables. Notice that this equation is just an extension of Simple Linear Regression, and each predictor has a corresponding slope coefficient (β).The first β term (βo) is the intercept constant and is the value of Y in absence of all predictors (i.e when all X terms are 0). It may or may or may not hold any ...
Regression Analysis - Retail Case Study Example. ... Multiple R-squared: 0.2069: Adjusted R-squared: 0.2065: F-statistic (P Value) 2.20E-16 . The following is the linear equation for this regression model. Notice, that the model just has mid-sized and larger cities as the predictor variables. The information about small towns is absorbed in ...
Fig-1. Multiple Linear Regression Formula. The linear regression formula's slope can also be interpreted as the linear relationship strength between the independent variable and its dependent variable.Based on that definition, we can comfortably say that the higher the slope value of the independent variable, the more significant this variable influences its dependent variable.
The Multiple Linear Regression Equation. As previously stated, regression analysis is a statistical technique that can test the hypothesis that a variable is dependent upon one or more other variables. Further, regression analysis can provide an estimate of the magnitude of the impact of a change in one variable on another.
This is the use of linear regression with multiple variables, and the equation is: Y = b0 + b1X1 + b2X2 + b3X3 + … + bnXn + e. Y and b0 are the same as in the simple linear regression model. b1X1 represents the regression coefficient (b1) on the first independent variable (X1). The same analysis applies to all the remaining regression ...
5A.4.2 The Statistical Goal in a Regression Analysis The statistical goal of multiple regression analysis is to produce a model in the form of a linear equa-tion that identifies the best weighted linear combination of independent variables in the study to optimally predict the criterion variable.
The brief research using multiple regression analysis is a broad study or analysis of the reasons or underlying factors that significantly relate to the number of hours devoted by high school students in using the Internet. The regression analysis is broad because it only focuses on the total number of hours devoted by high school students to ...
Photo by Ferdinand Stöhr on Unsplash. Multiple linear regression is one of the most fundamental statistical models due to its simplicity and interpretability of results. For prediction purposes, linear models can sometimes outperform fancier nonlinear models, especially in situations with small numbers of training cases, low signal-to-noise ratio, or sparse data (Hastie et al., 2009).
The study is mixed in nature, and the researchers have used analytical tools to analyse the data factually. Multiple regression using MS Excel is used in the study.,This case is based on the experiences of a real-life travel and tour company located in New Delhi, India.
Multiple regression analysis can be used to assess effect modification. This is done by estimating a multiple regression equation relating the outcome of interest (Y) to independent variables representing the treatment assignment, sex and the product of the two (called the treatment by sex interaction variable).For the analysis, we let T = the treatment assignment (1=new drug and 0=placebo), M ...
In this case, the first model of interest using the two SAT percentiles, fygpai = β0 +βsatvsatvi +βsatmsatmi +εi, (8.8.1) (8.8.1) fygpa i = β 0 + β satv satv i + β satm satm i + ε i, looks like it might be worth interrogating further so we can jump straight into considering the 6+ steps involved in hypothesis testing for the two slope ...
Multiple response variables falls into a category of statistics called multivariate statistics. Like multi-way ANOVA, multiple regression is the extension of simple linear regression from one independent predictor variable to include two or more predictors. The benefit of this extension is obvious — our models gain realism.
Multiple Regression Case Study. The following is a sample Multiple Regression Case Study. There are several key elements to a successful regression analysis. The first one is choosing the right functional model. The second one consists of assessing the fulfilment of the regression assumptions. These two elements go hand to hand and they depend ...
My tutorial helps you go through the regression content in a systematic and logical order. This tutorial covers many facets of regression analysis including selecting the correct type of regression analysis, specifying the best model, interpreting the results, assessing the fit of the model, generating predictions, and checking the assumptions.
Abstract. Multiple regression is one of the most significant forms of regression and has a wide range. of applications. The study of the implementation of multiple regression analysis in different ...
This case study presents an introduction to the basics of real estate appraisal and multiple regression analysis; in particular, as used in real estate valuation for mass property tax assessment. While real estate researchers, appraisers and some tax assessors have used multiple regression analysis for many years, its use by a large
INTRODUCTION. This report describes an auditing situation in which the Kansas City Regional Office staff used regression analysis to confirm its questioning of an agency position. It is hoped that this report can be a useful reference to aid further use of this analytical technique. Regression analysis is a statistical technique used to measure ...
Linear Regression Real Life Example #3. Agricultural scientists often use linear regression to measure the effect of fertilizer and water on crop yields. For example, scientists might use different amounts of fertilizer and water on different fields and see how it affects crop yield. They might fit a multiple linear regression model using ...
Multicollinearity, a critical issue in regression analysis that can severely compromise the stability and accuracy of parameter estimates, arises when two or more variables exhibit correlation with each other. This paper solves this problem by introducing six new, improved two-parameter ridge estimators (ITPRE): NATPR1, NATPR2, NATPR3, NATPR4, NATPR5, and NATPR6. These ITPRE are designed to ...
In this case, 0.78 indicates a strong correlation, especially when considering that the closer the correlation coefficient is to 1, the stronger the correlation. Step 4: Find the equation of the least-squares regression equation and write out the equation. Add the regression line to the scatterplot you generated above.
Background In computer-aided diagnosis (CAD) studies utilizing multireader multicase (MRMC) designs, missing data might occur when there are instances of misinterpretation or oversight by the reader or problems with measurement techniques. Improper handling of these missing data can lead to bias. However, little research has been conducted on addressing the missing data issue within the MRMC ...