• Methodology
  • Open access
  • Published: 19 December 2019

Targeted test evaluation: a framework for designing diagnostic accuracy studies with clear study hypotheses

  • Daniël A. Korevaar   ORCID: orcid.org/0000-0002-7979-7897 1 ,
  • Gowri Gopalakrishna 2 ,
  • Jérémie F. Cohen 3 , 4 &
  • Patrick M. Bossuyt 5  

Diagnostic and Prognostic Research volume  3 , Article number:  22 ( 2019 ) Cite this article

15k Accesses

40 Citations

4 Altmetric

Metrics details

Most randomized controlled trials evaluating medical interventions have a pre-specified hypothesis, which is statistically tested against the null hypothesis of no effect. In diagnostic accuracy studies, study hypotheses are rarely pre-defined and sample size calculations are usually not performed, which may jeopardize scientific rigor and can lead to over-interpretation or “spin” of study findings. In this paper, we propose a strategy for defining meaningful hypotheses in diagnostic accuracy studies. Based on the role of the index test in the clinical pathway and the downstream consequences of test results, the consequences of test misclassifications can be weighed, to arrive at minimally acceptable criteria for pre-defined test performance: levels of sensitivity and specificity that would justify the test’s intended use. Minimally acceptable criteria for test performance should form the basis for hypothesis formulation and sample size calculations in diagnostic accuracy studies.

Peer Review reports


The randomized controlled trial (RCT) has become the undisputed cornerstone of evidence-based health care [ 1 ]. RCTs typically evaluate the benefits and harms of pharmaceuticals (and other interventions) by comparing health outcomes between one group of participants who receive the drug to be evaluated, and a second group of participants who receive a placebo or an alternative drug [ 2 ]. Most RCTs have as a pre-specified hypothesis that the intervention under evaluation improves health outcomes, which is statistically tested against the null hypothesis of no effect (Table  1 ). The sample size of the trial is then calculated based on this pre-specified hypothesis and on the desired magnitude of the type I and type II errors [ 3 ]. Based on the collected data, investigators then typically calculate a test statistic and the corresponding p value. This is done alongside estimating effect sizes, such as the mean difference, relative risk, or odds ratio, and their precision, such as confidence intervals.

The situation is very different for diagnostic tests. Comparative trials that focus on the effects of testing on patient outcomes are relatively rare [ 4 ]. There is, in general, no requirement to demonstrate a reasonable benefits-to-harms balance for new tests before they can be introduced to the market [ 5 ]. The clinical performance of medical tests is often evaluated in diagnostic accuracy studies. Such studies evaluate a diagnostic test’s ability to correctly distinguish between patients with and without a target condition, by comparing the results of the test against the results of a reference standard (Table  2 ) [ 6 ].

Diagnostic accuracy studies typically report results in terms of accuracy statistics, such as sensitivity and specificity. Many fail to report measures of statistical precision [ 8 ]. Somewhat surprisingly, most diagnostic accuracy studies do not pre-specify a study hypothesis; they are usually reported without any explicit statistical test of a null hypothesis. In an analysis of 126 published diagnostic accuracy studies, Ochodo and colleagues observed that only 12% reported any statistical test of a hypothesis somewhat related to the study objectives, and no more than 11% reported a sample size justification [ 9 ]. Similar evaluations found that only 5% of diagnostic accuracy studies published in eight leading medical journals reported a sample size justification, and 3% of diagnostic accuracy studies of depression screening tools, and 3% of diagnostic accuracy studies in ophthalmology [ 10 , 11 , 12 ].

We believe the logic of having clear and pre-specified study hypotheses could and should extend to diagnostic accuracy studies. Scientific rigor is likely to benefit from this, as explicitly defining study hypotheses forces researchers to express minimally acceptable criteria for accuracy values that would make a test clinically fit-for-purpose, before initiating a study. A clearly defined study hypothesis also enables an informed judgment of the appropriateness of the study’s design, sample size, statistical analyses, and conclusions. It may also prevent the authors from over-interpreting their findings [ 9 , 13 , 14 ], as the absence of a pre-specified hypothesis leaves ample room for “spin”: generous presentations of the study findings, inviting the readers to conclude that the test is useful, even though the estimates of sensitivity and specificity do not support such a conclusion.

Below, we propose a strategy for defining meaningful hypotheses in diagnostic accuracy studies, based on the consequences of using the test in clinical practice. With the exposition below, we invite researchers who are designing diagnostic accuracy studies to derive meaningful study hypotheses and minimally acceptable criteria for test accuracy: targeted test evaluation.

Meaningful hypotheses about diagnostic accuracy

Since there are typically two measures of accuracy in a diagnostic accuracy study (Table  2 and Fig.  1 ), we need a joint hypothesis, with one component about the test’s sensitivity and a second about its specificity. Having a hypothesis about sensitivity only is usually pointless for quantitative tests, since one can always arbitrarily set the test positivity rate, by changing the positivity threshold, to match the desired sensitivity. That, in itself, does not guarantee that the corresponding specificity is sufficiently high for the test to be clinically useful. The same applies to only having a hypothesis about specificity.

figure 1

Typical output of a diagnostic accuracy study: the contingency table (or “2 × 2 table”)

Informative tests produce a higher rate of positive test results in patients with the target condition than in those without the target condition. In ROC (receiver operating characteristic) space, the combination of sensitivity and specificity for these tests will then be in the upper left triangle (Fig.  2 ). Yet, in contrast to RCTs of interventions, where a null hypothesis of “no effect” works quite well in most cases, a null hypothesis of “not informative” is not very useful for evaluations of the clinical performance of diagnostic tests. Such a hypothesis may be relevant in the early discovery phase of biomarkers, but it will no longer be informative when a test has been developed, based on that marker, and when that test is evaluated for its added value to clinical practice. By the time a diagnostic accuracy study is initiated, one usually already knows that the test to be evaluated is more informative than just throwing a dice.

figure 2

Receiver operating characteristic (ROC) space with “target region” based on minimally acceptable criteria for accuracy. ROC space has two dimensions: sensitivity ( y -axis) and 1-specificity ( x -axis). When the sum of sensitivity and specificity is ≥ 1.0, the test’s accuracy will be a point somewhere in the upper left triangle. The “target region” of a diagnostic accuracy study will always touch the upper left corner of ROC space, which is the point for perfect tests, where both sensitivity and specificity are 1.0. From there, the rectangle extends down, to MAC for sensitivity, and extend to the right, towards MAC for specificity. The gray square represents the target region of a diagnostic accuracy study with a MAC (sensitivity) of 0.70, and a MAC (specificity) of 0.60. MAC, minimally acceptable criteria

For many tests, both sensitivity and specificity will be higher than 0.50. A very simple study hypothesis then stipulates that both sensitivity and specificity be at least 0.50:

H 1 : {Sensitivity ≥ 0.50 and Specificity ≥ 0.50}

This could be evaluated against the following joint null hypothesis:

H 0 : {Sensitivity < 0.50 and/or Specificity < 0.50}

This hypothesis is also not very helpful in evaluations of the clinical performance of tests, because it can be too lenient in some cases and too strict in others. For example, if a test is meant to rule out disease, the number of false negatives should clearly be low. This means that a very high sensitivity is required, and a value barely exceeding 0.50 will not be enough. A useful triage test may combine a sensitivity of 0.999 with a specificity of 0.30, since it would mean that the triage test prevents further testing in 30% of those without the target condition, while missing only 1 in a 1000 in those with the target condition. If one wants a new, expensive test to replace an existing, inexpensive test, the accuracy of that new test should substantially exceed that of the existing test. Simply concluding that sensitivity and specificity exceed 0.50 will not be enough.

From these examples, we can conclude that the required levels of sensitivity and specificity will depend on the clinical context in which the new test will be used. This implies that we should explore that context explicitly when specifying hypotheses. Therefore, what would be more useful to know is not whether tests are informative, but whether they are informative enough, or in other words, whether the test meets “minimally acceptable criteria” (MAC) for a pre-defined test performance, i.e., levels of sensitivity and specificity that would justify the intended use. The study hypotheses then become:

H 1 : {Sensitivity ≥ MAC (Sensitivity) and Specificity ≥ MAC (Specificity)}

H 0 : {Sensitivity < MAC (Sensitivity) and/or Specificity < MAC (Specificity)}

In ROC space, this can be defined as a rectangle in the upper left corner that corresponds to MAC (Fig.  2 ). The test will be considered acceptable if both the sensitivity and specificity are in this rectangle, which we will refer to as the “target region” in ROC space.

A diagnostic accuracy study will produce point estimates of sensitivity and specificity, along with confidence intervals around it. If we position these in ROC space, then both the point estimates and the confidence intervals should be completely positioned in the target region. If MAC for sensitivity is set at 0.85 and MAC for specificity at 0.90, the lower limit of the confidence interval for sensitivity should exceed 0.85, and for specificity, it should exceed 0.90.

Targeted test evaluation: defining minimally acceptable criteria for diagnostic accuracy

Below, we provide a series of steps that could be used for defining minimally acceptable criteria for diagnostic accuracy (Fig.  3 ). A case example for each of the steps is reported in Table  3 and Fig.  4 .

figure 3

Defining minimally acceptable criteria (MAC) for diagnostic accuracy

figure 4

External validation of the diagnostic accuracy of rules-based selective testing strategies (figure derived from Cohen and colleagues [ 16 ]). Graph shows sensitivity and specificity estimates with their one-sided rectangular 95% confidence regions. Numbers indicate the rules-based selective testing strategies

Identify the existing clinical pathway in which the index test will be used

The diagnostic accuracy of a test is not a fixed property: it typically varies depending on the clinical setting in which it is applied, and on how the test is used [ 21 ]. Consequently, the sensitivity and specificity of a single test are likely to differ across settings and applications. Consequences of testing may also vary across different settings. Tests, therefore, should be evaluated in a setting that mirrors the clinical context in which they will be used. This can only be done by first defining the existing clinical pathway.

The identification of a clinical pathway is recommended in the evaluation of a diagnostic test by agencies such as the US Preventive Services Task Force (USPSTF); the Agency for Healthcare Research and Quality (AHRQ); the Grading of Recommendations, Assessment, Development and Evaluation (GRADE) Working Group; and the Cochrane Collaboration [ 22 , 23 ]. Likewise, the STARD (Standards for Reporting Diagnostic Accuracy) 2015 statement recommends authors to report the intended use and clinical role of the index test [ 24 , 25 ].

To help define the existing clinical pathway, we propose a number of guiding questions that authors of diagnostics accuracy tests can use:

What is the target condition to be diagnosed ? The target condition can be defined as the disease, disease stage, or severity or, more generally, the condition that the investigated test is intended to detect.

Who are the targeted patients? The patients undergoing testing can be those presenting with certain signs or symptoms, or those having undergone specific previous tests, or just selected based on age, sex, or other risk factors, as in screening.

In which setting will the test be used? The setting may be primary, secondary, or tertiary care, or, more specifically, the emergency department, outpatient clinic, or in the general community.

What are the other tests in the existing clinical pathway ? The test under evaluation may be positioned before or after other tests in the specific clinical setting as defined in the guiding question above. Also, a number of additional testing procedures may need to be considered, depending on the results of testing, before the diagnostic work-up can be closed and a clinical decision on further management is taken.

Define the role of the index test in the clinical pathway

Defining the role of the index test in the existing clinical pathway is critical for defining eligibility criteria for participants for the study. This step involves defining where in the existing clinical pathway the test under evaluation will be positioned. There are several possible roles for diagnostic tests relative to an existing test—triage, add-on, replacement, or new test [ 26 , 27 ]:

A triage test is used before the existing test(s), and its results determine which patients will undergo the existing test.

An add-on test is used after an existing test to improve the diagnostic accuracy of the testing strategy.

A replacement test aims to replace an existing test, either because it is expected to have higher diagnostic accuracy, is less invasive, is less costly, or is easier to use than the existing test.

A new test is a test that opens up a completely new test-treatment pathway. The latter would be the case with a new population screening strategy, for example, where, at present, no screening for the target condition is performed.

Define the expected proportion of patients with the target condition

Depending on the proportion of tested patients who have the target condition, absolute numbers of false-positive and false-negative results will vary. If 100 patients are tested by a test with a sensitivity of 0.90 and a specificity of 0.90, and 50 of them have the target condition, one can expect, on average, 5 false positives and 5 false negatives. However, when only 10 of the 100 have the target condition, there will only be 1 false negative versus 9 false positives, even if these are tested with the very same test. As a consequence, the potentially harmful downstream consequences of the test will depend on how many of the tested patients have the target condition.

Several strategies can be used for defining the expected proportion of those with the target condition in a specific clinical setting. Ideally, a systematic review is identified or performed, to estimate this proportion, and to define relevant determinants. Alternatively, or additionally, a small pilot study can be performed, or clinical experts consulted.

Identify the downstream consequences of test results

Bearing in mind the positioning of the index test in the clinical pathway, the downstream consequences of test results (i.e., test positives and test negatives) need to be defined. These refer to clinical management decisions, such as additional confirmatory tests patients may undergo if they are considered positive, or treatments that may be initiated or withheld as a result. Explicitly defining downstream consequences of the index test is important as they also determine the extent to which index test misclassifications (false-positive and false-negative results) could lead to harm to patients being tested.

Weigh the consequences of test misclassifications

Defining MAC for sensitivity and specificity comes down to weighing the downstream consequences of test misclassifications: false-positive results versus false-negative results. Depending on what role the index test has in the clinical pathway, and the downstream consequences of being falsely positive or negative, this can influence the weight given to the consequences of being misclassified. Take for example, triage tests aimed at ruling out disease. These typically need to have high sensitivity, while specificity may be less important. In such a scenario, the consequence of being false negative may have the potential of being more detrimental than being false positive as one might not want to miss any potential true positive cases at the triage stage of a disease especially if early detection and treatment are crucial. Further down the clinical pathway, however, it may be crucial to keep the number of false positives to a minimum, since positive test results may lead to radical treatment decisions with potentially serious side effects. Therefore, add-on tests generally require higher specificity than triage tests. In other words, the weight given to the consequences of being false positive are higher in this scenario. For replacement tests, sensitivity and specificity should, commonly, be both at least as good as those of the existing test. When weighing the consequences of test misclassifications, the following should eventually be considered:

Considering 100 patients suspected of the target condition, how many false-negative results are acceptable, considering the potential harms of such misclassifications?

Considering 100 patients suspected of the target condition, how many false-positive results are acceptable, considering the potential harms of such misclassifications?

Define the study hypothesis by setting minimally acceptable criteria for sensitivity and specificity

Based on the weighted consequences of false-positive and false-negative test results and taking into account the expected proportion of patients with the target condition (as defined earlier), MAC for sensitivity and specificity can be defined and the target region in the ROC space can be drawn (Fig.  2 ).

Pepe and colleagues recently provided a relatively simple method for specifying MAC that is based on weighing the harms and benefits of being detected with the target condition [ 28 ]. Their approach focuses on the threshold for starting the next action: the minimally required probability, after testing, of having the target condition that would justify subsequent management guided by testing, such as starting treatment, or order additional testing after a positive test result. From this threshold, and from the proportion of those with the target condition in the group in which the test under evaluation is going to be used, they derive minimum likelihood ratios: the combinations of sensitivity and specificity that would lead to the required post-test probability.

In their article, Pepe and colleagues argue that such thresholds can be inferred from comparisons with existing situations in which comparable actions are justified. An example is the probability of having colorectal cancers or its precursors in those referred for colonoscopy in a population screening program for colorectal cancer. A new marker would have MAC for sensitivity and specificity that would lead to a post-test probability that at least exceeds that probability.

The minimum positive likelihood ratio defines a specific region in ROC space: a triangle that includes the upper left corner. This area also includes very low values of sensitivity, which may not be clinically useful. The approach of Pepe and colleagues can be further refined by defining the acceptable number needed to test. This is the number of patients that must undergo testing in order to generate one positive result. It is the inverse of the positivity rate which depends on the proportion tested with the target condition and on the sensitivity and specificity. For expensive, invasive, or burdensome tests, the acceptable number needed to test will be lower than for simple, less costly tests.

Our framework focuses on weighing the consequences of test classifications for arriving at MAC for sensitivity and specificity. There are obviously other appropriate methods to define these. One option is to perform a survey among a panel of experts, directly asking what they would consider an appropriate MAC. Gieseker and colleagues, for example, evaluated the accuracy of multiple testing strategies for diagnosing Streptococcus pyogenes pharyngitis (“strep throat”); they performed a sample survey of pediatricians to identify a MAC for sensitivity and report: “67 (80%) of 84 were willing to miss no more than 5% of streptococcal infections” [ 29 ]. A similar method was used to identify minimally acceptable interpretative performance criteria for screening mammography [ 30 ]. In some areas, there are clearly established MAC. In triaging strategies to safely exclude pulmonary embolism without imaging, for example, it is now a common practice to require that the 3-month thrombo-embolic risk does not exceed 3% in test-negatives. This failure rate corresponds to that observed after a negative pulmonary angiography [ 31 ].

Perform a sample size calculation

Based on the MAC for sensitivity and specificity and the expected proportion of patients with the target condition, a sample size calculation can be performed, which represents the number of participants (i.e., patients suspected of having the target condition) that need to be included in the study to conclude that the point estimates and lower limits of the confidence intervals for sensitivity and specificity fall within the “target region,” by rejecting the null hypothesis that they do not. The statistical tests and methods for sample size calculations have all been defined before in the literature [ 32 ].

Additional file  1 provides an example of a sample size calculator that can be used for this purpose, with background information on the formula used in Additional file  2 . Information that needs to be filled in are α and β (see Table  1 for details), MAC for sensitivity and specificity, and the expected value for sensitivity and specificity. The output of the calculator is the minimal numbers of participants with and without the target condition that need to be included; the final sample size will depend on the expected prevalence of the target condition.

Arrive at meaningful conclusions

Upon completion of the study, estimates of sensitivity and specificity are compared with the pre-defined MAC for sensitivity and specificity. This can be done by (1) assessing whether the point estimates of sensitivity and specificity and the lower confidence interval limits are above MAC, or (2) by performing formal statistical testing of the null hypothesis and arriving at a p value. As diagnostic accuracy studies have a joint hypothesis (one for sensitivity and one for specificity), one cannot reject the null hypothesis if only one of these fulfills the criteria for MAC and the other does not. One can also not reject the null hypothesis if the lower confidence limit of sensitivity or specificity is below MAC. Obviously, this “statistically negative” result does not mean that the diagnostic test is useless. Firstly, one should consider the possibility that the study was too small, for example, due to incorrect assumptions during the sample size calculations, which may have led to wide confidence intervals. Secondly, one should consider that the pre-specified criteria for MAC may have been too strict, or that the test may have added value in another clinical setting, or in a different role in the existing clinical pathway. On the other hand, a significant p value does not mean that the test under evaluation is fit-for-purpose; the study may be biased (e.g., due to many missing results) or have low generalizability.


Targeted test evaluation will usually require the expertise of multiple professionals. There should be clinical experts to identify the management actions that will result from positive or negative test results and who can weigh the downstream consequences of test results. In some cases, it may be desirable to also include patients or their advocates in this process. There should also be methodological and statistical experts, to avoid mistakes in drawing the clinical pathway, to promote consistency in the process, and to arrive at adequate sample size calculations based on the defined MAC for test accuracy.

There is a growing recognition that explicitly specifying study hypotheses and how these were specified in the protocol-development phase of the study is crucial in test accuracy research. The STARD 2015 statement for reporting diagnostic accuracy studies now requires authors to report “study hypotheses” (item 4) and the “intended sample size and how it was determined” (item 18) [ 24 , 25 ]. Similar methods for focusing on MAC of test performance are also increasingly being implemented among systematic reviews and clinical guidelines. The Cochrane Handbook for Diagnostic Test Accuracy Reviews, for example, now encourages authors to describe the clinical pathway in which the test under evaluation will be implemented, including prior tests, the role of the index test and alternative tests, if applicable [ 23 ]. A similar practice is advised by the recently established GRADE (Grading of Recommendations Assessment, Development and Evaluation) quality assessment criteria for diagnostic accuracy studies, which encourages guideline developers to focus on and weigh consequences of testing [ 33 ].

The process described here is not that different from hypothesis formulation and sample size calculations in RCTs. Even though most superiority RCTs generally have a simple null hypothesis (i.e., no effect), the calculation of the required sample size depends on the definition of a “minimum important difference”: the smallest difference in the primary outcome that the trial should be able to detect. The DELTA (Difference ELicitation in TriAls) group recently provided a systematic overview of methods for specifying the target difference in RCTs [ 34 ]. These methods are subdivided in those for specifying an important difference (e.g., by weighing resource costs and health outcomes to estimate the overall net benefit of the intervention), those for specifying a realistic difference (e.g., through a pilot study), or both (e.g., through opinion seeking among health professionals).

We realize that our framework has some potential shortcomings. We focused on MAC for the sensitivity and specificity of a new test, and null hypotheses based on these criteria, to be used in the evaluation of a single test with dichotomous test results. Defining MAC may be more difficult in other situations, although the general principles should be the same. In some cases, for example, diagnostic accuracy studies do not focus on a single test but compare two or more tests or testing strategies. Hayen and colleagues have described how one can use meaningful measures and statistics in such studies, such as the relative likelihood ratios [ 27 ]. In other situations, the index test does not produce a dichotomous test result, but a continuous one. This is, for example, often the case with laboratory tests. We believe that our framework could, with some adaptations, also be useful in those cases, as evaluating continuous tests generally comes down to finding a clinically relevant test threshold at which the test is useful for ruling in or ruling out the target condition. Currently, studies on continuous test very often select an optimal threshold for sensitivity and specificity based on, for example, Youden’s index. In many cases, this leads to a test threshold that is clinically not useful as both sensitivity and specificity are too low for decision-making. An alternative theory would to pre-define MAC for sensitivity and specificity, as outlined, and investigate whether there is a test threshold that is able to fulfill these criteria.

Mainly due to technological innovations, the field of diagnostic testing evolves quickly. Premature incorporation of new diagnostic tests into clinical practice may lead to unnecessary testing, waste of resources, and faulty clinical decision-making. Defining MAC before initiating new diagnostic accuracy studies should improve methodological study quality and help draw more meaningful evidence synthesis of such studies.

Availability of data and materials

Not applicable.

Smith R, Rennie D. Evidence-based medicine--an oral history. JAMA. 2014;311(4):365–7.

Article   CAS   Google Scholar  

Kendall JM. Designing a research project: randomised controlled trials and their principles. Emerg Med J. 2003;20(2):164–8.

Jones SR, Carley S, Harrison M. An introduction to power and sample size estimation. Emerg Med J. 2003;20(5):453–8.

Bossuyt PM, Reitsma JB, Linnet K, Moons KG. Beyond diagnostic accuracy: the clinical utility of diagnostic tests. Clin Chem. 2013;58(12):1636–43.

Article   Google Scholar  

Ferrante di Ruffano L, Davenport C, Eisinga A, Hyde C, Deeks JJ. A capture-recapture analysis demonstrated that randomized controlled trials evaluating the impact of diagnostic tests on patient outcomes are rare. J Clin Epidemiol. 2012;65(3):282–7.

Linnet K, Bossuyt PM, Moons KG, Reitsma JB. Quantifying the accuracy of a diagnostic test or marker. Clin Chem. 2012;58(9):1292–301.

Glas AS, Lijmer JG, Prins MH, Bonsel GJ, Bossuyt PM. The diagnostic odds ratio: a single indicator of test performance. J Clin Epidemiol. 2003;56(11):1129–35.

Korevaar DA, van Enst WA, Spijker R, Bossuyt PM, Hooft L. Reporting quality of diagnostic accuracy studies: a systematic review and meta-analysis of investigations on adherence to STARD. Evid Based Med. 2014;19(2):47–54.

Ochodo EA, de Haan MC, Reitsma JB, Hooft L, Bossuyt PM, Leeflang MM. Overinterpretation and misreporting of diagnostic accuracy studies: evidence of “spin”. Radiology. 2013;267(2):581–8.

Bachmann LM, Puhan MA, ter Riet G, Bossuyt PM. Sample sizes of studies on diagnostic accuracy: literature survey. BMJ. 2006;332(7550):1127–9.

Bochmann F, Johnson Z, Azuara-Blanco A. Sample size in studies on diagnostic accuracy in ophthalmology: a literature survey. Br J Ophthalmol. 2007;91(7):898–900.

Thombs BD, Rice DB. Sample sizes and precision of estimates of sensitivity and specificity from primary studies on the diagnostic accuracy of depression screening tools: a survey of recently published studies. Int J Methods Psychiatr Res. 2016;25(2):145–52.

Lumbreras B, Parker LA, Porta M, Pollan M, Ioannidis JP, Hernandez-Aguado I. Overinterpretation of clinical applicability in molecular diagnostic research. Clin Chem. 2009;55(4):786–94.

McGrath TA, McInnes MDF, van Es N, Leeflang MMG, Korevaar DA, Bossuyt PMM. Overinterpretation of research findings: evidence of “spin” in systematic reviews of diagnostic accuracy studies. Clin Chem. 2017;63(8):1353–62.

Shaikh N, Leonard E, Martin JM. Prevalence of streptococcal pharyngitis and streptococcal carriage in children: a meta-analysis. Pediatrics. 2010;126(3):e557–64.

Cohen JF, Cohen R, Levy C, Thollot F, Benani M, Bidet P, Chalumeau M. Selective testing strategies for diagnosing group A streptococcal infection in children with pharyngitis: a systematic review and prospective multicentre external validation study. Can Med Assoc J. 2015;187(1):23–32.

Group ESTG, Pelucchi C, Grigoryan L, Galeone C, Esposito S, Huovinen P, Little P, Verheij T. Guideline for the management of acute sore throat. Clin Microbiol Infect. 2012;18(Suppl 1):1–28.

Google Scholar  

Shulman ST, Bisno AL, Clegg HW, Gerber MA, Kaplan EL, Lee G, Martin JM, Van Beneden C. Clinical practice guideline for the diagnosis and management of group A streptococcal pharyngitis: 2012 update by the Infectious Diseases Society of America. Clin Infect Dis. 2012;55(10):1279–82.

Linder JA, Bates DW, Lee GM, Finkelstein JA. Antibiotic treatment of children with sore throat. JAMA. 2005;294(18):2315–22.

Cohen JF, Bertille N, Cohen R, Chalumeau M. Rapid antigen detection test for group A streptococcus in children with pharyngitis. Cochrane Database Syst Rev. 2016;7:CD010502.

PubMed   Google Scholar  

Irwig L, Bossuyt P, Glasziou P, Gatsonis C, Lijmer J. Designing studies to ensure that estimates of test accuracy are transferable. BMJ. 2002;324(7338):669–71.

Gopalakrishna G, Mustafa RA, Davenport C, Scholten RJ, Hyde C, Brozek J, Schunemann HJ, Bossuyt PM, Leeflang MM, Langendam MW. Applying Grading of Recommendations Assessment, Development and Evaluation (GRADE) to diagnostic tests was challenging but doable. J Clin Epidemiol. 2014;67(7):760–8.

Deeks JJ, Wisniewski S, Davenport C. Chapter 4: guide to the contents of a Cochrane Diagnostic Test Accuracy Protocol. In: Deeks JJ, Bossuyt PM, Gatsonis C (editors), Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy Version 1.0.0. The Cochrane Collaboration; 2013. Available from: http://srdta.cochrane.org/ .

Bossuyt PM, Reitsma JB, Bruns DE, Gatsonis CA, Glasziou PP, Irwig L, Lijmer JG, Moher D, Rennie D, de Vet HC, et al. STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. BMJ. 2015;351:h5527.

Cohen JF, Korevaar DA, Altman DG, Bruns DE, Gatsonis CA, Hooft L, Irwig L, Levine D, Reitsma JB, de Vet HC, et al. STARD 2015 guidelines for reporting diagnostic accuracy studies: explanation and elaboration. BMJ Open. 2016;6(11):e012799.

Bossuyt PM, Irwig L, Craig J, Glasziou P. Comparative accuracy: assessing new tests against existing diagnostic pathways. BMJ. 2006;332(7549):1089–92.

Hayen A, Macaskill P, Irwig L, Bossuyt P. Appropriate statistical methods are required to assess diagnostic tests for replacement, add-on, and triage. J Clin Epidemiol. 2010;63(8):883–91.

Pepe MS, Janes H, Li CI, Bossuyt PM, Feng Z, Hilden J. Early-phase studies of biomarkers: what target sensitivity and specificity values might confer clinical utility? Clin Chem. 2016;62(5):737–42.

Gieseker KE, Roe MH, MacKenzie T, Todd JK. Evaluating the American Academy of Pediatrics diagnostic standard for Streptococcus pyogenes pharyngitis: backup culture versus repeat rapid antigen testing. Pediatrics. 2003;111(6 Pt 1):e666–70.

Carney PA, Sickles EA, Monsees BS, Bassett LW, Brenner RJ, Feig SA, Smith RA, Rosenberg RD, Bogart TA, Browning S, et al. Identifying minimally acceptable interpretive performance criteria for screening mammography. Radiology. 2010;255(2):354–61.

Righini M, Van Es J, Den Exter PL, Roy PM, Verschuren F, Ghuysen A, Rutschmann OT, Sanchez O, Jaffrelot M, Trinh-Duc A, et al. Age-adjusted D-dimer cutoff levels to rule out pulmonary embolism: the ADJUST-PE study. JAMA. 2014;311(11):1117–24.

Pepe MS. The statistical evaluation of medical tests for classification and prediction. New York: Oxford University Press; 2003. Chapter 8: Study design and hypothesis testing, Section 8.2: Sample sizes for phase 2 studies. Available online at https://research.fhcrc.org/content/dam/stripe/diagnostic-biomarkers-statistical-center/files/excerpt.pdf

Schunemann HJ, Oxman AD, Brozek J, Glasziou P, Jaeschke R, Vist GE, Williams JW Jr, Kunz R, Craig J, Montori VM, et al. Grading quality of evidence and strength of recommendations for diagnostic tests and strategies. BMJ. 2008;336(7653):1106–10.

Hislop J, Adewuyi TE, Vale LD, Harrild K, Fraser C, Gurung T, Altman DG, Briggs AH, Fayers P, Ramsay CR, et al. Methods for specifying the target difference in a randomised controlled trial: the Difference ELicitation in TriAls (DELTA) systematic review. PLoS Med. 2014;11(5):e1001645.

Download references


Author information, authors and affiliations.

Department of Respiratory Medicine, Academic Medical Center, Amsterdam University Medical Centers, Amsterdam, the Netherlands

Daniël A. Korevaar

Department of Epidemiology and Biostatistics, Vrije University Medical Centre, Amsterdam University Medical Centers, Amsterdam, the Netherlands

Gowri Gopalakrishna

Department of General Pediatrics and Pediatric Infectious Diseases, Necker-Enfants Malades Hospital, APHP, Paris Descartes University, Paris, France

Jérémie F. Cohen

Inserm U1153, Obstetrical, Perinatal and Pediatric Epidemiology Research Team, Centre of Research in Epidemiology and Statistics Sorbonne Paris Cité (CRESS), Paris Descartes University, Paris, France

Department of Clinical Epidemiology, Biostatistics and Bioinformatics, Academic Medical Center, Amsterdam University Medical Centers, Amsterdam, the Netherlands

Patrick M. Bossuyt

You can also search for this author in PubMed   Google Scholar


All authors contributed to the design of the proposed framework and the writing of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Daniël A. Korevaar .

Ethics declarations

Ethics approval and consent to participate, consent for publication, competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1..

An example of a sample size calculator.

Additional file 2.

Formulas used for the calculator provided in Additional File  1 .

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/ ), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article.

Korevaar, D.A., Gopalakrishna, G., Cohen, J.F. et al. Targeted test evaluation: a framework for designing diagnostic accuracy studies with clear study hypotheses. Diagn Progn Res 3 , 22 (2019). https://doi.org/10.1186/s41512-019-0069-2

Download citation

Received : 03 October 2019

Accepted : 04 December 2019

Published : 19 December 2019

DOI : https://doi.org/10.1186/s41512-019-0069-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Diagnostic and Prognostic Research

ISSN: 2397-7523

  • Submission enquiries: Access here and click Contact Us
  • General enquiries: [email protected]

formulation of hypothesis may not be required in clinical studies

Formulating Hypotheses for Different Study Designs


  • 1 Department of Clinical Immunology and Rheumatology, Sanjay Gandhi Postgraduate Institute of Medical Sciences, Lucknow, India.
  • 2 Departments of Rheumatology and Research and Development, Dudley Group NHS Foundation Trust (Teaching Trust of the University of Birmingham, UK), Russells Hall Hospital, Dudley, UK. [email protected].
  • 3 Department of Internal Medicine #2, Danylo Halytsky Lviv National Medical University, Lviv, Ukraine.
  • 4 Department of Biology and Biochemistry, South Kazakhstan Medical Academy, Shymkent, Kazakhstan.
  • 5 Departments of Rheumatology and Research and Development, Dudley Group NHS Foundation Trust (Teaching Trust of the University of Birmingham, UK), Russells Hall Hospital, Dudley, UK.
  • 6 Centre for Epidemiology versus Arthritis, University of Manchester, Manchester, UK.
  • PMID: 34962112
  • PMCID: PMC8728594
  • DOI: 10.3346/jkms.2021.36.e338

Generating a testable working hypothesis is the first step towards conducting original research. Such research may prove or disprove the proposed hypothesis. Case reports, case series, online surveys and other observational studies, clinical trials, and narrative reviews help to generate hypotheses. Observational and interventional studies help to test hypotheses. A good hypothesis is usually based on previous evidence-based reports. Hypotheses without evidence-based justification and a priori ideas are not received favourably by the scientific community. Original research to test a hypothesis should be carefully planned to ensure appropriate methodology and adequate statistical power. While hypotheses can challenge conventional thinking and may be controversial, they should not be destructive. A hypothesis should be tested by ethically sound experiments with meaningful ethical and clinical implications. The coronavirus disease 2019 pandemic has brought into sharp focus numerous hypotheses, some of which were proven (e.g. effectiveness of corticosteroids in those with hypoxia) while others were disproven (e.g. ineffectiveness of hydroxychloroquine and ivermectin).

Keywords: Hypotheses; Pandemic; Research Ethics; Study Design.

© 2021 The Korean Academy of Medical Sciences.

Publication types

  • COVID-19 / epidemiology
  • COVID-19 Drug Treatment*
  • Ethics, Research
  • Peer Review
  • Pilot Projects
  • Research Design*
  • SARS-CoV-2*
  • Research article
  • Open access
  • Published: 17 June 2007

Problem formulation by medical students: an observation study

  • Francois Auclair 1  

BMC Medical Education volume  7 , Article number:  16 ( 2007 ) Cite this article

8075 Accesses

15 Citations

Metrics details

Medical problems are often complex and ill-structured. In formulating the problem, one has to discriminate pertinent elements from irrelevant information in order to effectively find a solution. In this observation study, we describe how medical students formulate the problem of a complex case.

32 third year medical students were presented with a complex case of endocarditis. They were asked to synthesize the case and give the best formulation of the problem. They were then asked to provide a diagnosis. A subsequent group of 25 students were presented with the problem already formulated and were also asked for the diagnosis. We analyzed the student's problem formulations using the presence or absence of essential elements of the case, the use of higher-order concepts and the use of relations between concepts.

12/32 students presented with the case made the correct diagnosis. Diagnostic accuracy was significantly associated with the use of higher-order concepts and relations between concepts. Establishing explicit relations was particularly important. Almost all students who missed the diagnosis could not elicit any relations between concepts but only reported factual observations. When presented with an already formulated problem, 19/25 students made the correct diagnosis. (p < 0.05)

When faced with a complex new case, students may not have the structured knowledge to recognize the nature of the problem. They have to build new schema or problem representation. Our observations suggest that this process involves using higher-order concepts and establishing new relations between concepts. The fact that students could recognize the disease when presented with a formulated problem but had more difficulty when presented with the original complex case indicates that knowledge of the clinical features may be necessary but not sufficient for problem formulation. Our hypothesis is that problem formulation represents a distinct ability.

Peer Review reports

Problem formulation is necessary because the world we experience is complex. The problems we face are often ill-structured. One or several elements of the problem may be unknown, the same elements may be different in different context, there is uncertainty about the concepts necessary for a solution, the relationship between concepts and rules may be inconsistent between cases. We need to organize the information into sensible patterns [ 1 ].

We can define problem formulation as a structure of concepts linked by relations. The role of such a system is to organize the problematic experience in such a way that enables better understanding of the state of affairs so that we can more effectively search for solutions [ 2 , 3 ]. As stated by Rosenhead [ 4 ]: "the most demanding and troubling task in formative decision situations is to decide what the problem is".

Problem formulation is not an objective procedure. It is a creation that is probably dependent on norms, values, knowledge, perception of the situation, problem environment and past experience [ 2 , 5 ]. In medicine, problem formulation is a created synthesis that ideally would contain only those attributes that are deemed significant [ 6 ]. It has been well documented that integration of irrelevant information from past experience may lead to inaccurate diagnosis [ 7 ].

In this study, we investigated how medical students formulate the problem of a complex medical case. We analyzed the characteristics of their formulated problems and examined if there was any correlation with diagnostic accuracy. We also tested the hypothesis that problem formulation is a creative ability in which knowledge of signs and symptoms of a disease may be necessary but insufficient for making a diagnosis.

Seven small groups of third year medical students (total 32 students) doing their Internal Medicine rotation were successively presented with a complex case of endocarditis from the New England Journal of Medicine [ 8 ]. This was done in the context of clinical reasoning sessions in which students are presented with an unknown case followed by interactive discussion on solving the problem. The dual purpose was to learn about a specific infectious disease and foster reflective practice. Students were instructed to regroup the clinical information into the smallest number of concepts that represent the whole case and to write it down as their best possible formulation of the problem. They were given a written example of what was expected. They were given all the time required to complete the task. At the end, they were asked for their diagnosis. This was followed by interactive discussion that included teaching about the disease and resolution of the case at the end. The duration of the session was two hours.

We considered five elements or observations to be essential to the case namely: the occurrence of pneumonia, the positive blood cultures with Streptococcus pneumoniae, the complication of endophthalmitis, the presence of a heart murmur and the development of hemodynamic problems. These were used for presenting a formulated problem. The formulated problem was constructed according to accepted criteria for the clinical diagnosis of endocarditis [ 9 ].

We analyzed the students' problem formulation with the following characteristics: the presence or absence of the above essential elements, the use or not of higher-order concepts (or new conceptual category) and the mention or not of relations between concepts. For higher-order concepts we considered any introduction of a new conceptual category from the observation terms. These included semantic qualifier as described by Bordage [ 10 ] in which the content of an observation is given an abstract form along oppositional relationships. For example, pneumonia could be qualified as acute or chronic. We also included any abstraction into a larger set such as a generalization from multiple terms. For example, the combination of leucopenia and use of corticosteroids could be subsumed under 'immunocompromised'.

Relations we looked for were related to causation. We look for direct expression of causation in which the student would state that an element brings about or progress into another one or that an element was an effect of another. We also looked for temporal relations of continuity or succession in which student would state that an event follows or is preceded by another. These are not necessarily causal but often associated with causation. We examined if there was any significant association between the characteristics of problem formulation and diagnostic accuracy using Fisher's exact test.

In order to examine the role of previous knowledge, we presented a formulated problem of the same case of endocarditis to five subsequent groups of third year students (total 25 students) also doing their rotation in Internal Medicine. The students were asked to make a diagnosis based on the following formulated problem: "A 43 year old man presents with a history of complicated pneumococcal pneumonia with empyema, positive blood cultures, and subsequent development of left eye endophthalmitis, heart murmur and sudden onset of heart failure requiring intubation". The problem was used as an illustration of the importance of problem formulation and was followed by different cases presentations.

The study was reviewed and approved by the Education Research committee of the Faculty of Medicine of the University of Ottawa. No further ethical approval was requested as the primary focus of these sessions is teaching. The clinical reasoning sessions are part of the regular curriculum for third year medical students. Our study is a report on observations made during teaching sessions. The collection of data was made with the informed consent of participating students and participation to the sessions was voluntary.

12/32 students presented with the case made a correct diagnosis of endocarditis.

We observed significant differences in the nature of problem formulation between students. Students with the correct diagnosis were more likely to use higher-order concepts and to make relations between concepts explicit than those with incorrect diagnosis (p < 0.05) (Tables 1 ). There was no difference in the frequency of including any of the five essential elements between students who had the right diagnosis and those who did not (Table 2 ).

The role of relations between concepts was particularly revealing. Although the students who missed the diagnosis elicited the same number of relevant clinical findings as those who made the diagnosis, they were unlikely to make explicit any relation between concepts (1/20).

The group of students presented with a formulated case seemed to have sufficient knowledge to recognize the disease. When first presented with a synthesis of the case, a high proportion of those students were able to make an accurate diagnosis (19/25) by comparison to the groups of students presented with the original complex case (12/32) (p < 0.05).

Our first set of observations suggests that the use of higher-order concepts and making explicit relations between concepts are associated with diagnostic accuracy. Bordage and Chang [ 10 , 11 ] have demonstrated that diagnostic accuracy correlates with semantic level of the problem's description. The use of semantic qualifiers that describe the content at a more abstract level was associated with better diagnostic accuracy. Abstractions from observation concepts constitute interpretations. Those interpretations confer additional meaning [ 12 ]. Qualifying pneumonia as recurrent or persistent is an interpretation of a series of events and adds to the meaning of the concept of pneumonia and may thus increase diagnostic accuracy.

In another experiment, Nendaz and Bordage [ 13 ] instructed students how to use more semantic qualifiers. They found that students could learn to introduce more semantic qualifiers but there was no difference in diagnostic accuracy suggesting that the use of semantic qualifiers correlates with diagnosis but may not be causal.

Our observations may provide a possible explanation to this apparent lack of causation between use of semantic qualifiers and diagnostic accuracy. We found that relations between concepts need to be established. Absence of explicit relations between concepts in problem formulation was associated with incorrect diagnosis. The use of conceptual abstraction may be necessary but not sufficient for problem formulation; a critical element is the structure resulting from relations established between concepts. Establishing meaningful relations has also been shown by Norman [ 14 ] to be important in problem solving. Physicians with different levels of experience were presented with complex nephrology problems and asked to solve them while thinking aloud. Experienced physicians solved the problems by clustering data into more meaningful relations than less experienced ones. We have observed that conceptual relations need to be established early in the problem formulation.

Why would establishing relations between concepts be associated with better diagnostic accuracy? Our observations suggest that the structure of problem formulation is analogous to that of a model of a theory [ 15 ]. In the semantic view of theory, a model can be a linguistic entity on which observation concepts are abstracted into theoretical terms. Those are organized in a structure which contains as minimal requirements: concepts and a set of relations or operations on those concepts. The relations are made explicit. With such a structure the model may represent the world and have an explanatory function [ 16 ]. Like a model, a problem formulation can make explicit functional and causal relations between concepts. A model of a case of endocarditis will link bacteremia, valvular disease and embolic phenomena in causal relationships and these have explanatory utility. The formulated problem will allow the physician to see the case as belonging to a 'theory' of endocarditis. What is shared between the model and the theory is not only a set of features of individual concepts but the same pattern of abstract relationships. Choosing the pertinent concepts and establishing relations most likely involve analytic and non-analytic processes [ 17 ]. Non-analytical recognition of similar cases from past experience has been associated with expertise [ 18 ] and likely involves seeing relations between concepts. On the other hand, analysis of specific features, weighting prior probability, and consideration for simplicity must also play a role in formulating the problem.

In the second set of observations, we found that a majority of students could readily recognize the disease when presented with an already formulated problem. This was in contrast to the groups of students presented with the original case. A possible explanation would be that students who made the diagnosis had already structured knowledge of endocarditis. Why those groups would have such a structured knowledge is unexpected since there was no new course or teaching on the subject in those groups. Moreover, several preceding groups had significant difficulty in making the diagnosis.

More likely, the explanation would be that when presented with the original complex case, the difficulty was in structuring the elements of the problem into a recognizable form. Very few students identified the regurgitant murmur as an essential feature. Possibly, when presented with the formulated case in which the murmur is mentioned, students recognized the disease as endocarditis. Faced with the large amount of clinical data of the original case, students may have had problem seeing relations between pertinent clinical features. Medin [ 19 ] has shown that when relationships between properties were exhibited, particularly causal relationships, subject were better able to assign objects to a similar category.

There are several limitations to our study. These were observations made during teaching sessions. The groups of students were not randomized and were seen in sequence with the last five groups presented with the formulated case. It is possible that students who made the diagnosis had different prior experiences with similar cases or different knowledge structures. This should not invalidate the observations on the structure of problem formulation but would cast doubt on our hypothesis that problem formulation is a creative ability. The formulated problem and the criteria for higher-order concepts and what constitutes relations between concepts were not validated by other physicians. The formulated problem was based on accepted clinical criteria of endocarditis and unlikely would cause significant dissension among experts. The choice of higher-order concepts and relations would be subject to interpretation. However, the interpretation would concern more the type of relations or concepts than whether a relation is made explicit or not or whether a new conceptual category is introduced or not. No observational term was accepted as new concept. This study was also limited to the use of one complex case. Conceptual analysis with abstractions and relations may not be so important in other cases. In many instances, the single ability to detect critical features may be more important. These observations may not apply to less complex case. No doubt, there are many ways to adequately formulate a problem.

With practice, experts develop schema or scripts that enable them to recognize a situation as belonging to a certain class of problem [ 20 , 21 ]. Such schema can make sense of new complex information and consist of structured knowledge. There is now evidence that acquiring multiple representations of knowledge is more important in clinical reasoning than any particular strategy such as hypothetico-deductive reasoning [ 17 , 22 ]. In novel situations, schema may not be directly available for searching the relevant elements of the problem and one must build a new problem representation [ 1 ]. This process involves creative thinking. Cognitive and psychometric approaches in the study of creativity suggest that mental representations are involved in new combinations with processes like association, synthesis, analogical transfer or categorical reduction where elements are reduced to more primitive descriptions [ 23 ]. Our observations suggest that problem formulation involves not only abstraction but also making new relations between concepts and that it may be a creative ability for which knowledge of clinical features of a disease may be necessary but not sufficient.

Reeves WR: Cognition and complexity. The cognitive science of managing complexity. 1996, Lanham: Scarecrow Press

Google Scholar  

Lyles MA, Mitrof II: Organizational problem formulation: an empirical study. Administrative Science Quarterly. 1980, 25: 102-119. 10.2307/2392229.

Article   Google Scholar  

Heylighen F: Formulating the problem of problem-formulation. Cybernetics and Systems. Edited by: Trappl R. 1988, Dordrecht: Kluwer Academic Publishers, 949-957.

Rosenhead J, Mingers J: A new paradigm of analysis. Rational Analysis for a Problematic World Revisited. Edited by: Rosenhead J, Mingers J. 2001, Chichester: John Wiley & Sons, Ltd, 1-19. 2

Kelsey JGT: Learning from teaching: problems, problem-formulation, and the enhancement of problem-solving capability. Cognitive Perspectives on Educational Leadership. Edited by: Hallinger P, Leithwood K, Murphy J. 1993, New York: Teacher College Press, 231-252.

Barrows Howard S, Pickell Garfield C: Developing Clinical Problem-Solving Skills. A Guide to More Effective Diagnosis and Treatment. 1991, New York: Norton Medical Books

Hatala R, Norman GR, Brooks LR: Influence of a single example on subsequent electrocardiogram interpretation. Teach Learn Med. 1999, 11: 110-117. 10.1207/S15328015TL110210.

Case Records of the Massachusetts General Hospital: Case 7–2003: A 43-year-old man with fever, rapid loss of vision of the left eye, and cardiac findings. N Engl J Med. Edited by: Rubin RH, King ME, Mark EJ. 2003, , 348: 834-43. 10.1056/NEJMcpc020032.

Durack DT, Lukes AS, Bright DK: New criteria for diagnosis of infective endocarditis. Am J Med. 1994, 96: 200-209. 10.1016/0002-9343(94)90143-0.

Bordage G, Lemieux M: Semantic structures and diagnostic thinking of experts and novices. Acad Med. 1991, 66: S70-S72. 10.1097/00001888-199109000-00045.

Chang RW, Bordage G, Connell KJ: The importance of early problem representation during case presentations. Acad Med. 1998, 73: S109-S111. 10.1097/00001888-199810000-00062.

Bordage G, Connell KJ, Chang RW, Gecht MR, Sinacore JM: Assessing the semantic content of clinical case presentations: studies of reliability and concurrent validity. Acad Med. 1997, 72: S37-S39. 10.1097/00001888-199710000-00036.

Nendaz MR, Bordage G: Promoting diagnostic problem representation. Med Educ. 2002, 6: 760-766. 10.1046/j.1365-2923.2002.01279.x.

Norman GR, Trott AD, Brooks LR, Smith EKM: Cognitive differences in clinical reasoning related to postgraduate training. Teaching and learning in medicine. 1994, 6: 114-120.

Suppes P: Representation and invariance of scientific structures. 2002, Stanford: CSLI Publications

Giere RN: Using models to represent reality. Model-based reasoning in scientific discovery. Edited by: Magnani L, Nersessian NJ, Thagard P. 1999, New York: Kluwer Academic/Plenum Publishers, 41-57.

Chapter   Google Scholar  

Eva KW: What every teacher needs to know about clinical reasoning. Med Educ. 2004, 39: 98-106. 10.1111/j.1365-2929.2004.01972.x.

Codere S, Mandin H, Harasym PH, Fick GH: Diagnostic reasoning strategies and diagnostic success. Med Educ. 2003, 37: 695-703. 10.1046/j.1365-2923.2003.01577.x.

Medin DL, Wattenmaker WD, Hampson SE: Family resemblance, conceptual cohesiveness, and category construction. Cognitive Psychology. 1987, 19: 242-279. 10.1016/0010-0285(87)90012-0.

Schank R, Abelson R: Scripts, plans, goals, and understanding: An inquiry into human knowledge structures. 1977, Hillsdale: Lawrence Erlbaum

Schmidt HG, Norman GR, Boshuizen PA: A cognitive perspective on medical expertise: theory and implications. Acad Med. 1990, 65: 611-621. 10.1097/00001888-199010000-00001.

Norman G: Research in clinical reasoning: past history and current trends. Med Educ. 2005, 39: 418-427. 10.1111/j.1365-2929.2005.02127.x.

Sternberg RJ: Wisdom, Intelligence, and Creativity Synthesized. 2003, Cambridge: Cambridge University Press

Book   Google Scholar  

Pre-publication history

The pre-publication history for this paper can be accessed here: http://www.biomedcentral.com/1472-6920/7/16/prepub

Download references


This study was funded by a grant in Innovations in Medical Education and Education Research from the Department of Medicine of the University of Ottawa.

Author information

Authors and affiliations.

Department of Medicine, University of Ottawa, Canada

Francois Auclair

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Francois Auclair .

Additional information

Competing interests.

The author(s) declare that they have no competing interests.

Authors' contributions

The author has developed the clinical reasoning sessions in Infectious Diseases and has analysed the structure of the problems formulated by the medical students.

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article.

Auclair, F. Problem formulation by medical students: an observation study. BMC Med Educ 7 , 16 (2007). https://doi.org/10.1186/1472-6920-7-16

Download citation

Received : 06 November 2006

Accepted : 17 June 2007

Published : 17 June 2007

DOI : https://doi.org/10.1186/1472-6920-7-16

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Medical Student
  • Diagnostic Accuracy
  • Endocarditis
  • Problem Formulation
  • Endophthalmitis

BMC Medical Education

ISSN: 1472-6920

formulation of hypothesis may not be required in clinical studies

How Do You Formulate (Important) Hypotheses?

  • Open Access
  • First Online: 03 December 2022

Cite this chapter

You have full access to this open access chapter

formulation of hypothesis may not be required in clinical studies

  • James Hiebert 6 ,
  • Jinfa Cai 7 ,
  • Stephen Hwang 7 ,
  • Anne K Morris 6 &
  • Charles Hohensee 6  

Part of the book series: Research in Mathematics Education ((RME))

11k Accesses

Building on the ideas in Chap. 1, we describe formulating, testing, and revising hypotheses as a continuing cycle of clarifying what you want to study, making predictions about what you might find together with developing your reasons for these predictions, imagining tests of these predictions, revising your predictions and rationales, and so on. Many resources feed this process, including reading what others have found about similar phenomena, talking with colleagues, conducting pilot studies, and writing drafts as you revise your thinking. Although you might think you cannot predict what you will find, it is always possible—with enough reading and conversations and pilot studies—to make some good guesses. And, once you guess what you will find and write out the reasons for these guesses you are on your way to scientific inquiry. As you refine your hypotheses, you can assess their research importance by asking how connected they are to problems your research community really wants to solve.

You have full access to this open access chapter,  Download chapter PDF

Part I. Getting Started

We want to begin by addressing a question you might have had as you read the title of this chapter. You are likely to hear, or read in other sources, that the research process begins by asking research questions . For reasons we gave in Chap. 1 , and more we will describe in this and later chapters, we emphasize formulating, testing, and revising hypotheses. However, it is important to know that asking and answering research questions involve many of the same activities, so we are not describing a completely different process.

We acknowledge that many researchers do not actually begin by formulating hypotheses. In other words, researchers rarely get a researchable idea by writing out a well-formulated hypothesis. Instead, their initial ideas for what they study come from a variety of sources. Then, after they have the idea for a study, they do lots of background reading and thinking and talking before they are ready to formulate a hypothesis. So, for readers who are at the very beginning and do not yet have an idea for a study, let’s back up. Where do research ideas come from?

There are no formulas or algorithms that spawn a researchable idea. But as you begin the process, you can ask yourself some questions. Your answers to these questions can help you move forward.

What are you curious about? What are you passionate about? What have you wondered about as an educator? These are questions that look inward, questions about yourself.

What do you think are the most pressing educational problems? Which problems are you in the best position to address? What change(s) do you think would help all students learn more productively? These are questions that look outward, questions about phenomena you have observed.

What are the main areas of research in the field? What are the big questions that are being asked? These are questions about the general landscape of the field.

What have you read about in the research literature that caught your attention? What have you read that prompted you to think about extending the profession’s knowledge about this? What have you read that made you ask, “I wonder why this is true?” These are questions about how you can build on what is known in the field.

What are some research questions or testable hypotheses that have been identified by other researchers for future research? This, too, is a question about how you can build on what is known in the field. Taking up such questions or hypotheses can help by providing some existing scaffolding that others have constructed.

What research is being done by your immediate colleagues or your advisor that is of interest to you? These are questions about topics for which you will likely receive local support.

Exercise 2.1

Brainstorm some answers for each set of questions. Record them. Then step back and look at the places of intersection. Did you have similar answers across several questions? Write out, as clearly as you can, the topic that captures your primary interest, at least at this point. We will give you a chance to update your responses as you study this book.

Part II. Paths from a General Interest to an Informed Hypothesis

There are many different paths you might take from conceiving an idea for a study, maybe even a vague idea, to formulating a prediction that leads to an informed hypothesis that can be tested. We will explore some of the paths we recommend.

We will assume you have completed Exercise 2.1 in Part I and have some written answers to the six questions that preceded it as well as a statement that describes your topic of interest. This very first statement could take several different forms: a description of a problem you want to study, a question you want to address, or a hypothesis you want to test. We recommend that you begin with one of these three forms, the one that makes most sense to you. There is an advantage to using all three and flexibly choosing the one that is most meaningful at the time and for a particular study. You can then move from one to the other as you think more about your research study and you develop your initial idea. To get a sense of how the process might unfold, consider the following alternative paths.

Beginning with a Prediction If You Have One

Sometimes, when you notice an educational problem or have a question about an educational situation or phenomenon, you quickly have an idea that might help solve the problem or answer the question. Here are three examples.

You are a teacher, and you noticed a problem with the way the textbook presented two related concepts in two consecutive lessons. Almost as soon as you noticed the problem, it occurred to you that the two lessons could be taught more effectively in the reverse order. You predicted better outcomes if the order was reversed, and you even had a preliminary rationale for why this would be true.

You are a graduate student and you read that students often misunderstand a particular aspect of graphing linear functions. You predicted that, by listening to small groups of students working together, you could hear new details that would help you understand this misconception.

You are a curriculum supervisor and you observed sixth-grade classrooms where students were learning about decimal fractions. After talking with several experienced teachers, you predicted that beginning with percentages might be a good way to introduce students to decimal fractions.

We begin with the path of making predictions because we see the other two paths as leading into this one at some point in the process (see Fig. 2.1 ). Starting with this path does not mean you did not sense a problem you wanted to solve or a question you wanted to answer.

The process flow diagram of initiation of hypothesis. It starts with a problem situation and leads to a prediction following the question to the hypothesis.

Three Pathways to Formulating Informed Hypotheses

Notice that your predictions can come from a variety of sources—your own experience, reading, and talking with colleagues. Most likely, as you write out your predictions you also think about the educational problem for which your prediction is a potential solution. Writing a clear description of the problem will be useful as you proceed. Notice also that it is easy to change each of your predictions into a question. When you formulate a prediction, you are actually answering a question, even though the question might be implicit. Making that implicit question explicit can generate a first draft of the research question that accompanies your prediction. For example, suppose you are the curriculum supervisor who predicts that teaching percentages first would be a good way to introduce decimal fractions. In an obvious shift in form, you could ask, “In what ways would teaching percentages benefit students’ initial learning of decimal fractions?”

The picture has a difference between a question and a prediction: a question simply asks what you will find whereas a prediction also says what you expect to find; written.

There are advantages to starting with the prediction form if you can make an educated guess about what you will find. Making a prediction forces you to think now about several things you will need to think about at some point anyway. It is better to think about them earlier rather than later. If you state your prediction clearly and explicitly, you can begin to ask yourself three questions about your prediction: Why do I expect to observe what I am predicting? Why did I make that prediction? (These two questions essentially ask what your rationale is for your prediction.) And, how can I test to see if it’s right? This is where the benefits of making predictions begin.

Asking yourself why you predicted what you did, and then asking yourself why you answered the first “why” question as you did, can be a powerful chain of thought that lays the groundwork for an increasingly accurate prediction and an increasingly well-reasoned rationale. For example, suppose you are the curriculum supervisor above who predicted that beginning by teaching percentages would be a good way to introduce students to decimal fractions. Why did you make this prediction? Maybe because students are familiar with percentages in everyday life so they could use what they know to anchor their thinking about hundredths. Why would that be helpful? Because if students could connect hundredths in percentage form with hundredths in decimal fraction form, they could bring their meaning of percentages into decimal fractions. But how would that help? If students understood that a decimal fraction like 0.35 meant 35 of 100, then they could use their understanding of hundredths to explore the meaning of tenths, thousandths, and so on. Why would that be useful? By continuing to ask yourself why you gave the previous answer, you can begin building your rationale and, as you build your rationale, you will find yourself revisiting your prediction, often making it more precise and explicit. If you were the curriculum supervisor and continued the reasoning in the previous sentences, you might elaborate your prediction by specifying the way in which percentages should be taught in order to have a positive effect on particular aspects of students’ understanding of decimal fractions.

Developing a Rationale for Your Predictions

Keeping your initial predictions in mind, you can read what others already know about the phenomenon. Your reading can now become targeted with a clear purpose.

By reading and talking with colleagues, you can develop more complete reasons for your predictions. It is likely that you will also decide to revise your predictions based on what you learn from your reading. As you develop sound reasons for your predictions, you are creating your rationales, and your predictions together with your rationales become your hypotheses. The more you learn about what is already known about your research topic, the more refined will be your predictions and the clearer and more complete your rationales. We will use the term more informed hypotheses to describe this evolution of your hypotheses.

The picture says you develop sound reasons for your predictions, you are creating your rationales, and your predictions together with your rationales become your hypotheses.

Developing more informed hypotheses is a good thing because it means: (1) you understand the reasons for your predictions; (2) you will be able to imagine how you can test your hypotheses; (3) you can more easily convince your colleagues that they are important hypotheses—they are hypotheses worth testing; and (4) at the end of your study, you will be able to more easily interpret the results of your test and to revise your hypotheses to demonstrate what you have learned by conducting the study.

Imagining Testing Your Hypotheses

Because we have tied together predictions and rationales to constitute hypotheses, testing hypotheses means testing predictions and rationales. Testing predictions means comparing empirical observations, or findings, with the predictions. Testing rationales means using these comparisons to evaluate the adequacy or soundness of the rationales.

Imagining how you might test your hypotheses does not mean working out the details for exactly how you would test them. Rather, it means thinking ahead about how you could do this. Recall the descriptor of scientific inquiry: “experience carefully planned in advance” (Fisher, 1935). Asking whether predictions are testable and whether rationales can be evaluated is simply planning in advance.

You might read that testing hypotheses means simply assessing whether predictions are correct or incorrect. In our view, it is more useful to think of testing as a means of gathering enough information to compare your findings with your predictions, revise your rationales, and propose more accurate predictions. So, asking yourself whether hypotheses can be tested means asking whether information could be collected to assess the accuracy of your predictions and whether the information will show you how to revise your rationales to sharpen your predictions.

Cycles of Building Rationales and Planning to Test Your Predictions

Scientific reasoning is a dialogue between the possible and the actual, an interplay between hypotheses and the logical expectations they give rise to: there is a restless to-and-fro motion of thought, the formulation and rectification of hypotheses (Medawar, 1982 , p.72).

As you ask yourself about how you could test your predictions, you will inevitably revise your rationales and sharpen your predictions. Your hypotheses will become more informed, more targeted, and more explicit. They will make clearer to you and others what, exactly, you plan to study.

When will you know that your hypotheses are clear and precise enough? Because of the way we define hypotheses, this question asks about both rationales and predictions. If a rationale you are building lets you make a number of quite different predictions that are equally plausible rather than a single, primary prediction, then your hypothesis needs further refinement by building a more complete and precise rationale. Also, if you cannot briefly describe to your colleagues a believable way to test your prediction, then you need to phrase it more clearly and precisely.

Each time you strengthen your rationales, you might need to adjust your predictions. And, each time you clarify your predictions, you might need to adjust your rationales. The cycle of going back and forth to keep your predictions and rationales tightly aligned has many payoffs down the road. Every decision you make from this point on will be in the interests of providing a transparent and convincing test of your hypotheses and explaining how the results of your test dictate specific revisions to your hypotheses. As you make these decisions (described in the succeeding chapters), you will probably return to clarify your hypotheses even further. But, you will be in a much better position, at each point, if you begin with well-informed hypotheses.

Beginning by Asking Questions to Clarify Your Interests

Instead of starting with predictions, a second path you might take devotes more time at the beginning to asking questions as you zero in on what you want to study. Some researchers suggest you start this way (e.g., Gournelos et al., 2019 ). Specifically, with this second path, the first statement you write to express your research interest would be a question. For example, you might ask, “Why do ninth-grade students change the way they think about linear equations after studying quadratic equations?” or “How do first graders solve simple arithmetic problems before they have been taught to add and subtract?”

The first phrasing of your question might be quite general or vague. As you think about your question and what you really want to know, you are likely to ask follow-up questions. These questions will almost always be more specific than your first question. The questions will also express more clearly what you want to know. So, the question “How do first graders solve simple arithmetic problems before they have been taught to add and subtract” might evolve into “Before first graders have been taught to solve arithmetic problems, what strategies do they use to solve arithmetic problems with sums and products below 20?” As you read and learn about what others already know about your questions, you will continually revise your questions toward clearer and more explicit and more precise versions that zero in on what you really want to know. The question above might become, “Before they are taught to solve arithmetic problems, what strategies do beginning first graders use to solve arithmetic problems with sums and products below 20 if they are read story problems and given physical counters to help them keep track of the quantities?”

Imagining Answers to Your Questions

If you monitor your own thinking as you ask questions, you are likely to begin forming some guesses about answers, even to the early versions of the questions. What do students learn about quadratic functions that influences changes in their proportional reasoning when dealing with linear functions? It could be that if you analyze the moments during instruction on quadratic equations that are extensions of the proportional reasoning involved in solving linear equations, there are times when students receive further experience reasoning proportionally. You might predict that these are the experiences that have a “backward transfer” effect (Hohensee, 2014 ).

These initial guesses about answers to your questions are your first predictions. The first predicted answers are likely to be hunches or fuzzy, vague guesses. This simply means you do not know very much yet about the question you are asking. Your first predictions, no matter how unfocused or tentative, represent the most you know at the time about the question you are asking. They help you gauge where you are in your thinking.

Shifting to the Hypothesis Formulation and Testing Path

Research questions can play an important role in the research process. They provide a succinct way of capturing your research interests and communicating them to others. When colleagues want to know about your work, they will often ask “What are your research questions?” It is good to have a ready answer.

However, research questions have limitations. They do not capture the three images of scientific inquiry presented in Chap. 1 . Due, in part, to this less expansive depiction of the process, research questions do not take you very far. They do not provide a guide that leads you through the phases of conducting a study.

Consequently, when you can imagine an answer to your research question, we recommend that you move onto the hypothesis formulation and testing path. Imagining an answer to your question means you can make plausible predictions. You can now begin clarifying the reasons for your predictions and transform your early predictions into hypotheses (predictions along with rationales). We recommend you do this as soon as you have guesses about the answers to your questions because formulating, testing, and revising hypotheses offers a tool that puts you squarely on the path of scientific inquiry. It is a tool that can guide you through the entire process of conducting a research study.

This does not mean you are finished asking questions. Predictions are often created as answers to questions. So, we encourage you to continue asking questions to clarify what you want to know. But your target shifts from only asking questions to also proposing predictions for the answers and developing reasons the answers will be accurate predictions. It is by predicting answers, and explaining why you made those predictions, that you become engaged in scientific inquiry.

Cycles of Refining Questions and Predicting Answers

An example might provide a sense of how this process plays out. Suppose you are reading about Vygotsky’s ( 1987 ) zone of proximal development (ZPD), and you realize this concept might help you understand why your high school students had trouble learning exponential functions. Maybe they were outside this zone when you tried to teach exponential functions. In order to recognize students who would benefit from instruction, you might ask, “How can I identify students who are within the ZPD around exponential functions?” What would you predict? Maybe students in this ZPD are those who already had knowledge of related functions. You could write out some reasons for this prediction, like “students who understand linear and quadratic functions are more likely to extend their knowledge to exponential functions.” But what kind of data would you need to test this? What would count as “understanding”? Are linear and quadratic the functions you should assess? Even if they are, how could you tell whether students who scored well on tests of linear and quadratic functions were within the ZPD of exponential functions? How, in the end, would you measure what it means to be in this ZPD? So, asking a series of reasonable questions raised some red flags about the way your initial question was phrased, and you decide to revise it.

You set the stage for revising your question by defining ZPD as the zone within which students can solve an exponential function problem by making only one additional conceptual connection between what they already know and exponential functions. Your revised question is, “Based on students’ knowledge of linear and quadratic functions, which students are within the ZPD of exponential functions?” This time you know what kind of data you need: the number of conceptual connections students need to bridge from their knowledge of related functions to exponential functions. How can you collect these data? Would you need to see into the minds of the students? Or, are there ways to test the number of conceptual connections someone makes to move from one topic to another? Do methods exist for gathering these data? You decide this is not realistic, so you now have a choice: revise the question further or move your research in a different direction.

Notice that we do not use the term research question for all these early versions of questions that begin clarifying for yourself what you want to study. These early versions are too vague and general to be called research questions. In this book, we save the term research question for a question that comes near the end of the work and captures exactly what you want to study . By the time you are ready to specify a research question, you will be thinking about your study in terms of hypotheses and tests. When your hypotheses are in final form and include clear predictions about what you will find, it will be easy to state the research questions that accompany your predictions.

To reiterate one of the key points of this chapter: hypotheses carry much more information than research questions. Using our definition, hypotheses include predictions about what the answer might be to the question plus reasons for why you think so. Unlike research questions, hypotheses capture all three images of scientific inquiry presented in Chap. 1 (planning, observing and explaining, and revising one’s thinking). Your hypotheses represent the most you know, at the moment, about your research topic. The same cannot be said for research questions.

Beginning with a Research Problem

When you wrote answers to the six questions at the end of Part I of this chapter, you might have identified a research interest by stating it as a problem. This is the third path you might take to begin your research. Perhaps your description of your problem might look something like this: “When I tried to teach my middle school students by presenting them with a challenging problem without showing them how to solve similar problems, they didn’t exert much effort trying to find a solution but instead waited for me to show them how to solve the problem.” You do not have a specific question in mind, and you do not have an idea for why the problem exists, so you do not have a prediction about how to solve it. Writing a statement of this problem as clearly as possible could be the first step in your research journey.

As you think more about this problem, it will feel natural to ask questions about it. For example, why did some students show more initiative than others? What could I have done to get them started? How could I have encouraged the students to keep trying without giving away the solution? You are now on the path of asking questions—not research questions yet, but questions that are helping you focus your interest.

As you continue to think about these questions, reflect on your own experience, and read what others know about this problem, you will likely develop some guesses about the answers to the questions. They might be somewhat vague answers, and you might not have lots of confidence they are correct, but they are guesses that you can turn into predictions. Now you are on the hypothesis-formulation-and-testing path. This means you are on the path of asking yourself why you believe the predictions are correct, developing rationales for the predictions, asking what kinds of empirical observations would test your predictions, and refining your rationales and predictions as you read the literature and talk with colleagues.

A simple diagram that summarizes the three paths we have described is shown in Fig. 2.1 . Each row of arrows represents one pathway for formulating an informed hypothesis. The dotted arrows in the first two rows represent parts of the pathways that a researcher may have implicitly travelled through already (without an intent to form a prediction) but that ultimately inform the researcher’s development of a question or prediction.

Part III. One Researcher’s Experience Launching a Scientific Inquiry

Martha was in her third year of her doctoral program and beginning to identify a topic for her dissertation. Based on (a) her experience as a high school mathematics teacher and a curriculum supervisor, (b) the reading she has done to this point, and (c) her conversations with her colleagues, she has developed an interest in what kinds of professional development experiences (let’s call them learning opportunities [LOs] for teachers) are most effective. Where does she go from here?

Exercise 2.2

Before you continue reading, please write down some suggestions for Martha about where she should start.

A natural thing for Martha to do at this point is to ask herself some additional questions, questions that specify further what she wants to learn: What kinds of LOs do most teachers experience? How do these experiences change teachers’ practices and beliefs? Are some LOs more effective than others? What makes them more effective?

To focus her questions and decide what she really wants to know, she continues reading but now targets her reading toward everything she can find that suggests possible answers to these questions. She also talks with her colleagues to get more ideas about possible answers to these or related questions. Over several weeks or months, she finds herself being drawn to questions about what makes LOs effective, especially for helping teachers teach more conceptually. She zeroes in on the question, “What makes LOs for teachers effective for improving their teaching for conceptual understanding?”

This question is more focused than her first questions, but it is still too general for Martha to define a research study. How does she know it is too general? She uses two criteria. First, she notices that the predictions she makes about the answers to the question are all over the place; they are not constrained by the reasons she has assembled for her predictions. One prediction is that LOs are more effective when they help teachers learn content. Martha makes this guess because previous research suggests that effective LOs for teachers include attention to content. But this rationale allows lots of different predictions. For example, LOs are more effective when they focus on the content teachers will teach; LOs are more effective when they focus on content beyond what teachers will teach so teachers see how their instruction fits with what their students will encounter later; and LOs are more effective when they are tailored to the level of content knowledge participants have when they begin the LOs. The rationale she can provide at this point does not point to a particular prediction.

A second measure Martha uses to decide her question is too general is that the predictions she can make regarding the answers seem very difficult to test. How could she test, for example, whether LOs should focus on content beyond what teachers will teach? What does “content beyond what teachers teach” mean? How could you tell whether teachers use their new knowledge of later content to inform their teaching?

Before anticipating what Martha’s next question might be, it is important to pause and recognize how predicting the answers to her questions moved Martha into a new phase in the research process. As she makes predictions, works out the reasons for them, and imagines how she might test them, she is immersed in scientific inquiry. This intellectual work is the main engine that drives the research process. Also notice that revisions in the questions asked, the predictions made, and the rationales built represent the updated thinking (Chap. 1 ) that occurs as Martha continues to define her study.

Based on all these considerations and her continued reading, Martha revises the question again. The question now reads, “Do LOs that engage middle school mathematics teachers in studying mathematics content help teachers teach this same content with more of a conceptual emphasis?” Although she feels like the question is more specific, she realizes that the answer to the question is either “yes” or “no.” This, by itself, is a red flag. Answers of “yes” or “no” would not contribute much to understanding the relationships between these LOs for teachers and changes in their teaching. Recall from Chap. 1 that understanding how things work, explaining why things work, is the goal of scientific inquiry.

Martha continues by trying to understand why she believes the answer is “yes.” When she tries to write out reasons for predicting “yes,” she realizes that her prediction depends on a variety of factors. If teachers already have deep knowledge of the content, the LOs might not affect them as much as other teachers. If the LOs do not help teachers develop their own conceptual understanding, they are not likely to change their teaching. By trying to build the rationale for her prediction—thus formulating a hypothesis—Martha realizes that the question still is not precise and clear enough.

Martha uses what she learned when developing the rationale and rephrases the question as follows: “ Under what conditions do LOs that engage middle school mathematics teachers in studying mathematics content help teachers teach this same content with more of a conceptual emphasis?” Through several additional cycles of thinking through the rationale for her predictions and how she might test them, Martha specifies her question even further: “Under what conditions do middle school teachers who lack conceptual knowledge of linear functions benefit from LOs that engage them in conceptual learning of linear functions as assessed by changes in their teaching toward a more conceptual emphasis on linear functions?”

Each version of Martha’s question has become more specific. This has occurred as she has (a) identified a starting condition for the teachers—they lack conceptual knowledge of linear functions, (b) specified the mathematics content as linear functions, and (c) included a condition or purpose of the LO—it is aimed at conceptual learning.

Because of the way Martha’s question is now phrased, her predictions will require thinking about the conditions that could influence what teachers learn from the LOs and how this learning could affect their teaching. She might predict that if teachers engaged in LOs that extended over multiple sessions, they would develop deeper understanding which would, in turn, prompt changes in their teaching. Or she might predict that if the LOs included examples of how their conceptual learning could translate into different instructional activities for their students, teachers would be more likely to change their teaching. Reasons for these predictions would likely come from research about the effects of professional development on teachers’ practice.

As Martha thinks about testing her predictions, she realizes it will probably be easier to measure the conditions under which teachers are learning than the changes in the conceptual emphasis in their instruction. She makes a note to continue searching the literature for ways to measure the “conceptualness” of teaching.

As she refines her predictions and expresses her reasons for the predictions, she formulates a hypothesis (in this case several hypotheses) that will guide her research. As she makes predictions and develops the rationales for these predictions, she will probably continue revising her question. She might decide, for example, that she is not interested in studying the condition of different numbers of LO sessions and so decides to remove this condition from consideration by including in her question something like “. . . over five 2-hour sessions . . .”

At this point, Martha has developed a research question, articulated a number of predictions, and developed rationales for them. Her current question is: “Under what conditions do middle school teachers who lack conceptual knowledge of linear functions benefit from five 2-hour LO sessions that engage them in conceptual learning of linear functions as assessed by changes in their teaching toward a more conceptual emphasis on linear functions?” Her hypothesis is:

Prediction: Participating teachers will show changes in their teaching with a greater emphasis on conceptual understanding, with larger changes on linear function topics directly addressed in the LOs than on other topics.

Brief Description of Rationale: (1) Past research has shown correlations between teachers’ specific mathematics knowledge of a topic and the quality of their teaching of that topic. This does not mean an increase in knowledge causes higher quality teaching but it allows for that possibility. (2) Transfer is usually difficult for teachers, but the examples developed during the LO sessions will help them use what they learned to teach for conceptual understanding. This is because the examples developed during the LO sessions are much like those that will be used by the teachers. So larger changes will be found when teachers are teaching the linear function topics addressed in the LOs.

Notice it is more straightforward to imagine how Martha could test this prediction because it is more precise than previous predictions. Notice also that by asking how to test a particular prediction, Martha will be faced with a decision about whether testing this prediction will tell her something she wants to learn. If not, she can return to the research question and consider how to specify it further and, perhaps, constrain further the conditions that could affect the data.

As Martha formulates her hypotheses and goes through multiple cycles of refining her question(s), articulating her predictions, and developing her rationales, she is constantly building the theoretical framework for her study. Because the theoretical framework is the topic for Chap. 3 , we will pause here and pick up Martha’s story in the next chapter. Spoiler alert: Martha’s experience contains some surprising twists and turns.

Before leaving Martha, however, we point out two aspects of the process in which she has been engaged. First, it can be useful to think about the process as identifying (1) the variables targeted in her predictions, (2) the mechanisms she believes explain the relationships among the variables, and (3) the definitions of all the terms that are special to her educational problem. By variables, we mean things that can be measured and, when measured, can take on different values. In Martha’s case, the variables are the conceptualness of teaching and the content topics addressed in the LOs. The mechanisms are cognitive processes that enable teachers to see the relevance of what they learn in PD to their own teaching and that enable the transfer of learning from one setting to another. Definitions are the precise descriptions of how the important ideas relevant to the research are conceptualized. In Martha’s case, definitions must be provided for terms like conceptual understanding, linear functions, LOs, each of the topics related to linear functions, instructional setting, and knowledge transfer.

A second aspect of the process is a practice that Martha acquired as part of her graduate program, a practice that can go unnoticed. Martha writes out, in full sentences, her thinking as she wrestles with her research question, her predictions of the answers, and the rationales for her predictions. Writing is a tool for organizing thinking and we recommend you use it throughout the scientific inquiry process. We say more about this at the end of the chapter.

Here are the questions Martha wrote as she developed a clearer sense of what question she wanted to answer and what answer she predicted. The list shows the increasing refinement that occurred as she continued to read, think, talk, and write.

Early questions: What kinds of LOs do most teachers experience? How do these experiences change teachers’ practices and beliefs? Are some LOs more effective than others? What makes them more effective?

First focused question: What makes LOs for teachers effective for improving their teaching for conceptual understanding?

Question after trying to predict the answer and imagining how to test the prediction: Do LOs that engage middle school mathematics teachers in studying mathematics content help teachers teach this same content with more of a conceptual emphasis?

Question after developing an initial rationale for her prediction: Under what conditions do LOs that engage middle school mathematics teachers in studying mathematics content help teachers teach this same content with more of a conceptual emphasis?

Question after developing a more precise prediction and richer rationale: Under what conditions do middle school teachers who lack conceptual knowledge of linear functions benefit from five 2-hour LO sessions that engage them in conceptual learning of linear functions as assessed by changes in their teaching toward a more conceptual emphasis on linear functions?

Part IV. An Illustrative Dialogue

The story of Martha described the major steps she took to refine her thinking. However, there is a lot of work that went on behind the scenes that wasn’t part of the story. For example, Martha had conversations with fellow students and professors that sharpened her thinking. What do these conversations look like? Because they are such an important part of the inquiry process, it will be helpful to “listen in” on the kinds of conversations that students might have with their advisors.

Here is a dialogue between a beginning student, Sam (S), and their advisor, Dr. Avery (A). They are meeting to discuss data Sam collected for a course project. The dialogue below is happening very early on in Sam’s conceptualization of the study, prior even to systematic reading of the literature.

Thanks for meeting with me today. As you know, I was able to collect some data for a course project a few weeks ago, but I’m having trouble analyzing the data, so I need your help. Let me try to explain the problem. As you know, I wanted to understand what middle-school teachers do to promote girls’ achievement in a mathematics class. I conducted four observations in each of three teachers’ classrooms. I also interviewed each teacher once about the four lessons I observed, and I interviewed two girls from each of the teachers’ classes. Obviously, I have a ton of data. But when I look at all these data, I don’t really know what I learned about my topic. When I was observing the teachers, I thought I might have observed some ways the teachers were promoting girls’ achievement, but then I wasn’t sure how to interpret my data. I didn’t know if the things I was observing were actually promoting girls’ achievement.

What were some of your observations?

Well, in a couple of my classroom observations, teachers called on girls to give an answer, even when the girls didn’t have their hands up. I thought that this might be a way that teachers were promoting the girls’ achievement. But then the girls didn’t say anything about that when I interviewed them and also the teachers didn’t do it in every class. So, it’s hard to know what effect, if any, this might have had on their learning or their motivation to learn. I didn’t want to ask the girls during the interview specifically about the teacher calling on them, and without the girls bringing it up themselves, I didn’t know if it had any effect.

Well, why didn’t you want to ask the girls about being called on?

Because I wanted to leave it as open as possible; I didn’t want to influence what they were going to say. I didn’t want to put words in their mouths. I wanted to know what they thought the teacher was doing that promoted their mathematical achievement and so I only asked the girls general questions, like “Do you think the teacher does things to promote girls’ mathematical achievement?” and “Can you describe specific experiences you have had that you believe do and do not promote your mathematical achievement?”

So then, how did they answer those general questions?

Well, with very general answers, such as that the teacher knows their names, offers review sessions, grades their homework fairly, gives them opportunities to earn extra credit, lets them ask questions, and always answers their questions. Nothing specific that helps me know what teaching actions specifically target girls’ mathematics achievement.

OK. Any ideas about what you might do next?

Well, I remember that when I was planning this data collection for my course, you suggested I might want to be more targeted and specific about what I was looking for. I can see now that more targeted questions would have made my data more interpretable in terms of connecting teaching actions to the mathematical achievement of girls. But I just didn’t want to influence what the girls would say.

Yes, I remember when you were planning your course project, you wanted to keep it open. You didn’t want to miss out on discovering something new and interesting. What do you think now about this issue?

Well, I still don’t want to put words in their mouths. I want to know what they think. But I see that if I ask really open questions, I have no guarantee they will talk about what I want them to talk about. I guess I still like the idea of an open study, but I see that it’s a risky approach. Leaving the questions too open meant I didn’t constrain their responses and there were too many ways they could interpret and answer the questions. And there are too many ways I could interpret their responses.

By this point in the dialogue, Sam has realized that open data (i.e., data not testing a specific prediction) is difficult to interpret. In the next part, Dr. Avery explains why collecting open data was not helping Sam achieve goals for her study that had motivated collecting open data in the first place.

Yes, I totally agree. Even for an experienced researcher, it can be difficult to make sense of this kind of open, messy data. However, if you design a study with a more specific focus, you can create questions for participants that are more targeted because you will be interested in their answers to these specific questions. Let’s reflect back on your data collection. What can you learn from it for the future?

When I think about it now, I realize that I didn’t think about the distinction between all the different constructs at play in my study, and I didn’t choose which one I was focusing on. One construct was the teaching moves that teachers think could be promoting achievement. Another is what teachers deliberately do to promote girls’ mathematics achievement, if anything. Another was the teaching moves that actually do support girls’ mathematics achievement. Another was what teachers were doing that supported girls’ mathematics achievement versus the mathematics achievement of all students. Another was students’ perception of what their teacher was doing to promote girls’ mathematics achievement. I now see that any one of these constructs could have been the focus of a study and that I didn’t really decide which of these was the focus of my course project prior to collecting data.

So, since you told me that the topic of this course project is probably what you’ll eventually want to study for your dissertation, which of these constructs are you most interested in?

I think I’m more interested in the teacher moves that teachers deliberately do to promote girls’ achievement. But I’m still worried about asking teachers directly and getting too specific about what they do because I don’t want to bias what they will say. And I chose qualitative methods and an exploratory design because I thought it would allow for a more open approach, an approach that helps me see what’s going on and that doesn’t bias or predetermine the results.

Well, it seems to me you are conflating three issues. One issue is how to conduct an unbiased study. Another issue is how specific to make your study. And the third issue is whether or not to choose an exploratory or qualitative study design. Those three issues are not the same. For example, designing a study that’s more open or more exploratory is not how researchers make studies fair and unbiased. In fact, it would be quite easy to create an open study that is biased. For example, you could ask very open questions and then interpret the responses in a way that unintentionally, and even unknowingly, aligns with what you were hoping the findings would say. Actually, you could argue that by adding more specificity and narrowing your focus, you’re creating constraints that prevent bias. The same goes for an exploratory or qualitative study; they can be biased or unbiased. So, let’s talk about what is meant by getting more specific. Within your new focus on what teachers deliberately do, there are many things that would be interesting to look at, such as teacher moves that address math anxiety, moves that allow girls to answer questions more frequently, moves that are specifically fitted to student thinking about specific mathematical content, and so on. What are one or two things that are most interesting to you? One way to answer this question is by thinking back to where your interest in this topic began.

In the preceding part of the dialogue, Dr. Avery explained how the goals Sam had for their study were not being met with open data. In the next part, Sam begins to articulate a prediction, which Sam and Dr. Avery then sharpen.

Actually, I became interested in this topic because of an experience I had in college when I was in a class of mostly girls. During whole class discussions, we were supposed to critically evaluate each other’s mathematical thinking, but we were too polite to do that. Instead, we just praised each other’s work. But it was so different in our small groups. It seemed easier to critique each other’s thinking and to push each other to better solutions in small groups. I began wondering how to get girls to be more critical of each other’s thinking in a whole class discussion in order to push everyone’s thinking.

Okay, this is great information. Why not use this idea to zoom-in on a more manageable and interpretable study? You could look specifically at how teachers support girls in critically evaluating each other’s thinking during whole class discussions. That would be a much more targeted and specific topic. Do you have predictions about what teachers could do in that situation, keeping in mind that you are looking specifically at girls’ mathematical achievement, not students in general?

Well, what I noticed was that small groups provided more social and emotional support for girls, whereas the whole class discussion did not provide that same support. The girls felt more comfortable critiquing each other’s thinking in small groups. So, I guess I predict that when the social and emotional supports that are present in small groups are extended to the whole class discussion, girls would be more willing to evaluate each other’s mathematical thinking critically during whole class discussion . I guess ultimately, I’d like to know how the whole class discussion could be used to enhance, rather than undermine, the social and emotional support that is present in the small groups.

Okay, then where would you start? Would you start with a study of what the teachers say they will do during whole class discussion and then observe if that happens during whole class discussion?

But part of my prediction also involves the small groups. So, I’d also like to include small groups in my study if possible. If I focus on whole groups, I won’t be exploring what I am interested in. My interest is broader than just the whole class discussion.

That makes sense, but there are many different things you could look at as part of your prediction, more than you can do in one study. For instance, if your prediction is that when the social and emotional supports that are present in small groups are extended to whole class discussions, girls would be more willing to evaluate each other’s mathematical thinking critically during whole class discussions , then you could ask the following questions: What are the social and emotional supports that are present in small groups?; In which small groups do they exist?; Is it groups that are made up only of girls?; Does every small group do this, and for groups that do this, when do these supports get created?; What kinds of small group activities that teachers ask them to work on are associated with these supports?; Do the same social and emotional supports that apply to small groups even apply to whole group discussion?

All your questions make me realize that my prediction about extending social and emotional supports to whole class discussions first requires me to have a better understanding of the social and emotional supports that exist in small groups. In fact, I first need to find out whether those supports commonly exist in small groups or is that just my experience working in small groups. So, I think I will first have to figure out what small groups do to support each other and then, in a later study, I could ask a teacher to implement those supports during whole class discussions and find out how you can do that. Yeah, now I’m seeing that.

The previous part of the dialogue illustrates how continuing to ask questions about one’s initial prediction is a good way to make it more and more precise (and researchable). In the next part, we see how developing a precise prediction has the added benefit of setting the researcher up for future studies.

Yes, I agree that for your first study, you should probably look at small groups. In other words, you should focus on only a part of your prediction for now, namely the part that says there are social and emotional supports in small groups that support girls in critiquing each other’s thinking . That begins to sharpen the focus of your prediction, but you’ll want to continue to refine it. For example, right now, the question that this prediction leads to is a question with a yes or no answer, but what you’ve said so far suggests to me that you are looking for more than that.

Yes, I want to know more than just whether there are supports. I’d like to know what kinds. That’s why I wanted to do a qualitative study.

Okay, this aligns more with my thinking about research as being prediction driven. It’s about collecting data that would help you revise your existing predictions into better ones. What I mean is that you would focus on collecting data that would allow you to refine your prediction, make it more nuanced, and go beyond what is already known. Does that make sense, and if so, what would that look like for your prediction?

Oh yes, I like that. I guess that would mean that, based on the data I collect for this next study, I could develop a more refined prediction that, for example, more specifically identifies and differentiates between different kinds of social and emotional supports that are present in small groups, or maybe that identifies the kinds of small groups that they occur in, or that predicts when and how frequently or infrequently they occur, or about the features of the small group tasks in which they occur, etc. I now realize that, although I chose qualitative research to make my study be more open, really the reason qualitative research fits my purposes is because it will allow me to explore fine-grained aspects of social and emotional supports that may exist for girls in small groups.

Yes, exactly! And then, based on the data you collect, you can include in your revised prediction those new fine-grained aspects. Furthermore, you will have a story to tell about your study in your written report, namely the story about your evolving prediction. In other words, your written report can largely tell how you filled out and refined your prediction as you learned more from carrying out the study. And even though you might not use them right away, you are also going to be able to develop new predictions that you would not have even thought of about social and emotional supports in small groups and your aim of extending them to whole-class discussions, had you not done this study. That will set you up to follow up on those new predictions in future studies. For example, you might have more refined ideas after you collect the data about the goals for critiquing student thinking in small groups versus the goals for critiquing student thinking during whole class discussion. You might even begin to think that some of the social and emotional supports you observe are not even replicable or even applicable to or appropriate for whole-class discussions, because the supports play different roles in different contexts. So, to summarize what I’m saying, what you look at in this study, even though it will be very focused, sets you up for a research program that will allow you to more fully investigate your broader interest in this topic, where each new study builds on your prior body of work. That’s why it is so important to be explicit about the best place to start this research, so that you can build on it.

I see what you are saying. We started this conversation talking about my course project data. What I think I should have done was figure out explicitly what I needed to learn with that study with the intention of then taking what I learned and using it as the basis for the next study. I didn’t do that, and so I didn’t collect data that pushed forward my thinking in ways that would guide my next study. It would be as if I was starting over with my next study.

Sam and Dr. Avery have just explored how specifying a prediction reveals additional complexities that could become fodder for developing a systematic research program. Next, we watch Sam beginning to recognize the level of specificity required for a prediction to be testable.

One thing that would have really helped would have been if you had had a specific prediction going into your data collection for your course project.

Well, I didn’t really have much of an explicit prediction in mind when I designed my methods.

Think back, you must have had some kind of prediction, even if it was implicit.

Well, yes, I guess I was predicting that teachers would enact moves that supported girls’ mathematical achievement. And I observed classrooms to identify those teacher moves, I interviewed teachers to ask them about the moves I observed, and I interviewed students to see if they mentioned those moves as promoting their mathematical achievement. The goal of my course project was to identify teacher moves that support girls’ mathematical achievement. And my specific research question was: What teacher moves support girls’ mathematical achievement?

So, really you were asking the teacher and students to show and tell you what those moves are and the effects of those moves, as a result putting the onus on your participants to provide the answers to your research question for you. I have an idea, let’s try a thought experiment. You come up with data collection methods for testing the prediction that there are social and emotional supports in small groups that support girls in critiquing each other’s thinking that still puts the onus on the participants. And then I’ll see if I can think of data collection methods that would not put the onus on the participants.

Hmm, well. .. I guess I could simply interview girls who participated in small groups and ask them “are there social and emotional supports that you use in small groups that support your group in critiquing each other’s thinking and if so, what are they?” In that case, I would be putting the onus on them to be aware of the social dynamics of small groups and to have thought about these constructs as much as I have. Okay now can you continue the thought experiment? What might the data collection methods look like if I didn’t put the onus on the participants?

First, I would pick a setting in which it was only girls at this point to reduce the number of variables. Then, personally I would want to observe a lot of groups of girls interacting in groups around tasks. I would be looking for instances when the conversation about students’ ideas was shut down and instances when the conversation about students’ ideas involved critiquing of ideas and building on each other’s thinking. I would also look at what happened just before and during those instances, such as: did the student continue to talk after their thinking was critiqued, did other students do anything to encourage the student to build on their own thinking (i.e., constructive criticism) or how did they support or shut down continued participation. In fact, now that I think about it, “critiquing each other’s thinking” can be defined in a number of different ways. I could mean just commenting on someone’s thinking, judging correctness and incorrectness, constructive criticism that moves the thinking forward, etc. If you put the onus on the participants to answer your research question, you are stuck with their definition, and they won’t have thought about this very much, if at all.

I think that what you are also saying is that my definitions would affect my data collection. If I think that critiquing each other’s thinking means that the group moves their thinking forward toward more valid and complete mathematical solutions, then I’m going to focus on different moves than if I define it another way, such as just making a comment on each other’s thinking and making each other feel comfortable enough to keep participating. In fact, am I going to look at individual instances of critiquing or look at entire sequences in which the critiquing leads to a goal? This seems like a unit of analysis question, and I would need to develop a more nuanced prediction that would make explicit what that unit of analysis is.

I agree, your definition of “critiquing each other’s thinking” could entirely change what you are predicting. One prediction could be based on defining critiquing as a one-shot event in which someone makes one comment on another person’s thinking. In this case the prediction would be that there are social and emotional supports in small groups that support girls in making an evaluative comment on another student’s thinking. Another prediction could be based on defining critiquing as a back-and-forth process in which the thinking gets built on and refined. In that case, the prediction would be something like that there are social and emotional supports in small groups that support girls in critiquing each other’s thinking in ways that do not shut down the conversation but that lead to sustained conversations that move each other toward more valid and complete solutions.

Well, I think I am more interested in the second prediction because it is more compatible with my long-term interests, which are that I’m interested in extending small group supports to whole class discussions. The second prediction is more appropriate for eventually looking at girls in whole class discussion. During whole class discussion, the teacher tries to get a sustained conversation going that moves the students’ thinking forward. So, if I learn about small group supports that lead to sustained conversations that move each other toward more valid and complete solutions , those supports might transfer to whole class discussions.

In the previous part of the dialogue, Dr. Avery and Sam showed how narrowing down a prediction to one that is testable requires making numerous important decisions, including how to define the constructs referred to in the prediction. In the final part of the dialogue, Dr. Avery and Sam begin to outline the reading Sam will have to do to develop a rationale for the specific prediction.

Do you see how your prediction and definitions are getting more and more specific? You now need to read extensively to further refine your prediction.

Well, I should probably read about micro dynamics of small group interactions, anything about interactions in small groups, and what is already known about small group interactions that support sustained conversations that move students’ thinking toward more valid and complete solutions. I guess I could also look at research on whole-class discussion methods that support sustained conversations that move the class to more mathematically valid and complete solutions, because it might give me ideas for what to look for in the small groups. I might also need to focus on research about how learners develop understandings about a particular subject matter so that I know what “more valid and complete solutions” look like. I also need to read about social and emotional supports but focus on how they support students cognitively, rather than in other ways.

Sounds good, let’s get together after you have processed some of this literature and we can talk about refining your prediction based on what you read and also the methods that will best suit testing that prediction.

Great! Thanks for meeting with me. I feel like I have a much better set of tools that push my own thinking forward and allow me to target something specific that will lead to more interpretable data.

Part V. Is It Always Possible to Formulate Hypotheses?

In Chap. 1 , we noted you are likely to read that research does not require formulating hypotheses. Some sources describe doing research without making predictions and developing rationales for these predictions. Some researchers say you cannot always make predictions—you do not know enough about the situation. In fact, some argue for the value of not making predictions (e.g., Glaser & Holton, 2004 ; Merton, 1968 ; Nemirovsky, 2011 ). These are important points of view, so we will devote this section to discussing them.

Can You Always Predict What You Will Find?

One reason some researchers say you do not need to make predictions is that it can be difficult to imagine what you will find. This argument comes up most often for descriptive studies. Suppose you want to describe the nature of a situation you do not know much about. Can you still make a prediction about what you will find? We believe that, although you do not know exactly what you will find, you probably have a hunch or, at a minimum, a very fuzzy idea. It would be unusual to ask a question about a situation you want to know about without at least a fuzzy inkling of what you might find. The original question just would not occur to you. We acknowledge you might have only a vague idea of what you will find and you might not have much confidence in your prediction. However, we expect if you monitor your own thinking you will discover you have developed a suspicion along the way, regardless how vague the suspicion might be. Through the cyclic process we discussed above, that suspicion or hunch gradually evolves and turns into a prediction.

The Benefits of Making Predictions Even When They Are Wrong: An Example from the 1970s

One of us was a graduate student at the University of Wisconsin in the late 1970s, assigned as a research assistant to a project that was investigating young children’s thinking about simple arithmetic. A new curriculum was being written, and the developers wanted to know how to introduce the earliest concepts and skills to kindergarten and first-grade children. The directors of the project did not know what to expect because, at the time, there was little research on five- and six-year-olds’ pre-instruction strategies for adding and subtracting.

After consulting what literature was available, talking with teachers, analyzing the nature of different types of addition and subtraction problems, and debating with each other, the research team formulated some hypotheses about children’s performance. Following the usual assumptions at the time and recognizing the new curriculum would introduce the concepts, the researchers predicted that, before instruction, most children would not be able to solve the problems. Based on the rationale that some young children did not yet recognize the simple form for written problems (e.g., 5 + 3 = ___), the researchers predicted that the best chance for success would be to read problems as stories (e.g., Jesse had 5 apples and then found 3 more. How many does she have now?). They reasoned that, even though children would have difficulty on all the problems, some story problems would be easier because the semantic structure is easier to follow. For example, they predicted the above story about adding 3 apples to 5 would be easier than a problem like, “Jesse had some apples in the refrigerator. She put in 2 more and now has 6. How many were in the refrigerator at the beginning?” Based on the rationale that children would need to count to solve the problems and that it can be difficult to keep track of the numbers, they predicted children would be more successful if they were given counters. Finally, accepting the common reasoning that larger numbers are more difficult than smaller numbers, they predicted children would be more successful if all the numbers in a problem were below 10.

Although these predictions were not very precise and the rationales were not strongly convincing, these hypotheses prompted the researchers to design the study to test their predictions. This meant they would collect data by presenting a variety of problems under a variety of conditions. Because the goal was to describe children’s thinking, problems were presented to students in individual interviews. Problems with different semantic structures were included, counters were available for some problems but not others, and some problems had sums to 9 whereas others had sums to 20 or more.

The punchline of this story is that gathering data under these conditions, prompted by the predictions, made all the difference in what the researchers learned. Contrary to predictions, children could solve addition and subtraction problems before instruction. Counters were important because almost all the solution strategies were based on counting which meant that memory was an issue because many strategies require counting in two ways simultaneously. For example, subtracting 4 from 7 was usually solved by counting down from 7 while counting up from 1 to 4 to keep track of counting down. Because children acted out the stories with their counters, the semantic structure of the story was also important. Stories that were easier to read and write were also easier to solve.

To make a very long story very short, other researchers were, at about the same time, reporting similar results about children’s pre-instruction arithmetic capabilities. A clear pattern emerged regarding the relative difficulty of different problem types (semantic structures) and the strategies children used to solve each type. As the data were replicated, the researchers recognized that kindergarten and first-grade teachers could make good use of this information when they introduced simple arithmetic. This is how Cognitively Guided Instruction (CGI) was born (Carpenter et al., 1989 ; Fennema et al., 1996 ).

To reiterate, the point of this example is that the study conducted to describe children’s thinking would have looked quite different if the researchers had made no predictions. They would have had no reason to choose the particular problems and present them under different conditions. The fact that some of the predictions were completely wrong is not the point. The predictions created the conditions under which the predictions were tested which, in turn, created learning opportunities for the researchers that would not have existed without the predictions. The lesson is that even research that aims to simply describe a phenomenon can benefit from hypotheses. As signaled in Chap. 1 , this also serves as another example of “failing productively.”

Suggestions for What to Do When You Do Not Have Predictions

There likely are exceptions to our claim about being able to make a prediction about what you will find. For example, there could be rare cases where researchers truly have no idea what they will find and can come up with no predictions and even no hunches. And, no research has been reported on related phenomena that would offer some guidance. If you find yourself in this position, we suggest one of three approaches: revise your question, conduct a pilot study, or choose another question.

Because there are many advantages to making predictions explicit and then writing out the reasons for these predictions, one approach is to adjust your question just enough to allow you to make a prediction. Perhaps you can build on descriptions that other researchers have provided for related situations and consider how you can extend this work. Building on previous descriptions will enable you to make predictions about the situation you want to describe.

A second approach is to conduct a small pilot study or, better, a series of small pilot studies to develop some preliminary ideas of what you might find. If you can identify a small sample of participants who are similar to those in your study, you can try out at least some of your research plans to help make and refine your predictions. As we detail later, you can also use pilot studies to check whether key aspects of your methods (e.g., tasks, interview questions, data collection methods) work as you expect.

A third approach is to return to your list of interests and choose one that has been studied previously. Sometimes this is the wisest choice. It is very difficult for beginning researchers to conduct research in brand-new areas where no hunches or predictions are possible. In addition, the contributions of this research can be limited. Recall the earlier story about one of us “failing productively” by completing a dissertation in a somewhat new area. If, after an exhaustive search, you find that no one has investigated the phenomenon in which you are interested or even related phenomena, it can be best to move in a different direction. You will read recommendations in other sources to find a “gap” in the research and develop a study to “fill the gap.” This can be helpful advice if the gap is very small. However, if the gap is large, too large to predict what you might find, the study will present severe challenges. It will be more productive to extend work that has already been done than to launch into an entirely new area.

Should You Always Try to Predict What You Will Find?

In short, our answer to the question in the heading is “yes.” But this calls for further explanation.

Suppose you want to observe a second-grade classroom in order to investigate how students talk about adding and subtracting whole numbers. You might think, “I don’t want to bias my thinking; I want to be completely open to what I see in the classroom.” Sam shared a similar point of view at the beginning of the dialogue: “I wanted to leave it as open as possible; I didn’t want to influence what they were going to say.” Some researchers say that beginning your research study by making predictions is inappropriate precisely because it will bias your observations and results. The argument is that by bringing a set of preconceptions, you will confirm what you expected to find and be blind to other observations and outcomes. The following quote illustrates this view: “The first step in gaining theoretical sensitivity is to enter the research setting with as few predetermined ideas as possible—especially logically deducted, a priori hypotheses. In this posture, the analyst is able to remain sensitive to the data by being able to record events and detect happenings without first having them filtered through and squared with pre-existing hypotheses and biases” (Glaser, 1978, pp. 2–3).

We take a different point of view. In fact, we believe there are several compelling reasons for making your predictions explicit.

Making Your Predictions Explicit Increases Your Chances of Productive Observations

Because your predictions are an extension of what is already known, they prepare you to identify more nuanced relationships that can advance our understanding of a phenomenon. For example, rather than simply noticing, in a general sense, that students talking about addition and subtraction leads them to better understandings, you might, based on your prediction, make the specific observation that talking about addition and subtraction in a particular way helps students to think more deeply about a particular concept related to addition and subtraction. Going into a study without predictions can bring less sensitivity rather than more to the study of a phenomenon. Drawing on knowledge about related phenomena by reading the literature and conducting pilot studies allows you to be much more sensitive and your observations to be more productive.

Making Your Predictions Explicit Allows You to Guard Against Biases

Some genres and methods of educational research are, in fact, rooted in philosophical traditions (e.g., Husserl, 1929/ 1973 ) that explicitly call for researchers to temporarily “bracket” or set aside existing theory as well as their prior knowledge and experience to better enter into the experience of the participants in the research. However, this does not mean ignoring one’s own knowledge and experience or turning a blind eye to what has been learned by others. Much more than the simplistic image of emptying one’s mind of preconceptions and implicit biases (arguably an impossible feat to begin with), the goal is to be as reflective as possible about one’s prior knowledge and conceptions and as transparent as possible about how they may guide observations and shape interpretations (Levitt et al., 2018 ).

We believe it is better to be honest about the predictions you are almost sure to have because then you can deliberately plan to minimize the chances they will influence what you find and how you interpret your results. For starters, it is important to recognize that acknowledging you have some guesses about what you will find does not make them more influential. Because you are likely to have them anyway, we recommend being explicit about what they are. It is easier to deal with biases that are explicit than those that lurk in the background and are not acknowledged.

What do we mean by “deal with biases”? Some journals require you to include a statement about your “positionality” with respect to the participants in your study and the observations you are making to gather data. Formulating clear hypotheses is, in our view, a direct response to this request. The reasons for your predictions are your explicit statements about your positionality. Often there are methodological strategies you can use to protect the study from undue influences of bias. In other words, making your vague predictions explicit can help you design your study so you minimize the bias of your findings.

Making Your Predictions Explicit Can Help You See What You Did Not Predict

Making your predictions explicit does not need to blind you to what is different than expected. It does not need to force you to see only what you want to see. Instead, it can actually increase your sensitivity to noticing features of the situation that are surprising, features you did not predict. Results can stand out when you did not expect to see them.

In contrast, not bringing your biases to consciousness might subtly shift your attention away from these unexpected results in ways that you are not aware of. This path can lead to claiming no biases and no unexpected findings without being conscious of them. You cannot observe everything, and some things inevitably will be overlooked. If you have predicted what you will see, you can design your study so that the unexpected results become more salient rather than less.

Returning to the example of observing a second-grade classroom, we note that the field already knows a great deal about how students talk about addition and subtraction. Being cognizant of what others have observed allows you to enter the classroom with some clear predictions about what will happen. The rationales for these predictions are based on all the related knowledge you have before stepping into the classroom, and the predictions and rationales help you to better deal with what you see. This is partly because you are likely to be surprised by the things you did not anticipate. There is almost always something that will surprise you because your predictions will almost always be incomplete or too general. This sensitivity to the unanticipated—the sense of surprise that sparks your curiosity—is an indication of your openness to the phenomenon you are studying.

Making Your Predictions Explicit Allows You to Plan in Advance

Recall from Chap. 1 the descriptor of scientific inquiry: “Experience carefully planned in advance.” If you make no predictions about what might happen, it is very difficult, if not impossible, to plan your study in advance. Again, you cannot observe everything, so you must make decisions about what you will observe. What kind of data will you plan to collect? Why would you collect these data instead of others? If you have no idea what to expect, on what basis will you make these consequential decisions? Even if your predictions are vague and your rationales for the predictions are a bit shaky, at least they provide a direction for your plan. They allow you to explain why you are planning this study and collecting these data. They allow you to “carefully plan in advance.”

Making Your Predictions Explicit Allows You to Put Your Rationales in Harm’s Way

Rationales are developed to justify the predictions. Rationales represent your best reasoning about the research problem you are studying. How can you tell whether your reasoning is sound? You can try it out with colleagues. However, the best way to test it is to put it in “harm’s way” (Cobb, Confrey, diSessa, Lehrer, & Schauble, 2003 p. 10). And the best approach to putting your reasoning in harm’s way is to test the predictions it generates. Regardless if you are conducting a qualitative or quantitative study, rationales can be improved only if they generate testable predictions. This is possible only if predictions are explicit and precise. As we described earlier, rationales are evaluated for their soundness and refined in light of the specific differences between predictions and empirical observations.

Making Your Predictions Explicit Forces You to Organize and Extend Your (and the Field’s) Thinking

By writing out your predictions (even hunches or fuzzy guesses) and by reflecting on why you have these predictions and making these reasons explicit for yourself, you are advancing your thinking about the questions you really want to answer. This means you are making progress toward formulating your research questions and your final hypotheses. Making more progress in your own thinking before you conduct your study increases the chances your study will be of higher quality and will be exactly the study you intended. Making predictions, developing rationales, and imagining tests are tools you can use to push your thinking forward before you even collect data.

Suppose you wonder how preservice teachers in your university’s teacher preparation program will solve particular kinds of math problems. You are interested in this question because you have noticed several PSTs solve them in unexpected ways. As you ask the question you want to answer, you make predictions about what you expect to see. When you reflect on why you made these predictions, you realize that some PSTs might use particular solution strategies because they were taught to use some of them in an earlier course, and they might believe you expect them to solve the problems in these ways. By being explicit about why you are making particular predictions, you realize that you might be answering a different question than you intend (“How much do PSTs remember from previous courses?” or even “To what extent do PSTs believe different instructors have similar expectations?”). Now you can either change your question or change the design of your study (i.e., the sample of students you will use) or both. You are advancing your thinking by being explicit about your predictions and why you are making them.

The Costs of Not Making Predictions

Avoiding making predictions, for whatever reason, comes with significant costs. It prevents you from learning very much about your research topic. It would require not reading related research, not talking with your colleagues, and not conducting pilot studies because, if you do, you are likely to find a prediction creeping into your thinking. Not doing these things would forego the benefits of advancing your thinking before you collect data. It would amount to conducting the study with as little forethought as possible.

Part VI. How Do You Formulate Important Hypotheses?

We provided a partial answer in Chap. 1 to the question of a hypothesis’ importance when we encouraged considering the ultimate goal to which a study’s findings might contribute. You might want to reread Part III of Chap. 1 where we offered our opinions about the purposes of doing research. We also recommend reading the March 2019 editorial in the Journal for Research in Mathematics Education (Cai et al., 2019b ) in which we address what constitutes important educational research.

As we argued in Chap. 1 and in the March 2019 editorial, a worthy ultimate goal for educational research is to improve the learning opportunities for all students. However, arguments can be made for other ultimate goals as well. To gauge the importance of your hypotheses, think about how clearly you can connect them to a goal the educational community considers important. In addition, given the descriptors of scientific inquiry proposed in Chap. 1 , think about how testing your hypotheses will help you (and the community) understand what you are studying. Will you have a better explanation for the phenomenon after your study than before?

Although we address the question of importance again, and in more detail, in Chap. 5 , it is useful to know here that you can determine the significance or importance of your hypotheses when you formulate them. The importance need not depend on the data you collect or the results you report. The importance can come from the fact that, based on the results of your study, you will be able to offer revised hypotheses that help the field better understand an important issue. In large part, it is these revised hypotheses rather than the data that determine a study’s importance.

A critical caveat to this discussion is that few hypotheses are self-evidently important. They are important only if you make the case for their importance. Even if you follow closely the guidelines we suggest for formulating an important hypothesis, you must develop an argument that convinces others. This argument will be presented in the research paper you write.

The picture has a few hypotheses that are self-evidently important. They are important only if you make the case for their importance; written.

Consider Martha’s hypothesis presented earlier. When we left Martha, she predicted that “Participating teachers will show changes in their teaching with a greater emphasis on conceptual understanding with larger changes on linear function topics directly addressed in the LOs than on other topics.” For researchers and educators not intimately familiar with this area of research, it is not apparent why someone should spend a year or more conducting a dissertation to test this prediction. Her rationale, summarized earlier, begins to describe why this could be an important hypothesis. But it is by writing a clear argument that explains her rationale to readers that she will convince them of its importance.

How Martha fills in her rationale so she can create a clear written argument for its importance is taken up in Chap. 3 . As we indicated, Martha’s work in this regard led her to make some interesting decisions, in part due to her own assessment of what was important.

Part VII. Beginning to Write the Research Paper for Your Study

It is common to think that researchers conduct a study and then, after the data are collected and analyzed, begin writing the paper about the study. We recommend an alternative, especially for beginning researchers. We believe it is better to write drafts of the paper at the same time you are planning and conducting your study. The paper will gradually evolve as you work through successive phases of the scientific inquiry process. Consequently, we will call this paper your evolving research paper .

The picture has, we believe it is better to write drafts of the paper at the same time you are planning and conducting your study; written.

You will use your evolving research paper to communicate your study, but you can also use writing as a tool for thinking and organizing your thinking while planning and conducting the study. Used as a tool for thinking, you can write drafts of your ideas to check on the clarity of your thinking, and then you can step back and reflect on how to clarify it further. Be sure to avoid jargon and general terms that are not well defined. Ask yourself whether someone not in your field, maybe a sibling, a parent, or a friend, would be able to understand what you mean. You are likely to write multiple drafts with lots of scribbling, crossing out, and revising.

Used as a tool for communicating, writing the best version of what you know before moving to the next phase will help you record your decisions and the reasons for them before you forget important details. This best-version-for-now paper also provides the basis for your thinking about the next phase of your scientific inquiry.

At this point in the process, you will be writing your (research) questions, the answers you predict, and the rationales for your predictions. The predictions you make should be direct answers to your research questions and should flow logically from (or be directly supported by) the rationales you present. In addition, you will have a written statement of the study’s purpose or, said another way, an argument for the importance of the hypotheses you will be testing. It is in the early sections of your paper that you will convince your audience about the importance of your hypotheses.

In our experience, presenting research questions is a more common form of stating the goal of a research study than presenting well-formulated hypotheses. Authors sometimes present a hypothesis, often as a simple prediction of what they might find. The hypothesis is then forgotten and not used to guide the analysis or interpretations of the findings. In other words, authors seldom use hypotheses to do the kind of work we describe. This means that many research articles you read will not treat hypotheses as we suggest. We believe these are missed opportunities to present research in a more compelling and informative way. We intend to provide enough guidance in the remaining chapters for you to feel comfortable organizing your evolving research paper around formulating, testing, and revising hypotheses.

While we were editing one of the leading research journals in mathematics education ( JRME ), we conducted a study of reviewers’ critiques of papers submitted to the journal. Two of the five most common concerns were: (1) the research questions were unclear, and (2) the answers to the questions did not make a substantial contribution to the field. These are likely to be major concerns for the reviewers of all research journals. We hope the knowledge and skills you have acquired working through this chapter will allow you to write the opening to your evolving research paper in a way that addresses these concerns. Much of the chapter should help make your research questions clear, and the prior section on formulating “important hypotheses” will help you convey the contribution of your study.

Exercise 2.3

Look back at your answers to the sets of questions before part II of this chapter.

Think about how you would argue for the importance of your current interest.

Write your interest in the form of (1) a research problem, (2) a research question, and (3) a prediction with the beginnings of a rationale. You will update these as you read the remaining chapters.

Part VIII. The Heart of Scientific Inquiry

In this chapter, we have described the process of formulating hypotheses. This process is at the heart of scientific inquiry. It is where doing research begins. Conducting research always involves formulating, testing, and revising hypotheses. This is true regardless of your research questions and whether you are using qualitative, quantitative, or mixed methods. Without engaging in this process in a deliberate, intense, relentless way, your study will reveal less than it could. By engaging in this process, you are maximizing what you, and others, can learn from conducting your study.

In the next chapter, we build on the ideas we have developed in the first two chapters to describe the purpose and nature of theoretical frameworks . The term theoretical framework, along with closely related terms like conceptual framework, can be somewhat mysterious for beginning researchers and can seem like a requirement for writing a paper rather than an aid for conducting research. We will show how theoretical frameworks grow from formulating hypotheses—from developing rationales for the predicted answers to your research questions. We will propose some practical suggestions for building theoretical frameworks and show how useful they can be. In addition, we will continue Martha’s story from the point at which we paused earlier—developing her theoretical framework.

Cai, J., Morris, A., Hohensee, C., Hwang, S., Robison, V., Cirillo, M., Kramer, S. L., & Hiebert, J. (2019b). Posing significant research questions. Journal for Research in Mathematics Education, 50 (2), 114–120. https://doi.org/10.5951/jresematheduc.50.2.0114

Article   Google Scholar  

Carpenter, T. P., Fennema, E., Peterson, P. L., Chiang, C. P., & Loef, M. (1989). Using knowledge of children’s mathematics thinking in classroom teaching: An experimental study. American Educational Research Journal, 26 (4), 385–531.

Fennema, E., Carpenter, T. P., Franke, M. L., Levi, L., Jacobs, V. R., & Empson, S. B. (1996). A longitudinal study of learning to use children’s thinking in mathematics instruction. Journal for Research in Mathematics Education, 27 (4), 403–434.

Glaser, B. G., & Holton, J. (2004). Remodeling grounded theory. Forum: Qualitative Social Research, 5(2). https://www.qualitative-research.net/index.php/fqs/article/view/607/1316

Gournelos, T., Hammonds, J. R., & Wilson, M. A. (2019). Doing academic research: A practical guide to research methods and analysis . Routledge.

Book   Google Scholar  

Hohensee, C. (2014). Backward transfer: An investigation of the influence of quadratic functions instruction on students’ prior ways of reasoning about linear functions. Mathematical Thinking and Learning, 16 (2), 135–174.

Husserl, E. (1973). Cartesian meditations: An introduction to phenomenology (D. Cairns, Trans.). Martinus Nijhoff. (Original work published 1929).

Google Scholar  

Levitt, H. M., Bamberg, M., Creswell, J. W., Frost, D. M., Josselson, R., & Suárez-Orozco, C. (2018). Journal article reporting standards for qualitative primary, qualitative meta-analytic, and mixed methods research in psychology: The APA Publications and Communications Board Task Force report. American Psychologist, 73 (1), 26–46.

Medawar, P. (1982). Pluto’s republic [no typo]. Oxford University Press.

Merton, R. K. (1968). Social theory and social structure (Enlarged edition). Free Press.

Nemirovsky, R. (2011). Episodic feelings and transfer of learning. Journal of the Learning Sciences, 20 (2), 308–337. https://doi.org/10.1080/10508406.2011.528316

Vygotsky, L. (1987). The development of scientific concepts in childhood: The design of a working hypothesis. In A. Kozulin (Ed.), Thought and language (pp. 146–209). The MIT Press.

Download references

Author information

Authors and affiliations.

School of Education, University of Delaware, Newark, DE, USA

James Hiebert, Anne K Morris & Charles Hohensee

Department of Mathematical Sciences, University of Delaware, Newark, DE, USA

Jinfa Cai & Stephen Hwang

You can also search for this author in PubMed   Google Scholar

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/ ), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

© 2023 The Author(s)

About this chapter

Hiebert, J., Cai, J., Hwang, S., Morris, A.K., Hohensee, C. (2023). How Do You Formulate (Important) Hypotheses?. In: Doing Research: A New Researcher’s Guide. Research in Mathematics Education. Springer, Cham. https://doi.org/10.1007/978-3-031-19078-0_2

Download citation

DOI : https://doi.org/10.1007/978-3-031-19078-0_2

Published : 03 December 2022

Publisher Name : Springer, Cham

Print ISBN : 978-3-031-19077-3

Online ISBN : 978-3-031-19078-0

eBook Packages : Education Education (R0)

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Enago Academy

Quick Guide to Biostatistics in Clinical Research: Hypothesis Testing

' src=

In this article series, we will be looking at some of the important concepts of biostatistics in clinical trials and clinical research. Statistics is frequently used to analyze quantitative research data. Clinical trials and clinical research both often rely on statistics. Clinical trials proceed through many phases . Contract Research Organizations (CRO) can be hired to conduct a clinical trial. Clinical trials are an important step in deciding if a treatment can be safely and effectively used in medical practice. Once the clinical trial phases are completed, biostatistics is used to analyze the results.

Research generally proceeds in an orderly fashion as shown below.

Research Process

Once you have identified the research question you need to answer, it is time to frame a good hypothesis. The hypothesis is the starting point for biostatistics and is usually based on a theory. Experiments are then designed to test the hypothesis. What is a hypothesis ? A research hypothesis is a statement describing a relationship between two or more variables that can be tested. A good hypothesis will be clear, avoid moral judgments, specific, objective, and relevant to the research question. Above all, a hypothesis must be testable.

A simple hypothesis would contain one predictor and one outcome variable. For instance, if your hypothesis was, “Chocolate consumption is linked to type II diabetes” the predictor would be whether or not a person eats chocolate and the outcome would be developing type II diabetes. A good hypothesis would also be specific. This means that it should be clear which subjects and research methodology will be used to test the hypothesis. An example of a specific hypothesis would be, “Adults who consume more than 20 grams of milk chocolate per day, as measured by a questionnaire over the course of 12 months, are more likely to develop type II diabetes than adults who consume less than 10 grams of milk chocolate per day.”

Null and Alternative Hypothesis

In statistics, the null hypothesis (H 0 ) states that there is no relationship between the predictor and the outcome variable in the population being studied. For instance, “There is no relationship between a family history of depression and the probability that a person will attempt suicide.” The alternative hypothesis (H 1 ) states that there is a relationship between the predictor (a history of depression) and the outcome (attempted suicide). It is impossible to prove a statement by making several observations but it is possible to disprove a statement with a single observation. If you always saw red tulips, it is not proof that no other colors exist. However, seeing a single tulip that was not red would immediately prove that the statement, “All tulips are red” is false. This is why statistics tests the null hypothesis. It is also why the alternative hypothesis cannot be tested directly.

The alternative hypothesis proposed in medical research may be one-tailed or two-tailed. A one-tailed alternative hypothesis would predict the direction of the effect. Clinical studies may have an alternative hypothesis that patients taking the study drug will have a lower cholesterol level than those taking a placebo. This is an example of a one-tailed hypothesis. A two-tailed alternative hypothesis would only state that there is an association without specifying a direction. An example would be, “Patients who take the study drug will have a significantly different cholesterol level than those patients taking a placebo”. The alternative hypothesis does not state if that level will be higher or lower in those taking the placebo.

The P-Value Approach to Test Hypothesis

Once the hypothesis has been designed, statistical tests help you to decide if you should accept or reject the null hypothesis. Statistical tests determine the p-value associated with the research data. The p-value is the probability that one could have obtained the result by chance; assuming the null hypothesis (H 0 ) was true. You must reject the null hypothesis if the p-value of the data falls below the predetermined level of statistical significance. Usually, the level of statistical significance is set at 0.05. If the p- value is less than 0.05, then you would reject the null hypothesis stating that there is no relationship between the predictor and the outcome in the sample population.

However, if the p-value is greater than the predetermined level of significance, then there is no statistically significant association between the predictor and the outcome variable. This does not mean that there is no association between the predictor and the outcome in the population. It only means that the difference between the relationship observed and the relationship that could have occurred by random chance is small.

For example, null hypothesis (H 0 ): The patients who take the study drug after a heart attack did not have a better chance of not having a second heart attack over the next 24 months.

Data suggests that those who did not take the study drug were twice as likely to have a second heart attack with a p-value of 0.08. This p-value would indicate that there was an 8% chance that you would see a similar result (people on the placebo being twice as likely to have a second heart attack) in the general population because of random chance.

The hypothesis is not a trivial part of the clinical research process. It is a key element in a good biostatistics plan regardless of the clinical trial phase. There are many other concepts that are important for analyzing data from clinical trials. In our next article in the series, we will examine hypothesis testing for one or many populations, as well as error types.

' src=

Thank you for this very informative article. You describe all the things very well. I am doing a fellowship in Clinical research training. This information really helps me a lot in my research studies. I have been connected with your site since a long time for such updates. Thank you once again

Rate this article Cancel Reply

Your email address will not be published.

formulation of hypothesis may not be required in clinical studies

Enago Academy's Most Popular Articles

manuscript writing with AI

  • AI in Academia
  • Infographic
  • Manuscripts & Grants
  • Reporting Research
  • Trending Now

Can AI Tools Prepare a Research Manuscript From Scratch? — A comprehensive guide

As technology continues to advance, the question of whether artificial intelligence (AI) tools can prepare…

difference between abstract and introduction

Abstract Vs. Introduction — Do you know the difference?

Ross wants to publish his research. Feeling positive about his research outcomes, he begins to…

formulation of hypothesis may not be required in clinical studies

  • Old Webinars
  • Webinar Mobile App

Demystifying Research Methodology With Field Experts

Choosing research methodology Research design and methodology Evidence-based research approach How RAxter can assist researchers

Best Research Methodology

  • Manuscript Preparation
  • Publishing Research

How to Choose Best Research Methodology for Your Study

Successful research conduction requires proper planning and execution. While there are multiple reasons and aspects…

Methods and Methodology

Top 5 Key Differences Between Methods and Methodology

While burning the midnight oil during literature review, most researchers do not realize that the…

How to Draft the Acknowledgment Section of a Manuscript

Discussion Vs. Conclusion: Know the Difference Before Drafting Manuscripts

formulation of hypothesis may not be required in clinical studies

Sign-up to read more

Subscribe for free to get unrestricted access to all our resources on research writing and academic publishing including:

  • 2000+ blog articles
  • 50+ Webinars
  • 10+ Expert podcasts
  • 50+ Infographics
  • 10+ Checklists
  • Research Guides

We hate spam too. We promise to protect your privacy and never spam you.

I am looking for Editing/ Proofreading services for my manuscript Tentative date of next journal submission:

formulation of hypothesis may not be required in clinical studies

As a researcher, what do you consider most when choosing an image manipulation detector?

U.S. flag

An official website of the United States government

The .gov means it's official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you're on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • Browse Titles

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2024 Jan-.

Cover of StatPearls

StatPearls [Internet].

Hypothesis testing, p values, confidence intervals, and significance.

Jacob Shreffler ; Martin R. Huecker .


Last Update: March 13, 2023 .

  • Definition/Introduction

Medical providers often rely on evidence-based medicine to guide decision-making in practice. Often a research hypothesis is tested with results provided, typically with p values, confidence intervals, or both. Additionally, statistical or research significance is estimated or determined by the investigators. Unfortunately, healthcare providers may have different comfort levels in interpreting these findings, which may affect the adequate application of the data.

  • Issues of Concern

Without a foundational understanding of hypothesis testing, p values, confidence intervals, and the difference between statistical and clinical significance, it may affect healthcare providers' ability to make clinical decisions without relying purely on the research investigators deemed level of significance. Therefore, an overview of these concepts is provided to allow medical professionals to use their expertise to determine if results are reported sufficiently and if the study outcomes are clinically appropriate to be applied in healthcare practice.

Hypothesis Testing

Investigators conducting studies need research questions and hypotheses to guide analyses. Starting with broad research questions (RQs), investigators then identify a gap in current clinical practice or research. Any research problem or statement is grounded in a better understanding of relationships between two or more variables. For this article, we will use the following research question example:

Research Question: Is Drug 23 an effective treatment for Disease A?

Research questions do not directly imply specific guesses or predictions; we must formulate research hypotheses. A hypothesis is a predetermined declaration regarding the research question in which the investigator(s) makes a precise, educated guess about a study outcome. This is sometimes called the alternative hypothesis and ultimately allows the researcher to take a stance based on experience or insight from medical literature. An example of a hypothesis is below.

Research Hypothesis: Drug 23 will significantly reduce symptoms associated with Disease A compared to Drug 22.

The null hypothesis states that there is no statistical difference between groups based on the stated research hypothesis.

Researchers should be aware of journal recommendations when considering how to report p values, and manuscripts should remain internally consistent.

Regarding p values, as the number of individuals enrolled in a study (the sample size) increases, the likelihood of finding a statistically significant effect increases. With very large sample sizes, the p-value can be very low significant differences in the reduction of symptoms for Disease A between Drug 23 and Drug 22. The null hypothesis is deemed true until a study presents significant data to support rejecting the null hypothesis. Based on the results, the investigators will either reject the null hypothesis (if they found significant differences or associations) or fail to reject the null hypothesis (they could not provide proof that there were significant differences or associations).

To test a hypothesis, researchers obtain data on a representative sample to determine whether to reject or fail to reject a null hypothesis. In most research studies, it is not feasible to obtain data for an entire population. Using a sampling procedure allows for statistical inference, though this involves a certain possibility of error. [1]  When determining whether to reject or fail to reject the null hypothesis, mistakes can be made: Type I and Type II errors. Though it is impossible to ensure that these errors have not occurred, researchers should limit the possibilities of these faults. [2]


Significance is a term to describe the substantive importance of medical research. Statistical significance is the likelihood of results due to chance. [3]  Healthcare providers should always delineate statistical significance from clinical significance, a common error when reviewing biomedical research. [4]  When conceptualizing findings reported as either significant or not significant, healthcare providers should not simply accept researchers' results or conclusions without considering the clinical significance. Healthcare professionals should consider the clinical importance of findings and understand both p values and confidence intervals so they do not have to rely on the researchers to determine the level of significance. [5]  One criterion often used to determine statistical significance is the utilization of p values.

P values are used in research to determine whether the sample estimate is significantly different from a hypothesized value. The p-value is the probability that the observed effect within the study would have occurred by chance if, in reality, there was no true effect. Conventionally, data yielding a p<0.05 or p<0.01 is considered statistically significant. While some have debated that the 0.05 level should be lowered, it is still universally practiced. [6]  Hypothesis testing allows us to determine the size of the effect.

An example of findings reported with p values are below:

Statement: Drug 23 reduced patients' symptoms compared to Drug 22. Patients who received Drug 23 (n=100) were 2.1 times less likely than patients who received Drug 22 (n = 100) to experience symptoms of Disease A, p<0.05.

Statement:Individuals who were prescribed Drug 23 experienced fewer symptoms (M = 1.3, SD = 0.7) compared to individuals who were prescribed Drug 22 (M = 5.3, SD = 1.9). This finding was statistically significant, p= 0.02.

For either statement, if the threshold had been set at 0.05, the null hypothesis (that there was no relationship) should be rejected, and we should conclude significant differences. Noticeably, as can be seen in the two statements above, some researchers will report findings with < or > and others will provide an exact p-value (0.000001) but never zero [6] . When examining research, readers should understand how p values are reported. The best practice is to report all p values for all variables within a study design, rather than only providing p values for variables with significant findings. [7]  The inclusion of all p values provides evidence for study validity and limits suspicion for selective reporting/data mining.  

While researchers have historically used p values, experts who find p values problematic encourage the use of confidence intervals. [8] . P-values alone do not allow us to understand the size or the extent of the differences or associations. [3]  In March 2016, the American Statistical Association (ASA) released a statement on p values, noting that scientific decision-making and conclusions should not be based on a fixed p-value threshold (e.g., 0.05). They recommend focusing on the significance of results in the context of study design, quality of measurements, and validity of data. Ultimately, the ASA statement noted that in isolation, a p-value does not provide strong evidence. [9]

When conceptualizing clinical work, healthcare professionals should consider p values with a concurrent appraisal study design validity. For example, a p-value from a double-blinded randomized clinical trial (designed to minimize bias) should be weighted higher than one from a retrospective observational study [7] . The p-value debate has smoldered since the 1950s [10] , and replacement with confidence intervals has been suggested since the 1980s. [11]

Confidence Intervals

A confidence interval provides a range of values within given confidence (e.g., 95%), including the accurate value of the statistical constraint within a targeted population. [12]  Most research uses a 95% CI, but investigators can set any level (e.g., 90% CI, 99% CI). [13]  A CI provides a range with the lower bound and upper bound limits of a difference or association that would be plausible for a population. [14]  Therefore, a CI of 95% indicates that if a study were to be carried out 100 times, the range would contain the true value in 95, [15]  confidence intervals provide more evidence regarding the precision of an estimate compared to p-values. [6]

In consideration of the similar research example provided above, one could make the following statement with 95% CI:

Statement: Individuals who were prescribed Drug 23 had no symptoms after three days, which was significantly faster than those prescribed Drug 22; there was a mean difference between the two groups of days to the recovery of 4.2 days (95% CI: 1.9 – 7.8).

It is important to note that the width of the CI is affected by the standard error and the sample size; reducing a study sample number will result in less precision of the CI (increase the width). [14]  A larger width indicates a smaller sample size or a larger variability. [16]  A researcher would want to increase the precision of the CI. For example, a 95% CI of 1.43 – 1.47 is much more precise than the one provided in the example above. In research and clinical practice, CIs provide valuable information on whether the interval includes or excludes any clinically significant values. [14]

Null values are sometimes used for differences with CI (zero for differential comparisons and 1 for ratios). However, CIs provide more information than that. [15]  Consider this example: A hospital implements a new protocol that reduced wait time for patients in the emergency department by an average of 25 minutes (95% CI: -2.5 – 41 minutes). Because the range crosses zero, implementing this protocol in different populations could result in longer wait times; however, the range is much higher on the positive side. Thus, while the p-value used to detect statistical significance for this may result in "not significant" findings, individuals should examine this range, consider the study design, and weigh whether or not it is still worth piloting in their workplace.

Similarly to p-values, 95% CIs cannot control for researchers' errors (e.g., study bias or improper data analysis). [14]  In consideration of whether to report p-values or CIs, researchers should examine journal preferences. When in doubt, reporting both may be beneficial. [13]  An example is below:

Reporting both: Individuals who were prescribed Drug 23 had no symptoms after three days, which was significantly faster than those prescribed Drug 22, p = 0.009. There was a mean difference between the two groups of days to the recovery of 4.2 days (95% CI: 1.9 – 7.8).

  • Clinical Significance

Recall that clinical significance and statistical significance are two different concepts. Healthcare providers should remember that a study with statistically significant differences and large sample size may be of no interest to clinicians, whereas a study with smaller sample size and statistically non-significant results could impact clinical practice. [14]  Additionally, as previously mentioned, a non-significant finding may reflect the study design itself rather than relationships between variables.

Healthcare providers using evidence-based medicine to inform practice should use clinical judgment to determine the practical importance of studies through careful evaluation of the design, sample size, power, likelihood of type I and type II errors, data analysis, and reporting of statistical findings (p values, 95% CI or both). [4]  Interestingly, some experts have called for "statistically significant" or "not significant" to be excluded from work as statistical significance never has and will never be equivalent to clinical significance. [17]

The decision on what is clinically significant can be challenging, depending on the providers' experience and especially the severity of the disease. Providers should use their knowledge and experiences to determine the meaningfulness of study results and make inferences based not only on significant or insignificant results by researchers but through their understanding of study limitations and practical implications.

  • Nursing, Allied Health, and Interprofessional Team Interventions

All physicians, nurses, pharmacists, and other healthcare professionals should strive to understand the concepts in this chapter. These individuals should maintain the ability to review and incorporate new literature for evidence-based and safe care. 

  • Review Questions
  • Access free multiple choice questions on this topic.
  • Comment on this article.

Disclosure: Jacob Shreffler declares no relevant financial relationships with ineligible companies.

Disclosure: Martin Huecker declares no relevant financial relationships with ineligible companies.

This book is distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) ( http://creativecommons.org/licenses/by-nc-nd/4.0/ ), which permits others to distribute the work, provided that the article is not altered or used commercially. You are not required to obtain permission to distribute this article, provided that you credit the author and journal.

  • Cite this Page Shreffler J, Huecker MR. Hypothesis Testing, P Values, Confidence Intervals, and Significance. [Updated 2023 Mar 13]. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2024 Jan-.

In this Page

Bulk download.

  • Bulk download StatPearls data from FTP

Related information

  • PMC PubMed Central citations
  • PubMed Links to PubMed

Similar articles in PubMed

  • The reporting of p values, confidence intervals and statistical significance in Preventive Veterinary Medicine (1997-2017). [PeerJ. 2021] The reporting of p values, confidence intervals and statistical significance in Preventive Veterinary Medicine (1997-2017). Messam LLM, Weng HY, Rosenberger NWY, Tan ZH, Payet SDM, Santbakshsing M. PeerJ. 2021; 9:e12453. Epub 2021 Nov 24.
  • Review Clinical versus statistical significance: interpreting P values and confidence intervals related to measures of association to guide decision making. [J Pharm Pract. 2010] Review Clinical versus statistical significance: interpreting P values and confidence intervals related to measures of association to guide decision making. Ferrill MJ, Brown DA, Kyle JA. J Pharm Pract. 2010 Aug; 23(4):344-51. Epub 2010 Apr 13.
  • Interpreting "statistical hypothesis testing" results in clinical research. [J Ayurveda Integr Med. 2012] Interpreting "statistical hypothesis testing" results in clinical research. Sarmukaddam SB. J Ayurveda Integr Med. 2012 Apr; 3(2):65-9.
  • Confidence intervals in procedural dermatology: an intuitive approach to interpreting data. [Dermatol Surg. 2005] Confidence intervals in procedural dermatology: an intuitive approach to interpreting data. Alam M, Barzilai DA, Wrone DA. Dermatol Surg. 2005 Apr; 31(4):462-6.
  • Review Is statistical significance testing useful in interpreting data? [Reprod Toxicol. 1993] Review Is statistical significance testing useful in interpreting data? Savitz DA. Reprod Toxicol. 1993; 7(2):95-100.

Recent Activity

  • Hypothesis Testing, P Values, Confidence Intervals, and Significance - StatPearl... Hypothesis Testing, P Values, Confidence Intervals, and Significance - StatPearls

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

Connect with NLM

National Library of Medicine 8600 Rockville Pike Bethesda, MD 20894

Web Policies FOIA HHS Vulnerability Disclosure

Help Accessibility Careers



  1. 🏷️ Formulation of hypothesis in research. How to Write a Strong

    formulation of hypothesis may not be required in clinical studies

  2. 13 Different Types of Hypothesis (2024)

    formulation of hypothesis may not be required in clinical studies

  3. 🏷️ Formulation of hypothesis in research. How to Write a Strong

    formulation of hypothesis may not be required in clinical studies

  4. Formulation of hypothesis may not be necessary in

    formulation of hypothesis may not be required in clinical studies

  5. 🏷️ Formulation of hypothesis in research. How to Write a Strong

    formulation of hypothesis may not be required in clinical studies

  6. How to Write a Hypothesis

    formulation of hypothesis may not be required in clinical studies


  1. B.ed 2nd sem / Formulation of Action Hypothesis / need and importance

  2. Hypothesis Formulation

  3. Selecting the Appropriate Hypothesis Test [FIL]

  4. How to frame the Hypothesis statement in your Research

  5. صياغة الفرضية Formulation of the Hypothesis

  6. Formulating the research question


  1. Formulating Hypotheses for Different Study Designs

    Formulating Hypotheses for Different Study Designs. Generating a testable working hypothesis is the first step towards conducting original research. Such research may prove or disprove the proposed hypothesis. Case reports, case series, online surveys and other observational studies, clinical trials, and narrative reviews help to generate ...

  2. General Principles of Preclinical Study Design

    1. An Overview. Broadly, preclinical research can be classified into two distinct categories depending on the aim and purpose of the experiment, namely, "hypothesis generating" (exploratory) and "hypothesis testing" (confirmatory) research (Fig. 1).Hypothesis generating studies are often scientifically-informed, curiosity and intuition-driven explorations which may generate testable ...

  3. Formulating research questions for evidence-based studies

    The importance of formulating a sound and proper research question is summarized in three main motives: 1. Conducting an evidence-based study: Evidence-based studies, particularly, the systematic reviews in this case, rely on a research question developed to specifically address the problem with all required details. 2.

  4. Primary Question and Hypothesis Testing in Randomized Controlled

    Hypothesis testing. The methods for answering scientific questions from data collected in clinical trials belong to statistical inference. An important part of statistical inference is hypothesis testing, the foundation of which was laid by Fisher, Neyman, and Pearson among others [ 5]. Hypotheses consist of null hypothesis (H 0) and ...

  5. Targeted test evaluation: a framework for designing diagnostic accuracy

    Most randomized controlled trials evaluating medical interventions have a pre-specified hypothesis, which is statistically tested against the null hypothesis of no effect. In diagnostic accuracy studies, study hypotheses are rarely pre-defined and sample size calculations are usually not performed, which may jeopardize scientific rigor and can lead to over-interpretation or "spin" of study ...

  6. Formulating Hypotheses for Different Study Designs

    Generating a testable working hypothesis is the first step towards conducting original research. Such research may prove or disprove the proposed hypothesis. Case reports, case series, online surveys and other observational studies, clinical trials, and narrative reviews help to generate hypotheses. Observational and interventional studies help ...

  7. Generating a Testable Hypothesis and Underlying Principles of Clinical

    A well-designed and executed clinical trial can be one of the most powerful and definitive ways to assess the effectiveness and safety of an intervention (s). Thus, thorough knowledge of the underlying principles of clinical trials is required for any investigator choosing to embark on such an endeavor. This chapter will elaborate on three ...

  8. (PDF) Formulating Hypotheses for Different Study Designs

    naturally occurring event or a proposed outcome of an intervention. 1,2. Hypothesis testing requires choosing the most ap propriate methodology and adequately. powering statistically the study to ...

  9. PDF Scientific hypothesis generation process in clinical research: a

    After formulating a scientific hypothesis, researchers design studies to test the scientific hypothesis to determine the answer to research questions 2,4. Scientific hypothesis generation and scientific hypothesis testing are distinct processes 2,5. In clinical research, research questions are often delineated without the support of systematic

  10. The Research Question and the Hypothesis

    The Hypothesis. The primary research question, once defined, then sets the foundation for subsequent trial design and conduct. First, the primary question must be restated as the primary hypothesis to be tested by the trial. For the researcher, it is more than just simply restating or rewording the question to a statement, but this is where you ...

  11. Formulating the Research Question

    3 Turning Clinical Questions into Research Questions. The first step in the process of transforming a clinical question into research is to carefully define the study sample (or patient cohort), the exposure of interest, and the outcome of interest. These 3 components—sample, exposure, and outcome—are essential parts of every research question.

  12. Problem formulation by medical students: an observation study

    Background Medical problems are often complex and ill-structured. In formulating the problem, one has to discriminate pertinent elements from irrelevant information in order to effectively find a solution. In this observation study, we describe how medical students formulate the problem of a complex case. Methods 32 third year medical students were presented with a complex case of endocarditis ...

  13. PDF How to Formulate a Research Question, Hypothesis and Objective for a

    of the study. Ultimately, the hypothesis of a study determines the objectives of that study. Generally, a study's planning also relies on the primary research question/objective of that study ...

  14. A Review of the Main Considerations for Formulation Development in

    Even if it is not necessary to change the formulation, there will likely be an expectation for bridging studies in order to assess local irritancy by the new dose route. The bridging studies required to support regulatory submissions following a change in dose route will depend on the routes in question based on the relative risks.

  15. Basic Understanding of Study Types and Formulating Research Question

    As all clinical research endeavors to address a research question, framing the primary research question is of utmost importance as it influences the study design, sample size required and the resources that may be required. All study questions along with the primary research question should be developed at the beginning of the study.

  16. Formulation of Hypothesis & Examples

    For example: if the temperature of a chamber is raised, then the time it takes to melt an ice block will decrease. In this example, the independent variable is the temperature and the dependent ...

  17. Chapter 17

    Since doses required for a nonclinical safety assessment often exceed clinical/commercial doses by far, the BCS does not strictly apply to the formulation development for nonclinical safety studies, since even BCS class I and III compounds do not necessarily dissolve completely in 250 mL at the highest doses administered during these studies ...

  18. How Do You Formulate (Important) Hypotheses?

    Building on the ideas in Chap. 1, we describe formulating, testing, and revising hypotheses as a continuing cycle of clarifying what you want to study, making predictions about what you might find together with developing your reasons for these predictions, imagining tests of these predictions, revising your predictions and rationales, and so ...

  19. Quick Guide to Biostatistics in Clinical Research: Hypothesis ...

    The alternative hypothesis proposed in medical research may be one-tailed or two-tailed. A one-tailed alternative hypothesis would predict the direction of the effect. Clinical studies may have an alternative hypothesis that patients taking the study drug will have a lower cholesterol level than those taking a placebo.

  20. "Proving the null hypothesis" in clinical trials

    The problem is presented in terms of a trial in which the outcome of interest is dichotomous; test statistics, confidence intervals, and sample size calculations are discussed. The required sample size may be larger for either null hypothesis formulation than for the other, depending on the specific assumptions made.

  21. Hypothesis Testing, P Values, Confidence Intervals, and Significance

    Clinical Significance. Recall that clinical significance and statistical significance are two different concepts. Healthcare providers should remember that a study with statistically significant differences and large sample size may be of no interest to clinicians, whereas a study with smaller sample size and statistically non-significant results could impact clinical practice.

  22. [Solved] Formulation of hypothesis may not be necessary in

    These are exploratory in nature that is inductive research. In fact-finding studies, the premises or the matter is to be explored and identified first so that it can become a base for hypothesis formulation. Therefore hypothesis is not mandatory for Historical studies. Therefore option 2 is the correct answer.

  23. Formulation of Hypothesis & Examples

    A hypothesis is an educated prediction that provides an explanation for an observed event. An observed event is a measurable result or condition. If you can't measure it, then you can't form a ...