Help | Advanced Search

Computer Science > Artificial Intelligence

Title: artificial intelligence for literature reviews: opportunities and challenges.

Abstract: This manuscript presents a comprehensive review of the use of Artificial Intelligence (AI) in Systematic Literature Reviews (SLRs). A SLR is a rigorous and organised methodology that assesses and integrates previous research on a given topic. Numerous tools have been developed to assist and partially automate the SLR process. The increasing role of AI in this field shows great potential in providing more effective support for researchers, moving towards the semi-automatic creation of literature reviews. Our study focuses on how AI techniques are applied in the semi-automation of SLRs, specifically in the screening and extraction phases. We examine 21 leading SLR tools using a framework that combines 23 traditional features with 11 AI features. We also analyse 11 recent tools that leverage large language models for searching the literature and assisting academic writing. Finally, the paper discusses current trends in the field, outlines key research challenges, and suggests directions for future research.

Submission history

Access paper:.

  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Using artificial intelligence methods for systematic review in health sciences: A systematic review

Affiliations.

  • 1 Department of Pharmacotherapy, College of Pharmacy, University of Utah, Utah, USA.
  • 2 Faculty of Pharmacy, Chiang Mai University, Chiang Mai, Thailand.
  • 3 School of Computing, Robert Gordon University, Aberdeen, Scotland, UK.
  • 4 The Rowett Institute, University of Aberdeen, Aberdeen, Scotland, UK.
  • 5 School of Medicine, Faculty of Health and Medical Sciences, Taylors University, Selangor, Malaysia.
  • 6 School of Pharmacy, Monash University Malaysia, Selangor, Malaysia.
  • 7 IDEAS Center, Veterans Affairs Salt Lake City Healthcare System, Salt Lake City, Utah, USA.
  • PMID: 35174972
  • DOI: 10.1002/jrsm.1553

The exponential increase in published articles makes a thorough and expedient review of literature increasingly challenging. This review delineated automated tools and platforms that employ artificial intelligence (AI) approaches and evaluated the reported benefits and challenges in using such methods. A search was conducted in 4 databases (Medline, Embase, CDSR, and Epistemonikos) up to April 2021 for systematic reviews and other related reviews implementing AI methods. To be included, the review must use any form of AI method, including machine learning, deep learning, neural network, or any other applications used to enable the full or semi-autonomous performance of one or more stages in the development of evidence synthesis. Twelve reviews were included, using nine different tools to implement 15 different AI methods. Eleven methods were used in the screening stages of the review (73%). The rest were divided: two in data extraction (13%) and two in risk of bias assessment (13%). The ambiguous benefits of the data extractions, combined with the reported advantages from 10 reviews, indicating that AI platforms have taken hold with varying success in evidence synthesis. However, the results are qualified by the reliance on the self-reporting of the review authors. Extensive human validation still appears required at this stage in implementing AI methods, though further evaluation is required to define the overall contribution of such platforms in enhancing efficiency and quality in evidence synthesis.

Keywords: artificial intelligence; evidence synthesis; machine learning; systematic reviews.

© 2022 John Wiley & Sons Ltd.

Publication types

  • Systematic Review
  • Artificial Intelligence*
  • Machine Learning
  • Systematic Reviews as Topic*
  • Open access
  • Published: 15 January 2022

Automation of literature screening using machine learning in medical evidence synthesis: a diagnostic test accuracy systematic review protocol

  • Yuelun Zhang 1   na1 ,
  • Siyu Liang 2   na1 ,
  • Yunying Feng 3   na1 ,
  • Qing Wang 4 ,
  • Feng Sun 5 ,
  • Shi Chen 2 ,
  • Yiying Yang 3 ,
  • Huijuan Zhu 2 &
  • Hui Pan 2  

Systematic Reviews volume  11 , Article number:  11 ( 2022 ) Cite this article

8688 Accesses

15 Citations

13 Altmetric

Metrics details

Systematic review is an indispensable tool for optimal evidence collection and evaluation in evidence-based medicine. However, the explosive increase of the original literatures makes it difficult to accomplish critical appraisal and regular update. Artificial intelligence (AI) algorithms have been applied to automate the literature screening procedure in medical systematic reviews. In these studies, different algorithms were used and results with great variance were reported. It is therefore imperative to systematically review and analyse the developed automatic methods for literature screening and their effectiveness reported in current studies.

An electronic search will be conducted using PubMed, Embase, ACM Digital Library, and IEEE Xplore Digital Library databases, as well as literatures found through supplementary search in Google scholar, on automatic methods for literature screening in systematic reviews. Two reviewers will independently conduct the primary screening of the articles and data extraction, in which nonconformities will be solved by discussion with a methodologist. Data will be extracted from eligible studies, including the basic characteristics of study, the information of training set and validation set, and the function and performance of AI algorithms, and summarised in a table. The risk of bias and applicability of the eligible studies will be assessed by the two reviewers independently based on Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2). Quantitative analyses, if appropriate, will also be performed.

Automating systematic review process is of great help in reducing workload in evidence-based practice. Results from this systematic review will provide essential summary of the current development of AI algorithms for automatic literature screening in medical evidence synthesis and help to inspire further studies in this field.

Systematic review registration

PROSPERO CRD42020170815 (28 April 2020).

Peer Review reports

Systematic reviews synthesise the results of multiple original publications to provide clinicians with comprehensive knowledge and current optimal evidence in answering certain research questions. The major steps of a systematic review are defining a structured review question, developing inclusion criteria, searching in the databases, screening for relevant studies, collecting data from relevant studies, assessing the risk of bias critically, undertaking meta-analyses where appropriate, and assessing reporting biases [ 1 , 2 , 3 ]. A systematic review aims to provide a complete, exhaustive summary of current literature relevant to a research question with an objective and transparent approach. In the light of these characteristics, systematic reviews, in particular those combining high quality evidence, which used to be at the very top of the medical evidence pyramid [ 4 ] and now become regarded as an indispensable tool for evidence viewing [ 5 ], are widely used by reviewers in the practice of evidence-based medicine.

However, conducting systematic reviews for clinical decision making is time-consuming and labour-intensive, as the reviewers are supposed to perform a thorough search to identify any literatures that may be relevant, read through all abstracts of retrieved literatures, and identify the potential candidates for further full-text screening [ 6 ]. For original researches, the median time from the publication to their first inclusion in a systematic review ranged from 2.5 to 6.5 years [ 7 ]. It usually takes over a year to publish a systematic review from the time of literature search [ 8 ]. However, with advances in clinical research, this evidence and systematic review conclusions it generates may be out of date within several years. With the explosive increase of original research articles, reviewers have found difficulty identifying most relevant evidence in time, let alone updating systematic reviews periodically [ 9 ]. Therefore, researchers are exploring automatic methods to improve the efficacy of evidence synthesis while reducing the workload of systematic reviews.

Recent progresses in computer science show a promising future that more intelligent works can be accomplished with the aid of automatic technologies, such as pattern recognition and machine learning (ML). Being seen as a subset of artificial intelligence (AI), ML utilises algorithms to build mathematical models based on training data in order to make predictions or decisions without being explicitly programmed [ 10 ]. Various ML studies have been introduced in the medical field, such as diagnosis, prognosis, genetic analysis, and drug screening, to support clinical decision making [ 11 , 12 , 13 , 14 ]. When it comes to automatic methods for systematic reviews, models for automatic literature screening have been explored to reduce repetitive work and save time for reviewers [ 15 , 16 ].

To date, limited research has been focused on automatic methods used for biomedical literature screening in systematic review process. Automated literature classification systems [ 17 ] or hybrid relevance rating models [ 18 ] were tested in specific datasets, yet further extension of review datasets and performance improvement are required. To address this gap in knowledge, this article describes the protocol for a systematic review aiming at summarising existing automatic methods to screen relevant biomedical literature in the systematic review process, and evaluating the accuracy of the AI tools.

The primary objective of this review is to assess the diagnostic accuracy of AI algorithms (index test) compared with gold-standard human investigators (reference standard) for screening relevant literatures from original literatures identified by electronic search in systematic review. The secondary objective of this review is to describe the time and work saved by AI algorithms in literature screening. Additionally, we plan to conduct subgroup analyses to explore the potential factors that associate with the accuracy of AI algorithms.

Study registration

We prepared this protocol following the Preferred Reporting Items for Systematic Review and Meta-Analysis Protocols (PRISMA-P) [ 19 ]. This systematic review has been registered on PROSPERO (Registration number: CRD42020170815, 28 April 2020).

Review question

Our review question was refined using PRISMA-DTA framework, as detailed in Table 1 . In this systematic review, “literatures” refer to the subjects of the diagnostic test (the “participants” in Table 1 ), and “studies” refer to the studies included in our review.

Inclusion and exclusion criteria

We will include studies in medical research that reported a structured study question, described the source of the training or validation sets, developed or employed AI models for automatic literature screening, and used the screening results from human investigators as the reference standard.

We will exclude traditional clinical studies in human participants, editorials, commentaries, or other non-original reports. Pure methodological studies in AI algorithms without application in evidence synthesis will be excluded as well.

Information source and search strategy

An experienced methodologist will conduct searches in major public electronic medical and computer science databases, including PubMed, Embase, ACM Digital Library, and IEEE Xplore Digital Library, for publications ranged from January 2000 to present. We set this time range because to the best of our knowledge, AI algorithms prior to 2000 are unlikely to be applicable in evidence synthesis [ 20 ]. In addition to the literature search, we will also find more relevant studies through checking the reference lists of included studies identified by electronic search. Related abstracts and preprints will be searched in Google scholar. There are no language restrictions in searches. We will use free text words, MeSH/EMTREE terms, IEEE Terms, INSPEC Terms, and ACM Computing Classification System to develop strategies related to three major concepts: systematic review, literature screening, and AI. Multiple synonyms for each concept will be incorporated into the search. The Systematic Review Toolbox ( http://systematicreviewtools.com/ ) will also be utilised to detect potential automation methods in medical research evidence synthesis. Detailed search strategy used in PubMed is shown in Supplementary Material 1.

Study selection

Literatures with titles and abstracts from online electronic databases will be downloaded and imported into EndNote X9.3.2 software (Thomson Reuters, Toronto, Ontario, Canada) for further process after removing duplications.

All studies will be screened independently by 2 authors based on the titles and abstracts. Those which do not meet the inclusion criteria will be excluded with specific reasons. Disagreements will be solved by discussion with a methodologist if necessary. After the initial screening, the full texts of the potentially relevant studies will be independently reviewed by the two authors to make decisions on final inclusions. Conflicts will be resolved in the same way as they were initially screened. Excluded studies will be listed and noted according to PRISMA-DTA flowchart.

Data collection

A data collection form will be used for information extraction. Data from the eligible studies will be independently extracted and verified by two investigators. Disagreements will be resolved through discussion and consultation with the original publication. We will also try to contact the authors to collect the missing data. If one study did not report detailed accuracy data or did not provide enough data that are essential to calculate the accuracy data, this study will be omitted from the quantitative data synthesis.

The following data will be extracted from the original studies: characteristics of study, information of training set and validation set, and the function and performance of AI algorithms. The definitions of variables in data extraction are shown in Table 2 .

Risk of bias assessment, applicability, and levels of evidence

Two authors will independently assess risk of bias and applicability with a checklist based on Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) [ 21 ]. The QUADAS-2 contains 4 domains, respectively regarding patient selection, index test, reference standard, and flow and timing risk of bias. The risk of bias is classified as “low”, “high”, or “unclear”. Studies with high risk of bias will be excluded in the sensitivity analysis.

In this systematic review, the “participants” are literatures rather than human subjects. The index test is AI model used for automatic literature screening. Therefore, we will slightly revise the QUADAS-2 to fit our research context (Table 3 ). We deleted one signal question in the QUADAS-2 “was there an appropriate interval between index test and reference standard”. The purpose of this signal question in the original version of the QUADAS-2 is to judge the bias caused by the change of disease status between the index test and the reference test. The “disease status”, or the final inclusion status of one literature in our research context, will not change; thus, there are no such concerns.

The levels of the evidence body will be evaluated by the Grading of Recommendations, Assessment, Development and Evaluations (GRADE) framework [ 22 ].

Diagnostic accuracy measures

We will extract the data of per study in a two-by-two contingency table from the formal publication text, appendices, or by contacting the main authors to collect sensitivity, specificity, precision, negative predictive value (NPV), positive predictive value (PPV), negative likelihood ratio (NLR), positive likelihood ratio (PLR), diagnostic odds ratios (DOR), F-measure, and accuracy with 95% CI. If the outcomes cannot be formulated in a two-by-two contingency table, we will extract the reported performance data. If possible, we will also assess the area under the curve (AUC), as the two-by-two contingency table may not be available in some scenarios.

Qualitative and quantitative synthesis of results

We will qualitatively describe the application of AI in literature screening and evaluate and compare the accuracy of the AI tools. If there were adequate details and homogeneous data for the quantitative meta-analysis, we will combine the accuracy of AI algorithms in literature screening using the random-effects Rutter-Gatsonis hierarchical summarised receiver operating characteristic curve (HSROC) model which was recommended by the Cochrane Collaboration for combining the evidence for diagnostic accuracy [ 23 ]. The effect of threshold will be incorporated in the model in which heterogeneous thresholds among different studies will be allowed. The combined point estimates of accuracy will be retrieved from the summarised receiver operating characteristic curve (ROC).

Subgroup analyses and meta-regression will be used to explore the between-study heterogeneity. We will explore the following predefined sources of heterogeneity: (1) AI algorithm type, (2) study area of validation set (targeted specific diseases, interventions, or a general area), (3) searched electronic databases (PubMed, EMBASE, or others), and (4) proportion of eligible to original studies (the number of eligible literature identified in the screening step divided by the number of original literature identified during the electronic search). Furthermore, we will analyse the possible sources of heterogeneity from both dataset and methodological perspectives in HSROC as covariates following the recommendations from the Cochrane Handbook for Diagnostic Tests Review [ 23 ]. We regarded the factor as a source of heterogeneity if the coefficient of the covariate in the HSROC model was statistically significant. We will not evaluate the reporting bias (e.g. publication bias) since the hypothesis underlying the commonly used methods, such as funnel plot or Egger’s test, may not be satisfied in our research context. Data were analysed using R software, version 4.0.2 (R Foundation for Statistical Computing, Vienna, Austria) with two-tailed probability of type I error of 0.05 ( α =0.05).

Systematic review has developed rapidly within the last decades and plays a key role in enabling the spread of evidence-based practice. Systematic review, though costing less than primary research in money expenditure, is still time-consuming and labour-intensive. Conducting systematic review begins with electronic database searching for a specific research question, then at least two reviewers read each abstract of retrieved records to identify potential candidate literatures for full-text screening. Only 2.9% retrieved records are relevant and included in the final synthesis on average [ 24 ]; typically, reviewers have to find the proverbial needle in the haystack of irrelevant titles and abstracts. Computational scientists have developed various algorithms for automatic literature screening. Developing an automatic literature screening instrument will be source-saving and improve the quality of systematic review by liberating reviewers from repetitive work. In this systematic review, we aim to describe and evaluate the development process and algorithms used in various AI literature screening systems, in order to build a pipeline for the update of existing tools and creation of new models.

The accuracy of automatic literature screening instruments varied widely in different algorithms and review topics [ 17 ]. The automatic literature screening systems can reach a sensitivity as high as 95%, despite at the expense of specificity, since reviewers try to include every publication relative to the topic of review. As the automatic systems may have a low specificity, it is also important to evaluate how much reviewing work the reviewers can save in the step of screening. We will not only assess the diagnostic accuracy of AI screening algorithms compared with human investigators, but also collect the information of work saved by AI algorithms in literature screening. Additionally, we plan to conduct subgroup analyses to identify potential factors that associate with the accuracy and efficacy of AI algorithms.

As far as we know, this will be the first systematic review to evaluate AI algorithms for automatic literature screening in evidence synthesis. Few systematic reviews have focused on the application of AI algorithms in medical practice. The literature search strategies in previous published systematic reviews rarely use specific algorithms as search terms. Most of them generally use words such as “artificial intelligence” and “machine learning” in strategies, which may lose the studies that only reported one specific algorithm. In order to include AI-related studies as much as possible, our search strategy contained all of the AI algorithms commonly used in the past 50 years, and it was reviewed by an expert in ML. The process of literature screening can be assessed under the framework of the diagnostic test. Findings from this proposed systematic review will provide a comprehensive and essential summary of the application of AI algorithms for automatic literature screening in evidence synthesis. The proposed systematic review may also help to improve and promote the automatic methods in evidence synthesis in the future by locating and identifying the potential weakness in the current AI models and methods.

Availability of data and materials

The datasets used and analysed during the current study are available from the corresponding author on reasonable request.

Abbreviations

  • Artificial intelligence

Area under the curve

Diagnostic odds ratio

Grading of Recommendations, Assessment, Development and Evaluations

Hierarchical summarised receiver operating characteristic curve

Negative likelihood ratio

Negative predictive value

Positive likelihood ratio

Positive predictive value

Preferred Reporting Items for Systematic Review and Meta-Analysis Protocols

Quality Assessment of Diagnostic Accuracy Studies

Receiver operating characteristic curve

Support vector machine

Higgins J, Thomas J, Chandler J, et al. Cochrane handbook for systematic reviews of interventions version 6.0 (updated July 2019)Cochrane, 2019. Reference Source; 2020.

Google Scholar  

Mulrow CD, Cook D. Systematic reviews: synthesis of best evidence for health care decisions: ACP Press; 1998.

Armstrong R, Hall BJ, Doyle J, Waters E. ‘Scoping the scope’ of a cochrane review. J Public Health. 2011;33(1):147–50.

Article   Google Scholar  

Paul M, Leibovici L. Systematic review or meta-analysis? Their place in the evidence hierarchy. Clin Microbiol Infect. 2014;20(2):97–100. https://doi.org/10.1111/1469-069112489 2014(1469-0691 (Electronic)):97-100.

Article   CAS   PubMed   Google Scholar  

Murad MH, Asi N, Alsawas M, Alahdab F. New evidence pyramid. Evid Based Med. 2016;21(4):125.

Bigby M. Evidence-based medicine in a nutshell: a guide to finding and using the best evidence in caring for patients. Arch Dermatol. 1998;134(12):1609–18.

CAS   PubMed   Google Scholar  

Bragge P, Clavisi O, Turner T, Tavender E, Collie A, Gruen RL. The global evidence mapping initiative: scoping research in broad topic areas. BMC Med Res Methodol. 2011;11(1):92.

Sampson M, Shojania KG, Garritty C, Horsley T, Ocampo M, Moher D. Systematic reviews can be produced and published faster. J Clin Epidemiol. 2008;61(6):531–6.

Shojania K, Sampson M, Ansari M, Ji J, Doucette S, Moher D. How quickly do systematic reviews go out of date? A survival analysis. Ann Intern Med. 2007;147(4):224–33 2007(1539-3704 (Electronic)):224-233.

Bishop CM. Pattern recognition and machine learning: Springer; 2006.

Wang L-Y, Chakraborty A, Comaniciu D. Molecular diagnosis and biomarker identification on SELDI proteomics data by ADTBoost method. Paper presented at: 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference. 2006.

Cetin MS, Houck JM, Vergara VM, Miller RL, Calhoun V. Multimodal based classification of schizophrenia patients. Paper presented at: 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). 2015.

Sun Y, Loparo K. Information extraction from free text in clinical trials with knowledge-based distant supervision. Paper presented at: 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC). 2019.

Li M, Lu Y, Niu Z, Wu F-X. United complex centrality for identification of essential proteins from PPI networks. IEEE/ACM Transact Comput Biol Bioinform. 2015;14(2):370–80.

Whittington C, Feinman T, Lewis SZ, Lieberman G, Del Aguila M. Clinical practice guidelines: machine learning and natural language processing for automating the rapid identification and annotation of new evidence. J Clin Oncol. 2019;37.

Turner MD, Chakrabarti C, Jones TB, et al. Automated annotation of functional imaging experiments via multi-label classification. Front Neurosci. 2013;7:240.

Cohen AM, Hersh WR, Peterson K, Yen P-Y. Reducing workload in systematic review preparation using automated citation classification. J Am Med Inform Assoc. 2006;13(2):206–19.

Article   CAS   Google Scholar  

Rúbio TR, Gulo CA. Enhancing academic literature review through relevance recommendation: using bibliometric and text-based features for classification. Paper presented at: 2016 11th Iberian Conference on Information Systems and Technologies (CISTI). 2016.

Shamseer L, Moher D, Clarke M, et al. Preferred reporting items for systematic review and meta-analysis protocols (PRISMA-P) 2015: elaboration and explanation. BMJ. 2015;350:g7647.

Jonnalagadda SR, Goyal P, Huffman MD. Automating data extraction in systematic reviews: a systematic review. Syst Rev. 2015;4:78.

Whiting PF, Rutjes AW, Westwood ME, et al. QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies. Ann Intern Med. 2011;155(8):529–36.

Guyatt GH, Oxman AD, Vist GE, et al. GRADE: an emerging consensus on rating quality of evidence and strength of recommendations. BMJ. 2008;336(7650):924–6.

Macaskill P, Gatsonis C, Deeks J, Harbord R, Takwoingi Y. Cochrane handbook for systematic reviews of diagnostic test accuracy. Version 09 0. London: The Cochrane Collaboration; 2010.

Sampson M, Tetzlaff J, Urquhart C. Precision of healthcare systematic review searches in a cross-sectional sample. Res Synth Methods. 2011;2(2):119–25.

Download references

Acknowledgements

We thank Professor Siyan Zhan (Department of Epidemiology and Biostatistics, School of Public Health, Peking University Health Science Center, [email protected] ) for her critical comments in designing this study. We also thank Dr. Bin Zhang (Institute of Medical Information/Medical Library, Chinese Academy of Medical Sciences & Peking Union Medical College, [email protected] ) for her critical suggestions in developing search strategies.

This study will be supported by the Undergraduate Innovation and Entrepreneurship Training Program (Number 202010023001). The sponsors have no role in study design, data collection, data analysis, interpretations of findings, and decisions for dissemination.

Author information

Yuelun Zhang, Siyu Liang, and Yunying Feng contributed equally to this work and should be regarded as co-first authors.

Authors and Affiliations

Medical Research Center, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China

Yuelun Zhang

Department of Endocrinology, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, 1 Shuaifuyuan, Dongcheng District, Beijing, China

Siyu Liang, Shi Chen, Huijuan Zhu & Hui Pan

Eight-year Program of Clinical Medicine, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China

Yunying Feng, Yiying Yang & Xin He

Research Institute of Information and Technology, Tsinghua University, Beijing, China

Department of Epidemiology and Biostatistics, School of Public Health, Peking University Health Science Center, Beijing, China

You can also search for this author in PubMed   Google Scholar

Contributions

H Pan conceived this research. This protocol was designed by YL Zhang, SY Liang, and YY Feng. YY Yang, X He, Q Wang, F Sun, S Chen, and HJ Zhu provided critical suggestions and comments on the manuscript. YL Zhang, SY Liang, and YY Feng wrote the manuscript. All authors read and approved the final manuscript. H Pan is the guarantor for this manuscript.

Corresponding author

Correspondence to Hui Pan .

Ethics declarations

Ethics approval and consent to participate.

This research is exempt from ethics approval because the work is carried out on published documents.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1: supplementary table 1.

. Search strategy for PubMed.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Zhang, Y., Liang, S., Feng, Y. et al. Automation of literature screening using machine learning in medical evidence synthesis: a diagnostic test accuracy systematic review protocol. Syst Rev 11 , 11 (2022). https://doi.org/10.1186/s13643-021-01881-5

Download citation

Received : 20 August 2020

Accepted : 27 December 2021

Published : 15 January 2022

DOI : https://doi.org/10.1186/s13643-021-01881-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Evidence-based practice
  • Natural language process
  • Systematic review
  • Diagnostic test accuracy

Systematic Reviews

ISSN: 2046-4053

  • Submission enquiries: Access here and click Contact Us
  • General enquiries: [email protected]

ai in systematic literature review

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 01 February 2021

An open source machine learning framework for efficient and transparent systematic reviews

  • Rens van de Schoot   ORCID: orcid.org/0000-0001-7736-2091 1 ,
  • Jonathan de Bruin   ORCID: orcid.org/0000-0002-4297-0502 2 ,
  • Raoul Schram 2 ,
  • Parisa Zahedi   ORCID: orcid.org/0000-0002-1610-3149 2 ,
  • Jan de Boer   ORCID: orcid.org/0000-0002-0531-3888 3 ,
  • Felix Weijdema   ORCID: orcid.org/0000-0001-5150-1102 3 ,
  • Bianca Kramer   ORCID: orcid.org/0000-0002-5965-6560 3 ,
  • Martijn Huijts   ORCID: orcid.org/0000-0002-8353-0853 4 ,
  • Maarten Hoogerwerf   ORCID: orcid.org/0000-0003-1498-2052 2 ,
  • Gerbrich Ferdinands   ORCID: orcid.org/0000-0002-4998-3293 1 ,
  • Albert Harkema   ORCID: orcid.org/0000-0002-7091-1147 1 ,
  • Joukje Willemsen   ORCID: orcid.org/0000-0002-7260-0828 1 ,
  • Yongchao Ma   ORCID: orcid.org/0000-0003-4100-5468 1 ,
  • Qixiang Fang   ORCID: orcid.org/0000-0003-2689-6653 1 ,
  • Sybren Hindriks 1 ,
  • Lars Tummers   ORCID: orcid.org/0000-0001-9940-9874 5 &
  • Daniel L. Oberski   ORCID: orcid.org/0000-0001-7467-2297 1 , 6  

Nature Machine Intelligence volume  3 ,  pages 125–133 ( 2021 ) Cite this article

72k Accesses

212 Citations

165 Altmetric

Metrics details

  • Computational biology and bioinformatics
  • Computer science
  • Medical research

A preprint version of the article is available at arXiv.

To help researchers conduct a systematic review or meta-analysis as efficiently and transparently as possible, we designed a tool to accelerate the step of screening titles and abstracts. For many tasks—including but not limited to systematic reviews and meta-analyses—the scientific literature needs to be checked systematically. Scholars and practitioners currently screen thousands of studies by hand to determine which studies to include in their review or meta-analysis. This is error prone and inefficient because of extremely imbalanced data: only a fraction of the screened studies is relevant. The future of systematic reviewing will be an interaction with machine learning algorithms to deal with the enormous increase of available text. We therefore developed an open source machine learning-aided pipeline applying active learning: ASReview. We demonstrate by means of simulation studies that active learning can yield far more efficient reviewing than manual reviewing while providing high quality. Furthermore, we describe the options of the free and open source research software and present the results from user experience tests. We invite the community to contribute to open source projects such as our own that provide measurable and reproducible improvements over current practice.

Similar content being viewed by others

ai in systematic literature review

AI-assisted peer review

ai in systematic literature review

A typology for exploring the mitigation of shortcut behaviour

ai in systematic literature review

Distributed peer review enhanced with natural language processing and machine learning

With the emergence of online publishing, the number of scientific manuscripts on many topics is skyrocketing 1 . All of these textual data present opportunities to scholars and practitioners while simultaneously confronting them with new challenges. Scholars often develop systematic reviews and meta-analyses to develop comprehensive overviews of the relevant topics 2 . The process entails several explicit and, ideally, reproducible steps, including identifying all likely relevant publications in a standardized way, extracting data from eligible studies and synthesizing the results. Systematic reviews differ from traditional literature reviews in that they are more replicable and transparent 3 , 4 . Such systematic overviews of literature on a specific topic are pivotal not only for scholars, but also for clinicians, policy-makers, journalists and, ultimately, the general public 5 , 6 , 7 .

Given that screening the entire research literature on a given topic is too labour intensive, scholars often develop quite narrow searches. Developing a search strategy for a systematic review is an iterative process aimed at balancing recall and precision 8 , 9 ; that is, including as many potentially relevant studies as possible while simultaneously limiting the total number of studies retrieved. The vast number of publications in the field of study often leads to a relatively precise search, with the risk of missing relevant studies. The process of systematic reviewing is error prone and extremely time intensive 10 . In fact, if the literature of a field is growing faster than the amount of time available for systematic reviews, adequate manual review of this field then becomes impossible 11 .

The rapidly evolving field of machine learning has aided researchers by allowing the development of software tools that assist in developing systematic reviews 11 , 12 , 13 , 14 . Machine learning offers approaches to overcome the manual and time-consuming screening of large numbers of studies by prioritizing relevant studies via active learning 15 . Active learning is a type of machine learning in which a model can choose the data points (for example, records obtained from a systematic search) it would like to learn from and thereby drastically reduce the total number of records that require manual screening 16 , 17 , 18 . In most so-called human-in-the-loop 19 machine-learning applications, the interaction between the machine-learning algorithm and the human is used to train a model with a minimum number of labelling tasks. Unique to systematic reviewing is that not only do all relevant records (that is, titles and abstracts) need to seen by a researcher, but an extremely diverse range of concepts also need to be learned, thereby requiring flexibility in the modelling approach as well as careful error evaluation 11 . In the case of systematic reviewing, the algorithm(s) are interactively optimized for finding the most relevant records, instead of finding the most accurate model. The term researcher-in-the-loop was introduced 20 as a special case of human-in-the-loop with three unique components: (1) the primary output of the process is a selection of the records, not a trained machine learning model; (2) all records in the relevant selection are seen by a human at the end of the process 21 ; (3) the use-case requires a reproducible workflow and complete transparency is required 22 .

Existing tools that implement such an active learning cycle for systematic reviewing are described in Table 1 ; see the Supplementary Information for an overview of all of the software that we considered (note that this list was based on a review of software tools 12 ). However, existing tools have two main drawbacks. First, many are closed source applications with black box algorithms, which is problematic as transparency and data ownership are essential in the era of open science 22 . Second, to our knowledge, existing tools lack the necessary flexibility to deal with the large range of possible concepts to be learned by a screening machine. For example, in systematic reviews, the optimal type of classifier will depend on variable parameters, such as the proportion of relevant publications in the initial search and the complexity of the inclusion criteria used by the researcher 23 . For this reason, any successful system must allow for a wide range of classifier types. Benchmark testing is crucial to understand the real-world performance of any machine learning-aided system, but such benchmark options are currently mostly lacking.

In this paper we present an open source machine learning-aided pipeline with active learning for systematic reviews called ASReview. The goal of ASReview is to help scholars and practitioners to get an overview of the most relevant records for their work as efficiently as possible while being transparent in the process. The open, free and ready-to-use software ASReview addresses all concerns mentioned above: it is open source, uses active learning, allows multiple machine learning models. It also has a benchmark mode, which is especially useful for comparing and designing algorithms. Furthermore, it is intended to be easily extensible, allowing third parties to add modules that enhance the pipeline. Although we focus this paper on systematic reviews, ASReview can handle any text source.

In what follows, we first present the pipeline for manual versus machine learning-aided systematic reviews. We then show how ASReview has been set up and how ASReview can be used in different workflows by presenting several real-world use cases. We subsequently demonstrate the results of simulations that benchmark performance and present the results of a series of user-experience tests. Finally, we discuss future directions.

Pipeline for manual and machine learning-aided systematic reviews

The pipeline of a systematic review without active learning traditionally starts with researchers doing a comprehensive search in multiple databases 24 , using free text words as well as controlled vocabulary to retrieve potentially relevant references. The researcher then typically verifies that the key papers they expect to find are indeed included in the search results. The researcher downloads a file with records containing the text to be screened. In the case of systematic reviewing it contains the titles and abstracts (and potentially other metadata such as the authors’s names, journal name, DOI) of potentially relevant references into a reference manager. Ideally, two or more researchers then screen the records’s titles and abstracts on the basis of the eligibility criteria established beforehand 4 . After all records have been screened, the full texts of the potentially relevant records are read to determine which of them will be ultimately included in the review. Most records are excluded in the title and abstract phase. Typically, only a small fraction of the records belong to the relevant class, making title and abstract screening an important bottleneck in systematic reviewing process 25 . For instance, a recent study analysed 10,115 records and excluded 9,847 after title and abstract screening, a drop of more than 95% 26 . ASReview therefore focuses on this labour-intensive step.

The research pipeline of ASReview is depicted in Fig. 1 . The researcher starts with a search exactly as described above and subsequently uploads a file containing the records (that is, metadata containing the text of the titles and abstracts) into the software. Prior knowledge is then selected, which is used for training of the first model and presenting the first record to the researcher. As screening is a binary classification problem, the reviewer must select at least one key record to include and exclude on the basis of background knowledge. More prior knowledge may result in improved efficiency of the active learning process.

figure 1

The symbols indicate whether the action is taken by a human, a computer, or whether both options are available.

A machine learning classifier is trained to predict study relevance (labels) from a representation of the record-containing text (feature space) on the basis of prior knowledge. We have purposefully chosen not to include an author name or citation network representation in the feature space to prevent authority bias in the inclusions. In the active learning cycle, the software presents one new record to be screened and labelled by the user. The user’s binary label (1 for relevant versus 0 for irrelevant) is subsequently used to train a new model, after which a new record is presented to the user. This cycle continues up to a certain user-specified stopping criterion has been reached. The user now has a file with (1) records labelled as either relevant or irrelevant and (2) unlabelled records ordered from most to least probable to be relevant as predicted by the current model. This set-up helps to move through a large database much quicker than in the manual process, while the decision process simultaneously remains transparent.

Software implementation for ASReview

The source code 27 of ASReview is available open source under an Apache 2.0 license, including documentation 28 . Compiled and packaged versions of the software are available on the Python Package Index 29 or Docker Hub 30 . The free and ready-to-use software ASReview implements oracle, simulation and exploration modes. The oracle mode is used to perform a systematic review with interaction by the user, the simulation mode is used for simulation of the ASReview performance on existing datasets, and the exploration mode can be used for teaching purposes and includes several preloaded labelled datasets.

The oracle mode presents records to the researcher and the researcher classifies these. Multiple file formats are supported: (1) RIS files are used by digital libraries such as IEEE Xplore, Scopus and ScienceDirect; the citation managers Mendeley, RefWorks, Zotero and EndNote support the RIS format too. (2) Tabular datasets with the .csv, .xlsx and .xls file extensions. CSV files should be comma separated and UTF-8 encoded; the software for CSV files accepts a set of predetermined labels in line with the ones used in RIS files. Each record in the dataset should hold the metadata on, for example, a scientific publication. Mandatory metadata is text and can, for example, be titles or abstracts from scientific papers. If available, both are used to train the model, but at least one is needed. An advanced option is available that splits the title and abstracts in the feature-extraction step and weights the two feature matrices independently (for TF–IDF only). Other metadata such as author, date, DOI and keywords are optional but not used for training the models. When using ASReview in the simulation or exploration mode, an additional binary variable is required to indicate historical labelling decisions. This column, which is automatically detected, can also be used in the oracle mode as background knowledge for previous selection of relevant papers before entering the active learning cycle. If unavailable, the user has to select at least one relevant record that can be identified by searching the pool of records. At least one irrelevant record should also be identified; the software allows to search for specific records or presents random records that are most likely to be irrelevant due to the extremely imbalanced data.

The software has a simple yet extensible default model: a naive Bayes classifier, TF–IDF feature extraction, a dynamic resampling balance strategy 31 and certainty-based sampling 17 , 32 for the query strategy. These defaults were chosen on the basis of their consistently high performance in benchmark experiments across several datasets 31 . Moreover, the low computation time of these default settings makes them attractive in applications, given that the software should be able to run locally. Users can change the settings, shown in Table 2 , and technical details are described in our documentation 28 . Users can also add their own classifiers, feature extraction techniques, query strategies and balance strategies.

ASReview has a number of implemented features (see Table 2 ). First, there are several classifiers available: (1) naive Bayes; (2) support vector machines; (3) logistic regression; (4) neural networks; (5) random forests; (6) LSTM-base, which consists of an embedding layer, an LSTM layer with one output, a dense layer and a single sigmoid output node; and (7) LSTM-pool, which consists of an embedding layer, an LSTM layer with many outputs, a max pooling layer and a single sigmoid output node. The feature extraction techniques available are Doc2Vec 33 , embedding LSTM, embedding with IDF or TF–IDF 34 (the default is unigram, with the option to run n -grams while other parameters are set to the defaults of Scikit-learn 35 ) and sBERT 36 . The available query strategies for the active learning part are (1) random selection, ignoring model-assigned probabilities; (2) uncertainty-based sampling, which chooses the most uncertain record according to the model (that is, closest to 0.5 probability); (3) certainty-based sampling (max in ASReview), which chooses the record most likely to be included according to the model; and (4) mixed sampling, which uses a combination of random and certainty-based sampling.

There are several balance strategies that rebalance and reorder the training data. This is necessary, because the data is typically extremely imbalanced and therefore we have implemented the following balance strategies: (1) full sampling, which uses all of the labelled records; (2) undersampling the irrelevant records so that the included and excluded records are in some particular ratio (closer to one); and (3) dynamic resampling, a novel method similar to undersampling in that it decreases the imbalance of the training data 31 . However, in dynamic resampling, the number of irrelevant records is decreased, whereas the number of relevant records is increased by duplication such that the total number of records in the training data remains the same. The ratio between relevant and irrelevant records is not fixed over interactions, but dynamically updated depending on the number of labelled records, the total number of records and the ratio between relevant and irrelevant records. Details on all of the described algorithms can be found in the code and documentation referred to above.

By default, ASReview converts the records’s texts into a document-term matrix, terms are converted to lowercase and no stop words are removed by default (but this can be changed). As the document-term matrix is identical in each iteration of the active learning cycle, it is generated in advance of model training and stored in the (active learning) state file. Each row of the document-term matrix can easily be requested from the state-file. Records are internally identified by their row number in the input dataset. In oracle mode, the record that is selected to be classified is retrieved from the state file and the record text and other metadata (such as title and abstract) are retrieved from the original dataset (from the file or the computer’s memory). ASReview can run on your local computer, or on a (self-hosted) local or remote server. Data (all records and their labels) remain on the users’s computer. Data ownership and confidentiality are crucial and no data are processed or used in any way by third parties. This is unique by comparison with some of the existing systems, as shown in the last column of Table 1 .

Real-world use cases and high-level function descriptions

Below we highlight a number of real-world use cases and high-level function descriptions for using the pipeline of ASReview.

ASReview can be integrated in classic systematic reviews or meta-analyses. Such reviews or meta-analyses entail several explicit and reproducible steps, as outlined in the PRISMA guidelines 4 . Scholars identify all likely relevant publications in a standardized way, screen retrieved publications to select eligible studies on the basis of defined eligibility criteria, extract data from eligible studies and synthesize the results. ASReview fits into this process, particularly in the abstract screening phase. ASReview does not replace the initial step of collecting all potentially relevant studies. As such, results from ASReview depend on the quality of the initial search process, including selection of databases 24 and construction of comprehensive searches using keywords and controlled vocabulary. However, ASReview can be used to broaden the scope of the search (by keyword expansion or omitting limitation in the search query), resulting in a higher number of initial papers to limit the risk of missing relevant papers during the search part (that is, more focus on recall instead of precision).

Furthermore, many reviewers nowadays move towards meta-reviews when analysing very large literature streams, that is, systematic reviews of systematic reviews 37 . This can be problematic as the various reviews included could use different eligibility criteria and are therefore not always directly comparable. Due to the efficiency of ASReview, scholars using the tool could conduct the study by analysing the papers directly instead of using the systematic reviews. Furthermore, ASReview supports the rapid update of a systematic review. The included papers from the initial review are used to train the machine learning model before screening of the updated set of papers starts. This allows the researcher to quickly screen the updated set of papers on the basis of decisions made in the initial run.

As an example case, let us look at the current literature on COVID-19 and the coronavirus. An enormous number of papers are being published on COVID-19. It is very time consuming to manually find relevant papers (for example, to develop treatment guidelines). This is especially problematic as urgent overviews are required. Medical guidelines rely on comprehensive systematic reviews, but the medical literature is growing at breakneck pace and the quality of the research is not universally adequate for summarization into policy 38 . Such reviews must entail adequate protocols with explicit and reproducible steps, including identifying all potentially relevant papers, extracting data from eligible studies, assessing potential for bias and synthesizing the results into medical guidelines. Researchers need to screen (tens of) thousands of COVID-19-related studies by hand to find relevant papers to include in their overview. Using ASReview, this can be done far more efficiently by selecting key papers that match their (COVID-19) research question in the first step; this should start the active learning cycle and lead to the most relevant COVID-19 papers for their research question being presented next. A plug-in was therefore developed for ASReview 39 , which contained three databases that are updated automatically whenever a new version is released by the owners of the data: (1) the Cord19 database, developed by the Allen Institute for AI, with over all publications on COVID-19 and other coronavirus research (for example SARS, MERS and so on) from PubMed Central, the WHO COVID-19 database of publications, the preprint servers bioRxiv and medRxiv and papers contributed by specific publishers 40 . The CORD-19 dataset is updated daily by the Allen Institute for AI and updated also daily in the plugin. (2) In addition to the full dataset, we automatically construct a daily subset of the database with studies published after December 1st, 2019 to search for relevant papers published during the COVID-19 crisis. (3) A separate dataset of COVID-19 related preprints, containing metadata of preprints from over 15 preprints servers across disciplines, published since January 1st, 2020 41 . The preprint dataset is updated weekly by the maintainers and then automatically updated in ASReview as well. As this dataset is not readily available to researchers through regular search engines (for example, PubMed), its inclusion in ASReview provided added value to researchers interested in COVID-19 research, especially if they want a quick way to screen preprints specifically.

Simulation study

To evaluate the performance of ASReview on a labelled dataset, users can employ the simulation mode. As an example, we ran simulations based on four labelled datasets with version 0.7.2 of ASReview. All scripts to reproduce the results in this paper can be found on Zenodo ( https://doi.org/10.5281/zenodo.4024122 ) 42 , whereas the results are available at OSF ( https://doi.org/10.17605/OSF.IO/2JKD6 ) 43 .

First, we analysed the performance for a study systematically describing studies that performed viral metagenomic next-generation sequencing in common livestock such as cattle, small ruminants, poultry and pigs 44 . Studies were retrieved from Embase ( n  = 1,806), Medline ( n  = 1,384), Cochrane Central ( n  = 1), Web of Science ( n  = 977) and Google Scholar ( n  = 200, the top relevant references). After deduplication this led to 2,481 studies obtained in the initial search, of which 120 were inclusions (4.84%).

A second simulation study was performed on the results for a systematic review of studies on fault prediction in software engineering 45 . Studies were obtained from ACM Digital Library, IEEExplore and the ISI Web of Science. Furthermore, a snowballing strategy and a manual search were conducted, accumulating to 8,911 publications of which 104 were included in the systematic review (1.2%).

A third simulation study was performed on a review of longitudinal studies that applied unsupervised machine learning techniques to longitudinal data of self-reported symptoms of the post-traumatic stress assessed after trauma exposure 46 , 47 ; 5,782 studies were obtained by searching Pubmed, Embase, PsychInfo and Scopus and through a snowballing strategy in which both the references and the citation of the included papers were screened. Thirty-eight studies were included in the review (0.66%).

A fourth simulation study was performed on the results for a systematic review on the efficacy of angiotensin-converting enzyme inhibitors, from a study collecting various systematic review datasets from the medical sciences 15 . The collection is a subset of 2,544 publications from the TREC 2004 Genomics Track document corpus 48 . This is a static subset from all MEDLINE records from 1994 through 2003, which allows for replicability of results. Forty-one publications were included in the review (1.6%).

Performance metrics

We evaluated the four datasets using three performance metrics. We first assess the work saved over sampling (WSS), which is the percentage reduction in the number of records needed to screen achieved by using active learning instead of screening records at random; WSS is measured at a given level of recall of relevant records, for example 95%, indicating the work reduction in screening effort at the cost of failing to detect 5% of the relevant records. For some researchers it is essential that all relevant literature on the topic is retrieved; this entails that the recall should be 100% (that is, WSS@100%). We also propose the amount of relevant references found after having screened the first 10% of the records (RRF10%). This is a useful metric for getting a quick overview of the relevant literature.

For every dataset, 15 runs were performed with one random inclusion and one random exclusion (see Fig. 2 ). The classical review performance with randomly found inclusions is shown by the dashed line. The average work saved over sampling at 95% recall for ASReview is 83% and ranges from 67% to 92%. Hence, 95% of the eligible studies will be found after screening between only 8% to 33% of the studies. Furthermore, the number of relevant abstracts found after reading 10% of the abstracts ranges from 70% to 100%. In short, our software would have saved many hours of work.

figure 2

a – d , Results of the simulation study for the results for a study systematically review studies that performed viral metagenomic next-generation sequencing in common livestock ( a ), results for a systematic review of studies on fault prediction in software engineering ( b ), results for longitudinal studies that applied unsupervised machine learning techniques on longitudinal data of self-reported symptoms of posttraumatic stress assessed after trauma exposure ( c ), and results for a systematic review on the efficacy of angiotensin-converting enzyme inhibitors ( d ). Fiteen runs (shown with separate lines) were performed for every dataset, with only one random inclusion and one random exclusion. The classical review performances with randomly found inclusions are shown by the dashed lines.

Usability testing (user experience testing)

We conducted a series of user experience tests to learn from end users how they experience the software and implement it in their workflow. The study was approved by the Ethics Committee of the Faculty of Social and Behavioral Sciences of Utrecht University (ID 20-104).

Unstructured interviews

The first user experience (UX) test—carried out in December 2019—was conducted with an academic research team in a substantive research field (public administration and organizational science) that has conducted various systematic reviews and meta-analyses. It was composed of three university professors (ranging from assistant to full) and three PhD candidates. In one 3.5 h session, the participants used the software and provided feedback via unstructured interviews and group discussions. The goal was to provide feedback on installing the software and testing the performance on their own data. After these sessions we prioritized the feedback in a meeting with the ASReview team, which resulted in the release of v.0.4 and v.0.6. An overview of all releases can be found on GitHub 27 .

A second UX test was conducted with four experienced researchers developing medical guidelines based on classical systematic reviews, and two experienced reviewers working at a pharmaceutical non-profit organization who work on updating reviews with new data. In four sessions, held in February to March 2020, these users tested the software following our testing protocol. After each session we implemented the feedback provided by the experts and asked them to review the software again. The main feedback was about how to upload datasets and select prior papers. Their feedback resulted in the release of v.0.7 and v.0.9.

Systematic UX test

In May 2020 we conducted a systematic UX test. Two groups of users were distinguished: an unexperienced group and an experienced user who already used ASReview. Due to the COVID-19 lockdown the usability tests were conducted via video calling where one person gave instructions to the participant and one person observed, called human-moderated remote testing 49 . During the tests, one person (SH) asked the questions and helped the participant with the tasks, the other person observed and made notes, a user experience professional at the IT department of Utrecht University (MH).

To analyse the notes, thematic analysis was used, which is a method to analyse data by dividing the information in subjects that all have a different meaning 50 using the Nvivo 12 software 51 . When something went wrong the text was coded as showstopper, when something did not go smoothly the text was coded as doubtful, and when something went well the subject was coded as superb. The features the participants requested for future versions of the ASReview tool were discussed with the lead engineer of the ASReview team and were submitted to GitHub as issues or feature requests.

The answers to the quantitative questions can be found at the Open Science Framework 52 . The participants ( N  = 11) rated the tool with a grade of 7.9 (s.d. = 0.9) on a scale from one to ten (Table 2 ). The unexperienced users on average rated the tool with an 8.0 (s.d. = 1.1, N  = 6). The experienced user on average rated the tool with a 7.8 (s.d. = 0.9, N  = 5). The participants described the usability test with words such as helpful, accessible, fun, clear and obvious.

The UX tests resulted in the new release v0.10, v0.10.1 and the major release v0.11, which is a major revision of the graphical user interface. The documentation has been upgraded to make installing and launching ASReview more straightforward. We made setting up the project, selecting a dataset and finding past knowledge is more intuitive and flexible. We also added a project dashboard with information on your progress and advanced settings.

Continuous input via the open source community

Finally, the ASReview development team receives continuous feedback from the open science community about, among other things, the user experience. In every new release we implement features listed by our users. Recurring UX tests are performed to keep up with the needs of users and improve the value of the tool.

We designed a system to accelerate the step of screening titles and abstracts to help researchers conduct a systematic review or meta-analysis as efficiently and transparently as possible. Our system uses active learning to train a machine learning model that predicts relevance from texts using a limited number of labelled examples. The classifier, feature extraction technique, balance strategy and active learning query strategy are flexible. We provide an open source software implementation, ASReview with state-of-the-art systems across a wide range of real-world systematic reviewing applications. Based on our experiments, ASReview provides defaults on its parameters, which exhibited good performance on average across the applications we examined. However, we stress that in practical applications, these defaults should be carefully examined; for this purpose, the software provides a simulation mode to users. We encourage users and developers to perform further evaluation of the proposed approach in their application, and to take advantage of the open source nature of the project by contributing further developments.

Drawbacks of machine learning-based screening systems, including our own, remain. First, although the active learning step greatly reduces the number of manuscripts that must be screened, it also prevents a straightforward evaluation of the system’s error rates without further onerous labelling. Providing users with an accurate estimate of the system’s error rate in the application at hand is therefore a pressing open problem. Second, although, as argued above, the use of such systems is not limited in principle to reviewing, no empirical benchmarks of actual performance in these other situations yet exist to our knowledge. Third, machine learning-based screening systems automate the screening step only; although the screening step is time-consuming and a good target for automation, it is just one part of a much larger process, including the initial search, data extraction, coding for risk of bias, summarizing results and so on. Although some other works, similar to our own, have looked at (semi-)automating some of these steps in isolation 53 , 54 , to our knowledge the field is still far removed from an integrated system that would truly automate the review process while guaranteeing the quality of the produced evidence synthesis. Integrating the various tools that are currently under development to aid the systematic reviewing pipeline is therefore a worthwhile topic for future development.

Possible future research could also focus on the performance of identifying full text articles with different document length and domain-specific terminologies or even other types of text, such as newspaper articles and court cases. When the selection of past knowledge is not possible based on expert knowledge, alternative methods could be explored. For example, unsupervised learning or pseudolabelling algorithms could be used to improve training 55 , 56 . In addition, as the NLP community pushes forward the state of the art in feature extraction methods, these are easily added to our system as well. In all cases, performance benefits should be carefully evaluated using benchmarks for the task at hand. To this end, common benchmark challenges should be constructed that allow for an even comparison of the various tools now available. To facilitate such a benchmark, we have constructed a repository of publicly available systematic reviewing datasets 57 .

The future of systematic reviewing will be an interaction with machine learning algorithms to deal with the enormous increase of available text. We invite the community to contribute to open source projects such as our own, as well as to common benchmark challenges, so that we can provide measurable and reproducible improvement over current practice.

Data availability

The results described in this paper are available at the Open Science Framework ( https://doi.org/10.17605/OSF.IO/2JKD6 ) 43 . The answers to the quantitative questions of the UX test can be found at the Open Science Framework (OSF.IO/7PQNM) 52 .

Code availability

All code to reproduce the results described in this paper can be found on Zenodo ( https://doi.org/10.5281/zenodo.4024122 ) 42 . All code for the software ASReview is available under an Apache 2.0 license ( https://doi.org/10.5281/zenodo.3345592 ) 27 , is maintained on GitHub 63 and includes documentation ( https://doi.org/10.5281/zenodo.4287120 ) 28 .

Bornmann, L. & Mutz, R. Growth rates of modern science: a bibliometric analysis based on the number of publications and cited references. J. Assoc. Inf. Sci. Technol. 66 , 2215–2222 (2015).

Article   Google Scholar  

Gough, D., Oliver, S. & Thomas, J. An Introduction to Systematic Reviews (Sage, 2017).

Cooper, H. Research Synthesis and Meta-analysis: A Step-by-Step Approach (SAGE Publications, 2015).

Liberati, A. et al. The PRISMA statement for reporting systematic reviews and meta-analyses of studies that evaluate health care interventions: explanation and elaboration. J. Clin. Epidemiol. 62 , e1–e34 (2009).

Boaz, A. et al. Systematic Reviews: What have They Got to Offer Evidence Based Policy and Practice? (ESRC UK Centre for Evidence Based Policy and Practice London, 2002).

Oliver, S., Dickson, K. & Bangpan, M. Systematic Reviews: Making Them Policy Relevant. A Briefing for Policy Makers and Systematic Reviewers (UCL Institute of Education, 2015).

Petticrew, M. Systematic reviews from astronomy to zoology: myths and misconceptions. Brit. Med. J. 322 , 98–101 (2001).

Lefebvre, C., Manheimer, E. & Glanville, J. in Cochrane Handbook for Systematic Reviews of Interventions (eds. Higgins, J. P. & Green, S.) 95–150 (John Wiley & Sons, 2008); https://doi.org/10.1002/9780470712184.ch6 .

Sampson, M., Tetzlaff, J. & Urquhart, C. Precision of healthcare systematic review searches in a cross-sectional sample. Res. Synth. Methods 2 , 119–125 (2011).

Wang, Z., Nayfeh, T., Tetzlaff, J., O’Blenis, P. & Murad, M. H. Error rates of human reviewers during abstract screening in systematic reviews. PLoS ONE 15 , e0227742 (2020).

Marshall, I. J. & Wallace, B. C. Toward systematic review automation: a practical guide to using machine learning tools in research synthesis. Syst. Rev. 8 , 163 (2019).

Harrison, H., Griffin, S. J., Kuhn, I. & Usher-Smith, J. A. Software tools to support title and abstract screening for systematic reviews in healthcare: an evaluation. BMC Med. Res. Methodol. 20 , 7 (2020).

O’Mara-Eves, A., Thomas, J., McNaught, J., Miwa, M. & Ananiadou, S. Using text mining for study identification in systematic reviews: a systematic review of current approaches. Syst. Rev. 4 , 5 (2015).

Wallace, B. C., Trikalinos, T. A., Lau, J., Brodley, C. & Schmid, C. H. Semi-automated screening of biomedical citations for systematic reviews. BMC Bioinf. 11 , 55 (2010).

Cohen, A. M., Hersh, W. R., Peterson, K. & Yen, P.-Y. Reducing workload in systematic review preparation using automated citation classification. J. Am. Med. Inform. Assoc. 13 , 206–219 (2006).

Kremer, J., Steenstrup Pedersen, K. & Igel, C. Active learning with support vector machines. WIREs Data Min. Knowl. Discov. 4 , 313–326 (2014).

Miwa, M., Thomas, J., O’Mara-Eves, A. & Ananiadou, S. Reducing systematic review workload through certainty-based screening. J. Biomed. Inform. 51 , 242–253 (2014).

Settles, B. Active Learning Literature Survey (Minds@UW, 2009); https://minds.wisconsin.edu/handle/1793/60660

Holzinger, A. Interactive machine learning for health informatics: when do we need the human-in-the-loop? Brain Inform. 3 , 119–131 (2016).

Van de Schoot, R. & De Bruin, J. Researcher-in-the-loop for Systematic Reviewing of Text Databases (Zenodo, 2020); https://doi.org/10.5281/zenodo.4013207

Kim, D., Seo, D., Cho, S. & Kang, P. Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec. Inf. Sci. 477 , 15–29 (2019).

Nosek, B. A. et al. Promoting an open research culture. Science 348 , 1422–1425 (2015).

Kilicoglu, H., Demner-Fushman, D., Rindflesch, T. C., Wilczynski, N. L. & Haynes, R. B. Towards automatic recognition of scientifically rigorous clinical research evidence. J. Am. Med. Inform. Assoc. 16 , 25–31 (2009).

Gusenbauer, M. & Haddaway, N. R. Which academic search systems are suitable for systematic reviews or meta‐analyses? Evaluating retrieval qualities of Google Scholar, PubMed, and 26 other resources. Res. Synth. Methods 11 , 181–217 (2020).

Borah, R., Brown, A. W., Capers, P. L. & Kaiser, K. A. Analysis of the time and workers needed to conduct systematic reviews of medical interventions using data from the PROSPERO registry. BMJ Open 7 , e012545 (2017).

de Vries, H., Bekkers, V. & Tummers, L. Innovation in the Public Sector: a systematic review and future research agenda. Public Adm. 94 , 146–166 (2016).

Van de Schoot, R. et al. ASReview: Active Learning for Systematic Reviews (Zenodo, 2020); https://doi.org/10.5281/zenodo.3345592

De Bruin, J. et al. ASReview Software Documentation 0.14 (Zenodo, 2020); https://doi.org/10.5281/zenodo.4287120

ASReview PyPI Package (ASReview Core Development Team, 2020); https://pypi.org/project/asreview/

Docker container for ASReview (ASReview Core Development Team, 2020); https://hub.docker.com/r/asreview/asreview

Ferdinands, G. et al. Active Learning for Screening Prioritization in Systematic Reviews—A Simulation Study (OSF Preprints, 2020); https://doi.org/10.31219/osf.io/w6qbg

Fu, J. H. & Lee, S. L. Certainty-enhanced active learning for improving imbalanced data classification. In 2011 IEEE 11th International Conference on Data Mining Workshops 405–412 (IEEE, 2011).

Le, Q. V. & Mikolov, T. Distributed representations of sentences and documents. Preprint at https://arxiv.org/abs/1405.4053 (2014).

Ramos, J. Using TF–IDF to determine word relevance in document queries. In Proc. 1st Instructional Conference on Machine Learning Vol. 242, 133–142 (ICML, 2003).

Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12 , 2825–2830 (2011).

MathSciNet   MATH   Google Scholar  

Reimers, N. & Gurevych, I. Sentence-BERT: sentence embeddings using siamese BERT-networks Preprint at https://arxiv.org/abs/1908.10084 (2019).

Smith, V., Devane, D., Begley, C. M. & Clarke, M. Methodology in conducting a systematic review of systematic reviews of healthcare interventions. BMC Med. Res. Methodol. 11 , 15 (2011).

Wynants, L. et al. Prediction models for diagnosis and prognosis of COVID-19: systematic review and critical appraisal. Brit. Med. J . 369 , 1328 (2020).

Van de Schoot, R. et al. Extension for COVID-19 Related Datasets in ASReview (Zenodo, 2020). https://doi.org/10.5281/zenodo.3891420 .

Lu Wang, L. et al. CORD-19: The COVID-19 open research dataset. Preprint at https://arxiv.org/abs/2004.10706 (2020).

Fraser, N. & Kramer, B. Covid19_preprints (FigShare, 2020); https://doi.org/10.6084/m9.figshare.12033672.v18

Ferdinands, G., Schram, R., Van de Schoot, R. & De Bruin, J. Scripts for ‘ASReview: Open Source Software for Efficient and Transparent Active Learning for Systematic Reviews’ (Zenodo, 2020); https://doi.org/10.5281/zenodo.4024122

Ferdinands, G., Schram, R., van de Schoot, R. & de Bruin, J. Results for ‘ASReview: Open Source Software for Efficient and Transparent Active Learning for Systematic Reviews’ (OSF, 2020); https://doi.org/10.17605/OSF.IO/2JKD6

Kwok, K. T. T., Nieuwenhuijse, D. F., Phan, M. V. T. & Koopmans, M. P. G. Virus metagenomics in farm animals: a systematic review. Viruses 12 , 107 (2020).

Hall, T., Beecham, S., Bowes, D., Gray, D. & Counsell, S. A systematic literature review on fault prediction performance in software engineering. IEEE Trans. Softw. Eng. 38 , 1276–1304 (2012).

van de Schoot, R., Sijbrandij, M., Winter, S. D., Depaoli, S. & Vermunt, J. K. The GRoLTS-Checklist: guidelines for reporting on latent trajectory studies. Struct. Equ. Model. Multidiscip. J. 24 , 451–467 (2017).

Article   MathSciNet   Google Scholar  

van de Schoot, R. et al. Bayesian PTSD-trajectory analysis with informed priors based on a systematic literature search and expert elicitation. Multivar. Behav. Res. 53 , 267–291 (2018).

Cohen, A. M., Bhupatiraju, R. T. & Hersh, W. R. Feature generation, feature selection, classifiers, and conceptual drift for biomedical document triage. In Proc. 13th Text Retrieval Conference (TREC, 2004).

Vasalou, A., Ng, B. D., Wiemer-Hastings, P. & Oshlyansky, L. Human-moderated remote user testing: orotocols and applications. In 8th ERCIM Workshop, User Interfaces for All Vol. 19 (ERCIM, 2004).

Joffe, H. in Qualitative Research Methods in Mental Health and Psychotherapy: A Guide for Students and Practitioners (eds Harper, D. & Thompson, A. R.) Ch. 15 (Wiley, 2012).

NVivo v. 12 (QSR International Pty, 2019).

Hindriks, S., Huijts, M. & van de Schoot, R. Data for UX-test ASReview - June 2020. OSF https://doi.org/10.17605/OSF.IO/7PQNM (2020).

Marshall, I. J., Kuiper, J. & Wallace, B. C. RobotReviewer: evaluation of a system for automatically assessing bias in clinical trials. J. Am. Med. Inform. Assoc. 23 , 193–201 (2016).

Nallapati, R., Zhou, B., dos Santos, C. N., Gulcehre, Ç. & Xiang, B. Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Proc. 20th SIGNLL Conference on Computational Natural Language Learning 280–290 (Association for Computational Linguistics, 2016).

Xie, Q., Dai, Z., Hovy, E., Luong, M.-T. & Le, Q. V. Unsupervised data augmentation for consistency training. Preprint at https://arxiv.org/abs/1904.12848 (2019).

Ratner, A. et al. Snorkel: rapid training data creation with weak supervision. VLDB J. 29 , 709–730 (2020).

Systematic Review Datasets (ASReview Core Development Team, 2020); https://github.com/asreview/systematic-review-datasets

Wallace, B. C., Small, K., Brodley, C. E., Lau, J. & Trikalinos, T. A. Deploying an interactive machine learning system in an evidence-based practice center: Abstrackr. In Proc. 2nd ACM SIGHIT International Health Informatics Symposium 819–824 (Association for Computing Machinery, 2012).

Cheng, S. H. et al. Using machine learning to advance synthesis and use of conservation and environmental evidence. Conserv. Biol. 32 , 762–764 (2018).

Yu, Z., Kraft, N. & Menzies, T. Finding better active learners for faster literature reviews. Empir. Softw. Eng . 23 , 3161–3186 (2018).

Ouzzani, M., Hammady, H., Fedorowicz, Z. & Elmagarmid, A. Rayyan—a web and mobile app for systematic reviews. Syst. Rev. 5 , 210 (2016).

Przybyła, P. et al. Prioritising references for systematic reviews with RobotAnalyst: a user study. Res. Synth. Methods 9 , 470–488 (2018).

ASReview: Active learning for Systematic Reviews (ASReview Core Development Team, 2020); https://github.com/asreview/asreview

Download references

Acknowledgements

We would like to thank the Utrecht University Library, focus area Applied Data Science, and departments of Information and Technology Services, Test and Quality Services, and Methodology and Statistics, for their support. We also want to thank all researchers who shared data, participated in our user experience tests or who gave us feedback on ASReview in other ways. Furthermore, we would like to thank the editors and reviewers for providing constructive feedback. This project was funded by the Innovation Fund for IT in Research Projects, Utrecht University, the Netherlands.

Author information

Authors and affiliations.

Department of Methodology and Statistics, Faculty of Social and Behavioral Sciences, Utrecht University, Utrecht, the Netherlands

Rens van de Schoot, Gerbrich Ferdinands, Albert Harkema, Joukje Willemsen, Yongchao Ma, Qixiang Fang, Sybren Hindriks & Daniel L. Oberski

Department of Research and Data Management Services, Information Technology Services, Utrecht University, Utrecht, the Netherlands

Jonathan de Bruin, Raoul Schram, Parisa Zahedi & Maarten Hoogerwerf

Utrecht University Library, Utrecht University, Utrecht, the Netherlands

Jan de Boer, Felix Weijdema & Bianca Kramer

Department of Test and Quality Services, Information Technology Services, Utrecht University, Utrecht, the Netherlands

Martijn Huijts

School of Governance, Faculty of Law, Economics and Governance, Utrecht University, Utrecht, the Netherlands

Lars Tummers

Department of Biostatistics, Data management and Data Science, Julius Center, University Medical Center Utrecht, Utrecht, the Netherlands

Daniel L. Oberski

You can also search for this author in PubMed   Google Scholar

Contributions

R.v.d.S. and D.O. originally designed the project, with later input from L.T. J.d.Br. is the lead engineer, software architect and supervises the code base on GitHub. R.S. coded the algorithms and simulation studies. P.Z. coded the very first version of the software. J.d.Bo., F.W. and B.K. developed the systematic review pipeline. M.Huijts is leading the UX tests and was supported by S.H. M.Hoogerwerf developed the architecture of the produced (meta)data. G.F. conducted the simulation study together with R.S. A.H. performed the literature search comparing the different tools together with G.F. J.W. designed all the artwork and helped with formatting the manuscript. Y.M. and Q.F. are responsible for the preprocessing of the metadata under the supervision of J.d.Br. R.v.d.S, D.O. and L.T. wrote the paper with input from all authors. Each co-author has written parts of the manuscript.

Corresponding author

Correspondence to Rens van de Schoot .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Peer review information Nature Machine Intelligence thanks Jian Wu and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information.

Overview of software tools supporting systematic reviews.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

van de Schoot, R., de Bruin, J., Schram, R. et al. An open source machine learning framework for efficient and transparent systematic reviews. Nat Mach Intell 3 , 125–133 (2021). https://doi.org/10.1038/s42256-020-00287-7

Download citation

Received : 04 June 2020

Accepted : 17 December 2020

Published : 01 February 2021

Issue Date : February 2021

DOI : https://doi.org/10.1038/s42256-020-00287-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

A systematic review, meta-analysis, and meta-regression of the prevalence of self-reported disordered eating and associated factors among athletes worldwide.

  • Hadeel A. Ghazzawi
  • Lana S. Nimer
  • Haitham Jahrami

Journal of Eating Disorders (2024)

Systematic review using a spiral approach with machine learning

  • Amirhossein Saeidmehr
  • Piers David Gareth Steel
  • Faramarz F. Samavati

Systematic Reviews (2024)

The spatial patterning of emergency demand for police services: a scoping review

  • Samuel Langton
  • Stijn Ruiter
  • Linda Schoonmade

Crime Science (2024)

The SAFE procedure: a practical stopping heuristic for active learning-based screening in systematic reviews and meta-analyses

  • Josien Boetje
  • Rens van de Schoot

Tunneling, cognitive load and time orientation and their relations with dietary behavior of people experiencing financial scarcity – an AI-assisted scoping review elaborating on scarcity theory

  • Annemarieke van der Veer
  • Tamara Madern
  • Frank J. van Lenthe

International Journal of Behavioral Nutrition and Physical Activity (2024)

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

ai in systematic literature review

ai in systematic literature review

  • Help Center

GET STARTED

Rayyan

COLLABORATE ON YOUR REVIEWS WITH ANYONE, ANYWHERE, ANYTIME

Rayyan for students

Save precious time and maximize your productivity with a Rayyan membership. Receive training, priority support, and access features to complete your systematic reviews efficiently.

Rayyan for Librarians

Rayyan Teams+ makes your job easier. It includes VIP Support, AI-powered in-app help, and powerful tools to create, share and organize systematic reviews, review teams, searches, and full-texts.

Rayyan for Researchers

RESEARCHERS

Rayyan makes collaborative systematic reviews faster, easier, and more convenient. Training, VIP support, and access to new features maximize your productivity. Get started now!

Over 1 billion reference articles reviewed by research teams, and counting...

Intelligent, scalable and intuitive.

Rayyan understands language, learns from your decisions and helps you work quickly through even your largest systematic literature reviews.

WATCH A TUTORIAL NOW

Solutions for Organizations and Businesses

ai in systematic literature review

Rayyan Enterprise and Rayyan Teams+ make it faster, easier and more convenient for you to manage your research process across your organization.

  • Accelerate your research across your team or organization and save valuable researcher time.
  • Build and preserve institutional assets, including literature searches, systematic reviews, and full-text articles.
  • Onboard team members quickly with access to group trainings for beginners and experts.
  • Receive priority support to stay productive when questions arise.
  • SCHEDULE A DEMO
  • LEARN MORE ABOUT RAYYAN TEAMS+

RAYYAN SYSTEMATIC LITERATURE REVIEW OVERVIEW

ai in systematic literature review

LEARN ABOUT RAYYAN’S PICO HIGHLIGHTS AND FILTERS

ai in systematic literature review

Join now to learn why Rayyan is trusted by already more than 500,000 researchers

Individual plans, teams plans.

For early career researchers just getting started with research.

Free forever

  • 3 Active Reviews
  • Invite Unlimited Reviewers
  • Import Directly from Mendeley
  • Industry Leading De-Duplication
  • 5-Star Relevance Ranking
  • Advanced Filtration Facets
  • Mobile App Access
  • 100 Decisions on Mobile App
  • Standard Support
  • Revoke Reviewer
  • Online Training
  • PICO Highlights & Filters
  • PRISMA (Beta)
  • Auto-Resolver 
  • Multiple Teams & Management Roles
  • Monitor & Manage Users, Searches, Reviews, Full Texts
  • Onboarding and Regular Training

Professional

For researchers who want more tools for research acceleration.

Per month billed annually

  • Unlimited Active Reviews
  • Unlimited Decisions on Mobile App
  • Priority Support
  • Auto-Resolver

For currently enrolled students with valid student ID.

Per month billed annually

Billed monthly

For a team that wants professional licenses for all members.

Per-user, per month, billed annually

  • Single Team
  • High Priority Support

For teams that want support and advanced tools for members.

  • Multiple Teams
  • Management Roles

For organizations who want access to all of their members.

Annual Subscription

Contact Sales

  • Organizational Ownership
  • For an organization or a company
  • Access to all the premium features such as PICO Filters, Auto-Resolver, PRISMA and Mobile App
  • Store and Reuse Searches and Full Texts
  • A management console to view, organize and manage users, teams, review projects, searches and full texts
  • Highest tier of support – Support via email, chat and AI-powered in-app help
  • GDPR Compliant
  • Single Sign-On
  • API Integration
  • Training for Experts
  • Training Sessions Students Each Semester
  • More options for secure access control

ANNUAL ONLY

Per-user, billed monthly

Rayyan Subscription

membership starts with 2 users. You can select the number of additional members that you’d like to add to your membership.

Total amount:

Click Proceed to get started.

Great usability and functionality. Rayyan has saved me countless hours. I even received timely feedback from staff when I did not understand the capabilities of the system, and was pleasantly surprised with the time they dedicated to my problem. Thanks again!

This is a great piece of software. It has made the independent viewing process so much quicker. The whole thing is very intuitive.

Rayyan makes ordering articles and extracting data very easy. A great tool for undertaking literature and systematic reviews!

Excellent interface to do title and abstract screening. Also helps to keep a track on the the reasons for exclusion from the review. That too in a blinded manner.

Rayyan is a fantastic tool to save time and improve systematic reviews!!! It has changed my life as a researcher!!! thanks

Easy to use, friendly, has everything you need for cooperative work on the systematic review.

Rayyan makes life easy in every way when conducting a systematic review and it is easy to use.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Springer Nature - PMC COVID-19 Collection

Logo of phenaturepg

Artificial intelligence in disease diagnosis: a systematic literature review, synthesizing framework and future research agenda

Yogesh kumar.

1 Department of Computer Engineering, Indus Institute of Technology and Engineering, Indus University, Ahmedabad, 382115 India

Apeksha Koul

2 Shri Mata Vaishno Devi University, Jammu, India

Ruchi Singla

3 Department of Research, Innovations, Sponsored Projects and Entrepreneurship, CGC Landran, Mohali, India

Muhammad Fazal Ijaz

4 Department of Intelligent Mechatronics Engineering, Sejong University, Seoul, 05006 South Korea

Artificial intelligence can assist providers in a variety of patient care and intelligent health systems. Artificial intelligence techniques ranging from machine learning to deep learning are prevalent in healthcare for disease diagnosis, drug discovery, and patient risk identification. Numerous medical data sources are required to perfectly diagnose diseases using artificial intelligence techniques, such as ultrasound, magnetic resonance imaging, mammography, genomics, computed tomography scan, etc. Furthermore, artificial intelligence primarily enhanced the infirmary experience and sped up preparing patients to continue their rehabilitation at home. This article covers the comprehensive survey based on artificial intelligence techniques to diagnose numerous diseases such as Alzheimer, cancer, diabetes, chronic heart disease, tuberculosis, stroke and cerebrovascular, hypertension, skin, and liver disease. We conducted an extensive survey including the used medical imaging dataset and their feature extraction and classification process for predictions. Preferred reporting items for systematic reviews and Meta-Analysis guidelines are used to select the articles published up to October 2020 on the Web of Science, Scopus, Google Scholar, PubMed, Excerpta Medical Database, and Psychology Information for early prediction of distinct kinds of diseases using artificial intelligence-based techniques. Based on the study of different articles on disease diagnosis, the results are also compared using various quality parameters such as prediction rate, accuracy, sensitivity, specificity, the area under curve precision, recall, and F1-score.

Introduction

Healthcare is shaping up in front of our eyes with advances in digital healthcare technologies such as artificial intelligence (AI), 3D printing, robotics, nanotechnology, etc. Digitized healthcare presents numerous opportunities for reducing human errors, improving clinical outcomes, tracking data over time, etc. AI methods from machine learning to deep learning assume a crucial function in numerous well-being-related domains, including improving new clinical systems, patient information and records, and treating various illnesses (Usyal et al. 2020 ; Zebene et al. 2019 ). The AI techniques are also most efficient in identifying the diagnosis of different types of diseases. The presence of computerized reasoning (AI) as a method for improved medical services offers unprecedented occasions to recuperate patient and clinical group results, decrease costs, etc. The models used are not limited to computerization, such as providing patients, “family” (Musleh et al. 2019 ; Dabowsa et al. 2017 ), and medical service experts for data creation and suggestions as well as disclosure of data for shared evaluation building. AI can also help to recognize the precise demographics or environmental areas where the frequency of illness or high-risk behaviors exists. Researchers have effectively used deep learning classifications in diagnostic approaches to computing links between the built environment and obesity frequency (Bhatt et al. 2019 ; Plawiak et al. 2018 ).

AI algorithms must be trained on population-representative information to accomplish presentation levels essential for adaptable “accomplishment”. Trends, such as the charge for putting away and directing realities, information collection through electronic well-being records (Minaee et al. 2020 ; Kumar 2020 ), and exponential client state of information, have made a data-rich medical care biological system. This enlargement in health care data struggles with the lack of well-organized mechanisms for integrating and reconciling these data ahead of their current silos. However, numerous frameworks and principles facilitate summation and accomplish adequate data quantity for AI (Vasal et al. 2020 ). The challenges in the operational dynamism of AI technologies in healthcare systems are immeasurable despite the information that this is one of the most vital expansion areas in biomedical research (Kumar et al. 2020 ). The AI commune must build an integrated best practice method for execution and safeguarding by incorporating active best practices of principled inclusivity, software growth, implementation science, and individual–workstation interaction. At the same time, AI applications have an enormous ability to work on patient outcomes. Simultaneously, they could make significant hazards regarding inappropriate patient risk assessment, diagnostic inaccuracy, healing recommen­dations, privacy breaches, and other harms (Gouda et al. 2020 ; Khan and Member 2020 ).

Researchers have used various AI-based techniques such as machine and deep learning models to detect the diseases such as skin, liver, heart, alzhemier, etc. that need to be diagnosed early. Hence, in related work, the techniques like Boltzmann machine, K nearest neighbour (kNN), support vector machine (SVM), decision tree, logistic regression, fuzzy logic, and artificial neural network to diagnose the diseases are presented along with their accuracies. For example, a research study by Dabowsa et al. ( 2017 ) used a backpropagation neural network in diagnosing skin disease to achieve the highest level of accuracy. The authors used real-world data collected from the dermatology department. Ansari et al. ( 2011 ) used a recurrent neural network (RNN) to diagnose liver disease hepatitis virus and achieved 97.59%, while a feed-forward neural network achieved 100%. Owasis et al. ( 2019 ) got 97.057 area under the curve by using residual neural network and long short-term memory to diagnose gastrointestinal disease. Khan and Member ( 2020 ) introduced a computerized arrangement framework to recover the data designs. They proposed a five-phase machine learning pipeline that further arranged each stage in various sub levels. They built a classifier framework alongside information change and highlighted choice procedures inserted inside a test and information investigation plan. Skaane et al. ( 2013 ) enquired the property of digital breast tomosynthesis on period and detected cancer in residents based screening. They did a self-determining dual analysis examination by engaging ladies of 50–69 years and comparing full-field digitized mammography plus data building tool with full-field digital mammography. Accumulation of the data building tool resulted in a non-significant enhancement in sensitivity by 76.2% and a significant increase by 96.4%. Tigga et al. ( 2020 ) aimed to assess the diabetic risk among the patients based on their lifestyle, daily routines, health problems, etc. They experimented on 952 collected via an offline and online questionnaire. The same was applied to the Pima Indian Diabetes database. The random forest classifier stood out to be the best algorithm. Alfian et al. ( 2018 ) presented a personalized healthcare monitoring system using Bluetooth-based sensors and real-time data processing. It gathers the user’s vital signs data such as blood pressure, heart rate, weight, and blood glucose from sensor nodes to a smartphone. Katherine et al. ( 2019 ) gave an overview of the types of data encountered during the setting of chronic disease. Using various machine learning algorithms, they explained the extreme value theory to better quantify severity and risk in chronic disease. Gonsalves et al. ( 2019 ) aimed to predict coronary heart disease using historical medical data via machine learning technology. The presented work supported three supervised learning techniques named Naïve Bayes, Support vector machine, and Decision tree to find the correlations in coronary heart disease, which would help improve the prediction rate. The authors worked on the South African Heart Disease dataset of 462 instances and machine learning techniques using 10-fold cross-validation. Momin et al. ( 2019 ) proposed a secure internet of things-based healthcare system utilizing a body sensor network called body sensor network care to accomplish the requirements efficiently. The sensors used analogue to digital converter, Microcontroller, cloud database, network, etc. A study by Ijaz et al. ( 2018 ) has used IoT for a healthcare monitoring system for diabetes and hypertension patients at home and used personal healthcare devices that perceive and estimate a persons’ biomedical signals. The system can notify health personnel in real-time when patients experience emergencies. Shabut et al. ( 2018 ) introduced an examination to improve a smart, versatile, empowered master to play out a programmed discovery of tuberculosis. They applied administered AI method to achieve parallel grouping from eighteenth lower request shading minutes. Their test indicated a precision of 98.4%, particularly for the tuberculosis antigen explicit counteracting agent identification on the portable stage. Tran et al. ( 2019 ) provided the global trends and developments of artificial intelligence applications related to stroke and heart diseases to identify the research gaps and suggest future research directions. Matusoka et al. ( 2020 ) stated that the mindfulness, treatment, and control of hypertension are the most significant in overcoming stroke and cardiovascular infection. Rathod et al. ( 2018 ) proposed an automated image-based retrieval system for skin disease using machine learning classification. Srinivasu et al. ( 2021a , b ) proposed an effective model that can help doctors diagnose skin disease efficiently. The system combined neural networks with MobileNet V2 and Long Short Term Memory (LSTM) with an accuracy rate of 85%, exceeding other state-of-the-art deep models of deep learning neural networks. This system utilized the technique to analyse, process, and relegate the image data predicted based on various features. As a result, it gave more accuracy and generated faster results as compared to the traditional methods. Uehara et al. ( 2018 ) worked at the Japanese extremely chubby patients utilizing artificial brainpower with rule extraction procedure. They had 79 Non-alcoholic steatohepatitis, and 23 non- Non-alcoholic steatohepatitis patients analyse d to make the desired model. They accomplished the prescient exactness by 79.2%. Ijaz et al. ( 2020 ) propose a cervical cancer prediction model for early prediction of cervical cancer using risk factors as inputs. The authors utilize several machine learning approaches and outlier detection for different pre-processing tasks. Srinivasu et al. ( 2021a , b ) used an AW-HARIS algorithm to perform automated segmentation of CT scan images to identify abnormalities in the human liver. It is observed that the proposed approach has outperformed in the majority of the cases with an accuracy of 78%.

To fully understand how AI assists in the diagnosis and prediction of a disease, it is essential to understand the use and applicability of diverse techniques such as SVM, KNN, Naïve Bayes, Decision Tree, Ada Boost, Random Forest, K-Mean clustering, RNN, Convolutional neural networks (CNN), Deep-CNN, Generative Adversarial Networks (GAN), and Long short-term memory (LSTM) and many others for various disease detection system (Owasis et al. 2019 ; Nithya et al. 2020 ). We conducted an extensive survey based on the machine and deep learning models for disease diagnosis. The study covers the review of various diseases and their diagnostic methods using AI techniques. This contribution explains by addressing the four research questions: RQ1. What is the state-of-the-art research for AI in disease diagnosis? RQ2. What are the various types of diseases wherein AI is applied? RQ3. What are the emergent limitations and challenges that the literature advances for this research area? RQ4.What are the future avenues in healthcare that might benefit from the application of AI? The rest of the work is organized into various sections. Initially, a brief description of AI in healthcare and disease diagnosis using multiple machines and deep learning techniques is given in Sect.  1 . Then, it is named an introduction that includes Fig.  1 to describe all the papers taken from different organized sources for various diseases in the contribution sub-section. Materials and Methods is named as Sect.  2 , which includes the quality assessment and the investigation part regarding AI techniques and applications. Section  3 covers symptoms of diseases and challenges to diagnostics, a framework for AI in disease detection modelling, and various AI applications in healthcare. Section  4 includes the reported work of multiple diseases and the comparative analysis of different techniques with the used dataset, applied machine and deep learning methods with computed outcomes in terms of various parameters such as accuracy, sensitivity, specificity, the area under the curve, and F-score. In Sect.  5 , the discussion part is covered that answers the investigation part mentioned in Sect.  2 . Finally, in Sect.  6 , the work that helps researchers chooses the best approach for diagnosing the diseases is concluded along with the future scope.

An external file that holds a picture, illustration, etc.
Object name is 12652_2021_3612_Fig1_HTML.jpg

Distribution of published papers for diseases diagnosis using artificial intelligence techniques

Contribution

Diseases usually are quantified by signs and symptoms. A sign is identified as an objective appearance of a disease that doctors can specify, whereas a symptom is a particular indication of the patient’s illness (Plawiak et al. 2018 ). Thus, every disease has various signs and symptoms, such as fever, which is found in countless conditions.

As shown in Fig.  1 , the number of papers reviewed under preferred reporting items for systematic reviews and Meta-Analysis (PRISMA) guidelines for different types of diseases using AI from the year 2009 to the year 2020. The present work emphasizes various diseases and their diagnostics measures using machine and deep learning classifications. To the best of our knowledge, most of the past work focused on disease diagnostics for one or two disease prediction systems. Hence, the present study explores ten different disease symptoms and their detection using AI techniques. Furthermore, this paper is unique, as it contains an elaborate discussion about various disease diagnoses and predictions based upon the extensive survey conducted for detection methods.

Materials and methods

We have directed this review according to the preferred reporting items for systematic reviews and Meta-Analysis guidelines. The survey offers the readers wide-ranging knowledge of the literature on AI (decision tree, which breaks down the dataset into smaller subsets and to build it, two types of entropy using frequencies are calculated in which X, S is a discrete random variable which occurs with probability p(i),…. p(c) and logarithm with base 2 gives the unit of bits or Shannons where entropy using the frequency table of one attribute is given as (Sabottke and Spieler 2020 )

and entropy using the frequency table of two attributes is given as

K-nearest neighbour algorithm is a supervised machine learning technique that is used to solve classification issues as well as to calculate the distance between the test data and the input to give the prediction by using Euclidean distance formula in which p, q are the two points in Euclidean n-space, and qi and pi are the Euclidean vectors starting from the origin of the space (Zaar et al. 2020 ).

Whereas regression is used to determine the relationship between independent and dependent variables. The equation Y represents it is equal to an X plus b, where Y is the dependent variable, an is the slope of the regression equation, x is the independent variable, and b is constant (Kolkur et al. 2018 )

where Y is the dependent variable, X is the independent variable; a is the intercept, b is the slope and is the residual error, Naïve Bayes which provides a way of calculating the posterior probability, P (c | x) from P(c), P(x) and P(x | c). Naïve Bayes classifier assumes that the effect of the value of an attribute (x) on a given class (c) is independent of the values of other predictors (Spann et al. 2020 )

P(c | x) is the posterior probability of class given attribute, P(x | c) is the likelihood which is the probability of the attribute given class, P(x) is the prior probability of attribute, P(c) is the prior probability of a class, k-means ( Fujita et al. 2020 ) which is used to define k centers, one for each cluster and these centres should be placed far away from each other. This algorithm also aims at minimizing an objective function which is known as squared error function, given by :

||x i -v j || is the Euclidean distance between x i -v j, Ci is the number of data points in ith cluster, C is the number of cluster center’s, convolution neural network which is a type of feed-forward artificial neural network in which the connectivity pattern between its neurons is inspired by the organization of the animal visual cortex. Convolution is the first step in the process that convolution neural network undergoes (Zhang et al. 2019 )

where (f*g)(t) = functions that are being convoluted, t = real number variable of functions f and g, g( τ ) = convolution of time function, τ ′  = first derivative of tau function, a recurrent neural network which is used for handling sequential data and its formula in which h(t) is a function f of the previously hidden state h(t − 1) and the current input x(t). The theta are the parameters of the function f is (Yang et al. 2020 )

Boltzmann machine, which optimizes the weights, a quantity related to the particular problem. Its main objective is to maximize the Consensus function (CF), which is given by the following formula (Zhou et al. 2019 )

where U i and U j are the set of units, w ij is the fixed weight, gradient descent which is an iterative process and is formulated by (Chang et al. 2018 )

where θ 1 is the next position, θ 0 is the current position, α is the small step, ∇ J θ is the direction of fastest increase) in healthcare (Zhang et al. 2017 ). The extensive survey also promotes expounding prevailing knowledge gaps and subsequent identification of paths for future research (Lin et al. 2019 ). The current study reformed the structure, which produced wide-ranging article valuation standards from earlier published articles. Articles incorporated in our research are selected using keywords like “Artificial Intelligence”, “Disease Detection”, “Disease diagnosis using machine learning”, “Disease diagnosis using deep learning”, “Artificial Intelligence in Healthcare”, and combinations of these keywords. In addition, research articles associated with the applications of AI-based techniques in predicting diseases and diagnosing them are incorporated for review. Table  1 lists the publications that are included or omitted based on a variety of criteria such as time, studies to define how old papers/articles can be accessed, the problem on which the article is based, comparative analysis of the work, methods to represent the techniques used, and research design to analyse the results that are obtained. These characteristics assisted us in carrying out the research study very quickly, without wasting time on irrelevant or unnecessary searches and investigations. The standards for inclusion and exclusion are developed by the requirements of the problem of an article.

Inclusion and exclusion parameters

Quality assessment

Research articles included in this review are identified using several quality evaluation constraints. The significance of the study is assessed based on inclusion and exclusion standards. All research articles included for review encompass machine or deep learning-based prediction models for automatically detecting and diagnosing diseases. Each research work incorporated in this study carried empirical research and had experimental outcomes. The description of these research articles is stated in a separate subsection entitled literature survey.

The comprehensive selection of research papers is carried out in four phases: (1) identifying  where records are identified through various databases. At this phase, we must do the searches we’ve planned through the abstract and citation databases we’ve chosen. Take note of how many results the searches returned. We can also include data found in other places, such as Google Scholar or the reference lists of related papers. Then, in one citation management application, aggregate all of the records retrieved from the searches. Keep in mind that each database has its own set of rules for searching for terms of interest and combining keywords for a more efficient search. As a result, our search technique may vary significantly depending on the database, (2) screening  the selection process is done transparently by reporting on decisions made at various stages of the systematic review. One of the investigators reviews the title and abstract of each record to see if the publication provides information that might be useful or relevant to the systematic review. In certain situations, the title and abstract screening is done by two investigators. They don’t split the job amongst themselves! Each investigator screens every title and abstract, and then their judgments are compared. If one of them decides to leave out an item that the other thinks should be included, they may go over the entire text together and come to a common conclusion. They can also enlist the help of a third party (usually the project manager or main investigator) to decide whether or not the study should be included. Make sure that the most acceptable justification for excluding an item is chosen. (3) Eligibility  we study the complete contents of the articles that cleared the title and abstract screening to see whether they may assist in answering our research topic. Two investigators do this full-text screening. Each one examines the entire content of each article before deciding whether or not to include it. We must note the number of articles we remove and the number of articles under each cause for exclusion in the full-text screening, just as we did in the title/abstract screening. Hence, in this stage, full-text articles are assessed and then finally are included in qualitative analysis in (4) included  phase by utilizing the Preferred reporting items for systematic reviews and meta-analysis (PRISMA) flowchart as depicted in Fig.  2 . In this stage, we’ll know how many papers will be included in our systematic review after removing irrelevant studies from the full-text screen. We assess how many of these studies may be included in a quantitative synthesis, commonly known as “meta-analysis,“ in the fourth and final screening stage.

An external file that holds a picture, illustration, etc.
Object name is 12652_2021_3612_Fig2_HTML.jpg

PRISMA flow chart

To address the RQ1, RQ2, RQ3, and RQ4, the current survey examined the number of articles on different disease diagnoses using AI techniques from various data sources, including Psychological Information, Excerpta Medica Database, Google Scholar, PubMed, Scopus, and Web of Science. The above sources are popular sources of information for articles on AI in health informatics in previous studies. As previously explained, articles are chosen based on specified inclusion and exclusion criteria (Zhang et al. 2017 ). These were derived from (Behera et al. 2019 ), where the authors established and accepted the variations. To better understand the state of research on AI in disease detection, peer-reviewed papers are cited. The current review suggests that AI and healthcare have developed a present synergy.

Investigation

Investigation 1: Why do we need AI?

Investigation 2: What is the impact of AI on medical diagnosis and treatment?

Investigation 3: Why is AI important, and how is it used to analyse these diseases?

Investigation 4: Which AI-based algorithm is used in disease diagnosis?

Investigation 5: What are the challenges faced by the researchers while using AI models in several disease diagnoses?

Investigation 6: How are AI-based techniques helping doctors in diagnosing diseases?

Artificial intelligence in disease diagnosis

Detecting any irresistible ailment is nearly an afterward movement and forestalling its spread requires ongoing data and examination. Hence, acting rapidly with accurate data tosses a significant effect on the lives of individuals around the globe socially and financially (Minaee et al. 2020 ). The best thing about applying AI in health care is to improve from gathering and processing valuable data to programming surgeon robots. This section expounds on the various techniques and applications of artificial intelligence, disease symptoms, diagnostics issues, and a framework for disease detection modelling using learning models and AI in healthcare applications (Kumar and Singla 2021 ).

Framework for AI in disease detection modelling

AI describes the capability of a machine to study the way a human learns, e.g., through image identification and detecting pattern in a problematic situation. AI in health care alters how information gets composed, analysed, and developed for patient care (Ali et al. 2019 ).

System planning is the fundamental abstract design of the system. It includes the framework’s views, the course of action of the framework, and how the framework carries on underneath clear conditions. A solid grip of the framework design can help the client realize the limits and boundaries of the said framework. Figure  3 shows a pictorial portrayal of the ailment recognition model using utilitarian machines and profound learning classification strategies. In pre-preparing, real-world information requires upkeep and pre-preparing before being taken care of by the calculation (Jo et al. 2019 ). Because of the justifiable explanation, real-world data regularly contains mistakes regarding the utilized measures yet cannot practice such blunders. Accordingly, information pre-preparing takes this crude information, cycles it, eliminates errors, and spares it an extra examination. Information experiences a progression of steps during pre-handling (Chen et al. 2019a , b ): Information is purged by various strategies in information cleaning. These strategies involve gathering information, such as filling the information spaces that are left clear or decreasing information, such as the disposal of commas or other obscure characters. In information osmosis, the information is joined from a combination of sources. The information is then amended for any blend of mistakes, and they are quickly taken care of. Information Alteration : Data in this progression is standardized, which depends upon the given calculation. Information standardization can be executed utilizing several ways (Nasser et al. 2019 ). This progression is obligatory in most information mining calculations, as the information wants to be as perfect as possible. Information is then mutual and developed. Information Lessening : This progression in the strategy centers to diminish the information to more helpful levels. Informational collection and test information : The informational collection is segregated into parts preparing and testing informational indexes. The preparation information is utilized to gauge the actual examples of the data (Sarao et al. 2020 ). Equivalent to information needed for preparing and testing, experimental data is often replicated from a similar informational index. After the model has been pre-handled, the jiffy step is to test the accuracy of the framework. Systematic model : Analytical displaying strategies are utilized to calculate the probability of a given occurrence function given commitment factors, and it is very productive in illness expectation. It tends to imagine what the individual is experiencing in light of their info indications and prior determinations (Keenan et al. 2020 ; Rajalakshmi et al. 2018 ).

An external file that holds a picture, illustration, etc.
Object name is 12652_2021_3612_Fig3_HTML.jpg

Framework for disease detection system

Medical imaging for diseases diagnosis

Clinical Imaging is seen to assign the arrangement of procedures that produce pictures of the inside part of the body. The procedure and cycles are used to take pictures of the human body for clinical purposes, such as uncovering, analysing, or looking at an injury, brokenness, and pathology (Bibault et al. 2020 ). Computed tomography (CT) scan outputs are great representations of helpful indicative imaging that encourages exact conclusion, mediation, and evaluation of harms and dysfunctions that actual advisors address consistently (Chen et al. 2017 ). Additional contemplates demonstrate overuse of Imaging, for example, X-rays or magnetic resonance imaging (MRI) for intense and complicated work, as shown in Table  2 .

Medical imaging types with their respective descriptions

Symptoms of diseases and challenges to diagnostics

The disease may be severe, persistent, cruel, or benign. Of these terms, persistent and severe have to do with the interval of a disease, lethal and begin with the potential for causing death. Additionally, different manifestations that may be irrelevant could post the warnings for more restorative severe illness or situation. The followings are a couple of diseases with their sign and indications for events:

  • Heart assault signs incorporate hurt, nervousness, crushing, or feeling of breadth in the focal point of the chest that endures more than a couple of moments; agony or anxiety in different territories of the chest area; succinctness of breath; cold perspiration; heaving; or unsteadiness (Aggarwal et al. 2020 ).
  • Stroke signs incorporate facial listing, arm shortcoming, the intricacy with discourse, quickly creating happiness or equalization, unexpected absence of sensation or weak point, loss of vision, puzzlement, or agonizing torment (Lukwanto et al. 2015 ).
  • Reproductive wellbeing manages the signs that develop the issues such as blood misfortune or spotting between periods; tingling, copying, disturbance at genital region; agony or disquiet during intercourse; genuine or sore feminine dying; extreme pelvic/stomach torment; strange vaginal release; the sentiment of totality in the lower mid-region; and customary pee or urinary weight (Kather et al. 2019 ).
  • Breast issue side effects include areola release, abnormal bosom delicacy or torment, bosom or areola skin changes, knot or thickening in or close to bosom or in the underarm zone (Memon et al. 2019 ).
  • Lung issue side effects include hacking of blood, succinctness of breath, difficult breathing, consistent hack, rehashed episodes of bronchitis or pneumonia, and puffing (Ma et al. 2020 ).
  • Stomach or stomach-related issue manifestations incorporate rectal dying, blood in the stool or dark stools, changes in gut properties or not having the option to control guts, stoppage, loose bowels, indigestion or heartburn, or spewing blood (Kather et al. 2019 ).
  • Bladder issue manifestations include confounded or excruciating pee, incessant pee, loss of bladder control, blood in pee, waking routinely to pee around evening time to pee or wetting the bed around evening time, or spilling pee (Shkolyar et al. 2019 ).
  • Skin issue indications remember changes for skin moles, repetitive flushing and redness of face and neck, jaundice, skin sores that do not disappear or re-establish to wellbeing, new development or moles on the skin, and thick, red skin with bright patches (Rodrigues et al. 2020 ).
  • Emotional issues include nervousness, sadness, weariness, feeling tense, flashbacks and bad dreams, lack of engagement in daily exercises, self-destructive musings, mind flights, and fancies (Krittanawong et al. 2018 ).
  • Headache issues indications (excluding ordinary strain cerebral pains) incorporate migraines that please unexpectedly, “the most noticeably awful migraine of your life”, and cerebral pain connected with extreme energy, queasiness, heaving, and powerlessness to walk (Mueller 2020 ).

Above, we have described the variety of illness signals and their symptoms. In contrast, illness recognition errors in medication are reasonably regular, can have a stringent penalty, and are only now the foundation to materialize outstandingly in patient safety. Here we have critical issues for various diagnostic types while detecting the particular diseases (Chuang 2011 ; Park et al. 2020 ).

  • Analysis that is accidentally deferred wrong, or on the other hand, missed as decided from a definitive delight of more amazing data.
  • Any fault or malfunction in the analytical course which is essential to a missed finding or a conceded conclusion comprises a breakdown in occasional admittance to mind; elicitation or comprehension of side effects, images, research facility result; detailing and weighing of difference investigation; and ideal development and strength arrangement or appraisal.

Healthcare applications

The healthcare system has long been an early adopter of generally innovative technologies. Today, artificial intelligence and its subset machine and deep learning are on their way to becoming a mean element in the healthcare system, from creating new health check actions to treat patient records and accounts. One of the maximum burdens physician practices today is the association and performance of organizational tasks (Fukuda et al. 2019 ). By automating them, healthcare institutions could help resolve the trouble and allow physicians to do their best, i.e., spend more time with patients. The following are the details of the artificial intelligence techniques in healthcare applications as shown in Table  3 :

Healthcare applications and their purpose

Reported work

This section highlights the best finding for different diseases with their diagnosis methods via machine and deep learning algorithms. It covers the extensive survey on various diseases such as alzheimer’s, cancer, diabetes, chronic, heart disease, tuberculosis, stroke and cerebrovascular, hypertension, skin and liver disease (Chui et al. 2020 ).

Diagnosis of Alzheimer’s disease

Alzheimer’s is a disease that worsens the dementia symptoms over several years (Zebene et al. 2019 ). During its early stage, it affects memory loss, but in the end, it loses the ability to carry the conservation and respond to the environment. Usyal et al. ( 2020 ) decided on the analysis of dementia in Alzheimer’s through investigating neuron pictures. They utilized the alzheimer’s disease neuroimaging initiative convention that comprises T1 weighted magnetic resonance information for finding. The prescient shows the precision estimated the characterization models, affectability, and explicitness esteem. Ljubic et al. ( 2020 ) presented the method to diagnose Alzheimer’s disease from electronic medical record (EMR) data. The results acquired showed the accuracy by 90% on using the SCRL dataset. Soundarya et al. ( 2020 ) proposed the methodology in which description of shrink brain tissue is used for the ancient analysis of Alzheimer’s disease. They have implemented various machine and deep learning algorithms. The deep algorithm has been considered the better solution provider to recognize the ailment at its primary stage with reasonable accuracy. Park et al. ( 2020 ) used a vast range of organizational health data to test the chance of machine learning models to expect the outlook occurrence of Alzheimer’s disease. Lin et al. ( 2019 ) proposed a method that used the spectrogram features extracted from speech data to identify Alzheimer’s disease. The system used the voice data collected via the internet of things (IoT) and transmitted to the cloud server where the original data is stored. The received data is used for training the model to identify the Alzheimer’s disease symptoms.

As seen in Fig.  4 , (Subasi 2020 ) proposed a broad framework for detecting Alzheimer’s illness using AI methods. The learning process is the process of optimizing model parameters using a training dataset or prior practice. Learning models can be predictive, predicting the future, descriptive, collecting data from input data sources, and combining them. Two critical stages are performed in machine learning and deep learning: pre-processing the vast input and improving the model. The second phase involves effectively testing the learning model and resembling the answer. Oh et al. ( 2019 ) offered a technique for demonstrating the end-to-end learning of four binary classification problems using a volumetric convolutional neural network form. The trials are performed on the ADNI database, and the results indicated that the suggested technique obtained an accuracy of 86.60% and a precision of 73.95%, respectively. Raza et al. ( 2019 ) proposed a unique AI-based examination and observation of Alzheimer’s disorder. The analysis results appeared at 82% improvement in contrast with notable existing procedures.

An external file that holds a picture, illustration, etc.
Object name is 12652_2021_3612_Fig4_HTML.jpg

Alzheimer’s disease detection using artificial intelligence techniques (Subasi 2020 )

Additionally, above 95% precision is accomplished to order the exercises of everyday living, which are very reassuring regarding checking the action profile of the subject. Lodha et al. ( 2018 ) used a machine-learning algorithm to process the data obtained by neuroimaging technologies to detect Alzheimer’s in its primitive stage. It uses various algorithms like support vector machine (SVM), gradient boosting, K-nearest neighbour, Random forest, a neural network that shows the accuracy rate 97.56, 97.25, 95.00, 97.86, 98.36, respectively. Lei et al. ( 2020 ) state that to evaluate Alzheimer’s ailment, a clinical score forecast using neuroimaging data is incredibly profitable since it can adequately reveal the sickness status. The proposed structure comprises three sections: determination dependent on joint learning, highlight encoding dependent on profound polynomial arrange and amass learning for relapse through help vector relapse technique. Jo et al. ( 2019 ) performed the deep learning approach and neuroimaging data for the analytical classification of Alzheimer’s disease. Autoencoder for feature selection formed accuracy up to 98.8% and 83.7% for guessing conversion from mild cognitive impairment, a prodromal stage of Alzheimer’s disease.

A deep neural network uses neuroimaging data without pre-processing for feature collection that yields accuracies up to 96.0% for Alzheimer’s disease categorization and 84.2% for the medical council of India conversion problems (Oomman et al. 2018 ). Chen et al. ( 2017 ) hypothesized the combination of diffusivity and kurtosis in diffusion kurtosis imaging to increase the capacity of diffusion kurtosis imaging in detecting Alzheimer’s disease. The method was applied on the 53 subjects, including 27 Alzheimer’s patients, which provides an accuracy of 96.23%. Janghel et al. ( 2020 ) used a convolution neural network to improve classification accuracy. They demonstrated a deep learning technique for identifying Alzheimer’s disease using data from the Alzheimer’s disease neuroimaging initiative database, which included magnetic resonance imaging and positron emission tomography scan pictures of Alzheimer’s patients, as well as an image of a healthy individual. The experiment attained an average classification accuracy of 99.95% for the magnetic resonance imaging dataset and 73.46% for the positron emission tomography scan dataset. Balaji et al. ( 2020 ) presented the gait classification system based on machine learning to help the clinician diagnose the stage of Parkinson’s disease. They used four supervised machine learning algorithms: decision tree, support vector machine, ensemble classifier, and Bayes’ classifier, which are used for statistical and kinematic analysis that predict the severity of Parkinson’s disease.

Diagnosis of cancer disease

Artificial Intelligence methods can affect several facets of cancer therapy, including drug discovery, drug development, and the clinical validation of these drugs. Pradhan et al. ( 2020 ) evaluated several machine learning algorithms which are flexible for lung cancer recognition correlated with the internet of things. They reviewed various papers to predict different diseases using a machine learning algorithm. They also identified and depicted various research directions based on the existing methodologies. Memon et al. ( 2019 ) proposed an AI calculation-based symptomatic framework which adequately grouped the threatening and favorable individuals in the climate of the internet of things. They tried the proposed strategy on the Wisconsin Diagnostic Breast Cancer. They exhibited that the recursive element determination calculation chose the best subset of highlights and the classifier support vector machine that accomplished high order precision of 99% and affectability 98%, and Matthew’s coefficient is 99%. Das et al. ( 2019 ) proposed another framework called the watershed Gaussian-based profound learning method to depict the malignant growth injury in processed tomography pictures of the liver. They took a test of 225 pictures which are used to build up the proposed model. Yue et al. ( 2018 ) reviewed the machine learning techniques that include artificial neural networks, support vector machines, decision trees, and k-nearest neighbor for disease diagnosis. The author has investigated the breast cancer-related applications and applied them to the Wisconsin breast cancer database. Han et al. ( 2020 ) focused on the research and user-friendly design of an intelligent recommendation model for cancer patients’ rehabilitation schemes. Their prediction also achieved up to 92%. Rodrigues et al. ( 2020 ) proposed utilizing the move learning approach and profound learning approach in an IoT framework to help the specialists analyse common skin sores, average nevi, and melanoma. This investigation utilized two datasets: the first gave by the International Skin Imaging Collaboration at the worldwide Biomedical Imaging Symposium. The DenseNet201 extraction model, joined with the K nearest neighbor classifier, accomplished an exactness of 96.805% for the International Society for Bioluminescence and Chemiluminescence - International Standard Industrial Classification dataset. Huang et al. ( 2020 ) reviewed the literature on the application of artificial intelligence for cancer diagnosis and prognosis and demonstrated how these methods were advancing the field. Kather et al. ( 2019 ) used deep learning to mine clinically helpful information from histology. It can also predict the survival and molecular alternations in gastrointestinal and liver cancer. Also, these methods could be used as an inexpensive biomarker only if the pathology workflows are used. Kohlberger et al. ( 2019 ) built up a convolution neural organization to restrict and measure the seriousness of out-of-fold districts on digitized slides. On contrasting it and pathologist-reviewed center quality, ConvFocus accomplished Spearman rank coefficients of 0.81 and 0.94 on two scanners and replicated the typical designs from stack checking. Tschandl et al. ( 2019 ) build an image-based artificial intelligence for skin cancer diagnosis to address the effects of varied representations of clinical expertise and multiple clinical workflows. They also found that excellent quality artificial intelligence-based clinical decision-making support improved diagnostic accuracy over earlier artificial intelligence or physicians. It is observed that the least experienced clinicians gain the most from AI-based support. Chambi et al. ( 2019 ) worked on the volumetric Optical coherence tomography datasets acquired from resected cerebrum tissue example of 21 patients with glioma tumours of various stages. They were marked as either non-destructive or limo-invaded based on histopathology assessment of the tissue examples. Unlabelled Optical coherence tomography pictures from the other nine patients were utilized as the approval dataset to evaluate the strategy discovery execution. Chen et al. ( 2019a , b ) proposed a cost-effective technique, i.e., ARM (augmented reality microscope), that overlays artificial intelligence-based information onto the current view of the model in real-time, enabling a flawless combination of artificial intelligence into routine workflows. They even anticipated that the segmented reality microscope would remove the barrier to using AI considered to enhance the accuracy and efficiency of cancer analysis.

Diabetes detection

Diabetes Mellitus, also known as diabetes, is the leading cause of high blood sugar. AI is cost-effective to reduce the ophthalmic complications and preventable blindness associated with diabetes. This section covers the study of various researchers that worked on detecting diabetes in patients (Chaki et al. 2020 ). Kaur and Kumari ( 2018 ) used machine learning models on Pima Indian diabetes dataset to see patterns with risk factors with the help of the R data manipulation tool. They also analyse d five predictive models using the R data manipulation tool and support vector machine learning algorithm: linear kernel support vector machine, multifactor dimensionality reduction, and radial basis function.

As shown in Fig.  5 , blood glucose prediction has been categorized in three different parts: physiology-based, information-driven, and hybrid-based. Woldaregy et al. ( 2019 ) developed a compact guide in machine learning and a hybrid system that focused on predicting the blood glucose level in type 1 diabetes. They mentioned various machine learning methods crucial to regulating an artificial pancreas, decision support system, blood glucose alarm applications. They had also portrayed the knowledge about the blood glucose predictor that gave information to track and predict blood glucose levels as many factors could affect the blood glucose levels like BMI, stress, illness, medications, amount of sleep, etc. Thus blood glucose prediction provides the forecasting of an individual’s blood glucose level based on the past and current history of the patient to give an alarm to delay any complications. Chaki et al. ( 2020 ) provided detailed information to detect diabetes mellitus and self-management techniques to prove its importance to the scientists that work in this area. They also analyse d and diagnosed diabetes mellitus via its dataset, pre-processing techniques, feature extraction methods, machine learning algorithms, classification, etc. Mercaldo et al. ( 2017 ) proposed a method to classify diabetes-affected patients using a set of characteristics selected by a world health organization and obtained the precision value and recall value 0.770 and 0.775, respectively, with the help of the Hoeffding tree algorithm. Mujumdar et al. ( 2019 ) proposed the model for prediction, classification of diabetes, and external factors like glucose, body mass index, insulin, age, etc. They also analyse d that classification accuracy proved to be much more efficient with the new dataset than their used dataset. Kavakiotis et al. ( 2017 ) conducted a systematic review regarding the machine learning applications, data mining techniques, and tools used in the diabetes field to showcase the prediction and diagnosis of diabetes, its complications, and genetic conditions and situation, including the physical condition care management. After the in-depth search, it had been found that supervised learning methods characterized 85%, and the rest, 15%, were characterized by unsupervised learning methods. Aggarwal et al. ( 2020 ) demonstrated the non-linear heart rate variability in the prediction of diabetes using an artificial neural network and support vector machine. The author computed 526 datasets and obtained the classification accuracy of 90.5% with a support vector machine. Besides that, they evaluated thirteen non-linear heart rate variability parameters for the training and testing of artificial neural networks. Lukmanto et al. ( 2015 ) worked on many diabetes mellitus patients to provide an advantage for researchers to fight against it. Their main objective was to leverage fuzzy support vector machine and F-score feature selection to classify and detect diabetes mellitus. The methodology is applied to the Pima Indian Diabetes dataset, where they got an accuracy of 89.02% to predict the diabetes mellitus patients. Wang et al. ( 2017 ) proposed a weighted rank support vector machine to overcome the imbalanced problem seen during the daily dose system of drugs, leading to poor prediction results. They also employed the area under the curve (AUC) to show the model’s effectiveness and improved the average precision of their proposed algorithm. Carter et al. ( 2018 ) showcased the performance of 46 different machine learning models compared on re-sampled trained and tested data. The model obtained the area under the curve of 0.73 of training data and 0.90 of tested data. Nazir et al. ( 2019 ) proposed a technique to minutely detect the diabetic retinopathy’s different stages via tetragonal local octa pattern features that are further classified by extreme machine learning. For classifying periodic heart rate variability signals and diabetes, Swapna et al. ( 2018 ) presented a deep learning architecture. The authors used long short term memory, a convolution neural network, to extract the dynamic features of heart rate variability. They achieved an accuracy of 95.7% on using electrocardiography signals along with the support vector machine classification.

An external file that holds a picture, illustration, etc.
Object name is 12652_2021_3612_Fig5_HTML.jpg

Blood glucose prediction approaches (Woldaregy et al. 2019 )

Diagnose chronic diseases

Researchers have shown that artificial intelligence helps in the streamlining care of chronic diseases. Therefore, various machine learning algorithms are developed to identify patients at higher risk of chronic disease. The other techniques based on AI are stated below (Jain et al. 2018 ).

Jain et al. ( 2018 ) presented a survey to showcase feature choice and arrangement methods to analyse and anticipate the constant illnesses. They utilized dimensionality decrease strategies to improve the presentation of AI calculation. To put it plainly, they introduced different component determination techniques and their inalienable points of interest and impediments. He et al. ( 2019 ) proposed a kernel-based structure for training the chronic illness detector to forecast and track the disease’s progression. Their approach was based on an enhanced version of a structured output support vector machine for longitudinal data processing. Tang et al. ( 2020 ) utilized deep residual networks to identify chronic obstructive pulmonary disease automatically. After gathering data from the PanCad project, which includes ex-smokers and current smokers at high risk of lung cancer, the residual network was trained to diagnose chronic obstructive pulmonary disease using computed topography scans. Additionally, they ran three rounds of cross-validation on it. With the help of three-fold cross-validation, the experiment had an area under the curve of 0.889. Ma et al. ( 2020 ) proposed the heterogeneous changed artificial neural organization to identify, divide, and determine persistent renal disappointment utilizing the web of medical things stage. The proposed strategy was named uphold vector machine and multilayer perceptron alongside the back engendering calculation. They used ultrasound images and later performed segmentation in that image. Especially in Kidney segmentation, it performed very well by achieving high results. Aldhyani et al. ( 2020 ) proposed the system that was used to increase the accuracy in detecting chronic disease by using machine learning algorithms. The machine learning methods such as Naïve Bayes, support vector machine, K nearest neighbour, and random forest were presented and compared. They also used a rough k-means algorithm to figure out the ambiguity in chronic disease to improve its performance. The Naïve Bayes method and RKM achieved an accuracy of 80.55% for diabetic disease, the support vector machine achieved 100% accuracy for kidney disease, and the support vector machine achieved 97.53% for cancer disease. Chui and Alhalabi ( 2017 ) reviewed the chronic disease diagnosis in smart health care. They provide a summarized view of optimization algorithms and machine learning algorithms. The authors also gave information regarding Alzheimer’s disease, dementia, tuberculosis, etc., followed by the challenges during the deployment phase of the disease diagnosis. Nam et al. ( 2019 ) introduced the internet of things and digital biomarkers and their relationships to artificial intelligence and other current trends. They have also discussed the role of artificial intelligence in the internet of things for chronic disease detection. Battineni et al. ( 2020 ) reviewed the applications of predictive models of machine learning to diagnose chronic disease. After going through 453 papers, they selected only 22 studies from where it was concluded that there were no standard methods that would determine the best approach in real-time clinical practice. The commonly used algorithms were support vector machine, logistic regression, etc. Wang et al. ( 2018 ) analyse d chronic kidney disease using machine learning techniques based on chronic kidney disease dataset and performed ten-fold cross-validation testing. The dataset had been pre-processed for completing and normalizing the missing data. They achieved the detection accuracy of 99% and were further tested using four patient data samples to predict the disease. Kim et al. ( 2019 ) indicated the constant sicknesses in singular patients that utilized a character repetitive neural organization to regard the information in each class as a word, mainly when an enormous bit of its information esteem is absent. They applied the Char-recurrent neural network to characterize the Korea National Health and Nutrition Examination Survey cases. They indicated the aftereffects of higher precision for the Char-recurrent neural network than for the customary multilayer perceptron model. Ani et al. ( 2017 ) proposed a patient monitoring system for stroke-affected people that reduced future recurrence by alarming the doctor and provided the data analytics and decision-making based on the patient’s real-time health parameters. That helped the doctors in systematic diagnosis followed by tailored treatment of the disease.

Heart disease diagnosis

Researchers suggest that artificial intelligence can predict the possible periods of death for heart disease patients. Thus multiple algorithms have been used to predict the heart rate severity along with its diagnosis. Escamila et al. ( 2019 ) proposed a dimensionality decrease strategy to discover the highlights of coronary illness utilizing the highlight determination procedure. The dataset used was the UCIrvine artificial intelligence vault called coronary illness which contains 74 highlights. The most remarkable precision was accomplished by the chi-square and head segment investigation alongside the irregular woods classifier. Tuli et al. ( 2019 ) proposed a Health fog framework to integrate deep learning in edge computing devices and incorporate it into the real-life application of heart detecting disease. They consisted of the hardware and software components, including body area sensor network, gateway, fogbus module, data filtering, pre-processing, resource manager, deep learning module, and ensembling module. The health fog model was an internet of things-based fog enabled model that can help effectively manage the data of heart patients and diagnose it to identify the heart rate severity.

George et al. ( 2018 ) aimed to describe the obstacles Indian nurses face in becoming active and valued members of the cardiovascular healthcare team as cardiovascular disease imposed substantial and increasing physical, psychological, societal, and financial burdens. As shown in Fig.  6 , there are numerous possible facts for health intelligent mediations to support helping cardiovascular health and decreasing hazard for cardiovascular disease. So the focus has started on the inhibition of cardiovascular disease and, more importantly, on the advancement of cardiovascular health. Several findings revealed that depression is connected with inferior cardiovascular health between adults without cardiovascular disease.

An external file that holds a picture, illustration, etc.
Object name is 12652_2021_3612_Fig6_HTML.jpg

Cardiovascular health promotion and disease prevention (George et al. 2018 )

Haq et al. ( 2018 ) created a system based on machine learning to diagnose the cardiac disease guess using its dataset and worked on seven prominent feature learning-based algorithms. It was also observed that the machine learning-based decision support system assisted the doctors in diagnosing the heart patients effectively. Khan and Member ( 2020 ) proposed a framework to estimate the cardio disease using a customized deep convolution network for categorizing the fetched sensor information into the usual and unusual state. Their results demonstrated that if there would be the utmost amount of records, the multi-task cascaded convolution neural network achieved an accuracy of 98.2%. Ahmed ( 2017 ) explained the architecture for heart rate and other techniques to understand using machine learning algorithms such as K nearest neighbour classification to predict the heart attack during collecting heart rate datasets. The author also mentioned the six data types predicting heart attack in three different levels (Patel 2016 ). The dataset used consists of 303 instances and 76 attributes. They worked on a technique that could reduce the number of deaths from heart diseases. They compared various decision tree algorithms to present the heart disease diagnosis using Waikato Environment for Knowledge Analysis. They aimed to fetch the hidden patterns by using data mining techniques linked to heart disease to predict its presence. Saranya et al. ( 2019 ) proposed a cloud-based approach based on sensors for an automated disease predictive system to calculate various parameters of patients like blood pressure, heartbeat rate, and temperature. As per their knowledge, this method could reduce the time complexity of the doctor and patient in providing medical treatment quickly. The best part was that anyone could access it from anywhere. Isravel et al. ( 2020 ) presented a pre-processing approach that might enhance the accuracy in identifying the electrocardiographic signals. They evaluated the classification using different classifying algorithms such as K nearest neighbour, Naïve Bayes, and Decision tree to detect normal and irregular heartbeat sounds. Also, after trying, it was discovered that pre-processing approach increased the performance of classifying algorithms. The devices utilized for IoT set up were the LM35 sensor, Pulse sensor, AD8232 electrocardiographic sensor, and Arduino Uno. Thai et al. ( 2017 ) proposed a new lightweight method to remove the noise from electrocardiographic signals to perform minute diagnosis and prediction. Initially, they worked on the Sequential Recursive algorithm for the transformation of signals into digital format. The same was sent to the Discrete Wavelet Transform algorithm to detect the peaks in the data for removing the noises. Then features were extracted from the electrocardiographic dataset from Massachusetts Institute of Technology-Beth Israel Hospital to perform diagnosis and prediction and remove the redundant features using Fishers Linear Discriminant. Nashif et al. ( 2018 ) proposed a cloud-based heart disease prediction system for detecting heart disease using machine learning models derived from Java Based Open Access Data Mining Platform, Waikato Environment for Knowledge Analysis. They got an accuracy level of 97.53% using a support vector machine with 97.50% sensitivity and 94.94% specificity. They used an efficient software tool that trained the large dataset and compared multiple machine learning techniques. The smartphone used to detect and predict heart disease based on the information acquired from the patients. Hardware components are used to monitor the system continuously. Babu et al. ( 2019 ) aimed to determine whether the heart attack could occur using hereditary or not. Thus to work on it, initially, they collected and compared the previous data of parents with their child dataset to find the prediction and accurate values. It could help them to determine how healthy the child is. The authors used different parameters to show the dependent and independent parameters to find whether the person gets a heart attack.

Tuberculosis disease detection

AI is placed as an answer for aid in the battle against tuberculosis. Computerized reasoning applications in indicative radiology might have the option to give precise methods for recognizing the infections for low pay countries. Romero et al. ( 2020 ) performed the classification tree analysis to reveal the associations between predictors of tuberculosis in England. They worked on the American Public Health Association data ranging from demographic herd properties and tuberculosis variables using Sam Tuberculosis management. They used a machine-learning algorithm, performed data preparation, data reduction, and data analysis, and finally got the results. Horvath et al. ( 2020 ) performed the automatic scanning and analysis on 531 slides of tuberculosis, out of which 56 were from the positive specimen. They also validated a scanning and analysis system to combine fully automated microscopy using deep learning analysis. Their proposed system achieved the highest sensitivity by detecting 40 out of 56 positive slides. Sathitratanacheewin et al. ( 2020 ) developed a convolution neural network model using tuberculosis. They used a specified chest X-ray dataset taken from the national library of medical Shenzhen no. 3 hospitals and did its testing with a non-tuberculosis chest X-ray dataset taken from the national institute of health care and center. The deep convolution neural network model achieved the region of curve area under the curve by 0.9845 and 0.8502 for detecting tuberculosis and the specificity 82% and sensitivity of 72%. Bahadur et al. ( 2020 ) proposed an automatic technique to detect the abnormal chest X-ray images that contained at least one pathology such as infiltration, fibrosis, pleural effusion, etc., because of tuberculosis. This technique is based on a hierarchical structure for extracting the feature where feature sets are used in two hierarchy levels to group healthy and unhealthy people. The authors used 800 chest X-ray images taken from two public datasets named Montgomery and Shenzhen. López-Úbeda et al. ( 2020 ) explored the machine learning methods to detect tuberculosis in Spanish radiology reports. They also mentioned the deep learning classification algorithms with the purpose of its evaluation and comparison and to carry such a task. The authors have used the data of 5947 radiology reports collected from high-tech media. Ullah et al. ( 2020 ) presented the study of Raman Spectroscopy and machine learning based on principal component analysis and hierarchical component analysis to analyse tuberculosis either in positive form or negative form. They also showed Raman results which indicated the irregularities in the blood composition collected from tuberculosis-negative patients. Panicker et al. ( 2018 ) introduced the programmed technique for the location of tuberculosis bacilli from tiny smear pictures. They performed picture binarization and grouping of distinguished districts utilizing convolution neural organization. They did an assessment utilizing 22 sputum smear minuscule pictures. The results demonstrated 97.13% review, 78.4% accuracy, 86.76% F-score for predicting tuberculosis. Lai et al. ( 2020 ) compared the artificial neural network outcomes, support vector machine, and random forest while diagnosing anti-tuberculosis drugs on Taipei Medical University Wanfang Hospital patients. They selected the features via univariate risk factor analysis and literature evaluation. The authors achieved the specificity by 90.4% and sensitivity of 80%. Gao et al. ( 2019 ) investigated the applications of computed topography pulmonary images to detect tuberculosis at five levels of severity. They proposed a deep Res Net to predict the severity scores and analyse the high severity probability. They also calculate overall severity probability, separate probabilities of both high severity and low severity forces. Singh et al. ( 2020 ) worked to discover tuberculosis sores in the lungs. They proposed a computerized recognition strategy utilizing a profound learning technique known as Antialiased Convolution Neural Network proposed by Richard Zhang. Their dataset included 3D computed topography pictures, which were cut into 2D pictures. They applied division on each cutting picture utilizing UNet and Link net design.

Stroke and cerebrovascular disease detection

AI can analyse and detect stroke signs in medical images as if the system suspects a stroke in the patient. It immediately gives the signal to the patient or doctor. Researchers have proposed various methodologies to showcase the impact of AI in stroke and cerebrovascular detection (Singh et al. 2009 ). O’Connell et al. ( 2017 ) assessed the diagnostic capability and temporal stability for the detection of stroke. They observed the mostly identical patterns between the stroke patients and controls across the ten patients. They achieved the specificity and sensitivity of 90% across the research. Labovitz et al. ( 2017 ) stated the use of AI for daily monitoring of patients for the identification and medication. They achieved the improvement by 50%on plasma drug concentration levels. Abedi et al. ( 2020 ) also presented a framework to build up the decision support system using an artificial neural network, which improved patient care and outcome. Singh et al. ( 2009 ) compared the different methods to predict stroke on the cardiovascular health study dataset. They also used the decision tree algorithm for the feature selection process, principal component analysis to reduce the classification algorithm’s dimension, and a backpropagation neural network. Biswas et al. ( 2020 ) introduced an AI-based system for the location and estimation of carotid plaque as carotid intima-media thickness for the same and solid atherosclerotic carotid divider discovery and plaque estimations.

Hypertension disease detection

Researchers have found that AI has been able to diagnose hypertension by taking input data from blood pressure, demographics, etc. Krittanawong et al. ( 2018 ) summarized the review about the recent computer science and medical field advancements. They also illustrated the innovative approach of artificial intelligence to predict the early stages of hypertension. They also stated that AI plays a vital role in investigating the risk factors for hypertension. However, on the side, it has also been restricted by researchers because of its limitations in designing, etc. Arsalan et al. ( 2019 ) conducted the experiments using three publicly available datasets as digitized retinal imagery for vessel extraction (DRIVE), structured analysis of retina (STARE) for hypertension detection. They achieved the accuracy for all datasets with sensitivity, specificity, area under the curve, and accuracy of 80.22%, 98.1%, 98.2%, 96.55%, respectively. Kanegae et al. ( 2020 ) used machine learning techniques to validate the prediction of risk for new-onset hypertension. They used data in a split form for the model construction and development and validation to test its performance. The models they used were XGBoost and ensemble, in which the XGBoost model was considered the best predictor because it was systolic blood pressure nature during cardio ankle vascular. Figure  7 shows the structure of heart during its normal phase as well as in hypertension phase. When the human heart is in hypertension phase, its pulmonary arteries gets constricted because of which the right ventricle did not get the blood in to the lungs.

An external file that holds a picture, illustration, etc.
Object name is 12652_2021_3612_Fig7_HTML.jpg

Pulmonary hypertension (Kanegae et al. 2020 )

Koshimizu et al. ( 2020 ) has also described artificial intelligence in pulse the executives, which was utilized to foresee the chance of circulatory strain utilizing enormous scope information. The authors also focused on the measure that was used to control blood pressure using an artificial neural network. In a nutshell, they were trying to prove that an artificial neural network is beneficial for high blood pressure organization and can also use it to create medical confirmation for the realistic organization of hypertension. Mueller et al. ( 2020 ) stated that using artificial analytic tools to the large dataset based on hypertension would generate questionable results and would also miss treatments and the potential targets. The author also stated that the vision of hypertension would be challenging to achieve and doubtlessly not happen in the future. Chaikijuraja et al. ( 2020 ) also noted the merits of using artificial intelligence to detect hypertension as artificial intelligence can recognize hypertension’s risk factors and phenotypes.

Moreover, it is used to interpret data from randomized trials that contained blood pressure targets associated with cardio vascular outcomes. Kiely et al. ( 2019 ) investigated the prescient model dependent on the medical care assets that could be sued to screen huge populaces to distinguish the patients at great danger of pneumonic blood vessel hypertension. They took the information of 709 patients from 2008 to 2016 with pneumonic blood vessel hypertension and contrasted it and separated associate of 2,812,458 who was delegated non-aspiratory blood vessel hypertension just as the prescient model was created and approved by utilizing cross approval. Kwon et al. ( 2020 ) did the past group learning of information taken on or after successive diseased people from two health care sectors to predict pulmonary hypertension using electrocardiography with the help of artificial intelligence. Sakr et al. ( 2018 ) assessed and analyse d AI strategies, such as Logit Boost, Bayesian Network Classifier, locally weighted Naïve Bayes, counterfeit neural organization, Support Vector Machine, and Random Tree Forest foresee the people to recognize hypertension. Thus, AI provides insights for hypertension healthcare and implements prescient, customized, and pre-emptive methodologies in clinical practice.

Skin disease diagnosis

Researchers have developed an AI system that can precisely group cutaneous skin problems and fill in as an auxiliary instrument to improve the demonstrative exactness of clinicians. Chakraborty et al. ( 2017 ) proposed a neural-based location technique for various skin disorders. They utilized two infected skin pictures named Basel Cell Carcinoma and Skin Angioma. Non-overwhelming arranging hereditary calculation is used to prepare the counterfeit neural organization, contrasted with the neural network particle swarm optimization classifier and neural network Caesarean Section classifier. Zaar et al. ( 2020 ) collected the clinical images of skin disease from the department of Dermatology at the Sahlgrenska University, where artificial intelligence algorithms had been used for the classification, thereby achieving the diagnosis accuracy by 56.4% for the top five suggested diseases. Kumar et al. ( 2019 ) used a dual-stage approach that combined computer vision and machine learning to evaluate and recognize skin diseases. During training and testing of the diseases, the method produced an accuracy of up to 95%. Kolkur et al. ( 2018 ) developed a system that identified skin disease based on input symptoms. They collected the data of the symptoms of ten skin diseases and got 90% above accuracy.

Liver disease detection

Researchers have found that AI can treat liver disease at its early diagnosis to work on its endurance and heal rate. Abdar et al. ( 2018 ) showed that efficient early liver disease recognition through Multilayer Perceptron Neural Network calculation depends on different choice tree calculations, such as chi-square programmed communication indicator and characterization, and relapse tree with boosting strategy. Their technique had the option to analyse and characterize the liver malady proficiently. Khaled et al. ( 2018 ) introduced an artificial neural network for the diagnosis of hepatitis virus. Protein and Histology is utilized as an info variable for the fake neural organization model, and it also showed the correct prediction of diagnosis by 93%. Spann et al. ( 2020 ) provided the strengths of machine learning tools and their potential as machine learning is applied to liver disease research, including clinical, molecular, demographic, pathological, and radiological data. Nahar and Ara ( 2018 ) explored the early guess of liver ailment using various decision tree techniques. The choice tree methods utilized were J48, Licensed Massage Therapist, Random Forest, Random Tree, REP tree, Decision Stump, and Hoeffding Trees. Their primary purpose was to calculate and compare the performances of various decision tree techniques. Farokhzad et al. ( 2016 ) used fuzzy logic for diagnosing liver sickness. Using this method, where they had two triangular membership and Gussy membership functions, they reached 79–83% accuracy.

Comparative analysis

In addition to the above mentioned reported work, the comparative analysis illustrated in Table  4 showcase the detailed information such as type of dataset, techniques, and the predicted outcomes regarding the work done by the researchers on different diseases, which in return helped the author to look for the best technique for detecting or diagnosing any particular disease.

Comparative analysis for different disease detection

From Table  4 , we can observe that AI techniques have proven to be the best for detecting diseases with improved results. AI uses machine and deep learning models that work upon training and testing data sets so that the system can see the disease and diagnose it early. In the AI-based model, we initially need to train human beings to remember the data and provide accurate results. However, it also deals with the problem. Suppose the training data produced the incorrect analysis of disease because of insufficient information, which artificial intelligence cannot factor. As a result, it will become a horrible condition for the patients as AI cannot assure us whether the prediction regarding disease detection is accurate.

On assaying the accuracy of algorithms in diagnosing the disease, deep learning classifiers have dominated over machine learning models in the field of disease diagnosis. Deep learning models have proved to be best in terms of scalp disease by 99%, Alzheimer disease by 96%, thyroid disease by 99%, 96% in skin disease, 99.37% in case of Arrhythmia disease, 95.7% in diabetic disease, while as machine learning models achieved 89% in diabetic disease, 88.67% in tuberculosis, 86.84% in Alzheimer disease, etc.

We have presented recently published research studies that employed AI-based Learning techniques for diagnosing the disease in the current review. This study highlights research on disease diagnosis prediction and predicting the post-operative life expectancy of diseased patients using AI-based learning techniques.

Investigation 1 : Why do we need AI?

We know that AI is the simulation of human processes by machines (computer systems) and that this simulation includes learning, reasoning, and self-correction. We require AI since the amount of labour we must perform is rising daily. As a result, it’s a good idea to automate regular tasks. It conserves the organization’s staff and also boosts production (Vasal et al. 2020 ).

In terms of the healthcare industry, AI in health refers to a set of diverse technologies that enable robots to detect, comprehend, act, and learn1 to execute administrative and clinical healthcare activities. AI has the potential to transform healthcare by addressing some of the industry’s most pressing issues. For example, AI can result in improved patient outcomes and increased productivity and efficiency in care delivery (Gouda et al. 2020 ). It can also enhance healthcare practitioners’ daily lives by spending more time caring for patients, therefore increasing staff morale and retention. In addition, it may potentially help bring life-saving medicines to market more quickly. Figure  8 shows the significance of AI in the medical field.

An external file that holds a picture, illustration, etc.
Object name is 12652_2021_3612_Fig8_HTML.jpg

Importance of artificial intelligence in healthcare

Investigation 2 : Why is AI important, and how is it used to analyse the disease?

The emergence of new diseases remains a critical parameter in human health and society. Hence, the advances in AI allow for rapid processing and analysis of such massive and complex data. It recommends the correct decision for over ten different diseases (as mentioned in the literature) with at least 98% accuracy.

Doctors use technologies such as computed tomography scan or magnetic resonance imaging to produce a detailed 3D map of the area that needs to be diagnosed. Later, AI technology analyse s the system-generated image using machine and deep learning models to spot the diseased area’s features in seconds. As shown in the framework section, an artificial intelligence model using machine and deep learning algorithms is initially trained with the help of a particular disease dataset (Owasis et al. 2019 ). The dataset is then pre-processed using data cleaning and transformation techniques so that the disease symptoms in the form of feature vectors can be extracted and further diagnosed.

Suppose doctors do not use AI techniques. In that case, it will cause a delay in treating the patients as it is tough to interpret the scanned image manually, and it also takes a considerable amount of time. But, on the other hand, it shows that an AI technique helps the patients and helps the doctors save the patient’s life by treating them as early as possible (Luo et al. 2019 ).

Investigation 3 : What is the impact of AI in medical diagnosis?

Due to advancements in computer power, learning algorithms, and the availability of massive datasets (big data) derived from medical records and wearable health monitors. The best part of implementing AI in healthcare is that it helps to enhance various areas, including illness detection, disease classification, decision-making processes, giving optimal treatment choices, and ultimately, helping people live longer. In terms of disease diagnosis, AI has been used to enhance medical diagnosis (Chen et al. 2019a , b ). For example, the technology, which is currently in use in China, may detect hazardous tumors and nodules in patients with lung cancer, allowing physicians to provide an early diagnosis rather than sending tissue samples to a lab for testing, allowing for earlier treatment (Keenan et al. 2020 ). Figure  9 illustrates the influence of artificial intelligence and other approaches.

An external file that holds a picture, illustration, etc.
Object name is 12652_2021_3612_Fig9_HTML.jpg

Comparison between AI and other techniques

Investigation 4 : Which AI-based algorithm is used in disease diagnosis?

Disease detection algorithms driven by AI demonstrated to be an effective tool for identifying undiagnosed patients with under-diagnosed, uncoded, and rare diseases. Therefore, AI models for disease detection have an ample opportunity to drive earlier diagnosis for patients in need and guide pharmaceutical companies with highly advanced, targeted diagnostics to help these patients get correctly diagnosed and treated earlier in their disease journey (Keenan et al. 2020 ). The research work mentioned in the literature has covered both machine and deep learning models for diagnosing the diseases such as cancer, diabetes, chronic, heart disease, alzheimer, stroke and cerebrovascular, hypertension, skin, and liver disease. Machine learning models, Random Forest Classifier, Logistic Regression, Fuzzy logics, Gradient Boosting Machines, Decision Tree, K nearest neighbour (KNN), and Support vector machines (SVM) are primarily used in literature. Among deep learning models, Convolutional Neural Networks (CNN) have been used most commonly for disease diagnosis. In addition, faster Recurrent Convolution Neural Network, Multilayer Perceptron, Long Short Term Memory (LSTM) have also been used extensively in the literature. Figure  10 displays the usage of AI-based prediction models in the literature.

An external file that holds a picture, illustration, etc.
Object name is 12652_2021_3612_Fig10_HTML.jpg

Artificial intelligence-based prediction models

Investigation 5 : What are the challenges faced by the researchers while using AI models in several disease diagnosis?

Although AI-based techniques have marked their significance in disease diagnosis, there are still many challenges faced by the researchers that need to be addressed.

  • i. Limited Data size  The most common challenge faced by most of the studies was insufficient data to train the model. A small sample size implies a smaller training set which does not authenticate the efficiency of the proposed approaches. On the other hand, good sample size can train the model better than the limited one (Rajalakshmi et al. 2018 ).
  • ii. High dimensionality  Another data-related issue faced in cancer research is high dimensionality. High dimensionality is referred to a vast number of features as compared to cases. However, multiple dimensionality reduction techniques are available to deal with this issue (Bibault et al. 2020 ).
  • iii. Efficient feature selection technique  Many studies have achieved exceptional prediction outcomes. However, a computationally effective feature selection method is required to eradicate the data cleaning procedures while generating high disease prediction accuracy (Koshimizu et al. 2020 ).
  • iv. Model Generalizability  A shift in research towards improving the generalizability of the model is required. Most of the studies have proposed a prediction model that is validated on a single site. There is a need to validate the models on multiple sites that can help improve the model’s generalizability (Fukuda et al. 2019 ).
  • v. Clinical Implementation  AI-based models have proved their dominance in medical research; still, the practical implementation of the models in the clinics is not incorporated. These models need to be validated in a clinical setting to assist the medical practitioner in affirming the diagnosis verdicts (Huang et al. 2020 ).

Investigation 6 : How artificial intelligence-based techniques are helping doctors in diagnosing diseases?

AI improves the lives of patients, physicians, and hospital managers by doing activities usually performed by people but in a fraction of the time and the expense. For example, AI assists physicians in making suggestions by evaluating vast amounts of healthcare data such as electronic health records, symptom data, and physician reports to improve health outcomes and eventually save the patient’s life (Kohlberger et al. 2019 ). Additionally, this data aids in the improvement and acceleration of decision-making while diagnosing and treating patients’ illnesses using artificial intelligence-based approaches. Not only that, AI assists physicians in detecting diseases by utilizing complicated algorithms, hundreds of biomarkers, imaging findings from millions of patients, aggregated published clinical studies, and thousands of physicians’ notes to improve the accuracy of diagnosis.

Conclusion and future scope

When it comes to disease diagnosis, accuracy is critical for planning, effective treatment and ensuring the well-being of patients. AI is a vast and diverse realm of data, algorithms, analytics, deep learning, neural networks, and insights that is constantly expanding and adapting to the needs of the healthcare industry and its patients. According to the findings of this study, AI approaches in the healthcare system, particularly for illness detection, are essential. Aiming at illuminating how machine and deep learning techniques work in various disease diagnosis areas, the current study has been divided into several sections that cover the diagnosis of alzheimer’s, cancer, diabetes, chronic diseases, heart disease, stroke and cerebrovascular disease, hypertension, skin disease, and liver disease. The introduction and contribution were covered in the first section, followed by an evaluation of the quality of the work and an examination of AI approaches and applications. Later, various illness symptoms and diagnostic difficulties, a paradigm for AI in disease detection models, and various AI applications in healthcare were discussed. The reported work on multiple diseases and the comparative analysis of different techniques with the used dataset as well as the results of an applied machine and deep learning methods in terms of multiple parameters such as accuracy, sensitivity, specificity, an area under the curve, and F-score has also been portrayed. Finally, the work that assisted researchers in determining the most effective method for detecting illnesses is finished, as in future scope. In a nutshell, medical experts better understand how AI may be used for illness diagnosis, leading to more appropriate proposals for the future development of AI based techniques.

Contrary to considerable advancements over the past several years, the area of accurate clinical diagnostics faces numerous obstacles that must be resolved and improved constantly to treat emerging illnesses and diseases effectively. Even healthcare professionals recognize the barriers that must be overcome before sickness may be detected in conjunction with artificial intelligence. Even doctors do not entirely rely on AI-based approaches at this time since they are unclear of their ability to anticipate illnesses and associated symptoms. Thus much work is required to train the AI-based systems so that there will be an increase in the accuracy to predict the methods for diagnosing diseases. Hence, in the future, AI-based research should be conducted by keeping the flaw mentioned earlier in consideration to provide a mutually beneficial relationship between AI and clinicians. In addition to this, a decentralized federated learning model should also be applied to create a single training model for disease datasets at remote places for the early diagnosis of diseases.

Acknowledgements

This research work was supported by Sejong University research fund. Yogesh Kumar and Muhammad Fazal Ijaz contributed equally to this work and are first co-authors.

Declarations

The authors declare that they have no conflict of interest.

All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards.

This article does not contain any studies with the animals performed by any of the authors.

Informed consent was obtained from all individual participants included in the study.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Yogesh Kumar, Email: [email protected] .

Apeksha Koul, Email: moc.liamg@9oluokahskepa .

Ruchi Singla, Email: moc.oohay@algnisihcur .

Muhammad Fazal Ijaz, Email: rk.ca.gnojes@lazaf .

  • Abdar M, Yen N, Hung J. Improving the diagnosis of liver disease using multilayer perceptron neural network and boosted decision tree. J Med Biol Eng. 2018; 38 :953–965. doi: 10.1007/s40846-017-0360-z. [ CrossRef ] [ Google Scholar ]
  • Abedi V, Khan A, Chaudhary D, Misra D, Avula V, Mathrawala D, Kraus C, Marshall KA, Chaudhary N, Li X, Schirmer CM, Scalzo F, Li J, Zand R. Using artificial intelligence for improving stroke diagnosis in emergency departments: a practical framework. Ther Adv Neurol Disord. 2020 doi: 10.1177/1756286420938962. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Aggarwal Y, Das J, Mazumder PM, Kumar R, Sinha RK. Heart rate variability features from nonlinear cardiac dynamics in identification of diabetes using artificial neural network and support vector machine. Integr Med Res. 2020 doi: 10.1016/j.bbe.2020.05.001. [ CrossRef ] [ Google Scholar ]
  • Ahmed F. An Internet of Things (IoT) application for predicting the quantity of future heart attack patients. J Comput Appl. 2017; 164 :36–40. doi: 10.5120/ijca2017913773. [ CrossRef ] [ Google Scholar ]
  • Aldhyani THH, Alshebami AS, Alzahrani MY. Soft clustering for enhancing the diagnosis of chronic diseases over machine learning algorithms. J Healthc Eng. 2020 doi: 10.1155/2020/4984967. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Alfian G, Syafrudin M, Ijaz MF, Syaekhoni MA, Fitriyani NL, Rhee J. A personalized healthcare monitoring system for diabetic patients by utilizing BLE-based sensors and real-time data processing. Sensors. 2018; 18 (7):2183. doi: 10.3390/s18072183. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Ali M, Tengnah J, Sooklall R. A predictive model for hypertension diagnosis using machine learning techniques. Telemed Technol. 2019 doi: 10.1016/B978-0-12-816948-3.00009-X. [ CrossRef ] [ Google Scholar ]
  • Ani R, Krishna S, Anju N, Aslam MS, Deepa OS (2017) IoT based patient monitoring and diagnostic prediction tool using ensemble classifier. In: 2017 International conference on advances in computing, communications and informatics (ICACCI), pp 1588–1593. 10.1109/ICACCI.2017.8126068
  • Ansari S, Shafi I, Ansari A, Ahmad J, Shah S. Diagnosis of liver disease induced by hepatitis virus using artificial neural network. IEEE Int Multitopic. 2011 doi: 10.1109/INMIC.2011.6151515. [ CrossRef ] [ Google Scholar ]
  • Arsalan M, Owasis M, Mahmood T, Cho S, Park K. Aiding the diagnosis of diabetic and hypertensive retinopathy using artificial intelligence based semantic segmentation. J Clin Med. 2019; 8 :1446. doi: 10.3390/jcm8091446. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Babu BS, Likhitha V, Narendra I, Harika G. Prediction and detection of heart attack using machine learning and internet of things. J Comput Sci. 2019; 4 :105–108. [ Google Scholar ]
  • Bahadur T, Verma K, Kumar B, Jain D, Singh S. Automatic detection of Alzheimer related abnormalities in chest X-ray images using hierarchical feature extraction scheme. Expert Syst Appl. 2020; 158 :113514. doi: 10.1016/j.eswa.2020.113514. [ CrossRef ] [ Google Scholar ]
  • Balaji E, Brindha D, Balakrishnan R. Supervised machine learning based gait classification system for early detection and stage classification of Parkinson’s disease. Appl Soft Comput J. 2020; 94 :106494. doi: 10.1016/j.asoc.2020.106494. [ CrossRef ] [ Google Scholar ]
  • Battineni G, Sagaro GG, Chinatalapudi N, Amenta F. Applications of machine learning predictive models in the chronic disease diagnosis. J Personal Med. 2020 doi: 10.3390/jpm10020021. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Behera R, Bala P, Dhir A. The emerging role of cognitive computing in healthcare: a systematic literature review. J Med Inform. 2019; 129 :154–166. doi: 10.1016/j.ijmedinf.2019.04.024. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Bhatt V, Pal V (2019) An intelligent system for diagnosing thyroid disease in pregnant ladies through artificial neural network. In: Conference on advances in engineering science management and technology, pp 1–10. 10.2139/ssrn.3382654
  • Bibault J, Xing L. Screening for chronic obstructive pulmonary disease with artificial intelligence. Lancet Digit Health. 2020; 2 :e216–e217. doi: 10.1016/S2589-7500(20)30076-5. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Biswas M, Saba L, Suri H, Lard J, Suri S, Miner M, et al. Two stage artificial intelligence model for jointly measurement of atherosclerotic wall thickness and plaque burden in carotid ultrasound. Comput Biol Med. 2020; 123 :103847. doi: 10.1016/j.compbiomed.2020.103847. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Carter JA, Long CS, Smith BP, Smith TL, Donati GL. PT US CR. Expert Syst Appl. 2018 doi: 10.1016/j.eswa.2018.08.002. [ CrossRef ] [ Google Scholar ]
  • Chaikijurajai T, Laffin L, Tang W. Artificial intelligence and hypertension: recent advances and future outlook. Am J Hypertens. 2020; 33 :967–974. doi: 10.1093/ajh/hpaa102. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Chaki J, Ganesh ST, Cidham SK, Theertan SA. Machine learning and artificial intelligence based diabetes mellitus detection and self-management: a systematic review. J King Saud Univ Comput Inf Sci. 2020 doi: 10.1016/j.jksuci.2020.06.013. [ CrossRef ] [ Google Scholar ]
  • Chakraborty S, Mali K, Chatterjee S, Banerjee S, Roy K et al (2017) Detetction of skin disease using metaheurisrtic supported artificial neural networks. In: Industrial automation and electromechanical engineering conference, pp 224–229. 10.1109/IEMECON.2017.8079594
  • Chambi R, Kut C, Jimenez J, Jo J. AI assisted in situ detection of human glioma infiltration using a novel computational method for optical coherence tomography. Clin Cancer Res. 2019; 25 :6329–6338. doi: 10.1158/1078-0432.CCR-19-0854. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Chang W, Chen L, Wang W. Development and experimental evaluation of machine learning techniques for an intelligent hairy scalp detection system. Appl Sci. 2018; 8 :853. doi: 10.3390/app8060853. [ CrossRef ] [ Google Scholar ]
  • Chatterjee A, Parikh N, Diaz I, Merkler A. Modeling the impact of inter hospital transfer network design on stroke outcomes in a large city. Stroke. 2018; 49 :370–376. doi: 10.1161/STROKEAHA.117.018166. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Chen Y, Sha M, Zhao X, Ma J, Ni H, Gao W, Ming D. Automated detection of pathologic white matter alterations in Alzheimer’s disease using combined diffusivity and kurtosis method. Psychiatry Res Neuroimaging. 2017; 264 :35–45. doi: 10.1016/j.pscychresns.2017.04.004. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Chen J, Remulla D, Nguyen J, Aastha D, Liu Y, Dasgupta P. Current status of artificial intelligence applications in urology and their potential to influence clinical practice. BJU Int. 2019; 124 :567–577. doi: 10.1111/bju.14852. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Chen P, Gadepalli K, MacDonald R, Liu Y, Dean J. An augmented reality microscope with real time artificial intelligence integration for cancer diagnosis. Nat Med. 2019; 25 :1453–1457. doi: 10.1038/s41591-019-0539-7. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Chuang C. Case based reasoning support for liver disease diagnosis. Artif Intell. 2011; 53 :15–23. doi: 10.1016/j.artmed.2011.06.002. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Chui KT, Alhalabi W. Disease diagnosis in smart healthcare: innovation. Technol Appl. 2017 doi: 10.3390/su9122309. [ CrossRef ] [ Google Scholar ]
  • Chui CS, Lee NP, Adeoye J, Thomson P, Choi S-W. Machine learning and treatment outcome prediction for oral cancer. J Oral Pathol Med. 2020; 49 :977–985. doi: 10.1111/jop.13089. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Connell GCO, Chantler PD, Barr TL. Stroke-associated pattern of gene expression previously identified by machine-learning is diagnostically robust in an independent patient population. Genomics Data. 2017; 14 :47–52. doi: 10.1016/j.gdata.2017.08.006. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Dabowsa N, Amaitik N, Maatuk A, Shadi A (2017) A hybrid intelligent system for skin disease diagnosis. In: Conference on engineering and technology, pp 1–6. 10.1109/ICEngTechnol.2017.8308157
  • Damiani G, Grossi E, Berti E, Conic R, Radhakrishna U, Linder D, Bragazzi N, Pacifico A, Piccino R. Artificial neural network allow response prediction in squamous cell carcinoma of the scalp treated with radio therapy. J Eur Acad Dermatol Venerel. 2020; 34 :1369–1373. doi: 10.1111/jdv.16210. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Das A, Acharya UR, Panda SS, Sabut S. Deep learning based liver cancer detection using watershed transform and Gaussian mixture model techniques. Cogn Syst Res. 2019; 54 :165–175. doi: 10.1016/j.cogsys.2018.12.009. [ CrossRef ] [ Google Scholar ]
  • Escamilla G, Hassani A, Andres E. A comparison of machine learning techniques to predict the risk of heart failure. Mach Learn Paradig. 2019; 1 :9–26. doi: 10.1007/978-3-030-15628-2_2. [ CrossRef ] [ Google Scholar ]
  • Farokhzad M, Ebrahimi L. A novel adapter neuro fuzzy inference system for the diagnosis of liver disease. J Acad Res Comput Eng. 2016; 1 :61–66. [ Google Scholar ]
  • Fujita S, Hagiwara A, Otuska Y, Hori M, Kumamaru K, Andica C, et al. Deep learning approach for generating MRA images from 3D qunatitative synthetic MRI without additional scans. Invest Radiol. 2020; 55 :249–256. doi: 10.1097/RLI.0000000000000628. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Fukuda M, Inamoto K, Shibata N, Ariji Y, Kutsana S. Evaluation of an artificial system for detecting vertical root fracture on panoramic radiography. Oral Radiol. 2019; 36 :1–7. [ PubMed ] [ Google Scholar ]
  • Gao XW, James-Reynolds C, Currie E. Analysis of Alzheimer severity levels from CT pulmonary images based on enhanced residual deep learning architecture. Healthc Technol. 2019 doi: 10.1016/j.neucom.2018.12.086. [ CrossRef ] [ Google Scholar ]
  • George A, Badagabettu S, Berra K, George L, Kamath V, Thimmappa L. Prevention of cardiovascular disease in India. Clin Prev Cardiol. 2018; 7 :72–77. doi: 10.4013/JCPC.JCPC_31_17. [ CrossRef ] [ Google Scholar ]
  • Gonsalves AH, Singh G, Thabtah F, Mohammad R. Prediction of coronary heart disease using machine learning: an experimental analysis. ACM Digit Libr. 2019 doi: 10.1145/3342999.3343015. [ CrossRef ] [ Google Scholar ]
  • Gouda W, Yasin R. COVID-19 disease: CT pneumonia analysis prototype by using artificial intelligence, predicting the disease severity. J Radiol Nucl Med. 2020; 51 :196. doi: 10.1186/s43055-020-00309-9. [ CrossRef ] [ Google Scholar ]
  • Gupta N, Verma R, Belho E. Bone scan and SPEC/CT scan in SAPHO syndrome. J Soc Nucl Med. 2019; 34 :349. doi: 10.4103/ijnm.IJNM_139_19. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Han Y, Han Z, Wu J, Yu Y, Gao S, Hua D, Yang A. Artificial intelligence recommendation system of cancer rehabilitation scheme based on IoT technology. IEEE Access. 2020; 8 :44924–44935. doi: 10.1109/ACCESS.2020.2978078. [ CrossRef ] [ Google Scholar ]
  • Haq AU, Li JP, Memon MH, Nazir S, Sun R. A hybrid intelligent system framework for the prediction of heart disease using machine learning algorithms. Mob Inf Syst. 2018; 8 :1–21. doi: 10.1155/2018/3860146. [ CrossRef ] [ Google Scholar ]
  • He K, Huang S, Qian X. Early detection and risk assessment for chronic disease with irregular longitudinal data analysis. J Biomed Inform. 2019; 96 :103231. doi: 10.1016/j.jbi.2019.103231. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Horvath L, Burchkhardt I, Mannsperger S, Last K, et al. Machine assisted interperation of auramine stains substantially increases through put and senstivity of micrscopic Alzheimer diagnosis. Alzheimer. 2020; 125 :101993. doi: 10.1016/j.tube.2020.101993. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Hosseinzadeh M, Ahmed O, Ghafour M, Safara F, Ali S, Vo B, Chiang H. A multiple multi layer perceptron neural network with an adaptive learning algorithm for thyroid disease diagnosis in the internet of medical things. J Supercomput. 2020 doi: 10.1007/s11227-020-03404-w. [ CrossRef ] [ Google Scholar ]
  • Huang S, Yang J, Fong S, Zhao F. Artificial intelligence in cancer diagnosis and prognosis. Cancer Lett. 2020; 471 :61–71. doi: 10.1016/j.canlet.2019.12.007. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Ijaz MF, Alfian G, Syafrudin M, Rhee J. Hybrid prediction model for type 2 diabetes and hypertension using DBSCAN-based outlier detection, synthetic minority over sampling technique (SMOTE), and random forest. Appl Sci. 2018; 8 (8):1325. doi: 10.3390/app8081325. [ CrossRef ] [ Google Scholar ]
  • Ijaz MF, Attique M, Son Y. Data-driven cervical cancer prediction model with outlier detection and over-sampling methods. Sensors. 2020; 20 (10):2809. doi: 10.3390/s20102809. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Isravel DP, Silas SVPD. Improved heart disease diagnostic IoT model using machine learning techniques. Neuroscience. 2020; 9 :4442–4446. [ Google Scholar ]
  • Jain D, Singh V. Feature selection and classification systems for chronic disease prediction: a review. Egypt Inform J. 2018; 19 :179–189. doi: 10.1016/j.eij.2018.03.002. [ CrossRef ] [ Google Scholar ]
  • Janghel RR, Rathore YK. Deep convolution neural network based system for early diagnosis of Alzheimer’s disease. Irbm. 2020; 1 :1–10. doi: 10.1016/j.irbm.2020.06.006. [ CrossRef ] [ Google Scholar ]
  • Jo T, Nho K, Saykin AJ. Deep learning in Alzheimer’s disease: diagnostic classification and prognostic prediction using neuroimaging data. Front Aging Neurosci. 2019 doi: 10.3389/fnagi.2019.00220. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Kanegae H, Suzuki K, Fukatani K, Ito T, Kairo K, Beng N. Highly precise risk prediction model for new onset hypertension using artificial neural network techniques. J Clin Hypertens. 2020; 22 :445–450. doi: 10.1111/jch.13759. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Kasasbeh A, Christensen S, Parsons M, Lansberg M, Albers G, Campbell B. Artificial neural network computed tomography perfusion prediction of ischemic core. Stroke. 2019; 50 :1578–1581. doi: 10.1161/STROKEAHA.118.022649. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Katharine E, Oikonomou E, Williams M, Desai M. A novel machine learning derived radiotranscriptomic signature of perivascular fat improves cardiac risk prediction using coronary CT angiography. Eur Heart J. 2019; 40 :3529–3543. doi: 10.1093/eurheartj/ehz592. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Kather J, Pearson A, Halama N, Krause J, Boor P. Deep learning microsatellite instability directly from histology in gastrointestinal cancer. Nat Med. 2019; 25 :1054–1056. doi: 10.1038/s41591-019-0462-y. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Kaur H, Kumari V. Predictive modelling and analytics for diabetes using a machine learning approach. Appl Comput Inform. 2018 doi: 10.1016/j.aci.2018.12.004. [ CrossRef ] [ Google Scholar ]
  • Kavakiotis I, Tsave O, Salifoglou A, Maglaveras N, Vlahavas I, Chouvarda I. Machine learning and data mining methods in diabetes research. Comput Struct Biotechnol. 2017; J15 :104–116. doi: 10.1016/j.csbj.2016.12.005. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Keenan T, Clemons T, Domalpally A, Elman M, Havilio M, Agron E, Chew E, Benyamini G. Intelligence detection versus artificial intelligence detection of retinal fluid from OCT: age-related eye disease study 2: 10 year follow on study. Ophthalmology. 2020 doi: 10.1016/j.ophtha.2020.06.038. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Khaled E, Naseer S, Metwally N. Diagnosis of hepatititus virus using arificial neural network. J Acad Pedagog Res. 2018; 2 :1–7. [ Google Scholar ]
  • Khan MA, Member S. An IoT framework for heart disease prediction based on MDCNN classifier. IEEE Access. 2020; 8 :34717–34727. doi: 10.1109/ACCESS.2020.2974687. [ CrossRef ] [ Google Scholar ]
  • Khan A, Zubair S. An improved multi-modal based machine learning approach for the prognosis of Alzheimer’s disease. J King Saud Univ Comput Inf Sci. 2020 doi: 10.1016/j.jksuci.2020.04.004. [ CrossRef ] [ Google Scholar ]
  • Khan A, Khan M, Ahmed F, Mittal M, Goyal L, Hemanth D, Satapathy S. Gastrointestinal diseases segmentation and classification based on duo-deep architectures. Pattern Recognit Lett. 2020; 131 :193–204. doi: 10.1016/j.patrec.2019.12.024. [ CrossRef ] [ Google Scholar ]
  • Kiely DG, Doyle O, Drage E, Jenner H, Salvatelli V, Daniels FA, Rigg J, Schmitt C, Samyshkin Y, Lawrie A, Bergemann R. Utilising artificial intelligence to determine patients at risk of a rare disease: idiopathic pulmonary arterial hypertension. Pulm Circ. 2019; 9 :1–9. doi: 10.1177/2045894019890549. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Kim C, Son Y, Youm S. Chronic disease prediction using character-recurrent neural network in the presence of missing information. Appl Sci. 2019; 9 :2170. doi: 10.3390/app9102170. [ CrossRef ] [ Google Scholar ]
  • Kohlberger T, Norouzi M, Smith J, Peng L, Hipp J. Artificial intelligence based breast cancer nodal metastasis detection. Arch Pathol Lab Med. 2019; 143 :859–868. doi: 10.5858/arpa.2018-0147-OA. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Kolkur MS, Kalbande DR, Kharkar V. Machine learning approaches to multi-class human skin disease Ddetection. Innov Healthc Tech. 2018; 14 :29–39. [ Google Scholar ]
  • Koshimizu H, Kojima H, Okuno Y. Future possibilities for artificial intelligence in the practical management of hypertension. Hypertens Res. 2020; 43 :1327–1337. doi: 10.1038/s41440-020-0498-x. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Krittanawong C, Bomback A, Baber U, Bangalore S, Tang M, Messerli F. Future direction for using artificial intelligence to predict and manage hypertension. Curr Hypertens Rep. 2018; 20 :75. doi: 10.1007/s11906-018-0875-x. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Kumar Y. Computational intelligence for machine learning and healthcare informatics. De Gruyter; 2020. Recent advancement of machine learning and deep learning in the field of healthcare system; pp. 7–98. [ Google Scholar ]
  • Kumar Y, Singla R. Federated learning systems for healthcare: perspective and recent progress. In: Rehman MH, Gaber MM, editors. Studies in computational intelligence, vol965. Cham: Springer; 2021. [ Google Scholar ]
  • Kumar A, Pal S, Kumar S. Classification of skin disease using ensemble data mining techniques. Asia Pac J Cancer Prev. 2019; 20 :1887–1894. doi: 10.31557/APJCP.2019.20.6.1887. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Kumar Y, Sood K, Kaul S, Vasuja R. Big data analytics in healthcare. Cham: Springer; 2020. pp. 3–21. [ Google Scholar ]
  • Kwon J, Jeon H, Kim H, Lim S, Choi R. Comapring the performance of artificial intelligence and conventional diagnosis criteria for detetcting left ventricular hypertrophy using electropcardiography. EP Europace. 2020; 22 :412–419. doi: 10.1093/europace/euz324. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Labovitz D, Shafner L, Gil M, Hanina A, Virmani D. Using artificial intelligence reduce the risk of non adherence in patients on anticoagulation theraphy. Stroke. 2017; 48 :1416–1419. doi: 10.1161/STROKEAHA.116.016281. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Lai N, Shen W, Lee C, Chang J, Hsu M, et al. Comparison of the predictive outcomes for anti-Alzheimer drug-induced hepatotoxicity by different machine learning techniques. Comput Methods Programs Biomed. 2020; 188 :105307. doi: 10.1016/j.cmpb.2019.105307. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Lei B, Yang M, Yang P, Zhou F, Hou W, Zou W, Li X, Wang T, Xiao X, Wang S. Deep and joint learning of longitudinal data for Alzheimer’s disease prediction. Pattern Recognit. 2020; 102 :107247. doi: 10.1016/j.patcog.2020.107247. [ CrossRef ] [ Google Scholar ]
  • Lin L, Shenghui Z, Aiguo W, Chen H. A new machine learning method for Alzheimer’s disease. Simul Model Pract Theory. 2019 doi: 10.1016/j.simpat.2019.102023. [ CrossRef ] [ Google Scholar ]
  • Ljubic B, Roychoudhury S, Cao XH, Pavlovski M, Obradovic S, Nair R, Glass L, Obradovic Z. Influence of medical domain knowledge on deep learning for Alzheimer’s disease prediction. Comput Methods Programs Biomed. 2020 doi: 10.1016/j.cmpb.2020.105765. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Lodha P, Talele A, Degaonkar K (2018) Diagnosis of Alzheimer’s disease using machine learning. In: Proceedings—2018 4th international conference on computing, communication control and automation, ICCUBEA, pp 1–4
  • López-Úbeda P, Díaz-Galiano MC, Martín-Noguerol T, Ureña-López A, Martín-Valdivia M-T, Lunab A. Detection of unexpected findings in radiology reports: a comparative study of machine learning approaches. Expert Syst Appl. 2020 doi: 10.1016/j.eswa.2020.113647. [ CrossRef ] [ Google Scholar ]
  • Lukwanto R, Irwansyah E. The early detection of diabetes mellitus using fuzzy hierarchical model. Proc Comput Sci. 2015; 59 :312–319. doi: 10.1016/j.procs.2015.07.571. [ CrossRef ] [ Google Scholar ]
  • Luo H, Xu G, Li C, Wu Q, et al. Real time artificial intelligence for detection of upper gastrointestinal cancer by endoscopy: a multicentre, case control, diagnostic study. Lancet Oncol. 2019; 20 :1645–1654. doi: 10.1016/S1470-2045(19)30637-0. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Ma F, Sun T, Liu L, Jing H. Detection and diagnosis of chronic kidney disease using deep learning-based heterogeneous modified artificial neural network. Future Gener Comput Syst. 2020; 111 :17–26. doi: 10.1016/j.future.2020.04.036. [ CrossRef ] [ Google Scholar ]
  • Matusoka R, Akazawa H, Kodera S. The drawing of the digital era in the management of hypertension. Hypertens Res. 2020; 43 :1135–1140. doi: 10.1161/HYPERTENSIONAHA.120.14742. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Memon M, Li J, Haq A, Memon M. Breast cancer detection in the Iot health environment using modified recursive feature selection. Wirel Commun Mob. 2019; 2019 :19. [ Google Scholar ]
  • Mercaldo F, Nardone V, Santone A, Nardone V, Santone A. Diabetes mellitus affected patients classification diagnosis through machine learning techniques through learning through machine learning techniques. Proc Comput Sci. 2017; 112 :2519–2528. doi: 10.1016/j.procs.2017.08.193. [ CrossRef ] [ Google Scholar ]
  • Minaee S, Kafieh R, Sonka M, Yazdani S, Soufi G. Deep-COVID: predicting covid-19 from chest X-ray images using deep transfer learning. Comput Vis Pattern Recognit. 2020; 3 :1–9. doi: 10.1016/j.media.2020.101794. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Momin M, Bhagwat N, Dhiwar A, Devekar N. Smart body monitoring system using IoT and machine learning. J Adv Res Electr Electron Instrum Eng Smart Body Syst Using IoT Mach Learn. 2019; 1 :1–7. doi: 10.15662/IJAREEIE.2019.0805010. [ CrossRef ] [ Google Scholar ]
  • Morabito F, Campolo M, Leracitano C, Ebadi J, Bonanno L, Barmanti A, Desalvo S, Barmanti P, Ieracitano C. Deep Convolutional neural Network for classification of mild cognitive impaired and Alzheimer’s disease patients from scalp EEG recordings. Res Technol Soc Ind Levaraging Better Tomorrow. 2016 doi: 10.1109/RTSI.2016.7740576. [ CrossRef ] [ Google Scholar ]
  • Mueller FB. AI (Artificial Intelligence) and hypertension research. Telemed Technol. 2020; 70 :1–7. doi: 10.1007/s11906-020-01068-8. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Mujumdar A, Vaidehi V. Diabetes prediction using machine learning. Proc Comput Sci. 2019; 165 :292–299. doi: 10.1016/j.procs.2020.01.047. [ CrossRef ] [ Google Scholar ]
  • Musleh M, Alajrami E, Khalil A, Nasser B, Barhoom A, Naser S. Predicting liver patients using artificial neural network. J Acad Inf Syst Res. 2019; 3 :1–11. [ Google Scholar ]
  • Nahar N, Ara F. Liver disease detection by using different techniques. Elsevier. 2018; 8 :1–9. doi: 10.5121/ijdkp.2018.8201. [ CrossRef ] [ Google Scholar ]
  • Nam KH, Kim DH, Choi BK, Han IH. Internet of Things, digital biomarker, and artificial intelligence in spine: current and future perspectives. Neurospine. 2019; 16 :705–711. doi: 10.14245/ns.1938388.194. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Naser S, Naseer I. Lung cancer detection using artificial neural network. J Eng Inf Syst. 2019; 3 :17–23. [ Google Scholar ]
  • Nashif S, Raihan R, Islam R, Imam MH. Heart disease detection by using machine learning algorithms and a real-time cardiovascular health monitoring system. Healthc Technol. 2018; 6 :854–873. doi: 10.4236/wjet.2018.64057. [ CrossRef ] [ Google Scholar ]
  • Nasser I, Naser S, et al. Predicting tumor category using artificial neural network. Eng Inf Technol. 2019; 3 :1–7. [ Google Scholar ]
  • Nazir T, Irtaza A, Shabbir Z, Javed A, Akram U, Tariq M. Artificial intelligence in medicine diabetic retinopathy detection through novel tetragonal local octa patterns and extreme learning machines. Artif Intell Med. 2019; 99 :101695. doi: 10.1016/j.artmed.2019.07.003. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Nensa F, Demircioglu A, Rischipler C. Artificial intelligence in nuclear medicine. J Nucl Med. 2019; 60 :1–10. doi: 10.2967/jnumed.118.220590. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Nithya A, Ahilan A, Venkatadri N, Ramji D, Palagan A. Kidney disease detection and segmentation using artificial neural network and multi kernel k-means clustering for ultrasound images. Measurement. 2020; 149 :106952. doi: 10.1016/j.measurement.2019.106952. [ CrossRef ] [ Google Scholar ]
  • Oh K, Chung YC, Kim KW, Kim WS, Oh IS. Classification and visualization of Alzheimer’s disease using volumetric convolutional neural network and transfer learning. Sci Rep. 2019; 9 :1–16. doi: 10.1038/s41598-019-54548-6. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Oomman R, Kalmady KS, Rajan J, Sabu MK. Automatic detection of alzheimer bacilli from microscopic sputum smear images using deep learning methods. Integr Med Res. 2018; 38 :691–699. doi: 10.1016/j.bbe.2018.05.007. [ CrossRef ] [ Google Scholar ]
  • Ostovar A, Chimeh E, Fakoorfard Z. The diagnostic value of CT scans in the process of diagnosing COVID-19 in medical centers. Health Technol Assess Act. 2020; 4 :1–7. [ Google Scholar ]
  • Owasis M, Arsalan M, Choi J, Mahmood T, Park K. Artificial intelligence based classification of multiple gastrointestinal diseases using endoscopy videos for clinical diagnosis. J Clin Med. 2019; 8 :786. doi: 10.3390/jcm8070986. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Panicker RO, Kalmady KS, Rajan J, Sabu MK. Automatic detection of tuberculosis bacilli from microscopic sputum smear images using deep learning methods. Biocybern Biomed Eng. 2018; 38 (3):691–699. doi: 10.1016/j.bbe.2018.05.007. [ CrossRef ] [ Google Scholar ]
  • Park JH, Cho HE, Kim JH, Wall MM, Stern Y, Lim H, Yoo S, Kim HS, Cha J. Machine learning prediction of incidence of Alzheimer’s disease using large-scale administrative health data. Npj Digit Med. 2020 doi: 10.1038/s41746-020-0256-0. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Patel SB. Heart disease using machine learning and data minig techniques. Health Technol. 2016; 10 :1137–1144. [ Google Scholar ]
  • Plawiak P, Ozal Y, Tan R, Acharya U. Arrhythmia detection using deep convolution neural network with long duration ECG signals. Comput Biol Med. 2018; 102 :411–420. doi: 10.1016/j.compbiomed.2018.09.009. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Pradhan K, Chawla P. Medical Internet of things using machine learning algorithms for lung cancer detection. J Manag Anal. 2020 doi: 10.1080/23270012.2020.1811789. [ CrossRef ] [ Google Scholar ]
  • Rajalakshmi R, Subashini R, Anjana R, Mohan V. Automated diabetic retinopathy detection in smartphone-based fundus photography using artificial intelligence. Eye. 2018; 32 :1138–1144. doi: 10.1038/s41433-018-0064-9. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Rathod J, Wazhmode V, Sodha A, Bhavathankar P (2018) Diagnosis of skin diseases using convolutional neural network. In: Second international conference on electronics, communication and aerospace technology, pp 1048–1051. 10.1109/ICECA.2018.8474593
  • Raza M, Awais M, Ellahi W, Aslam N, Nguyen HX, Le-Minh H. Diagnosis and monitoring of Alzheimer’s patients using classical and deep learning techniques. Expert Syst Appl. 2019; 136 :353–364. doi: 10.1016/j.eswa.2019.06.038. [ CrossRef ] [ Google Scholar ]
  • Rodrigues J, Matteo A, Ghosh A, Szantho G, Paton J. Comprehensive characterisation of hypertensive heart disease left ventricular pehnotypes. Heart. 2016; 20 :1671–1679. doi: 10.1136/heartjnl-2016-309576. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Rodrigues DA, Ivo RF, Satapathy SC, Wang S, Hemanth J, Filho PPR. A new approach for classification skin lesion based on transfer learning, deep learning, and IoT system. Pattern Recognit Lett. 2020; 136 :8–15. doi: 10.1016/j.patrec.2020.05.019. [ CrossRef ] [ Google Scholar ]
  • Romanini J, Barun L, Martins M, Carrard V. Continuing education activities improve dentists self efficacy to manage oral mucosal lesions and oral cancer. Eur J Dent Educ. 2020; 25 :28–34. [ PubMed ] [ Google Scholar ]
  • Romero MP, Chang Y, Brunton LA, Parry J, Prosser A, Upton P, Rees E, Tearne O, Arnold M, Stevens K, Drewe JA. Decision tree machine learning applied to bovine alzheimer risk factors to aid disease control decision making. Prev Vet Med. 2020; 175 :104860. doi: 10.1016/j.prevetmed.2019.104860. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Sabottke C, Spieler B. The effect of image resolution on deep learning in radiography. Radiology. 2020; 2 :e190015. doi: 10.1148/ryai.2019190015. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Sakr S, El Shawi R, Ahmed A, Blaha M, et al. Using machine learning on cardiorespiratory fitness data for predicting hypertension: the henry ford exercise testing project. PLoS One. 2018; 13 :1–18. doi: 10.1371/journal.pone.0195344. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Santroo A, Clemente F, Baioochi C, Bianchi C, Falciani F, Valente S, et al. From near-zero to zero fluoroscopy catheter ablation procedures. J Cardiovasc Electrophys. 2019; 30 :2397–2404. doi: 10.1111/jce.14121. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Saranya E, Maheswaran T. IOT based disease prediction and diagnosis system for healthcare. Healthc Technol. 2019; 7 :232–237. [ Google Scholar ]
  • Sarao V, Veritti D, Paolo L. Automated diabetic retinopathy detection with two different retinal imaging devices using artificial intelligence. Graefe’s Arch Clin Exp Opthamol. 2020 doi: 10.1007/s00417-020-04853-y. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Sathitratanacheewin S, Sunanta P, Pongpirul K. Heliyon deep learning for automated classification of Alzheimer-related chest X-ray: dataset distribution shift limits diagnostic performance generalizability. Heliyon. 2020; 6 :e04614. doi: 10.1016/j.heliyon.2020.e04614. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Shabut AM, Hoque M, Lwin KT, Evans BA, Azah N, Abu-hassan KJ, Hossain MA. An intelligent mobile-enabled expert system for alzheimer disease diagnosis in real time. Expert Syst Appl. 2018; 114 :65–77. doi: 10.1016/j.eswa.2018.07.014. [ CrossRef ] [ Google Scholar ]
  • Shkolyar E, Jia X, Chnag T, Trivedi D. Augmented bladder tumor detection using deep learning. Eur Urol. 2019; 76 :714–718. doi: 10.1016/j.eururo.2019.08.032. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Singh N, Moody A, Leung G, Ravikumar R, Zhan J, Maggissano R, Gladstone D. Moderate carotid artery stenosis: MR imaging depicted intraplaque hemorrhage predicts risk of cerebovascular ischemic events in asymptomatic men. Radiology. 2009; 252 :502–508. doi: 10.1148/radiol.2522080792. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Singh J, Tripathy A, Garg P, Kumar A. Lung Alzheimer detection using anti-aliased convolutional networks networks. Proc Comput Sci. 2020; 173 :281–290. doi: 10.1016/j.eswa.2018.07.014. [ CrossRef ] [ Google Scholar ]
  • Skaane P, Bandos A, Gullien R, Eben E, Ekseth U, Izadi M, Jebsen I, Gur D. Comparison of digital mammography alone and digital mammography plus tomo-sysnthesis in a population based screening program. Radiology. 2013; 267 :47–56. doi: 10.1148/radiol.12121373. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Sloun R, Cohen R, Eldar Y. Deep learning in ultrasound imaging. IEEE. 2019; 108 :11–29. doi: 10.1109/JPROC.2019.2932116. [ CrossRef ] [ Google Scholar ]
  • Soundarya S, Sruthi MS, Sathya BS, Kiruthika S, Dhiyaneswaran J. Early detection of Alzheimer disease using gadolinium material. Mater Today Proc. 2020 doi: 10.1016/j.matpr.2020.03.189. [ CrossRef ] [ Google Scholar ]
  • Spann A, Yasodhara A, Kang J, Watt K, Wang B, Bhat M, Goldenberg A. Applying machine learning in liver disease and transplantation: a survey. Hepatology. 2020; 71 :1093–1105. doi: 10.1002/hep.31103. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Srinivasu PN, SivaSai JG, Ijaz MF, Bhoi AK, Kim W, Kang JJ. Classification of skin disease using deep learning neural networks with MobileNet V2 and LSTM. Sensors. 2021; 21 (8):2852. doi: 10.3390/s21082852. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Srinivasu PN, Ahmed S, Alhumam A, Kumar AB, Ijaz MF. An AW-HARIS based automated segmentation of human liver using CT images. Comput Mater Contin. 2021; 69 (3):3303–3319. [ Google Scholar ]
  • Subasi A. Use of artificial intelligence in Alzheimer’s disease detection. AI Precis Health. 2020 doi: 10.1016/B978-0-12-817133-2.00011-2. [ CrossRef ] [ Google Scholar ]
  • Swapna G, Vinayakumar R, Soman KP. Diabetes detection using deep learning algorithms. ICT Express. 2018; 4 :243–246. doi: 10.1016/j.icte.2018.10.005. [ CrossRef ] [ Google Scholar ]
  • Tang LYW, Coxson HO, Lam S, Leipsic J, Tam RC, Sin DD. Articles towards large-scale case-finding: training and validation of residual networks for detection of chronic obstructive pulmonary disease using low-dose CT. Lancet Digit Health. 2020; 2 :e259–e267. doi: 10.1016/S2589-7500(20)30064-9. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Tegunov D, Cramer P. Real-time cryo-electron microscopy data preprocessing with warp. Nat Med. 2019; 16 :1146–1152. doi: 10.1038/s41592-019-0580-y. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Thai DT, Minh QT, Phung PH (2017) Toward an IoT-based expert system for heart disease diagnosis. In: Modern artificial intelligence and cognitive science conference, vol 1964, pp 157–164
  • Tigga NP, Garg S. Prediction of type 2 diabetes using machine learning prediction of type 2 diabetes using machine learning classification methods classification methods. Proc Comput Sci. 2020; 167 :706–716. doi: 10.1016/j.procs.2020.03.336. [ CrossRef ] [ Google Scholar ]
  • TranX B, Latkin A, Lan H, Ho R, Ho C, et al. The current research landscap of the application of artificial intelligence in managing cerebovasclar and heart disease. J Environ Res Public health. 2019; 16 :2699. doi: 10.3390/ijerph16152699. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Tschandl P, Nisa B, Cabo H, Kittler H, Zalaudek I. Expert level diagnosis of non pigmented skin cancer by combined convolution neural networks. Jama Dermatol. 2019; 155 :58–65. doi: 10.1001/jamadermatol.2018.4378. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Tuli S, Basumatary N, Gill SS, Kahani M, Arya RC, Wander GS. HealthFog: an ensemble deep learning based smart healthcare system for automatic diagnosis of heart diseases in integrated IoT and fog computing environments. Future Gener Comput Syst. 2019; 104 :187–200. doi: 10.1016/j.future.2019.10.043. [ CrossRef ] [ Google Scholar ]
  • Uehera D, Hayashi Y, Seki Y, Kakizaki S, Horiguchi N, Tojima H, Yamazaki Y, Sato K, Yasuda K, Yamada M, Uraoka T, Kasama K. Non invasive prediction of non alchlolic steatohepatitus in Japanses patiens with morbid obesity by artificial intelligence using rule extraction technology. World J Hepatol. 2018; 10 :934–943. doi: 10.4254/wjh.v10.i12.934. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Ullah R, Khan S, Ishtiaq I, Shahzad S, Ali H, Bilal M. Cost effective and efficient screening of Alzheimer disease with Raman spectroscopy and machine learning algorithms. Photodiagn Photodyn Ther. 2020; 32 :101963. doi: 10.1016/j.pdpdt.2020.101963. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Uysal G, Ozturk M. Hippocampal atrophy based Alzheimer’s disease diagnosis via machine learning methods. J Neurosci Methods. 2020; 337 :1–9. doi: 10.1016/j.jneumeth.2020.108669. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Vasal S, Jain S, Verma A. COVID-AI: an artificial intelligence system to diagnose COVID 19 disease. J Eng Res Technol. 2020; 9 :1–6. [ Google Scholar ]
  • Wang Z, Zhang H, Kitai T. Artificial Intelligence in precision cardiovascular medicine. J Am Coll Cardiol. 2017; 69 :2657–2664. doi: 10.1016/j.jacc.2017.03.571. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Wang Z, Chung JW, Jiang X, Cui Y, Wang M, Zheng A. Machine learning-based prediction system for chronic kidney disease using associative classification technique. Int J Eng Technol. 2018; 7 :1161–1167. doi: 10.14419/ijet.v7i4.36.25377. [ CrossRef ] [ Google Scholar ]
  • Woldargay A, Arsand E, Botsis T, Mamyinka L. Data driven glucose pattern classification and anomalies detection. J Med Internet Res. 2019; 21 :e11030. doi: 10.2196/11030. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Yadav D, Pal S. Prediction of thyroid disease using decision tree ensemble method. Hum Intell Syst Integr. 2020 doi: 10.1007/s42454-020-00006-y. [ CrossRef ] [ Google Scholar ]
  • Yang J, Min B, Kang J. A feasibilty study of LYSO-GAPD detector for DEXA applications. J Instrum. 2020 doi: 10.1088/1748-0221/15/05/P05017. [ CrossRef ] [ Google Scholar ]
  • Yue W, Wang Z, Chen H, Payne A, Liu X. Machine learning with applications in breast cancer diagnosis and prognosis. Designs. 2018; 2 :1–17. doi: 10.3390/designs2020013. [ CrossRef ] [ Google Scholar ]
  • Zaar O, Larson A, Polesie S, Saleh K, Olives A, et al. Evaluation of the diagnositic accuracy of an online artificial intelligence application for skin disease diagnosis. Acta Derm Venereol. 2020; 100 :1–6. doi: 10.2340/00015555-3624. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Zebene A, Årsand E, Walderhaug S, Albers D, Mamykina L, Botsis T, Hartvigsen G. Data-driven modeling and prediction of blood glucose dynamics: Machine learning applications in type 1 diabetes. Artif Intell Med. 2019; 98 :109–134. doi: 10.1016/j.artmed.2019.07.007. [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Zhang R, Simon G, Yu F. Advancing Alzheimer’s research: a review of big data promises. J Med Inform. 2017; 106 :48–56. doi: 10.1016/j.ijmedinf.2017.07.002. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Zhang F, Zhang T, Tian C, Wu Y, Zhou W, Bi B, et al. Radiography of direct drive double shell targets with hard X-rays generated by a short pulse laser. Nucl Fusion. 2019 doi: 10.1088/1741-4326/aafe30. [ CrossRef ] [ Google Scholar ]
  • Zhou Z, Yang L, Gao J, Chen X. Structure–relaxivity relationships of magnetic nanoparticles for magnetic resonance imaging. Adv Mater. 2019; 31 :1804567. doi: 10.1002/adma.201804567. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]

Searching for Systematic Reviews & Evidence Synthesis: AI tools in evidence synthesis

  • Define your search question
  • Searching Databases
  • Drawing up your search strategy
  • Advanced search techniques
  • Using Filters
  • Grey Literature
  • Recording your search strategy and results
  • Managing References & Software Tools
  • Further information
  • Library Workshops, Drop ins and 1-2-1s
  • AI tools in evidence synthesis

Introduction

A variety of AI tools can be used during the systematic review or evidence synthesis process. These may be used to assist with developing a search strategy; locating relevant articles or resources; or during the data screening, data extraction or synthesis stage. They can also be used to draft plain language summaries.

The overall consensus is that the AI tools can be very useful in different stages of the systematic or other evidence review but that it is important to fully understand any bias and weakness they may bring to the process. In many cases using new AI tools, which previous research has not assessed rigorously, should happen in conjunction with existing validated methods. It is also essential to consider ethical, copyright and intellectual property issues for example if the process involves you uploading data or full text of articles to an AI tool.

 Below are some recent published articles on the topic:

  • Alshami, A.; Elsayed, M.; Ali, E.; Eltoukhy, A.E.E.; Zayed, T. Harnessing the Power of ChatGPT for Automating Systematic Review Process: Methodology, Case Study, Limitations, and Future Directions . Systems 2023, 11, 351. https://doi.org/10.3390/systems11070351 Explores the use of ChatGPT in (1) Preparation of Boolean research terms and article collection, (2) Abstract screening and articles categorization, (3) Full-text filtering and information extraction, and (4) Content analysis to identify trends, challenges, gaps, and proposed solutions.
  • Blaizot, A, Veettil, SK, Saidoung, P, et al.  Using artificial intelligence methods for systematic review in health sciences: A systematic review.   Res Syn Meth . 2022; 13(3): 353-362. doi: 10.1002/jrsm.1553 The review below delineated automated tools and platforms that employ artificial intelligence (AI) approaches and evaluated the reported benefits and challenges in using such methods.They report the usage of Rayyan Robot Reviewer EPPI-reviewer; K-means; SWIFT-review; SWIFT-Active Screener; Abstrackr; Wordstat; Qualitative Data Analysis (QDA);  Miner and NLP and assess the quality of the reviews which used these.
  • Janka H, Metzendorf M-I. High precision but variable recall – comparing the performance of five deduplication tools . JEAHIL [Internet]. 17Mar.2024 [cited 28Mar.2024];20(1):12-7. Available from: http://ojs.eahil.eu/ojs/index.php/JEAHIL/article/view/607  
  • Kebede, MM, Le Cornet, C, Fortner, RT.  In-depth evaluation of machine learning methods for semi-automating article screening in a systematic review of mechanistic literature.   Res Syn Meth . 2023; 14(2): 156-172. doi: 10.1002/jrsm.1589 "We aimed to evaluate the performance of supervised machine learning algorithms in predicting articles relevant for full-text review in a systematic review." "Implementing machine learning approaches in title/abstract screening should be investigated further toward refining these tools and automating their implementation"  

Khalil H, Ameen D, Zarnegar A. Tools to support the automation of systematic reviews: a scoping review .  J Clin Epidemiol  2022;  144:  22-42  https://www.jclinepi.com/article/S0895-4356(21)00402-9/fulltext  "The current scoping review identified that LitSuggest, Rayyan, Abstractr, BIBOT, R software, RobotAnalyst, DistillerSR, ExaCT and NetMetaXL have potential to be used for the automation of systematic reviews. However, they are not without limitations. The review also identified other studies that employed algorithms that have not yet been developed into user friendly tools. Some of these algorithms showed high validity and reliability but their use is conditional on user knowledge of computer science and algorithms."

Khraisha Q, Put S, Kappenberg J, Warraitch A, Hadfield K.  Can large language models replace humans in systematic reviews? Evaluating GPT-4's efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages .  Res Syn Meth . 2024; 1-11. doi: 10.1002/jrsm.1715 "Although our findings indicate that, currently, substantial caution should be exercised if LLMs are being used to conduct systematic reviews, they also offer preliminary evidence that, for certain review tasks delivered under specific conditions, LLMs can rival human performance."

Mahuli, S., Rai, A., Mahuli, A. et al. Application ChatGPT in conducting systematic reviews and meta-analyses . Br Dent J 235, 90–92 (2023). https://doi.org/10.1038/s41415-023-6132-y Explores using ChatGPT for conducting Risk of Bias analysis and data extraction from a randomised controlled trial.

Ovelman, C., Kugley, S., Gartlehner, G., & Viswanathan, M. (2024). The use of a large language model to create plain language summaries of evidence reviews in healthcare: A feasibility study . Cochrane Evidence Synthesis and Methods, 2(2), e12041.  https://onlinelibrary.wiley.com/doi/abs/10.1002/cesm.12041 

Qureshi, R., Shaughnessy, D., Gill, K.A.R.  et al.   Are ChatGPT and large language models “the answer” to bringing us closer to systematic review automation? .  Syst Rev   12 , 72 (2023). https://doi.org/10.1186/s13643-023-02243-z "Our experience from exploring the responses of ChatGPT suggest that while ChatGPT and LLMs show some promise for aiding in SR-related tasks, the technology is in its infancy and needs much development for such applications. Furthermore, we advise that great caution should be taken by non-content experts in using these tools due to much of the output appearing, at a high level, to be valid, while much is erroneous and in need of active vetting."

van Dijk SHB, Brusse-Keizer MGJ, Bucsán CC, et al. Artificial intelligence in systematic reviews: promising when appropriately used . BMJ Open 2023;13:e072254. doi: 10.1136/bmjopen-2023-072254  Suggests how to conduct a transparent and reliable systematic review using the AI tool ‘ASReview’ in the title and abstract screening.

An update on machine learning AI in systematic reviews

June 2023 webinar including a panel discussion exploring the use of machine learning AI in Covidence (screening & data extraction tool).

CLEAR Framework for Prompt Engineering

  • The CLEAR path: A framework for enhancing information literacy through prompt engineering. This article introduces the CLEAR Framework for Prompt Engineering, designed to optimize interactions with AI language models like ChatGPT. The framework encompasses five core principles—Concise, Logical, Explicit, Adaptive, and Reflective—that facilitate more effective AI-generated content evaluation and creation. more... less... Lo, L. S. (2023). The CLEAR path: A framework for enhancing information literacy through prompt engineering. The Journal of Academic Librarianship, 49(4), 102720.

Selection of AI tools used in Evidence Synthesis

  • Systematic Review Toolbox The Systematic Review Toolbox is an online catalogue of tools that support various tasks within the systematic review and wider evidence synthesis process.
  • Rayyan Free web-tool designed to speed up the process of screening and selecting studies
  • Abstrackr Aids in citation screening. Please note you will need to create a free account before accessing the tool.
  • DistillerSR An online application designed to automate all stages of the systematic literature reviews. Priced packages available (please note we cannot offer support on using this system).
  • ExaCT Information Extraction system. Please note you will need to request a free account. The system is trained to find key information from scientific clinical trial publications, namely the descriptions of the trial's interventions, population, outcome measures, funding sources, and other critical characteristics. Please note you will need to request a free account.
  • RobotReviewer RobotReviewer is a machine learning system which aims to support evidence synthesis. The demonstration website allows users to upload RCT articles and see automatically determined information concerning the trial conduct (the 'PICO', study design, and whether there is a risk of bias).

Selection of tools to support the automation of systematic reviews (2022)

Khalil H, Ameen D, Zarnegar A. Tools to support the automation of systematic reviews: a scoping review. J Clin Epidemiol. 2022 Apr;144:22-42. doi: 10.1016/j.jclinepi.2021.12.005. Epub 2021 Dec 8. PMID: 34896236.  https://www.sciencedirect.com/science/article/pii/S0895435621004029?ref=pdf_download&fr=RR-2&rr=821cfdcf2d377762#tbl0004 [accessed 06-11-23].

Summary of validated tools available for each stage of the review

Screenshot of Table 4. Summary of validated tools available for each stage of the review

King’s guidance on generative AI for teaching, assessment and feedback

  • King’s guidance on generative AI for teaching, assessment and feedback comprehensive guidance aims to support the adoption and integration of generative AI at different institutional levels - macro (university), meso (department, programme, module), and micro (individual lecturers, especially those with assessment roles).

Leveraging GPT-4 for Systematic Reviews

Recording of 1 hour webinar exploring Artificial Intelligence (AI) and its potential impact on the process of systematic reviews (August 15th, 2023). Note PICO Portal is a systematic review platform that leverages artificial intelligence to accelerate research and innovation.

Moderator Dr Greg Martin. Presenters: Eitan Agai - PICO Portal Founder & AI Expert; Riaz Qureshi - U. of Colorado Anschutz Medical Campus; Kevin Kallmes - Chief Executive Officer, Cofounder; Jeff Johnson - Chef Design Officer.

PAIR (problem, AI, interaction, reflection) framework guidance

  • PAIR (problem, AI, interaction, reflection) framework guidance The framework is designed to be (i) simple, providing a straightforward structure to harness the potential of generative AI; (ii) customisable, allowing adaptation to align with specific learning objectives and student characteristics; and (iii) compatible, building on established pedagogical approaches such as problem/inquiry-based learning and active learning, making it suitable for different disciplines.

Artificial intelligence (AI) technologies in Cochrane

  • Web Clinic: Artificial intelligence (AI) technologies in Cochrane The session was delivered in May 2024 and you will find the videos from the webinar, together with the accompanying slides to download [PDF]. Recordings from other Methods Support Unit web clinics are available here. Part 1: How Cochrane currently uses machine learning: implementing innovative technology Part 2: What generative AI is, the opportunities it brings and the challenges regarding its safe use Part 3: Cochrane's focus on the responsible use of AI in systematic reviews Part 4: Questions and answers
  • << Previous: Library Workshops, Drop ins and 1-2-1s
  • Last Updated: May 14, 2024 1:15 PM
  • URL: https://libguides.kcl.ac.uk/systematicreview

© 2017 King's College London | Strand | London WC2R 2LS | England | United Kingdom | Tel +44 (0)20 7836 5454

  • Review article
  • Open access
  • Published: 31 October 2023

Role of AI chatbots in education: systematic literature review

  • Lasha Labadze   ORCID: orcid.org/0000-0002-8884-2792 1 ,
  • Maya Grigolia   ORCID: orcid.org/0000-0001-9043-7932 2 &
  • Lela Machaidze   ORCID: orcid.org/0000-0001-5958-5662 3  

International Journal of Educational Technology in Higher Education volume  20 , Article number:  56 ( 2023 ) Cite this article

45k Accesses

22 Citations

72 Altmetric

Metrics details

A Correction to this article was published on 15 April 2024

This article has been updated

AI chatbots shook the world not long ago with their potential to revolutionize education systems in a myriad of ways. AI chatbots can provide immediate support by answering questions, offering explanations, and providing additional resources. Chatbots can also act as virtual teaching assistants, supporting educators through various means. In this paper, we try to understand the full benefits of AI chatbots in education, their opportunities, challenges, potential limitations, concerns, and prospects of using AI chatbots in educational settings. We conducted an extensive search across various academic databases, and after applying specific predefined criteria, we selected a final set of 67 relevant studies for review. The research findings emphasize the numerous benefits of integrating AI chatbots in education, as seen from both students' and educators' perspectives. We found that students primarily gain from AI-powered chatbots in three key areas: homework and study assistance, a personalized learning experience, and the development of various skills. For educators, the main advantages are the time-saving assistance and improved pedagogy. However, our research also emphasizes significant challenges and critical factors that educators need to handle diligently. These include concerns related to AI applications such as reliability, accuracy, and ethical considerations.

Introduction

The traditional education system faces several issues, including overcrowded classrooms, a lack of personalized attention for students, varying learning paces and styles, and the struggle to keep up with the fast-paced evolution of technology and information. As the educational landscape continues to evolve, the rise of AI-powered chatbots emerges as a promising solution to effectively address some of these issues. Some educational institutions are increasingly turning to AI-powered chatbots, recognizing their relevance, while others are more cautious and do not rush to adopt them in modern educational settings. Consequently, a substantial body of academic literature is dedicated to investigating the role of AI chatbots in education, their potential benefits, and threats.

AI-powered chatbots are designed to mimic human conversation using text or voice interaction, providing information in a conversational manner. Chatbots’ history dates back to the 1960s and over the decades chatbots have evolved significantly, driven by advancements in technology and the growing demand for automated communication systems. Created by Joseph Weizenbaum at MIT in 1966, ELIZA was one of the earliest chatbot programs (Weizenbaum, 1966 ). ELIZA could mimic human-like responses by reflecting user inputs as questions. Another early example of a chatbot was PARRY, implemented in 1972 by psychiatrist Kenneth Colby at Stanford University (Colby, 1981 ). PARRY was a chatbot designed to simulate a paranoid patient with schizophrenia. It engaged in text-based conversations and demonstrated the ability to exhibit delusional behavior, offering insights into natural language processing and AI. Developed by Richard Wallace in 1995, ALICE (Artificial Linguistic Internet Computer Entity) was an early example of a chatbot using natural language processing techniques that won the Loebner Prize Turing Test in 2000–2001 (Wallace, 1995 ), which challenged chatbots to convincingly simulate human-like conversation. Later in 2001 ActiveBuddy, Inc. developed the chatbot SmarterChild that operated on instant messaging platforms such as AOL Instant Messenger and MSN Messenger (Hoffer et al., 2001 ). SmarterChild was a chatbot that could carry on conversations with users about a variety of topics. It was also able to learn from its interactions with users, which made it more and more sophisticated over time. In 2011 Apple introduced Siri as a voice-activated personal assistant for its iPhone (Aron, 2011 ). Although not strictly a chatbot, Siri showcased the potential of conversational AI by understanding and responding to voice commands, performing tasks, and providing information. In the same year, IBM's Watson gained fame by defeating human champions in the quiz show Jeopardy (Lally & Fodor, 2011 ). It demonstrated the power of natural language processing and machine learning algorithms in understanding complex questions and providing accurate answers. More recently, in 2016, Facebook opened its Messenger platform for chatbot development, allowing businesses to create AI-powered conversational agents to interact with users. This led to an explosion of chatbots on the platform, enabling tasks like customer support, news delivery, and e-commerce (Holotescu, 2016 ). Google Duplex, introduced in May 2018, was able to make phone calls and carry out conversations on behalf of users. It showcased the potential of chatbots to handle complex, real-time interactions in a human-like manner (Dinh & Thai, 2018 ; Kietzmann et al., 2018 ).

More recently, more sophisticated and capable chatbots amazed the world with their abilities. Among them, ChatGPT and Google Bard are among the most profound AI-powered chatbots. ChatGPT is an artificial intelligence chatbot developed by OpenAI. It was first announced in November 2022 and is available to the general public. ChatGPT’s rival Google Bard chatbot, developed by Google AI, was first announced in May 2023. Both Google Bard and ChatGPT are sizable language model chatbots that undergo training on extensive datasets of text and code. They possess the ability to generate text, create diverse creative content, and provide informative answers to questions, although their accuracy may not always be perfect. The key difference is that Google Bard is trained on a dataset that includes text from the internet, while ChatGPT is trained on a dataset that includes text from books and articles. This means that Google Bard is more likely to be up-to-date on current events, while ChatGPT is more likely to be accurate in its responses to factual questions (AlZubi et al., 2022 ; Rahaman et al., 2023 ; Rudolph et al., 2023 ).

Chatbots are now used across various sectors, including education. Most of the latest intelligent AI chatbots are web-based platforms that adapt to the behaviors of both instructors and learners, enhancing the educational experience (Chassignol et al., 2018 ; Devedzic, 2004 ; Kahraman et al., 2010 ; Peredo et al., 2011 ). AI chatbots have been applied in both instruction and learning within the education sector. Chatbots specialize in personalized tutoring, homework help, concept learning, standardized test preparation, discussion and collaboration, and mental health support. Some of the most popular AI-based tools /chatbots used in education are:

Bard, introduced in 2022, is a large language model chatbot created by Google AI. Its capabilities include generating text, language translation, producing various types of creative content, and providing informative responses to questions. (Rudolph et al., 2023 ). Bard is still under development, but it has the potential to be a valuable tool for education.

ChatGPT, launched in 2022 by OpenAI, is a large language model chatbot that can generate text, produce diverse creative content, and deliver informative answers to questions (Dergaa et al., 2023 ; Khademi, 2023 ; Rudolph et al., 2023 ). However, as discussed in the results section of this paper, there are numerous concerns related to the use of ChatGPT in education, such as accuracy, reliability, ethical issues, etc.

Ada, launched in 2017, is a chatbot that is used to provide personalized tutoring to students. It can answer questions, provide feedback, and facilitate individualized learning for students (Kabiljo et al., 2020 ; Konecki et al., 2023 ). However, the Ada chatbot has limitations in understanding complex queries. It could misinterpret context and provide inaccurate responses

Replika, launched in 2017, is an AI chatbot platform that is designed to be a friend and companion for students. It can listen to students' problems, offer advice, and help them feel less alone (Pentina et al., 2023 ; Xie & Pentina, 2022 ). However, given the personal nature of conversations with Replika, there are valid concerns regarding data privacy and security.

Socratic, launched in 2013, had the goal of creating a community that made learning accessible to all students. Currently, Socratic is an AI-powered educational platform that was acquired by Google in 2018. While not a chatbot per se, it has a chatbot-like interface and functionality designed to assist students in learning new concepts (Alsanousi et al., 2023 ; Moppel, 2018 ; St-Hilaire et al., 2022 ). Like with other chatbots, a concern arises where students might excessively rely on Socratic for learning. This could lead to a diminished emphasis on critical thinking, as students may opt to use the platform to obtain answers without gaining a genuine understanding of the underlying concepts.

Habitica, launched in 2013, is used to help students develop good study habits. It gamifies the learning process, making it more fun and engaging for students. Students can use Habitica to manage their academic tasks, assignments, and study schedules. By turning their to-do list into a game-like experience, students are motivated to complete their tasks and build productive habits (Sales & Antunes, 2021 ; Zhang, 2023 ). However, the gamified nature of Habitica could inadvertently introduce distractions, especially for students who are easily drawn into the gaming aspect rather than focusing on their actual academic responsibilities.

Piazza launched in 2009, is used to facilitate discussion and collaboration in educational settings, particularly in classrooms and academic institutions. It provides a space for students and instructors to engage in discussions, ask questions, and share information related to course content and assignments (Ruthotto et al., 2020 ; Wang et al., 2020 ). Because discussions on Piazza are user-generated, the quality and accuracy of responses can vary. This variability may result in situations where students do not receive accurate and helpful information.

We will likely see even more widespread adoption of chatbots in education in the years to come as technology advances further. Chatbots have enormous potential to improve teaching and learning. A large body of literature is devoted to exploring the role, challenges, and opportunities of chatbots in education. This paper gathers and synthesizes this vast amount of literature, providing a comprehensive understanding of the current research status concerning the influence of chatbots in education. By conducting a systematic review, we seek to identify common themes, trends, and patterns in the impact of chatbots on education and provide a holistic view of the research, enabling researchers, policymakers, and educators to make evidence-based decisions. One of the main objectives of this paper is to identify existing research gaps in the literature to pinpoint areas where further investigation is needed, enabling researchers to contribute to the knowledge base and guide future research efforts. Firstly, we aim to understand the primary advantages of incorporating AI chatbots in education, focusing on the perspectives of students. Secondly, we seek to explore the key advantages of integrating AI chatbots from the standpoint of educators. Lastly, we endeavor to comprehensively analyze the major concerns expressed by scholars regarding the integration of AI chatbots in educational settings. Corresponding research questions are formulated in the section below. Addressing these research questions, we aim to contribute valuable insights that shed light on the potential benefits and challenges associated with the utilization of AI chatbots in the field of education.

The paper follows a structured outline comprising several sections. Initially, we provide a summary of existing literature reviews. Subsequently, we delve into the methodology, encompassing aspects such as research questions, the search process, inclusion and exclusion criteria, as well as the data extraction strategy. Moving on, we present a comprehensive analysis of the results in the subsequent section. Finally, we conclude by addressing the limitations encountered during the study and offering insights into potential future research directions.

Summary of existing literature reviews

Drawing from extensive systematic literature reviews, as summarized in Table 1 , AI chatbots possess the potential to profoundly influence diverse aspects of education. They contribute to advancements in both teaching and learning processes. However, it is essential to address concerns regarding the irrational use of technology and the challenges that education systems encounter while striving to harness its capacity and make the best use of it.

It is evident that chatbot technology has a significant impact on overall learning outcomes. Specifically, chatbots have demonstrated significant enhancements in learning achievement, explicit reasoning, and knowledge retention. The integration of chatbots in education offers benefits such as immediate assistance, quick access to information, enhanced learning outcomes, and improved educational experiences. However, there have been contradictory findings related to critical thinking, learning engagement, and motivation. Deng and Yu ( 2023 ) found that chatbots had a significant and positive influence on numerous learning-related aspects but they do not significantly improve motivation among students. Contrary, Okonkwo and Ade-Ibijola (Okonkwo & Ade-Ibijola, 2021 ), as well as (Wollny et al., 2021 ) find that using chatbots increases students’ motivation.

In terms of application, chatbots are primarily used in education to teach various subjects, including but not limited to mathematics, computer science, foreign languages, and engineering. While many chatbots follow predetermined conversational paths, some employ personalized learning approaches tailored to individual student needs, incorporating experiential and collaborative learning principles. Challenges in chatbot development include insufficient training datasets, a lack of emphasis on usability heuristics, ethical concerns, evaluation methods, user attitudes, programming complexities, and data integration issues.

Although existing systematic reviews have provided valuable insights into the impact of chatbot technology in education, it's essential to acknowledge that the field of chatbot development is continually emerging and requires timely, and updated analysis to ensure that the information and assessments reflect the most recent advancements, trends, or developments in chatbot technology. The latest chatbot models have showcased remarkable capabilities in natural language processing and generation. Additional research is required to investigate the role and potential of these newer chatbots in the field of education. Therefore, our paper focuses on reviewing and discussing the findings of these new-generation chatbots' use in education, including their benefits and challenges from the perspectives of both educators and students.

There are a few aspects that appear to be missing from the existing literature reviews: (a) The existing findings focus on the immediate impact of chatbot usage on learning outcomes. Further research may delve into the enduring impacts of integrating chatbots in education, aiming to assess their sustainability and the persistence of the observed advantages over the long term. (b) The studies primarily discuss the impact of chatbots on learning outcomes as a whole, without delving into the potential variations based on student characteristics. Investigating how different student groups, such as age, prior knowledge, and learning styles, interact with chatbot technology could provide valuable insights. (c) Although the studies highlight the enhancements in certain learning components, further investigation could explore the specific pedagogical strategies employed by chatbots to achieve these outcomes. Understanding the underlying mechanisms and instructional approaches utilized by chatbots can guide the development of more effective and targeted educational interventions. (d) While some studies touch upon user attitudes and acceptance, further research can delve deeper into the user experience of interacting with chatbots in educational settings. This includes exploring factors such as usability, perceived usefulness, satisfaction, and preferences of students and teachers when using chatbot technology.

Addressing these gaps in the existing literature would significantly benefit the field of education. Firstly, further research on the impacts of integrating chatbots can shed light on their long-term sustainability and how their advantages persist over time. This knowledge is crucial for educators and policymakers to make informed decisions about the continued integration of chatbots into educational systems. Secondly, understanding how different student characteristics interact with chatbot technology can help tailor educational interventions to individual needs, potentially optimizing the learning experience. Thirdly, exploring the specific pedagogical strategies employed by chatbots to enhance learning components can inform the development of more effective educational tools and methods. Lastly, a deeper exploration of the user experience with chatbots, encompassing usability, satisfaction, and preferences, can provide valuable insights into enhancing user engagement and overall satisfaction, thus guiding the future design and implementation of chatbot technology in education.

Methodology

A systematic review follows a rigorous methodology, including predefined search criteria and systematic screening processes, to ensure the inclusion of relevant studies. This comprehensive approach ensures that a wide range of research is considered, minimizing the risk of bias and providing a comprehensive overview of the impact of AI in education. Firstly, we define the research questions and corresponding search strategies and then we filter the search results based on predefined inclusion and exclusion criteria. Secondly, we study selected articles and synthesize results and lastly, we report and discuss the findings. To improve the clarity of the discussion section, we employed Large Language Model (LLM) for stylistic suggestions.

Research questions

Considering the limitations observed in previous literature reviews, we have developed three research questions for further investigation:

What are the key advantages of incorporating AI chatbots in education from the viewpoint of students?

What are the key advantages of integrating AI chatbots in education from the viewpoint of educators?

What are the main concerns raised by scholars regarding the integration of AI chatbots in education?

Exploring the literature that focuses on these research questions, with specific attention to contemporary AI-powered chatbots, can provide a deeper understanding of the impact, effectiveness, and potential limitations of chatbot technology in education while guiding its future development and implementation. This paper will help to better understand how educational chatbots can be effectively utilized to enhance education and address the specific needs and challenges of students and educators.

Search process

The search for the relevant literature was conducted in the following databases: ACM Digital Library, Scopus, IEEE Xplore, and Google Scholar. The search string was created using Boolean operators, and it was structured as follows: (“Education” or “Learning” or “Teaching”) and (“Chatbot” or “Artificial intelligence” or “AI” or “ChatGPT”). Initially, the search yielded a total of 563 papers from all four databases. Search filters were applied based on predefined inclusion and exclusion criteria, followed by a rigorous data extraction strategy as explained below.

Inclusion and exclusion criteria

In our review process, we carefully adhered to the inclusion and exclusion criteria specified in Table 2 . Criteria were determined to ensure the studies chosen are relevant to the research question (content, timeline) and maintain a certain level of quality (literature type) and consistency (language, subject area).

Data extraction strategy

All three authors collaborated to select the articles, ensuring consistency and reliability. Each article was reviewed by at least two co-authors. The article selection process involved the following stages: Initially, the authors reviewed the studies' metadata, titles, abstracts, keywords and eliminated articles that were not relevant to research questions. This reduced the number of studies to 139. Next, the authors evaluated the quality of the studies by assessing research methodology, sample size, research design, and clarity of objectives, further refining the selection to 85 articles. Finally, the authors thoroughly read the entire content of the articles. Studies offering limited empirical evidence related to our research questions were excluded. This final step reduced the number of papers to 67. Figure  1 presents the article selection process.

figure 1

Flow diagram of selecting studies

In this section, we present the results of the reviewed articles, focusing on our research questions, particularly with regard to ChatGPT. ChatGPT, as one of the latest AI-powered chatbots, has gained significant attention for its potential applications in education. Within just eight months of its launch in 2022, it has already amassed over 100 million users, setting new records for user and traffic growth. ChatGPT stands out among AI-powered chatbots used in education due to its advanced natural language processing capabilities and sophisticated language generation, enabling more natural and human-like conversations. It excels at capturing and retaining contextual information throughout interactions, leading to more coherent and contextually relevant conversations. Unlike some educational chatbots that follow predetermined paths or rely on predefined scripts, ChatGPT is capable of engaging in open-ended dialogue and adapting to various user inputs. Its adaptability allows it to write articles, stories, and poems, provide summaries, accommodate different perspectives, and even write and debug computer code, making it a valuable tool in educational settings (Baidoo-Anu & Owusu Ansah, 2023 ; Tate et al., 2023 ; Williams, 2023 ).

Advantages for students

Research question 1. what are the key advantages of incorporating ai chatbots in education from the viewpoint of students.

The integration of chatbots and virtual assistants into educational settings has the potential to transform support services, improve accessibility, and contribute to more efficient and effective learning environments (Chen et al., 2023 ; Essel et al., 2022 ). AI tools have the potential to improve student success and engagement, particularly among those from disadvantaged backgrounds (Sullivan et al., 2023 ). However, the existing literature highlights an important gap in the discussion from a student’s standpoint. A few existing research studies addressing the student’s perspective of using ChatGPT in the learning process indicate that students have a positive view of ChatGPT, appreciate its capabilities, and find it helpful for their studies and work (Kasneci et al., 2023 ; Shoufan, 2023 ). Students acknowledge that ChatGPT's answers are not always accurate and emphasize the need for solid background knowledge to utilize it effectively, recognizing that it cannot replace human intelligence (Shoufan, 2023 ). Common most important benefits identified by scholars are:

Homework and Study Assistance. AI-powered chatbots can provide detailed feedback on student assignments, highlighting areas of improvement and offering suggestions for further learning (Celik et al., 2022 ). For example, ChatGPT can act as a helpful study companion, providing explanations and clarifications on various subjects. It can assist with homework questions, offering step-by-step solutions and guiding students through complex problems (Crawford et al., 2023 ; Fauzi et al., 2023 ; Lo, 2023 ; Qadir, 2023 ; Shidiq, 2023 ). According to Sedaghat ( 2023 ) experiment, ChatGPT performed similarly to third-year medical students on medical exams, and could write quite impressive essays. Students can also use ChatGPT to quiz themselves on various subjects, reinforcing their knowledge and preparing for exams (Choi et al., 2023 ; Eysenbach, 2023 ; Sevgi et al., 2023 ; Thurzo et al., 2023 ).

Flexible personalized learning. AI-powered chatbots in general are now able to provide individualized guidance and feedback to students, helping them navigate through challenging concepts and improve their understanding. These systems can adapt their teaching strategies to suit each student's unique needs (Fariani et al., 2023 ; Kikalishvili, 2023 ; Schiff, 2021 ). Students can access ChatGPT anytime, making it convenient. According to Kasneci et al. ( 2023 ), ChatGPT's interactive and conversational nature can enhance students' engagement and motivation, making learning more enjoyable and personalized. (Khan et al., 2023 ) examine the impact of ChatGPT on medical education and clinical management, highlighting its ability to offer students tailored learning opportunities.

Skills development. It can aid in the enhancement of writing skills (by offering suggestions for syntactic and grammatical corrections) (Kaharuddin, 2021 ), foster problem-solving abilities (by providing step-by-step solutions) (Benvenuti et al., 2023 ), and facilitate group discussions and debates (by furnishing discussion structures and providing real-time feedback) (Ruthotto et al., 2020 ; Wang et al., 2020 ).

It's important to note that some papers raise concerns about excessive reliance on AI-generated information, potentially leading to a negative impact on student’s critical thinking and problem-solving skills (Kasneci et al., 2023 ). For instance, if students consistently receive solutions or information effortlessly through AI assistance, they might not engage deeply in understanding the topic.

Advantages for educators

Research question 2. what are the key advantages of integrating ai chatbots in education from the viewpoint of educators.

With the current capabilities of AI and its future potential, AI-powered chatbots, like ChatGPT, can have a significant impact on existing instructional practices. Major benefits from educators’ viewpoint identified in the literature are:

Time-Saving Assistance. AI chatbot administrative support capabilities can help educators save time on routine tasks, including scheduling, grading, and providing information to students, allowing them to allocate more time for instructional planning and student engagement. For example, ChatGPT can successfully generate various types of questions and answer keys in different disciplines. However, educators should exercise critical evaluation and customization to suit their unique teaching contexts. The expertise, experience, and comprehension of the teacher are essential in making informed pedagogical choices, as AI is not yet capable of replacing the role of a science teacher (Cooper, 2023 ).

Improved pedagogy. Educators can leverage AI chatbots to augment their instruction and provide personalized support. According to Herft ( 2023 ), there are various ways in which teachers can utilize ChatGPT to enhance their pedagogical approaches and assessment methods. For instance, Educators can leverage the capabilities of ChatGPT to generate open-ended question prompts that align precisely with the targeted learning objectives and success criteria of the instructional unit. By doing so, teachers can tailor educational content to cater to the distinct needs, interests, and learning preferences of each student, offering personalized learning materials and activities (Al Ka’bi,  2023 ; Fariani et al., 2023 ).

Concerns raised by scholars

Research question 3. what are the main concerns raised by scholars regarding the integration of ai chatbots in education.

Scholars' opinions on using AI in this regard are varied and diverse. Some see AI chatbots as the future of teaching and learning, while others perceive them as a potential threat. The main arguments of skeptical scholars are threefold:

Reliability and Accuracy. AI chatbots may provide biased responses or non-accurate information (Kasneci et al., 2023 ; Sedaghat, 2023 ). If the chatbot provides incorrect information or guidance, it could mislead students and hinder their learning progress. According to Sevgi et al. ( 2023 ), although ChatGPT exhibited captivating and thought-provoking answers, it should not be regarded as a reliable information source. This point is especially important for medical education. Within the field of medical education, it is crucial to guarantee the reliability and accuracy of the information chatbots provide (Khan et al., 2023 ). If the training data used to develop an AI chatbot contains biases, the chatbot may inadvertently reproduce those biases in its responses, potentially including skewed perspectives, stereotypes, discriminatory language, or biased recommendations. This is of particular concern in an educational context.

Fair assessments. One of the challenges that educators face with the integration of Chatbots in education is the difficulty in assessing students' work, particularly when it comes to written assignments or responses. AI-generated text detection, while continually improving, is not yet foolproof and can produce false negatives or positives. This creates uncertainty and can undermine the credibility of the assessment process. Educators may struggle to discern whether the responses are genuinely student-generated or if they have been provided by an AI, affecting the accuracy of grading and feedback. This raises concerns about academic integrity and fair assessment practices (AlAfnan et al., 2023 ; Kung et al., 2023 ).

Ethical issues. The integration of AI chatbots in education raises several ethical implications, particularly concerning data privacy, security, and responsible AI use. As AI chatbots interact with students and gather data during conversations, necessitating the establishment of clear guidelines and safeguards. For example, medical education frequently encompasses the acquisition of knowledge pertaining to delicate and intimate subjects, including patient confidentiality and ethical considerations within the medical field and thus ethical and proper utilization of chatbots holds significant importance (Masters, 2023 ; Miao & Ahn, 2023 ; Sedaghat, 2023 ; Thurzo et al., 2023 ).

For these and other geopolitical reasons, ChatGPT is banned in countries with strict internet censorship policies, like North Korea, Iran, Syria, Russia, and China. Several nations prohibited the usage of the application due to privacy apprehensions. Meanwhile, North Korea, China, and Russia, in particular, contended that the U.S. might employ ChatGPT for disseminating misinformation. Conversely, OpenAI restricts access to ChatGPT in certain countries, such as Afghanistan and Iran, citing geopolitical constraints, legal considerations, data protection regulations, and internet accessibility as the basis for this decision. Italy became the first Western country to ban ChatGPT (Browne, 2023 ) after the country’s data protection authority called on OpenAI to stop processing Italian residents’ data. They claimed that ChatGPT did not comply with the European General Data Protection Regulation. However, after OpenAI clarified the data privacy issues with Italian data protection authority, ChatGPT returned to Italy. To avoid cheating on school homework and assignments, ChatGPT was also blocked in all New York school devices and networks so that students and teachers could no longer access it (Elsen-Rooney, 2023 ; Li et al., 2023 ). These examples highlight the lack of readiness to embrace recently developed AI tools. There are numerous concerns that must be addressed in order to gain broader acceptance and understanding.

To summarize, incorporating AI chatbots in education brings personalized learning for students and time efficiency for educators. Students benefit from flexible study aid and skill development. However, concerns arise regarding the accuracy of information, fair assessment practices, and ethical considerations. Striking a balance between these advantages and concerns is crucial for responsible integration in education.

The integration of artificial intelligence (AI) chatbots in education has the potential to revolutionize how students learn and interact with information. One significant advantage of AI chatbots in education is their ability to provide personalized and engaging learning experiences. By tailoring their interactions to individual students’ needs and preferences, chatbots offer customized feedback and instructional support, ultimately enhancing student engagement and information retention. However, there are potential difficulties in fully replicating the human educator experience with chatbots. While they can provide customized instruction, chatbots may not match human instructors' emotional support and mentorship. Understanding the importance of human engagement and expertise in education is crucial. A teacher's role encompasses more than just sharing knowledge. They offer students guidance, motivation, and emotional support—elements that AI cannot completely replicate.

We find that AI chatbots may benefit students as well as educators in various ways, however, there are significant concerns that need to be addressed in order to harness its capabilities effectively. Specifically, educational institutions should implement preventative measures. This includes (a) creating awareness among students, focusing on topics such as digital inequality, the reliability and accuracy of AI chatbots, and associated ethical considerations; and (b) offering regular professional development training for educators. This training should initially focus on enabling educators to integrate diverse in-class activities and assignments into the curriculum, aimed at nurturing students’ critical thinking and problem-solving skills while ensuring fair performance evaluation. Additionally, this training should cover educating educators about the capabilities and potential educational uses of AI chatbots, along with providing them with best practices for effectively integrating these tools into their teaching methods.

As technology continues to advance, AI-powered educational chatbots are expected to become more sophisticated, providing accurate information and offering even more individualized and engaging learning experiences. They are anticipated to engage with humans using voice recognition, comprehend human emotions, and navigate social interactions. Consequently, their potential impact on future education is substantial. This includes activities such as establishing educational objectives, developing teaching methods and curricula, and conducting assessments (Latif et al., 2023 ). Considering Microsoft's extensive integration efforts of ChatGPT into its products (Rudolph et al., 2023 ; Warren, 2023 ), it is likely that ChatGPT will become widespread soon. Educational institutions may need to rapidly adapt their policies and practices to guide and support students in using educational chatbots safely and constructively manner (Baidoo-Anu & Owusu Ansah, 2023 ). Educators and researchers must continue to explore the potential benefits and limitations of this technology to fully realize its potential.

The widespread adoption of chatbots and their increasing accessibility has sparked contrasting reactions across different sectors, leading to considerable confusion in the field of education. Among educators and learners, there is a notable trend—while learners are excited about chatbot integration, educators’ perceptions are particularly critical. However, this situation presents a unique opportunity, accompanied by unprecedented challenges. Consequently, it has prompted a significant surge in research, aiming to explore the impact of chatbots on education.

In this article, we present a systematic review of the latest literature with the objective of identifying the potential advantages and challenges associated with integrating chatbots in education. Through this review, we have been able to highlight critical gaps in the existing research that warrant further in-depth investigation. Addressing these gaps will be instrumental in optimizing the implementation of chatbots and harnessing their full potential in the educational landscape, thereby benefiting both educators and students alike. Further research will play a vital role in comprehending the long-term impact, variations based on student characteristics, pedagogical strategies, and the user experience associated with integrating chatbots in education.

From the viewpoint of educators, integrating AI chatbots in education brings significant advantages. AI chatbots provide time-saving assistance by handling routine administrative tasks such as scheduling, grading, and providing information to students, allowing educators to focus more on instructional planning and student engagement. Educators can improve their pedagogy by leveraging AI chatbots to augment their instruction and offer personalized support to students. By customizing educational content and generating prompts for open-ended questions aligned with specific learning objectives, teachers can cater to individual student needs and enhance the learning experience. Additionally, educators can use AI chatbots to create tailored learning materials and activities to accommodate students' unique interests and learning styles.

Incorporating AI chatbots in education offers several key advantages from students' perspectives. AI-powered chatbots provide valuable homework and study assistance by offering detailed feedback on assignments, guiding students through complex problems, and providing step-by-step solutions. They also act as study companions, offering explanations and clarifications on various subjects. They can be used for self-quizzing to reinforce knowledge and prepare for exams. Furthermore, these chatbots facilitate flexible personalized learning, tailoring their teaching strategies to suit each student's unique needs. Their interactive and conversational nature enhances student engagement and motivation, making learning more enjoyable and personalized. Also, AI chatbots contribute to skills development by suggesting syntactic and grammatical corrections to enhance writing skills, providing problem-solving guidance, and facilitating group discussions and debates with real-time feedback. Overall, students appreciate the capabilities of AI chatbots and find them helpful for their studies and skill development, recognizing that they complement human intelligence rather than replace it.

The presence of AI chatbots also brought lots of skepticism among scholars. While some see transformative potential, concerns loom over reliability, accuracy, fair assessments, and ethical dilemmas. The fear of misinformation compromised academic integrity, and data privacy issues cast an eerie shadow over the implementation of AI chatbots. Based on the findings of the reviewed papers, it is commonly concluded that addressing some of the challenges related to the use of AI chatbots in education can be accomplished by introducing preventative measures. More specifically, educational institutions must prioritize creating awareness among students about the risks associated with AI chatbots, focusing on essential aspects like digital inequality and ethical considerations. Simultaneously, investing in the continuous development of educators through targeted training is key. Empowering educators to effectively integrate AI chatbots into their teaching methods, fostering critical thinking and fair evaluation, will pave the way for a more effective and engaging educational experience.

The implications of the research findings for policymakers and researchers are extensive, shaping the future integration of chatbots in education. The findings emphasize the need to establish guidelines and regulations ensuring the ethical development and deployment of AI chatbots in education. Policies should specifically focus on data privacy, accuracy, and transparency to mitigate potential risks and build trust within the educational community. Additionally, investing in research and development to enhance AI chatbot capabilities and address identified concerns is crucial for a seamless integration into educational systems. Researchers are strongly encouraged to fill the identified research gaps through rigorous studies that delve deeper into the impact of chatbots on education. Exploring the long-term effects, optimal integration strategies, and addressing ethical considerations should take the forefront in research initiatives.

Availability of data and materials

The data and materials used in this paper are available upon request. The comprehensive list of included studies, along with relevant data extracted from these studies, is available from the corresponding author upon request.

Change history

15 april 2024.

A Correction to this paper has been published: https://doi.org/10.1186/s41239-024-00461-6

Al Ka’bi, A. (2023). Proposed artificial intelligence algorithm and deep learning techniques for development of higher education. International Journal of Intelligent Networks, 4 , 68–73.

Article   Google Scholar  

AlAfnan, M. A., Dishari, S., Jovic, M., & Lomidze, K. (2023). Chatgpt as an educational tool: Opportunities, challenges, and recommendations for communication, business writing, and composition courses. Journal of Artificial Intelligence and Technology, 3 (2), 60–68.

Google Scholar  

Alsanousi, B., Albesher, A. S., Do, H., & Ludi, S. (2023). Investigating the user experience and evaluating usability issues in ai-enabled learning mobile apps: An analysis of user reviews. International Journal of Advanced Computer Science and Applications , 14(6).

AlZubi, S., Mughaid, A., Quiam, F., & Hendawi, S. (2022). Exploring the Capabilities and Limitations of ChatGPT and Alternative Big Language Models. Artificial Intelligence and Applications .

Aron, J. (2011). How innovative is Apple’s new voice assistant. Siri, NewScientist , 212 (2836), 24

Baidoo-Anu, D., & Owusu Ansah, L. (2023). Education in the era of generative artificial intelligence (AI): Understanding the potential benefits of ChatGPT in promoting teaching and learning. Available at SSRN 4337484 .

Benvenuti, M., Cangelosi, A., Weinberger, A., Mazzoni, E., Benassi, M., Barbaresi, M., & Orsoni, M. (2023). Artificial intelligence and human behavioral development: A perspective on new skills and competencies acquisition for the educational context. Computers in Human Behavior, 148 , 107903.

Browne, R. (2023). Italy became the first Western country to ban ChatGPT. Here’s what other countries are doing . CNBC (Apr. 4, 2023).

Celik, I., Dindar, M., Muukkonen, H., & Järvelä, S. (2022). The promises and challenges of artificial intelligence for teachers: A systematic review of research. TechTrends, 66 (4), 616–630.

Chassignol, M., Khoroshavin, A., Klimova, A., & Bilyatdinova, A. (2018). Artificial Intelligence trends in education: A narrative overview. Procedia Computer Science, 136 , 16–24.

Chen, L., Chen, P., & Lin, Z. (2020). Artificial intelligence in education: A review. IEEE Access, 8 , 75264–75278.

Chen, Y., Jensen, S., Albert, L. J., Gupta, S., & Lee, T. (2023). Artificial intelligence (AI) student assistants in the classroom: Designing chatbots to support student success. Information Systems Frontiers, 25 (1), 161–182.

Choi, J. H., Hickman, K. E., Monahan, A., & Schwarcz, D. (2023). Chatgpt goes to law school. Available at SSRN .

Colby, K. M. (1981). PARRYing. Behavioral and Brain Sciences, 4 (4), 550–560.

Cooper, G. (2023). Examining science education in chatgpt: An exploratory study of generative artificial intelligence. Journal of Science Education and Technology, 32 (3), 444–452.

Crawford, J., Cowling, M., & Allen, K.-A. (2023). Leadership is needed for ethical ChatGPT: Character, assessment, and learning using artificial intelligence (AI). Journal of University Teaching and Learning Practice, 20 (3), 02.

Crompton, H., & Burke, D. (2023). Artificial intelligence in higher education: The state of the field. International Journal of Educational Technology in Higher Education, 20 (1), 1–22.

Deng, X., & Yu, Z. (2023). A meta-analysis and systematic review of the effect of chatbot technology use in sustainable education. Sustainability, 15 (4), 2940.

Dergaa, I., Chamari, K., Zmijewski, P., & Saad, H. B. (2023). From human writing to artificial intelligence generated text: Examining the prospects and potential threats of ChatGPT in academic writing. Biology of Sport, 40 (2), 615–622.

Devedzic, V. (2004). Web intelligence and artificial intelligence in education. Journal of Educational Technology and Society, 7 (4), 29–39.

Dinh, T. N., & Thai, M. T. (2018). AI and blockchain: A disruptive integration. Computer, 51 (9), 48–53.

Elsen-Rooney, M. (2023). NYC education department blocks ChatGPT on school devices, networks. Retrieved Jan , 25 , 2023.

Essel, H. B., Vlachopoulos, D., Tachie-Menson, A., Johnson, E. E., & Baah, P. K. (2022). The impact of a virtual teaching assistant (chatbot) on students’ learning in Ghanaian higher education. International Journal of Educational Technology in Higher Education, 19 (1), 1–19.

Eysenbach, G. (2023). The role of ChatGPT, generative language models, and artificial intelligence in medical education: A conversation with ChatGPT and a call for papers. JMIR Medical Education, 9 (1), e46885.

Fariani, R. I., Junus, K., & Santoso, H. B. (2023). A systematic literature review on personalised learning in the higher education context. Technology, Knowledge and Learning, 28 (2), 449–476.

Fauzi, F., Tuhuteru, L., Sampe, F., Ausat, A. M. A., & Hatta, H. R. (2023). Analysing the role of ChatGPT in improving student productivity in higher education. Journal on Education, 5 (4), 14886–14891.

Herft, A. (2023). A Teacher’s Prompt Guide to ChatGPT aligned with’What Works Best’ .

Hoffer, R., Kay, T., Levitan, P., & Klein, S. (2001). Smarterchild . ActiveBuddy.

Holotescu, C. (2016). MOOCBuddy: A Chatbot for personalized learning with MOOCs. RoCHI , 91–94.

Kabiljo, M., Vidas-Bubanja, M., Matic, R., & Zivkovic, M. (2020). Education system in the republic of serbia under COVID-19 conditions: Chatbot-acadimic digital assistant of the belgrade business and arts academy of applied studies. Knowledge-International Journal, 43 (1), 25–30.

Kaharuddin, A. (2021). Assessing the effect of using artificial intelligence on the writing skill of Indonesian learners of English. Linguistics and Culture Review, 5 (1), 288.

Kahraman, H. T., Sagiroglu, S., & Colak, I. (2010). Development of adaptive and intelligent web-based educational systems. In 2010 4th International Conference on Application of Information and Communication Technologies , 1–5.

Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., Günnemann, S., & Hüllermeier, E. (2023). ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences, 103 , 102274.

Khademi, A. (2023). Can ChatGPT and bard generate aligned assessment items? A reliability analysis against human performance. ArXiv Preprint ArXiv:2304.05372.

Khan, R. A., Jawaid, M., Khan, A. R., & Sajjad, M. (2023). ChatGPT-Reshaping medical education and clinical management. Pakistan Journal of Medical Sciences, 39 (2), 605.

Kietzmann, J., Paschen, J., & Treen, E. (2018). Artificial intelligence in advertising: How marketers can leverage artificial intelligence along the consumer journey. Journal of Advertising Research, 58 (3), 263–267.

Kikalishvili, S. (2023). Unlocking the potential of GPT-3 in education: Opportunities, limitations, and recommendations for effective integration. Interactive Learning Environments , 1–13.

Konecki, M., Konecki, M., & Biškupić, I. (2023). Using artificial intelligence in higher education. In Proceedings of the 15th International Conference on Computer Supported Education .

Krstić, L., Aleksić, V., & Krstić, M. (2022). Artificial intelligence in education: A review .

Kuhail, M. A., Alturki, N., Alramlawi, S., & Alhejori, K. (2023). Interacting with educational chatbots: A systematic review. Education and Information Technologies, 28 (1), 973–1018.

Kung, T. H., Cheatham, M., Medenilla, A., Sillos, C., De Leon, L., Elepaño, C., et al. (2023). Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLoS Digital Health, 2 (2), e0000198.

Lally, A., & Fodor, P. (2011). Natural language processing with prolog in the ibm watson system. The Association for Logic Programming (ALP) Newsletter , 9 , 2011.

Latif, E., Mai, G., Nyaaba, M., Wu, X., Liu, N., Lu, G., ... & Zhai, X. (2023). Artificial general intelligence (AGI) for education. arXiv preprint arXiv:2304.12479.

Li, L., Ma, Z., Fan, L., Lee, S., Yu, H., & Hemphill, L. (2023). ChatGPT in education: A discourse analysis of worries and concerns on social media. ArXiv Preprint ArXiv:2305.02201.

Lo, C. K. (2023). What is the impact of ChatGPT on education? A rapid review of the literature. Education Sciences, 13 (4), 410.

Masters, K. (2023). Ethical use of artificial intelligence in health professions education: AMEE Guide No. 158. Medical Teacher , 45 (6), 574–584.

Miao, H., & Ahn, H. (2023). Impact of ChatGPT on interdisciplinary nursing education and research. Asian/pacific Island Nursing Journal, 7 (1), e48136.

Moppel, J. (2018). Socratic chatbot . University Of Tartu, Institute of Computer Science, Bachelor’s Thesis.

Okonkwo, C. W., & Ade-Ibijola, A. (2021). Chatbots applications in education: A systematic review. Computers and Education: Artificial Intelligence, 2 , 100033.

Pentina, I., Hancock, T., & Xie, T. (2023). Exploring relationship development with social chatbots: A mixed-method study of replika. Computers in Human Behavior, 140 , 107600.

Peredo, R., Canales, A., Menchaca, A., & Peredo, I. (2011). Intelligent Web-based education system for adaptive learning. Expert Systems with Applications, 38 (12), 14690–14702.

Pérez, J. Q., Daradoumis, T., & Puig, J. M. M. (2020). Rediscovering the use of chatbots in education: A systematic literature review. Computer Applications in Engineering Education, 28 (6), 1549–1565.

Qadir, J. (2023). Engineering education in the era of ChatGPT: Promise and pitfalls of generative AI for education. IEEE Global Engineering Education Conference (EDUCON), 2023 , 1–9.

Rahaman, M. S., Ahsan, M. M., Anjum, N., Rahman, M. M., & Rahman, M. N. (2023). The AI race is on! Google’s Bard and OpenAI’s ChatGPT head to head: An opinion article. Mizanur and Rahman, Md Nafizur, The AI Race Is On .

Rudolph, J., Tan, S., & Tan, S. (2023). War of the chatbots: Bard, Bing Chat, ChatGPT, Ernie and beyond. The new AI gold rush and its impact on higher education. Journal of Applied Learning and Teaching, 6 (1).

Ruthotto, I., Kreth, Q., Stevens, J., Trively, C., & Melkers, J. (2020). Lurking and participation in the virtual classroom: The effects of gender, race, and age among graduate students in computer science. Computers & Education, 151 , 103854.

de Sales, A. B., & Antunes, J. G. (2021). Evaluation of educational games usage satisfaction. 2021 16th Iberian Conference on Information Systems and Technologies (CISTI) , 1–6.

Schiff, D. (2021). Out of the laboratory and into the classroom: the future of artificial intelligence in education. AI & Society, 36 (1), 331–348.

Sedaghat, S. (2023). Success through simplicity: What other artificial intelligence applications in medicine should learn from history and ChatGPT. Annals of Biomedical Engineering , 1–2.

Sevgi, U. T., Erol, G., Doğruel, Y., Sönmez, O. F., Tubbs, R. S., & Güngor, A. (2023). The role of an open artificial intelligence platform in modern neurosurgical education: A preliminary study. Neurosurgical Review, 46 (1), 86.

Shidiq, M. (2023). The use of artificial intelligence-based chat-gpt and its challenges for the world of education; from the viewpoint of the development of creative writing skills. Proceeding of International Conference on Education, Society and Humanity, 1 (1), 353–357.

Shoufan, A. (2023). Exploring Students’ Perceptions of CHATGPT: Thematic Analysis and Follow-Up Survey. IEEE Access .

St-Hilaire, F., Vu, D. D., Frau, A., Burns, N., Faraji, F., Potochny, J., Robert, S., Roussel, A., Zheng, S., & Glazier, T. (2022). A new era: Intelligent tutoring systems will transform online learning for millions. ArXiv Preprint ArXiv:2203.03724.

Sullivan, M., Kelly, A., & McLaughlan, P. (2023). ChatGPT in higher education: Considerations for academic integrity and student learning .

Tahiru, F. (2021). AI in education: A systematic literature review. Journal of Cases on Information Technology (JCIT), 23 (1), 1–20.

Tate, T., Doroudi, S., Ritchie, D., & Xu, Y. (2023). Educational research and AI-generated writing: Confronting the coming tsunami .

Thurzo, A., Strunga, M., Urban, R., Surovková, J., & Afrashtehfar, K. I. (2023). Impact of artificial intelligence on dental education: A review and guide for curriculum update. Education Sciences, 13 (2), 150.

Wallace, R. (1995). Artificial linguistic internet computer entity (alice). City .

Wang, Q., Jing, S., Camacho, I., Joyner, D., & Goel, A. (2020). Jill Watson SA: Design and evaluation of a virtual agent to build communities among online learners. Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems , 1–8.

Warren, T. (2023). Microsoft is looking at OpenAI’s GPT for Word, Outlook, and PowerPoint. The Verge .

Weizenbaum, J. (1966). ELIZA—A computer program for the study of natural language communication between man and machine. Communications of the ACM, 9 (1), 36–45.

Williams, C. (2023). Hype, or the future of learning and teaching? 3 Limits to AI’s ability to write student essays .

Wollny, S., Schneider, J., Di Mitri, D., Weidlich, J., Rittberger, M., & Drachsler, H. (2021). Are we there yet?—A systematic literature review on chatbots in education. Frontiers in Artificial Intelligence, 4 , 654924.

Xie, T., & Pentina, I. (2022). Attachment theory as a framework to understand relationships with social chatbots: A case study of Replika .

Zhang, Q. (2023). Investigating the effects of gamification and ludicization on learning achievement and motivation: An empirical study employing Kahoot! and Habitica. International Journal of Technology-Enhanced Education (IJTEE), 2 (1), 1–19.

Download references

Acknowledgements

Not applicable.

The authors declare that this research paper did not receive any funding from external organizations. The study was conducted independently and without financial support from any source. The authors have no financial interests or affiliations that could have influenced the design, execution, analysis, or reporting of the research.

Author information

Authors and affiliations.

Finance Department, American University of the Middle East, Block 6, Building 1, Egaila, Kuwait

Lasha Labadze

Statistics Department, American University of the Middle East, Block 6, Building 1, Egaila, Kuwait

Maya Grigolia

Caucasus School of Business, Caucasus University, 1 Paata Saakadze St, 0102, Tbilisi, Georgia

Lela Machaidze

You can also search for this author in PubMed   Google Scholar

Contributions

LL provided a concise overview of the existing literature and formulated the methodology. MG initiated the initial search process. LM authored the discussion section. All three authors collaborated on the selection of the final paper collection and contributed to crafting the conclusion. The final version of the paper received approval from all authors.

Corresponding author

Correspondence to Lasha Labadze .

Ethics declarations

Competing interests.

Authors have no competing interests to declare.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original version of this article was revised: “A sentence has been added to the Methodology section of the article to acknowledge use of LLM”

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Labadze, L., Grigolia, M. & Machaidze, L. Role of AI chatbots in education: systematic literature review. Int J Educ Technol High Educ 20 , 56 (2023). https://doi.org/10.1186/s41239-023-00426-1

Download citation

Received : 22 August 2023

Accepted : 18 October 2023

Published : 31 October 2023

DOI : https://doi.org/10.1186/s41239-023-00426-1

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Systematic literature review
  • Artificial intelligence
  • AI chatbots
  • Chatbots in education

ai in systematic literature review

  • Reference Manager
  • Simple TEXT file

People also looked at

Original research article, a systematic literature review on the impact of ai models on the security of code generation.

ai in systematic literature review

  • 1 Security and Trust, University of Luxembourg, Luxembourg, Luxembourg
  • 2 École Normale Supérieure, Paris, France
  • 3 Faculty of Humanities, Education, and Social Sciences, University of Luxembourg, Luxembourg, Luxembourg

Introduction: Artificial Intelligence (AI) is increasingly used as a helper to develop computing programs. While it can boost software development and improve coding proficiency, this practice offers no guarantee of security. On the contrary, recent research shows that some AI models produce software with vulnerabilities. This situation leads to the question: How serious and widespread are the security flaws in code generated using AI models?

Methods: Through a systematic literature review, this work reviews the state of the art on how AI models impact software security. It systematizes the knowledge about the risks of using AI in coding security-critical software.

Results: It reviews what security flaws of well-known vulnerabilities (e.g., the MITRE CWE Top 25 Most Dangerous Software Weaknesses) are commonly hidden in AI-generated code. It also reviews works that discuss how vulnerabilities in AI-generated code can be exploited to compromise security and lists the attempts to improve the security of such AI-generated code.

Discussion: Overall, this work provides a comprehensive and systematic overview of the impact of AI in secure coding. This topic has sparked interest and concern within the software security engineering community. It highlights the importance of setting up security measures and processes, such as code verification, and that such practices could be customized for AI-aided code production.

1 Introduction

Despite initial concerns, increasingly, many organizations rely on artificial intelligence (AI) to enhance the operational workflows in their software development life cycle and to support writing software artifacts. One of the most well-known tools is GitHub Copilot. It is created by Microsoft relies on OpenAI's Codex model, and is trained on open-source code publicly available on GitHub ( Chen et al., 2021 ). Like many similar tools—such as CodeParrot, PolyCoder, StarCoder—Copilot is built atop a large language model (LLM) that has been trained on programming languages. Using LLMs for such tasks is an idea that dates back at least as far back as the public release of OpenAI's ChatGPT.

However, using automation and AI in software development is a double-edged sword. While it can improve code proficiency, the quality of AI-generated code is problematic. Some models introduce well-known vulnerabilities, such as those documented in MITRE's Common Weakness Enumeration (CWE) list of the top 25 “most dangerous software weaknesses.” Others generate so-called “stupid bugs,” naïve single-line mistakes that developers would qualify as “stupid” upon review ( Karampatsis and Sutton, 2020 ).

This behavior was identified early on and is supported to a varying degree by academic research. Pearce et al. (2022) concluded that 40% of the code suggested by Copilot had vulnerabilities. Yet research also shows that users trust AI-generator code more than their own ( Perry et al., 2023 ). These situations imply that new processes, mitigation strategies, and methodologies should be implemented to reduce or control the risks associated with the participation of generative AI in the software development life cycle.

It is, however, difficult to clearly attribute the blame, as the tooling landscape evolves, different training strategies and prompt engineering are used to alter LLMs behavior, and there is conflicting if anecdotal, evidence that human-generated code could be just as bad as AI-generated code.

This systematic literature review (SLR) aims to critically examine how the code generated by AI models impacts software and system security. Following the categorization of the research questions provided by Kitchenham and Charters (2007) on SLR questions, this work has a 2-fold objective: analyzing the impact and systematizing the knowledge produced so far. Our main question is:

“ How does the code generation from AI models impact the cybersecurity of the software process? ”

This paper discusses the risks and reviews the current state-of-the-art research on this still actively-researched question.

Our analysis shows specific trends and gaps in the literature. Overall, there is a high-level agreement that AI models do not produce safe code and do introduce vulnerabilities , despite mitigations. Particular vulnerabilities appear more frequently and prove to be more problematic than others ( Pearce et al., 2022 ; He and Vechev, 2023 ). Some domains (e.g., hardware design) seem more at risk than others, and there is clearly an imbalance in the efforts deployed to address these risks.

This work stresses the importance of relying on dedicated security measures in current software production processes to mitigate the risks introduced by AI-generated code and highlights the limitations of AI-based tools to perform this mitigation themselves.

The article is divided as follows: we first introduce the reader to AI models and code generation in Section 2 to proceed to explain our research method in Section 3. We then present our results in Section 4. In Section 5 we discuss the results, taking in consideration AI models, exploits, programming languages, mitigation strategies and future research. We close the paper by addressing threats to validity in Section 6 and conclusion in Section 7.

2 Background and previous work

2.1 ai models.

The sub-branch of AI models that is relevant to our discussion are generative models, especially large-language models (LLMs) that developed out of the attention-based transformer architecture ( Vaswani et al., 2017 ), made widely known and available through pre-trained models (such as OpenAI's GPT series and Codex, Google's PaLM, Meta's LLaMA, or Mistral's Mixtral).

In a transformer architecture, inputs (e.g., text) are converted to tokens 1 which are then mapped to an abstract latent space, a process known as encoding ( Vaswani et al., 2017 ). Mapping back from the latent space to tokens is accordingly called decoding , and the model's parameters are adjusted so that encoding and decoding work properly. This is achieved by feeding the model with human-generated input, from which it can learn latent space representations that match the input's distribution and identify correlations between tokens.

Pre-training amortizes the cost of training, which has become prohibitive for LLMs. It consists in determining a reasonable set of weights for the model, usually through autocompletion tasks, either autoregressive (ChatGPT) or masked (BERT) for natural language, during which the model is faced with an incomplete input and must correctly predict the missing parts or the next token. This training happens once, is based on public corpora, and results in an initial set of weights that serves as a baseline ( Tan et al., 2018 ). Most “open-source” models today follow this approach. 2

It is possible to fine-tune parameters to handle specific tasks from a pre-trained model, assuming they remain within a small perimeter of what the model was trained to do. This final training often requires human feedback and correction ( Tan et al., 2018 ).

The output of a decoder is not directly tokens, however, but a probability distribution over tokens. The temperature hyperparameter of LLMs controls how much the likelihood of less probable tokens is amplified: a high temperature would allow less probable tokens to be selected more often, resulting in a less predictable output. This is often combined with nucleus sampling ( Holtzman et al., 2020 ), i.e., requiring that the total sum of token probabilities is large enough and various penalty mechanisms to avoid repetition.

Finally, before being presented to the user, an output may undergo one or several rounds of (possibly non-LLM) filtering, including for instance the detection of foul language.

2.2 Code generation with AI models

With the rise of generative AI, there has also been a rise in the development of AI models for code generation. Multiple examples exist, such as Codex, Polycoder, CodeGen, CodeBERT, and StarCoder, to name a few (337, Xu, Li). These new tools should help developers of different domains be more efficient when writing code—or at least expected to ( Chen et al., 2021 ).

The use of LLMs for code generation is a domain-specific application of generative methods that greatly benefit from the narrower context. Contrary to natural language, programming languages follow a well-defined syntax using a reduced set of keywords, and multiple clues can be gathered (e.g., filenames, other parts of a code base) to help nudging the LLM in the right direction. Furthermore, so-called boilerplate code is not project-specific and can be readily reused across different code bases with minor adaptations, meaning that LLM-powered code assistants can already go a long way simply by providing commonly-used code snippets at the right time.

By design, LLMs generate code based on their training set ( Chen et al., 2021 ). 3 In doing so, there is a risk that sensitive, incorrect, or dangerous code is uncritically copied verbatim from the training set or that the “minor adaptations” necessary to transfer code from one project to another introduces mistakes ( Chen et al., 2021 ; Pearce et al., 2022 ; Niu et al., 2023 ). Therefore, generated code may include security issues, such as well-documented bugs, malpractices, or legacy issues found in the training data. A parallel issue often brought up is the copyright status of works produced by such tools, a still-open problem that is not the topic of this paper.

Similarly, other challenges and concerns have been highlighted by different academic research. From an educational point of view, some concerns are that using AI code generation models may impact acquiring bad security habits between novice programmers or students ( Becker et al., 2023 ). However, the usage of such models can also help lower the entry barrier to the field ( Becker et al., 2023 ). Similarly, cite337 has suggested that using AI code generation models does not output secure code all the time, as they are non-deterministic, and future research on mitigation is required ( Pearce et al., 2022 ). For example, Pearce et al. (2022) was one of the first to research this subject.

There are further claims that it may be possible to use by cyber criminal ( Chen et al., 2021 ; Natella et al., 2024 ). In popular communication mediums, there are affirmations that ChatGPT and other LLMs will be “useful” for criminal activities, for example Burgess (2023) . However, these tools can be used defensively in cyber security, as in ethical hacking ( Chen et al., 2021 ; Natella et al., 2024 ).

3 Research method

This research aims to systematically gather and analyze publications that answer our main question: “ How does the code generation of AI models impact the cybersecurity of the software process? ” Following Kitchenham and Charters (2007) classification of questions for SLR, our research falls into the type of questions of “Identifying the impact of technologies” on security, and “Identifying cost and risk factors associated with a technology” in security too.

To carry out this research, we have followed different SLR guidelines, most notably Wieringa et al. (2006) , Kitchenham and Charters (2007) , Wohlin (2014) , and Petersen et al. (2015) . Each of these guidelines was used for different elements of the research. We list out in a high-level approach which guidelines were used for each element, which we further discuss in different subsections of this article.

• For the general structure and guideline on how to carry out the SLR, we used Kitchenham and Charters (2007) . This included exclusion and inclusion criteria, explained in Section 3.2 ;

• The identification of the Population, Intervention, Comparison, and Outcome (PICO) is based both in Kitchenham and Charters (2007) and Petersen et al. (2015) , as a framework to create our search string. We present and discuss this framework in Section 3.1 ;

• The questions and quality check of the sample, we used the research done by Kitchenham et al. (2010) , which we describe in further details at Section 3.4 ;

• The taxonomy of type of research is from Wieringa et al. (2006) as a strategy to identify if a paper falls under our exclusion criteria. We present and discuss this taxonomy in Section 3.2. Although their taxonomy focuses on requirements engineering, it is broad enough to be used in other areas as recognized by Wohlin et al. (2013) ;

• For the snowballing technique, we used the method presented in Wohlin (2014) , which we discuss in Section 3.3 ;

• Mitigation strategies from Wohlin et al. (2013) are used, aiming to increase the reliability and validity of this study. We further analyze the threats to validity of our research in Section 6.

In the following subsections, we explain our approach to the SLR in more detail. The results are presented in Section 4.

3.1 Search planning and string

To answer our question systematically, we need to create a search string that reflects the critical elements of our questions. To achieve this, we thus need to frame the question in a way that allows us to (1) identify keywords, (2) identify synonyms, (3) define exclusion and inclusion criteria, and (4) answer the research question. One common strategy is the PICO (population, intervention, comparison, outcome) approach ( Petersen et al., 2015 ). Originally from medical sciences, it has been adapted for computer science and software engineering ( Kitchenham and Charters, 2007 ; Petersen et al., 2015 ).

To frame our work with the PICO approach, we follow the methodologies outlined in Kitchenham and Charters (2007) and Petersen et al. (2015) . We can identify the set of keywords and their synonyms by identifying these four elements, which are explained in detail in the following bullet point.

• Population: Cybersecurity.

• Following Kitchenham and Charters (2007) , a population can be an area or domain of technology. Population can be very specific.

• Intervention: AI models.

• Following Kitchenham and Charters (2007) “The intervention is the software methodology/tool/technology, such as the requirement elicitation technique.”

• Comparison: we compare the security issues identified by the code generated in the research articles. In Kitchenham and Charters (2007) word, “This is the software engineering methodology/tool/technology/procedure with which the intervention is being compared. When the comparison technology is the conventional or commonly-used technology, it is often referred to as the ‘control' treatment.”

• Outcomes: A systematic list of security issues of using AI models for code generation and possible mitigation strategies.

• Context: Although not mandatory (per Kitchenham and Charters, 2007 ) in general we consider code generation.

With the PICO elements done, it is possible to determine specific keywords to generate our search string. We have identified three specific sets: security, AI, and code generation. Consequently, we need to include synonyms of these three sets for generating the search string, taking a similar approach as Petersen et al. (2015) . The importance of including different synonyms arises from different research papers referring to the same phenomena differently. If synonyms are not included, essential papers may be missed from the final sample. The three groups are explained in more detail:

• Set 1: search elements related to security and insecurity due to our population of interest and comparison.

• Set 2: AI-related elements based on our intervention. This set should include LLMs, generative AI, and other approximations.

• Set 3: the research should focus on code generation.

With these three sets of critical elements that our research focuses on, a search string is created. We constructed the search string by including synonyms based on the three sets (as seen in Table 1 ). In a concurrent manner, while identifying the synonyms, we create the search string. Through different iterations, we aim at achieving the “golden” string, following a test-retest approach by Kitchenham et al. (2010) . In every iteration, we checked if the vital papers of our study were in the sample. The final string was selected based on the new synonym that would add meaningful results. For example, one of the iterations included “ hard* ,” which did not add any extra article. Hence, it was excluded. Due to space constraints, the different iterations are available in the public repository of this research. The final string, with the unique query per database, is presented in Table 2 .

www.frontiersin.org

Table 1 . Keywords and synonyms.

www.frontiersin.org

Table 2 . Search string per database.

For this research, we selected the following databases to gather our sample: IEEE Explore, ACM, and Scopus (which includes Springer and ScienceDirect). The databases were selected based on their relevance for computer science research, publication of peer-reviewed research, and alignment with this research objective. Although other databases from other domains could have been selected, the ones selected are notably known in computer science.

3.2 Exclusion and inclusion criteria

The exclusion and inclusion criteria were decided to align our research objectives. Our interest in excluding unranked venues is to avoid literature that is not peer-reviewed and act as a first quality check. This decision also applies to gray literature or book chapters. Finally, we excluded opinion and philosophical papers, as they do not carry out primary research. Table 3 shows are inclusion and exclusion criteria.

www.frontiersin.org

Table 3 . Inclusion and exclusion criteria.

We have excluded articles that address AI models or AI technology in general, as our interest—based on PICO—is on the security issue of AI models in code generation. So although such research is interesting, it does not align with our main objective.

For identifying the secondary research, opinion, and philosophical papers—which are all part of our exclusion criteria in Table 3 —we follow the taxonomy provided by Wieringa et al. (2006) . Although this classification was written for the requirements engineering domain, it can be generalized to other domains ( Wieringa et al., 2006 ). In addition, apart from helping us identify if a paper falls under our exclusion criteria, this taxonomy also allows us to identify how complete the research might be. The classification is as follows:

• Solution proposal: Proposes a solution to a problem ( Wieringa et al., 2006 ). “The solution can be novel or a significant extension of an existing technique ( Petersen et al., 2015 ).”

• Evaluation research: “This is the investigation of a problem in RE practice or an implementation of an RE technique in practice [...] novelty of the knowledge claim made by the paper is a relevant criterion, as is the soundness of the research method used ( Petersen et al., 2015 ).”

• Validation research: “This paper investigates the properties of a solution proposal that has not yet been implemented... ( Wieringa et al., 2006 ).”

• Philosophical papers: “These papers sketch a new way of looking at things, a new conceptual framework ( Wieringa et al., 2006 ).”

• Experience papers: Is where the authors publish their experience over a matter. “In these papers, the emphasis is on what and not on why ( Wieringa et al., 2006 ; Petersen et al., 2015 ).”

• Opinion papers: “These papers contain the author's opinion about what is wrong or good about something, how we should do something, etc. ( Wieringa et al., 2006 ).”

3.3 Snowballing

Furthermore, to increase the reliability and validity of this research, we applied a forward snowballing technique ( Wohlin et al., 2013 ; Wohlin, 2014 ). Once the first sample (start set) has passed an exclusion and inclusion criteria based on the title, abstract, and keyword, we forward snowballed the whole start set ( Wohlin et al., 2013 ). That is to say; we checked which papers were citing the papers from our starting set, as suggested by Wohlin (2014) . For this section, we used Google Scholar.

In the snowballing phase, we analyzed the title, abstract, and key words of each possible candidate ( Wohlin, 2014 ). In addition, we did an inclusion/exclusion analysis based on the title, abstract, and publication venue. If there was insufficient information, we analyzed the full text to make a decision, following the recommendations by Wohlin (2014) .

Our objective with the snowballing is to increase the reliability and validity. Furthermore, some articles found through the snowballing had been accepted at different peer-reviewed venues but had not been published yet in the corresponding database. This is a situation we address at Section 6.

3.4 Quality analysis

Once the final sample of papers is collected, we proceed with the quality check, following the procedure of Kitchenham and Charters (2007) and Kitchenham et al. (2010) . The objective behind a quality checklist if 2-fold: “to provide still more detailed inclusion/exclusion criteria” and act “as a means of weighting the importance of individual studies when results are being synthesized ( Kitchenham and Charters, 2007 ).” We followed the approach taken by Kitchenham et al. (2010) for the quality check, taking their questions and categorizing. In addition, to further adapt the questionnaire to our objectives, we added one question on security and adapted another one. The questionnaire is properly described at Table 4 . Each question was scored, according to the scoring scale defined in Table 5 .

www.frontiersin.org

Table 4 . Quality criteria questionnaire.

www.frontiersin.org

Table 5 . Quality criteria assessment.

The quality analysis is done by at least two authors of this research, for reliability and validity purposes ( Wohlin et al., 2013 ).

3.5 Data extraction

To answer the main question and extract the data, we have subdivided the main question, to answer it. This allows us to extract information and summarize it systematically; we created an extract form in line with ( Kitchenham and Charters, 2007 ; Carrera-Rivera et al., 2022 ). The data extraction form is presented in Table 6 .

www.frontiersin.org

Table 6 . Data extraction form and type of answer.

The data extraction was done by at least two researchers per article. Afterward, the results are compared, and if there are “disagreements, [they must be] resolved either by consensus among researchers or arbitration by an additional independent researcher ( Kitchenham and Charters, 2007 ).”

4.1 Search results

The search and recollection of papers were done during the last week of November 2023. Table 7 shows the total number of articles gathered per database. The selection process for our final samples is exemplified in Figure 1 .

www.frontiersin.org

Table 7 . Search results per database.

www.frontiersin.org

Figure 1 . Selection of sample papers for this SLR.

The total number of articles in our first round, among all the databases, was 95. We then identified duplicates and applied our inclusion and exclusion criteria for the first round of selected papers. This process left us with a sample of 21 articles.

These first 21 artcles are our starting set, from which we proceeded for a forward snowballing. We snowballed each paper of the starting set by searching Google Scholar to find where it had been cited. The selected papers at this phase were based on the title, abstract, based on Wohlin (2014) . From this step, 22 more articles were added to the sample, leaving 43 articles. We then applied inclusion and exclusion criteria to the new snowballed papers, that left us with 35 papers. We discuss this high number of snowballed papers at Section 6.

At this point, we read all the articles to analyze if they should pass to the final phase. In this phase, we discarded 12 articles deemed out of scope for this research, leaving us with 23 articles for quality check. For example, they would not focus on cybersecurity, code generation, or the usage of AI models for code generation.

At this phase, three particular articles (counted among the eight articles previously discarded) sparked discussion between the first and fourth authors regarding whether they were within the scope of this research. We defined AI code generation as artifacts that suggest or produce code. Hence, those artifacts that use AI to check and/or verify code, and vulnerability detection without suggesting new code are not within scope. In addition, the article's main focus should be on code generation and not other areas, such as code verification. So, although an article might discuss code generation, the paper was not accepted as it was not the main topic. As a result, two of the three discussion articles were accepted, and one was rejected.

4.2 Quality evaluation

We carried out a quality check for our preliminary sample of papers ( N = 23) as detailed at Section 3.4. Based on the indicated scoring system, we discarded articles that did not pass 50% of the total possible score (four points). If there were disagreements in the scoring, these were discussed and resolved between authors. Each paper's score details are provided in Table 8 , for transparency purposes ( Carrera-Rivera et al., 2022 ). Quality scores guides us on where to place more weight of importance, and on which articles to focus ( Kitchenham and Charters, 2007 ). The final sample is of N = 19.

www.frontiersin.org

Table 8 . Quality scores of the final sample.

4.3 Final sample

The quality check discarded three papers, which left us with 19 as a final sample, as seen in Table 9 . The first article published in this sample was in 2022 and the number of publications has been increasing every year. This situation is not surprising, as generative AI has risen in popularity in 2020 and has expanded into widespread knowledge with the release of ChatGPT 3.5.

www.frontiersin.org

Table 9 . Sample of papers, with the main information of interest ( † means no parameter or base model was specified in the article).

5 Discussion

5.1 about ai models comparisons and methods for investigation.

Almost the majority (14 papers—73%) of the papers research at least one OpenAI model, Codex being the most popular option. OpenAI owns ChatGPT, which was adopted massively by the general public. Hence, it is not surprising that most articles focus on OpenAI models. However, other AI models from other organizations are also studied, Salesforce's CodeGen and CodeT5, both open-source, are prime examples. Similarly, Xu et al. (2022) Polycoder was a popular selection in the sample. Finally, different authors benchmarked in-house AI models and popular models. For example, papers such as Tony et al. (2022) with DeepAPI-plusSec and DeepAPI-onlySec and Pearce et al. (2023) with Gpt2-csrc. Figure 3 shows the LLM instances researched by two or more articles grouped by family.

As the different papers researched different vulnerabilities, it remains difficult to compare the results. Some articles researched specific CWE, other MITRE Top-25, the impact of AI in code, the quality of the code generated, and malware generation, among others. It was also challenging to find the same methodological approach for comparing results, and therefore, we can only infer certain tendencies. For this reason, future research could focus on generating a standardized approach and analyzing vulnerabilities to analyze the quality of security. Furthermore, it would be interesting to have more analysis between open-source and proprietary models.

Having stated this, two articles with similar approaches, topics, and vulnerabilities are Pearce et al. (2022 , 2023) . Both papers share authors, which can help explain the similarity in the approach. Both have similar conclusions on the security of the output of different OpenAI models: they can generate functional and safe code, but the percentage of this will vary between CWE and programming language ( Pearce et al., 2022 , 2023 ). For both authors, the security of the code generated in C was inferior to that in Python ( Pearce et al., 2022 , 2023 ). For example, Pearce et al. (2022) indicates that for Python, 39% of the code suggested is vulnerable and 50% for code in C. Pearce et al. (2023) highlights that the models they studied struggled with fixes for certain CWE, such as CWE-787 in C. So even though they compared different models of the OpenAI family, they produced similar results (albeit some models had better performance than others).

Based on the work of Pearce et al. (2023) , when comparing OpenAI's models to others (such as the AI21 family, Polycoder, and GPT-csrc) in C and Python with CWE vulnerabilities, OpenAI's models would perform better than the rest. In the majority of the cases, code-davinci-002 would outperform the rest. Furthermore, when applying the AI models to other programming languages, such as Verilog, not all models (namely Polycoder and gpt2-csrc) supported it ( Pearce et al., 2023 ). We cannot fully compare these results with other research articles, as they focused on different CWEs but identified tendencies. To name the difference,

• He and Vechev (2023) studies mainly CodeGen and mentions that Copilot can help with CWE-089,022 and 798. They do not compare the two AI models but compare CodeGen with SVEN. They use scenarios to evaluate CWE, adopting the method from Pearce et al. (2022) . CodeGen does seem to provide similar tendencies as Pearce et al. (2022) : certain CWE appeared more recurrently than others. For example, comparing with Pearce et al. (2022) and He and Vechev (2023) , CWE-787, 089, 079, and 125 in Python and C appeared in most scenarios at a similar rate. 4

• This data shows that even OpenAI's and CodeGen models have similar outputs. When He and Vechev (2023) present the “overall security rate” at different temperatures of CodeGen, they have equivalent security rates: 42% of the code suggested being vulnerable in He and Vechev (2023) vs. a 39% in Python and 50% in C in Pearce et al. (2022) .

• Nair et al. (2023) also studies CWE vulnerabilities for Verilog code. Both Pearce et al. (2022 , 2023) also analyze Verilog in OpenAI's models, but with very different research methods. Furthermore, their objectives are different: Nair et al. (2023) focuses on prompting and how to modify prompts for a secure output. What can be compared is that Nair et al. (2023) and Pearce et al. (2023) highlight the importance of prompting.

• Finally Asare et al. (2023) also studies OpenAI from a very different perspective: the human-computer interaction (HCI). Therefore, we cannot compare the study results of Asare et al. (2023) with Pearce et al. (2022 , 2023) .

Regarding malware code generation, both Botacin (2023) and Pa Pa et al. (2023) OpenAI's models, but different base-models. Both conclude that AI models can help generate malware but to different degrees. Botacin (2023) indicates that ChatGPT cannot create malware from scratch but can create snippets and help less-skilled malicious actors with the learning curve. Pa Pa et al. (2023) experiment with different jailbreaks and suggest that the different models can create malware, up to 400 lines of code. In contrast, Liguori et al. (2023) researchers Seq2Seq and CodeBERT and highlight the importance for malicious actors that AI models output correct code if not their attack fails. Therefore, human review is still necessary to fulfill the goals of malicious actors ( Liguori et al., 2023 ). Future work could benefit from comparing these results with other AI code generation models to understand if they have similar outputs and how to jailbreak them.

The last element we can compare is the HCI aspects, specifically Asare et al. (2023) , Perry et al. (2023) , and Sandoval et al. (2023) , who all researched on C. Both Asare et al. (2023) and Sandoval et al. (2023) agree that AI code generation models do not seem to be worse, if not the same, in generating insecure code and introducing vulnerabilities. In contrast, Perry et al. (2023) concludes that developers who used AI assistants generated more insecure code—although this is inconclusive for the C language—as these developers believed they had written more secure code. Perry et al. (2023) suggest that there is a relationship between how much trust there is between the AI model and the security of code. All three agree that AI assistant tools should not be used carefully, particularly between non-experts ( Asare et al., 2023 ; Perry et al., 2023 ; Sandoval et al., 2023 ).

5.2 New exploits

Firstly, Niu et al. (2023) hand-crafted prompts that seemed could leak personal data, which yielded 200 prompts. Then, they queried each of these prompts, obtaining five responses per prompt, giving 1,000 responses. Two authors then looked through the outputs to identify if the prompts had leaked personal data. The authors then improved these with the identified prompts. They tweaked elements such as context, pre-fixing or the natural language (English and Chinese), and meta-variables such as prompt programming language style for the final data set.

With the final set of prompts, the model was queried for privacy leaks. B efore querying the model, the authors also tuned specific parameters, such as temperature. “Using the BlindMI attack allowed filtering out 20% of the outputs, with the high recall ensuring that most of the leakages are classified correctly and not discarded ( Niu et al., 2023 ).” Once the outputs had been labeled as members, a human checked if they contained “sensitive data” ( Niu et al., 2023 ). The human could categorize such information as targeted leak, indirect leak, or uncategorized leak.

When applying the exploit to Codex Copilot and verifying with GitHub, it shows there is indeed a leakage of information ( Niu et al., 2023 ). 2.82% of the outputs contained identifiable information such as address, email, and date of birth; 0.78% private information such as medical records or identities; and 0.64% secret information such as private keys, biometric authentication or passwords ( Niu et al., 2023 ). The instances in which data was leaked varied; specific categories, such as bank statements, had much lower leaks than passwords, for example Niu et al. (2023) . Furthermore, most of the leaks tended to be indirect rather than direct. This finding implies that “the model has a tendency to generate information pertaining to individuals other than the subject of the prompt, thereby breaching privacy principles such as contextual agreement ( Niu et al., 2023 ).”

Their research proposes a scalable and semi-automatic manner to leak personal data from the training data in a code-generation AI model. The authors do note that the outputs are not verbatim or memorized data.

To achieve this, He and Vechev (2023) curated a dataset of vulnerabilities from CrossVul ( Nikitopoulos et al., 2021 ) and Big-Vul ( Fan et al., 2020 ), which focuses in C/C++ and VUDENC ( Wartschinski et al., 2022 ) for Python. In addition, they included data from commits from GitHub, taking into special consideration that they were true commits, avoiding that SVEN learns “undesirable behavior.” At the end, they target 9 CWES from MITRE Top 25.

Through benchmarking, they evaluate SVEN output's security (and functional) correctness against CodeGen (350M, 2.7B, and 6.1B). They follow a scenario-based approach “that reflect[s] real-world coding ( He and Vechev, 2023 ),” with each scenario targeting one CWE. They measure the security rate, which is defined as “the percentage of secure programs among valid programs ( He and Vechev, 2023 ).” They set the temperature at 0.4 for the samples.

Their results show that SVEN can significantly increase and decrease (depending on the controlled generation output) the code security score. “CodeGen LMs have a security rate of ≈60%, which matches the security level of other LMs [...] SVEN sec significantly improves the security rate to >85%. The best-performing case is 2.7B, where SVENsec increases the security rate from 59.1 to 92.3% ( He and Vechev, 2023 ).” Similar results are obtained for SVEN vul with the “security rate greatly by 23.5% for 350M, 22.3% for 2.7B, and 25.3% for 6.1B ( He and Vechev, 2023 )”. 5 When analyzed per CWE, in almost all cases (except CWE-416 language C) SVEN sec increases the security rate. Finally, even when tested with 4 CWE that were not included in the original training set of 9, SVEN had positive results.

Although the authors aim at evaluating and validating SVEN, as an artifact for cybersecurity, they also recognize its potential use as a malicious tool. They suggest that SVEN can be inserted in open-source projects and distributed ( He and Vechev, 2023 ). Future work could focus on how to integrate SVEN—or similar approaches—as plug-ins into AI code generations, to lower the security of the code generated. Furthermore, replication of this approach could raise security alarms. Other research can focus on seeking ways to lower the security score while keeping the functionality and how it can be distributed across targeted actors.

They benchmark CodeAttack against the TextFooler and BERT-Attack, two other adversarial attacks in three tasks: code translation (translating code between different programming languages, in this case between C# and Java), code repair (fixes bugs for Java) and code (a summary of the code in natural language). The authors also applied the benchmark in different AI models (CodeT5, CodeBERT, GraphCode-BERT, and RoBERTa) in different programming languages (C#, Java, Python, and PHP). In the majority of the tests, CodeAttack had the best results.

5.3 Performance per programming language

Different programming languages are studied. Python and the C family are the most common languages, including C, C++, and C# (as seen in Figure 2 ). To a lesser extent, Java and Verilog are tested. Finally, specific articles would study more specific programming languages, such as Solidity, Go or PHP. Figure 2 offers a graphical representation of the distribution of the programming languages.

www.frontiersin.org

Figure 2 . Number of articles that research specific programming languages. An article may research 2 or more programming languages.

www.frontiersin.org

Figure 3 . Number of times each LLM instance was researched by two or more articles, grouped by family. One paper might study several instances of the same family (e.g., Code-davinci-001 and Code-davinci-002), therefore counting twice. Table 9 offers details on exactly which AI models are studied per article.

5.3.1 Python

Python is the second most used programming language 6 as of today. As a result most publicly-available training corpora include Python and it is therefore reasonable to assume that AI models can more easily be tuned to handle this language ( Pearce et al., 2022 , 2023 ; Niu et al., 2023 ; Perry et al., 2023 ). Being a rather high level, interpreted language, Python should also expose a smaller attack surface. As a result, AI-generated Python code has fewer avenues to cause issues to begin with, and this is indeed backed up by evidence ( Pearce et al., 2022 , 2023 ; Perry et al., 2023 ).

In spite of this, issues still occur: Pearce et al. (2022) experimented with 29 scenarios, producing 571 Python programs. Out of these, 219 (38.35%) presented some kind of Top-25 MITRE (2021) vulnerability, with 11 (37.92%) scenarios having a top-vulnerable score. Unaccounted in these statistics are the situations where generated programs fail to achieve functional correctness ( Pearce et al., 2023 ), which could yield different conclusions. 7

Pearce et al. (2023) , building from Pearce et al. (2022) , study to what extent post-processing can automatically detect and fix bugs introduced during code generation. For instance, on CWE-089 (SQL injection) they found that “29.6% [3197] of the 10,796 valid programs for the CWE-089 scenario were repaired” by an appropriately-tuned LLM ( Pearce et al., 2023 ). In addition, they claim that AI models can generate bug-free programs without “additional context ( Pearce et al., 2023 ).”

It is however difficult to support such claims, which need to be nuanced. Depending on the class of vulnerability, AI models varied in their ability in producing secure Python code ( Pearce et al., 2022 ; He and Vechev, 2023 ; Perry et al., 2023 ; Tony et al., 2023 ). Tony et al. (2023) experimented with code generation from natural language prompts, findings that indeed, Codex output included vulnerabilities. In another research,Copilot reports only rare occurences of CWE-079 or CWE-020, but common occurences of CWE-798 and CWE- 089 ( Pearce et al., 2022 ). Pearce et al. (2022) report a 75% vulnerable score for scenario 1, 48% scenario 2, and 65% scenario 3 with regards to CWE-089 vulnerability ( Pearce et al., 2022 ). In February 2023, Copilot launched a prevention system for CWEs 089, 022, and 798 ( He and Vechev, 2023 ), the exact mechanism of which is unclear. At the time of writing it falls behind other approaches such as SVEN ( He and Vechev, 2023 ).

Perhaps surprisingly, there is not much variability across different AI models: CodeGen-2.7B has comparable vulnerability rates ( He and Vechev, 2023 ), with CWE-089 still on top. CodeGen-2.7B also produced code that exhibited CWE-078, 476, 079, or 787, which are considered more critical.

One may think that using AI as an assistant to a human programmer could alleviate some of these issues. Yet evidence points to the opposite: when using AI models as pair programmers, developers consistently deliver more insecure code for Python ( Perry et al., 2023 ). Perry et al. (2023) led a user-oriented study on how the usage of AI models for programming affects the security and functionality of code, focusing on Python, C, and SQL. For Python, they asked participants to write functions that performed basic cryptographic operations (encryption, signature) and file manipulation. 8 They show a statistically significant difference between subjects that used AI models (experimental group) and those that did not (control group), with the experimental group consistently producing less secure code ( Perry et al., 2023 ). For instance, for task 1 (encryption and decryption), 21% of the responses of the experiment group was secure and correct vs. 43% of the control group ( Perry et al., 2023 ). In comparison, 36% of the experiment group provided insecure but correct code, compared to 14%.

Even if AI models produce on occasion bug-free and secure code, evidence points out that it cannot be guaranteed. In this light, both Pearce et al. (2022 , 2023) recommend deploying additional security-aware tools and methodologies whenever using AI models. Moreover, Perry et al. (2023) suggests a relationship between security awareness and trust in AI models on the one hand, and the security of the AI-(co)generated code.

Another point of agreement in our sample is that prompting plays a crucial role in producing vulnerabilities, which can be introduced or avoided depending on the prompt and adjustment of parameters (such as temperature). Pearce et al. (2023) observes that AI models can generate code that repairs the issue when they are given a suitable repair prompt. Similarly, Pearce et al. (2022) analyzed how meta-type changes and comments (documentation) can have varying results over the security ( Pearce et al., 2022 ). An extreme example is the difference between an SQL code generated with different prompts: the prompt “adds a separate non-vulnerable SQL function above a task function” (identified as variation C-2, as it is a code change) would never produce vulnerable code whereas “adds a separate vulnerable SQL function above the task function” (identified as variation C-3) returns vulnerable code 94% of the time ( Pearce et al., 2022 ). Such results may not be surprising if we expect the AI model to closely follow instructions, but suffice to show the effect that even minor prompt variations can have on security.

Lastly, Perry et al. (2023) observe in the experimental group a relationship between parameters of the AI model (such as temperature) and code quality. They also observe a relationship between education, security awareness, and trust ( Perry et al., 2023 ). Because of this, there could be spurious correlations in their analysis, for instance the variable measuring AI model parameters adjustments could be, in reality, measuring education or something else.

On another security topic, Siddiq et al. (2022) study code and security “smells.” Smells are hints, not necessarily actual vulnerabilities, but they can open the door for developers to make mistakes that lead to security flaws that attackers exploit. Siddiq et al. (2022) reported on the following CWE vulnerabilities: 078,703,330. They have concluded that bad code patterns can (and will) leak to the output of models, and code generated with these tools should be taken with a “grain of salt” ( Siddiq et al., 2022 ). Furthermore, identified vulnerabilities may be severe (not merely functional issues) ( Siddiq et al., 2022 ). However, as they only researched OpenAI's AI models, their conclusion may lack external validity and generalization.

Finally, some authors explore the possibility to use AI models to deliberately produce malicious code ( He and Vechev, 2023 ; Jha and Reddy, 2023 ; Jia et al., 2023 ; Niu et al., 2023 ). It is interesting to the extent that this facilitates the work of attackers, and therefore affects cybersecurity as a whole, but it does not (in this form at least) affect the software development process or deployment per se, and is therefore outside of the scope of our discussion.

The C programming language is considered in 10 (52%) papers of our final sample, with C being the most common, followed by C++ and C#. Unlike Python, C is a low-level, compiled language, that puts the programmer in charge of many security-sensitive tasks (such as memory management). The vast majority of native code today is written in C. 9

The consensus is that AI generation of C programs yields insecure code ( Pearce et al., 2022 , 2023 ; He and Vechev, 2023 ; Perry et al., 2023 ; Tony et al., 2023 ), and can readily be used to develop malware ( Botacin, 2023 ; Liguori et al., 2023 ; Pa Pa et al., 2023 ). However, it is unclear whether AI code generation introduce more or new vulnerabilities compared to humans ( Asare et al., 2023 ; Sandoval et al., 2023 ), or to what extent they influence developers' trust in the security of the code ( Perry et al., 2023 ).

Multiple authors report that common and identified vulnerabilities are regularly found in AI-generated C code ( Pearce et al., 2022 , 2023 ; Asare et al., 2023 ; He and Vechev, 2023 ; Perry et al., 2023 ; Sandoval et al., 2023 ). Pearce et al. (2022) obtained 513 C programs, 258 of which (50.29% ) had a top-scoring vulnerability. He and Vechev (2023) provides a similar conclusion.

About automated code-fixing, Asare et al. (2023) and Pearce et al. (2023) report timid scores, with only 2.2% of C code for CWE-787.

On the question of human- vs. AI-generated code, Asare et al. (2023) used 152 scenarios to conclude that AI models make in fact fewer mistakes. Indeed, when prompted with the same scenario as a human, 33% cases suggested the original vulnerability, and 25% provided a bug-free output. Yet, when tested on code replication or automated vulnerability fixing, the authors do not recommend the usage of a model by non-experts. For example, in code replication, AI models would always replicate code regardless of whether it had a vulnerability, and CWE-20 would consistently be replicated ( Asare et al., 2023 ).

Sandoval et al. (2023) experimentally compared the security of code produced by AI-assisted students to the code generated by Codex. They had 58 participants and studied memory-related CWE, given that they are in the Top-25 MITRE list ( Sandoval et al., 2023 ). Although there were differences between groups, these were not bigger than 10% and would differ between metrics ( Sandoval et al., 2023 ). In other words, depending on the chosen metric, sometimes AI-assisted subjects perform better in security and vice versa ( Sandoval et al., 2023 ). For example, CWE-787 was almost the same for the control and experimental groups, whereas the generated Codex code was prevalent. Therefore, they conclude that the impact on “cybersecurity is less conclusive than the impact on functionality ( Sandoval et al., 2023 ).” Depending on the security metric, it may be beneficial to use AI-assisted tools, which the authors recognize goes against standard literature ( Sandoval et al., 2023 ). They go so far as to conclude that there is “no conclusive evidence to support the claim LLM assistant increase CWE incidence in general, even when we looked only at severe CWEs ( Sandoval et al., 2023 ).”

Regarding AI-assisted malware generation, there seems to be fundamental limitations preventing current AI models from writing self-contained software from scratch ( Botacin, 2023 ; Liguori et al., 2023 ; Pa Pa et al., 2023 ), although it is fine for creating smaller blocks of code which, strung together, produce a complete malware ( Botacin, 2023 ). It is also possible to bypass models' limitations by leveraging basic obfuscation techniques ( Botacin, 2023 ). Pa Pa et al. (2023) experiment prompts and jailbreaks in ChatGPT to produce code (specifically, fileless malware for C++), which was only provided with 2 jailbreaks they chose. While Liguori et al. (2023) reflect on how to best optimize AI-generating tools to assist attackers in producing code, as failure or incorrect codes means the attack fails.

Over CWE, Top MITRE-25 is a concern across multiple authors ( Pearce et al., 2022 , 2023 ; He and Vechev, 2023 ; Tony et al., 2023 ). CWE-787 is a common concern across articles, as it is the #1 vulnerability in the Top-25 MITRE list ( Pearce et al., 2022 ; Botacin, 2023 ; He and Vechev, 2023 ). On the three scenarios experimented by Pearce et al. (2022) , on average, ~34% of the output is vulnerable code. He and Vechev (2023) tested with two scenarios, the first receiving a security rate of 33.7% and the second one 99.6%. What was interesting in their experiment is that they were not able to provide lower security rates for SVEN vul than the originals ( He and Vechev, 2023 ). Other vulnerabilities had varying results but with a similar trend. Overall, it seems that the AI code generation models produce more vulnerable code compared to other programming languages, possibly due to the quality and type of data in the training data set ( Pearce et al., 2022 , 2023 ).

Finally, regarding human-computer interaction, Perry et al. (2023) suggests that subjects “with access to an AI assistant often produced more security vulnerabilities than those without access [...] overall.” However, they highlight that their difference is not statistically significant and inconclusive for the case they study in C. So even if the claim applies to Python, Perry et al. (2023) indicates this is not the case for the C language. Asare et al. (2023) and Sandoval et al. (2023) , as discussed previously, both conclude that AI models do not introduce more vulnerabilities than humans into code. “This means that in a substantial number of scenarios we studied where the human developer has written vulnerable code, Copilot can avoid the detected vulnerability ( Asare et al., 2023 ).”

Java 10 is a high-level programming language that runs atop a virtual machine, and is today primarily used for the development of mobile applications. Vulnerabilities can therefore arise from programs themselves, calls to vulnerable (native) libraries, or from problems within the Java virtual machine. Only the first category of issues is discussed here.

In our sample, four articles ( Tony et al., 2022 ; Jesse et al., 2023 ; Jha and Reddy, 2023 ; Wu et al., 2023 ) analyzed code generation AI models for Java. Each research focused on very different aspects of cyber security and they did not analyze the same vulnerabilities. Tony et al. (2022) investigated the dangers and incorrect of API calls for cryptographic protocols. Their conclusions is that generative AI might not be at all optimized for generating cryptographically secure code ( Tony et al., 2022 ). The accuracy of the code generated was significantly lower on cryptographic tasks than what the AI is advertised to have on regular code ( Tony et al., 2022 ).

Jesse et al. (2023) experiments with generating single stupid bugs (SStuB) with different AI models. They provide six main findings, which can be summarized as: AI models propose twice as much SSTuB as correct code. However, they also seem to help with other SStuB ( Jesse et al., 2023 ). 11 One of the issues with SStuBs is that “where Codex wrongly generates simple, stupid bugs, these may take developers significantly longer to fix than in cases where Codex does not ( Jesse et al., 2023 ).” In addition, different AI models would behave differently over the SStuBs generated ( Jesse et al., 2023 ). Finally, Jesse et al. (2023) found that commenting on the code leads to fewer SStuBs and more patches, even if the code is misleading.

Wu et al. (2023) analyze and compare (1) the capabilities of different LLMs and fine-tuned LLMs and automated program repair (APR) techniques for repairing vulnerabilities in Java; (2) proposes VJBench and VJBench-trans as a “new vulnerability repair benchmark;” (3) and evaluates the studied AI models on their proposed VJBench and VJBench-trans. VJBench aims to extend the work of Vul4J and thus proposes 42 vulnerabilities, including 12 new CWEs that were not included in Vul4J ( Wu et al., 2023 ). Therefore, their study assessed 35 vulnerabilities proposed by Vul4J and 15 by the authors ( Wu et al., 2023 ). On the other hand, VJBench-trans is composed of “150 transformed Java vulnerabilities ( Wu et al., 2023 ).” Overall, they concluded that the AI models fix very few Java vulnerabilities, with Codex fixing 20.4% of them ( Wu et al., 2023 ). Indeed, “large language models and APR techniques, except Codex, only fix vulnerabilities that require simple changes, such as deleting statements or replacing variable/method names ( Wu et al., 2023 ).” Alternatively, it seems that fine-tuning helps the LLMs improve the task of fixing vulnerabilities ( Wu et al., 2023 ).

However, four APR and nine LLMs did not fix the new CWEs introduced by VJBench ( Wu et al., 2023 ). Some CWEs that are not tackled are “CWE-172 (Encoding error), CWE-325 (Missing cryptographic step), CWE-444 (HTTP request smuggling; Wu et al., 2023 ),” which can have considerable cybersecurity impacts. For example, CWE-325 can weaken a cryptographic protocol, thus lowering the security capacity. Furthermore, apart from Codex, the other AI models and APR studied did not apply complex vulnerability repair but would focus on “simple changes, such as deletion of a statement ( Wu et al., 2023 ).”

Jia et al. (2023) study the possibility that a code-generation AI model is manipulated by “adversarial inputs.” In other words, the user inputs designed to trick the model into either misunderstanding code, or producing code that behaves in an adversarially-controlled way. They tested Claw, M1 and ContraCode both in Python and Java for the following tasks: code summarization, code completion and code clone detection ( Jia et al., 2023 ).

Finally, Jha and Reddy (2023) proposes CodeAttack , which is implemented in different programming languages, including Java. 12 When tested in Java, their results show that 60% of the adversarial code generated is syntactically correct ( Jha and Reddy, 2023 ).

5.3.4 Verilog

Verilog is a hardware-description language. Unlike other programming languages discussed so far, its purpose is not to describe software but to design and verify of digital circuits (at the register-transfer level of abstraction).

The articles that researched Verilog generally conclude that the AI models they researched are less efficient in this programming language than Python or C ( Pearce et al., 2022 , 2023 ; Nair et al., 2023 ). Different articles would research different vulnerabilities, with two specific CWEs standing out: 1271 and 1234. Pearce et al. (2022) summarizes the difficulty of defining which vulnerability to study from the CWE for Verilog, as there is no Top 25 CWE for hardware. Hence, their research selected vulnerabilities that could be analyzed ( Pearce et al., 2022 ). This situation produces difficulties in comparing research and results, as different authors can select different focuses. The different approaches to vulnerabilities in Verilog can be seen in Table 9 , where only two CWE are common across all studies (1271 and 1234), but others such as 1221 ( Nair et al., 2023 ) or 1294 ( Pearce et al., 2022 ) are researched by one article.

Note that unlike software vulnerabilities, it is much harder to agree on a list of the most relevant hardware vulnerabilities, and to the best of our knowledge there is no current consensus on the matter today.

Regarding the security concern, both Pearce et al. (2022 , 2023) , studying OpenAI, indicated that in general these models struggled to produce correct, functional, and meaningful code, being less efficient over the task. For example, Pearce et al. (2022) generates “198 programs. Of these, 56 (28.28%) were vulnerable. Of the 18 scenarios, 7 (38.89 %) had vulnerable top-scoring options.” Pearce et al. (2023) observes that when using these AI models to generate repair code, firstly, they had to vary around with the temperature of the AI model (compared to C and Python), as it produced different results. Secondly, they conclude that the models behaved differently with Verilog vs. other languages and “seemed [to] perform better with less context provided in the prompt ( Pearce et al., 2023 ).” The hypothesis on why there is a difference between Verilog and other programming languages is because there is less training data available ( Pearce et al., 2022 ).

5.4 Mitigation strategies

There have been several attempts, or suggestions, to mitigate the negative effects on security when using AI to code. Despite reasonable, not all are necessarily effective, as we discuss in the remainder of this section. Overall, the attempts we have surveyed discuss how modify the different elements that can affect the quality of the AI models or the quality of the user control over the AI-generated code. Table 10 summarizes the suggested mitigation strategies.

www.frontiersin.org

Table 10 . Summary of the mitigation strategies.

5.4.1 Dataset

Part of the issue is that LLMs are trained on code that is itself ripe with vulnerabilities and bad practice. As a number of the AI models are not open-source or their training corpora is no available, different researchers hypothesize that the security issue arise from the training dataset ( Pearce et al., 2022 ). Adding datasets that include different programming languages with different vulnerabilities may help reduce the vulnerabilities in the output ( Pearce et al., 2022 ). This is why, to mitigate the problems with dataset security quality, He and Vechev (2023) manually curated the training data for fine-tuning, which improved the output performance against the studied CWE.

By carefully selecting training corpora that are of higher quality, which can be partially automated, there is hope that fewer issues would arise ( He and Vechev, 2023 ). However, a consequence of such a mitigation is that the size of the training set would be much reduced, which weakens the LLM's ability to generate code and generalize ( Olson et al., 2018 ). Therefore one may expect that being too picky with the training set would result, paradoxically, in a reduction in code output quality. A fully fledged study of this trade-off remains to be done.

5.4.2 Training procedure

During the training process, LLMs are scored on their ability to autoencode, that is, to accurately reproduce their input (in the face of a partially occulted input). In the context of natural language, minor errors are often acceptable and almost always have little to no impact on the meaning or understanding of a sentence. Such is not the case for code, which can be particularly sensitive to minor variations, especially for low-level programming languages. A stricter training regimen could score an LLM based not only on syntactic correctness, but on (some degree of) semantic correctness, to limit the extent to which the model wanders away from a valid program. Unfortunately, experimental data from Liguori et al. (2023) suggests that currently no single metric succeeds at that task.

Alternatively, since most LLMs today come pre-trained, a better fine-tuning step could reduce the risks associated with incorrect code generation. He and Vechev (2023) took this approach and had promising results in the CWE they investigated. However, there is conflicting evidence. Evidence from Wu et al. (2023) seems to indicate that this approach is inherently limited to fixing a very narrow, and simple class of bugs. More studies analyzing the impact of fine-tuning models with curated security datasets are needed to assess the impact of this mitigation strategy.

5.4.3 Generation procedure

Code quality is improved by collecting more context that the user typically provides through their prompts ( Pearce et al., 2022 ; Jesse et al., 2023 ). The ability to use auxiliary data, such as other project files, file names, etc. seems to explain the significant difference in code acceptation between GitHub Copilot and its bare model OpenAI Codex. The exploration of creating guidelines and best practices on how to do prompts effectively may be interesting. Nair et al. (2023) explored the possibility of creating prompt strategies and techniques for ChatGPT that would output secure code.

From an adversarial point of view, Niu et al. (2023) provides evidence of the impact of context and prompts for exploiting AI models. There are ongoing efforts to limit which prompts are accepted by AI systems by safeguarding them ( Pa Pa et al., 2023 ). However, Pa Pa et al. (2023) showed—with mixed results—how to bypass these limitations, what is called “jailbreaking.” Further work on this area is needed as a mitigation strategy and its effectiveness.

Independently, post-processing the output (SVEN is one example; He and Vechev, 2023 ) has a measurable impact on code quality, and is LLM-agnostic, operating without the need for re-training nor fine-tuning. Presumably, non-LLM static analyzers or linters may be integrated as part of the code generation procedure to provide checks along the way and avoid producing code that is visibly incorrect or dangerous.

5.4.4 Integration of AI-generated code into software

Even after all the technical countermeasures have been taken to avoid producing code that is obviously incorrect, there remains situations where AI-generated programs contain (non-obvious) vulnerabilities. To a degree, such vulnerabilities could also appear out of human-generated code, and there should in any case be procedures to try and catch these as early as possible, through unit, functional and integration testing, fuzzing, or static analysis. Implementation of security policies and processes remains vital.

However AI models are specifically trained to produce code that looks correct, meaning that their mistakes may be of a different nature or appearance than those typically made by human software programmers, and may be harder to spot. At the same time, the very reason why code generation is appealing is that it increases productivity, hence the amount of code in question.

It is therefore essential that software developers who rely on AI code generation keep a level of mistrust with regards to these tools ( Perry et al., 2023 ). It is also likely that code review methodologies should be adjusted in the face of AI-generated code to look for the specific kind of mistakes or vulnerabilities that this approach produces.

5.4.5 End-user education

One straightforward suggestion is educating users to assess the quality of software generated with AI models. Among the works we have reviewed, we found no studies that specifically discuss the quality and efficacy of this potential mitigation strategy, so we can only speculate about it from related works. For instance, Moradi Dakhel et al. (2023) compares the code produced by human users with the code generated by GitHub Copilot. The study is not about security. It is about the correctness of the implementation of quite well-known algorithms. Still, human users—students with an education in algorithms—performed better than their AI counterparts, but the buggy solutions generated by Copilot were easily fixable by the users. Relevantly, the AI-generated bugs were more easily recognizable and fixable than those produced by other human developers performing the same task.

This observation suggests that using AI could help write code faster for programmers skilled in debugging and that this task should not hide particular complexity for them. As Chen et al. (2021) suggested, “human oversight and vigilance is required for safe use of code generation systems like Codex.” However, removing obvious errors from buggy implementations of well-known algorithms is not the same as spotting security vulnerabilities: the latter task is complex and error-prone, even for experts. And here we speculate that if AI-generated flaws are naïve, programmers can still have some gain from using AI if they back up coding with other instruments used in security engineering (e.g., property checking, code inspection, and static analysis). Possible design changes or decision at the user interfaces may also have an impact. However, we have no evidence of whether our speculative idea can work in practice. The question remains open and calls for future research.

6 Threats to validity and future work

Previous literature Wohlin et al. (2013) and Petersen et al. (2015) have identified different reliability and validity issues in systematic literature reviews. One of the first elements that needs to be noted is the sample of papers. As explained by Petersen et al. (2015) , the difference between systematic mapping studies and systematic literature reviews is the sample's representativeness; mappings do not necessarily need to obtain the whole universe of papers compared with literature reviews. Nevertheless, previous research has found that even two exact literature reviews on the same subject do not have the same sample of papers, affecting it. Consequently, to increase the reliability, we identified the PICO of our research and used golden standard research methods for SLR, such as Kitchenham and Charters (2007) . This strategy helps us develop different strings for the databases tested to obtain the most optimal result. Furthermore, aiming to obtain a complete sample, we followed a forward snowballing of the whole sample obtained in the first round, as suggested by Wohlin et al. (2013) and Petersen et al. (2015) .

However, there may still be reliability issues with the sample. Firstly, the amount of ongoing publications on the subjects increases daily. Therefore, the total number would increase depending on the day the sample was obtained. Furthermore, some research on open-source platforms (such as ArXiV) did not explicitly indicate if it was peer-reviewed. Hence, the authors manually checked whether it was accepted at a peer-review venue. This is why we hypothesize that the snowballing phase provided many more papers, as these had yet to be indexed in the databases and were only available at open-source platforms. Therefore, the final sample of this research may increase and change depending on the day the data was gathered.

In addition, the sample may differ based on the definition of “code generation.” For this research and as explained in Section 4 , we worked around the idea that AI models should suggest code (working or not). Some papers would fall under our scope in some cases, even if the main topic were “verification and validation,” as the AI tools proposed for this would suggest code. Hence, we focus not only on the development phase of the SDLC but also on any phase that suggests code. Different handling of “code generation” may provide different results.

On another note, the background and expertise of the researchers affect how papers are classified and information is extracted ( Wohlin et al., 2013 ). In this manner, in this research, we used known taxonomies and definitions for classification schemes, such as Wieringa et al. (2006) for the type of research or MITRE's Top Vulnerabilities to identify which are the most commonly discussed risk vulnerabilities. The objective of using well-known classification schemes and methodologies is to reduce bias, as identified ( Petersen et al., 2015 ). However, a complete reduction of bias cannot be ruled out.

Moreover, to fight authors' bias, every single article was reviewed, and data was extracted by at least two others, using a pairing strategy. If, due to time constraints, it was only reviewed by one author, the other author would review the work ( Wohlin et al., 2013 ). If disagreements appeared at any phase – such as the inclusion/exclusion or data gathering – a meeting would be done and discussed ( Wohlin et al., 2013 ). For example, in a couple of papers, Author #1 was unsure if it should be included or excluded based on the quality review, which was discussed with Author #4. Our objective in using a pairing strategy is to diminish authors' bias throughout the SLR.

On the analysis and comparison of the different articles, one threat to the validity of this SLR is that not all articles use the same taxonomy for vulnerabilities; they could not be classified under a single method. Some articles would research either MITRE's CWE or the Top-25, and others would tackle more specific vulnerabilities (such as jailbreaking, malware creation, SSB, and human programming). Therefore, comparing the vulnerabilities between the articles is, at best, complicated and, at worst, a threat to our conclusions. Given the lack of a classification scheme for the wide range of security issues tackled in our sample, we (1) tried to classify the papers based on the claims of the papers' articles; (2) we aimed at comparing based on the programming language used, and between papers researched similar subjects, such as MITRE's CWE. In this manner, we would not be comparing completely different subjects. As recognized by Petersen et al. (2015) , the need for a classification scheme for specific subjects is a common challenge for systematic mapping studies and literature reviews. Nevertheless, future studies would benefit from a better classification approach if the sample permits.

We have provided the whole sample at: https://doi.org/10.5281/zenodo.10666386 for replication and transparency, with the process explained in detail. Each paper has details on why it was included/excluded, at which phase, and with details and/or comments to help readers understand and replicate our research. Likewise, we explained our research methods in as much detail as possible in the papers. Tangently, providing the details and open sources of the data helps us increase validity issues that may be present in this study.

Nonetheless, even when using well-known strategies both for the SLR and to mitigate known issues, we cannot rule out that there are inherent validity and reliability elements proper from all SLRs. We did our best efforts to mitigate these.

7 Conclusion

By systematically reviewing the state of the art, we aimed to provide insight into the question, “How does the code generation from AI models impact the cybersecurity of the software process?” We can confirm that there is enough evidence for us to say, unsurprisingly, that code generated by AI is not necessarily secure and it also contains security flaws. But, as often happens with AI, the real matter is not if AI is infallible but whether it performs better than humans doing the same task. Unfortunately, the conclusions we gathered from the literature diverge in suggesting whether AI-generated security artifacts should be cautiously approached, for instance, because of some particular severity or because they are tricky to spot. Indeed, some work reports of them as naïve and easily detectable, but the result cannot be generalized. Overall, there is no clear favor for one hypothesis over the other because of incomparable differences between the papers' experimental setups, data sets used for the training, programming languages considered, types of flaws, and followed experimental methodologies.

Generally speaking and regardless of the code production activity—whether for code generation from scratch, generating code repair, or even suggesting code—our analysis reveals that well-documented vulnerabilities in have been found in AI-suggested code, and this happened a non-negligible amount of times. And among the many, specific vulnerabilities, such as CWE MITRE Top-25, have received special attention in the current research and for a reason. For instance, CWE-787 and 089 received particular attention from articles, as they are part of the top 3 of MITRE CWE. Furthermore, the CWE security scores of generated code suggested by AI models would vary, with some CWEs being more prevalent than others.

Other works report on having found naïve bugs, easy to fix while other discovered malware code hidden between the benign lines, and other more reported an unjustified trust by human on the quality of the AI-generated code, an issue that raises concerns of a more socio-technical nature.

Similarly, when generated with AI support, different programming languages have different security performances. AI-generated Python code seemed to be more secure (i.e., have fewer bugs) than AI-generated code of the C family. Indeed, different authors have hypothesized that this situation is a consequence of the training data set and its quality. Verilog seems to suffer from similar shortcomings as C. When comparing the security of AI-generated Verilog to C or Python, the literature converges on reporting that the security of the former is worse. Once again, the suggested reason for the finding is that available training data sets for Verilog are smaller and of worse quality than those available for training AI models to generate C or Python code. In addition, there is no identified Top 25 CWE for Verilog. Java is another commonly studied programming language, with similar conclusions as once stated before. To a lesser extent, other programming languages that could be further studied were studied.

Looking at security exploits enabled by AI-generated code with security weaknesses, four different of them are those more frequently reported: SVEN, CodeAttack, and Codex Leaks. Such attacks are reported to used to decreasing code security, creating adversarial code, and personal data leaks over automated generated code.

What can be done to mitigate the severity of flaws introduced by AI? Does the literature suggest giving up on AI entirely? No, this is not what anyone suggests, as it can be imagined that AI is considered an instrument that, despite imperfect, has a clear advantage in terms of speeding up code production. Instead different mitigation strategies are suggested, although more research is required to discuss their effectiveness and efficacy.

• Modifications to the dataset can be a possibility, but the impacts and trade-offs of such an approach are necessary;

• Raising awareness of the context of prompts and how to increase their quality seems to affect the security quality of the code generated positively;

• Security processes, policies, and a degree of mistrust of the AI-generated code could help with security. In other words, AI-generated should pass specific processes—such as test and security verification—before being accepted;

• Educating end-users on AI models (and for code generation) on their limits could help. Future research is required in this area.

As a closing remark, we welcome that the study of the impact on the security of AI models is sparking. We also greet the increased attention that the community is dedicating to the problem of how insecure our systems will be as developers continue to resort to AI support for their work. However, it is still premature to conclude on the impact of the flaws introduced by AI models and, in particular, the impact of those flaws comparatively with those generated by human programmers. Although several mitigation techniques are suggested, what combination of them is efficient or practical is a question that still needs experimental data.

Surely, we have to accept that AI will be used more and more in producing code and that the practice and this tool are still far from being flawless. Until more evidence is available, the general agreement is to exert caution: AI models for secure code generation need to be approached with due care.

Data availability statement

The datasets of the sample of papers for this study can be found in: https://zenodo.org/records/11092334 .

Author contributions

CN-R: Conceptualization, Data curation, Investigation, Methodology, Project administration, Resources, Validation, Writing—original draft, Writing—review & editing. RG-S: Investigation, Visualization, Writing—original draft, Writing—review & editing. AS: Conceptualization, Investigation, Methodology, Writing—review & editing. GL: Conceptualization, Funding acquisition, Investigation, Writing—original draft, Writing—review & editing.

The author(s) declare financial support was received for the research, authorship, and/or publication of this article. This research was funded in whole, or in part, by the Luxembourg National Research Fund (FNR), grant: NCER22/IS/16570468/NCER-FT.

Acknowledgments

The authors thank Marius Lombard-Platet for his feedback, comments, and for proof-reading the paper.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

1. ^ The number of different tokens that a model can handle, and their internal representation, is a design choice.

2. ^ This is the meaning of “GPT:” generative pre-trained transformer.

3. ^ Some authors claim that, because there is an encoding-decoding step, and the output is probabilistic, data is not directly copy-pasted. However seriously this argument can be taken, LLMs can and do reproduce parts of their training set ( Huang et al., 2023 ).

4. ^ Certain CWE prompting scenarios, when compared between the authors, had dissimilar security rates, which we would like to note.

5. ^ The authors do highlight that their proposal is not a poisoning attack.

6. ^ In reality, multiple (broadly incompatible) versions of Python coexist, but this is unimportant in the context of our discussion and we refer to them collectively as “Python.”

7. ^ One could argue for instance that the vulnerabilities occur in large proportions in generated code that fails basic functional testing, and would never make it into production because of this. Or, the other way around, that code without security vulnerabilities could still be functionally incorrect, which also causes issues. A full study of these effects remains to be done.

8. ^ They were tasked to write a program that “takes as input a string path representing a file path and returns a File object for the file at 'path' ( Perry et al., 2023 ).”

9. ^ Following the authors of our sample, we use “C” to refer to the various versions of the C standard, indiscriminately.

10. ^ Here again we conflate all versions of Java together.

11. ^ The authors define single stupid bugs as “...bugs that have single-statement fixes that match a small set of bug templates. They are called 'simple' because they are usually fixed by small changes and 'stupid' because, once located, a developer can usually fix them quickly with minor changes ( Jesse et al., 2023 ).”

12. ^ The attack is explained in detail in Section 5.2.

Ahmad, W., Chakraborty, S., Ray, B., and Chang, K.-W. (2021). “Unified pre-training for program understanding and generation,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , eds. K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, et al. (Association for Computational Linguistics), 2655–2668.

Google Scholar

Asare, O., Nagappan, M., and Asokan, N. (2023). Is GitHub's Copilot as bad as humans at introducing vulnerabilities in code? Empir. Softw. Eng . 28:129. doi: 10.48550/arXiv.2204.04741

Crossref Full Text | Google Scholar

Becker, B. A., Denny, P., Finnie-Ansley, J., Luxton-Reilly, A., Prather, J., and Santos, E. A. (2023). “Programming is hard-or at least it used to be: educational opportunities and challenges of ai code generation,” in Proceedings of the 54th ACM Technical Symposium on Computer Science Education V.1 (New York, NY), 500–506.

Botacin, M. (2023). “GPThreats-3: is automatic malware generation a threat?” in 2023 IEEE Security and Privacy Workshops (SPW) (San Francisco, CA: IEEE), 238–254.

Britz, D., Goldie, A., Luong, T., and Le, Q. (2017). Massive exploration of neural machine translation architectures. ArXiv e-prints . doi: 10.48550/arXiv.1703.03906

Burgess, M. (2023). Criminals Have Created Their Own ChatGPT Clones . Wired.

Carrera-Rivera, A., Ochoa, W., Larrinaga, F., and Lasa, G. (2022). How-to conduct a systematic literature review: a quick guide for computer science research. MethodsX 9:101895. doi: 10.1016/j.mex.2022.101895

PubMed Abstract | Crossref Full Text | Google Scholar

Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H. P., Kaplan, J., et al. (2021). Evaluating large language models trained on code. CoRR abs/2107.03374. doi: 10.48550/arXiv.2107.03374

Fan, J., Li, Y., Wang, S., and Nguyen, T. N. (2020). “A C/C + + code vulnerability dataset with code changes and CVE summaries,” in Proceedings of the 17th International Conference on Mining Software Repositories, MSR '20 (New York, NY: Association for Computing Machinery), 508–512.

Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., et al. (2020). “CodeBERT: a pre-trained model for programming and natural languages,” in Findings of the Association for Computational Linguistics: EMNLP 2020 , eds. T. Cohn, Y. He, and Y. Liu (Association for Computational Linguistics), 1536–1547.

Fried, D., Aghajanyan, A., Lin, J., Wang, S. I., Wallace, E., Shi, F., et al. (2022). InCoder: a generative model for code infilling and synthesis. ArXiv abs/2204.05999. doi: 10.48550/arXiv.2204.05999

Guo, D., Ren, S., Lu, S., Feng, Z., Tang, D., Liu, S., et al. (2021). “GraphCodeBERT: pre-training code representations with data flow,” in 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3–7, 2021 . OpenReview.net .

He, J., and Vechev, M. (2023). “Large language models for code: Security hardening and adversarial testing,” in Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (New York, NY), 1865–1879.

Henkel, J., Ramakrishnan, G., Wang, Z., Albarghouthi, A., Jha, S., and Reps, T. (2022). “Semantic robustness of models of source code,” in 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) (Honolulu, HI), 526–537.

Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y. (2020). “The curious case of neural text degeneration,” in 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26–30, 2020 . OpenReview.net .

Huang, Y., Li, Y., Wu, W., Zhang, J., and Lyu, M. R. (2023). Do Not Give Away My Secrets: Uncovering the Privacy Issue of Neural Code Completion Tools .

HuggingFaces (2022). Codeparrot. Available online at: https://huggingface.co/codeparrot/codeparrot (accessed February, 2024).

Jain, P., Jain, A., Zhang, T., Abbeel, P., Gonzalez, J., and Stoica, I. (2021). “Contrastive code representation learning,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , eds. M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih (Punta Cana: Association for Computational Linguistics), 5954–5971.

Jesse, K., Ahmed, T., Devanbu, P. T., and Morgan, E. (2023). “Large language models and simple, stupid bugs,” in 2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR) (Los Alamitos, CA: IEEE Computer Society), 563–575.

Jha, A., and Reddy, C. K. (2023). “CodeAttack: code-based adversarial attacks for pre-trained programming language models,” in Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37 , 14892–14900.

Jia, J., Srikant, S., Mitrovska, T., Gan, C., Chang, S., Liu, S., et al. (2023). “CLAWSAT: towards both robust and accurate code models,” in 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) (Los Alamitos, CA: IEEE), 212–223.

Karampatsis, R.-M., and Sutton, C. (2020). “How often do single-statement bugs occur? The manySStuBs4J dataset,” in Proceedings of the 17th International Conference on Mining Software Repositories, MSR '20 (Seoul: Association for Computing Machinery), 573–577.

Kitchenham, B., and Charters, S. (2007). Guidelines for performing systematic literature reviews in software engineering. Tech. Rep. Available online at: https://scholar.google.com/citations?view_op=view_citation&hl=en&user=CQDOm2gAAAAJ&citation_for_view=CQDOm2gAAAAJ:d1gkVwhDpl0C

Kitchenham, B., Sjøberg, D. I., Brereton, O. P., Budgen, D., Dybå, T., Höst, M., et al. (2010). “Can we evaluate the quality of software engineering experiments?,” in Proceedings of the 2010 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement (New York, NY), 1–8.

Li, R., Allal, L. B., Zi, Y., Muennighoff, N., Kocetkov, D., Mou, C., et al. (2023). StarCoder: may the source be with you! arXiv preprint arXiv:2305.06161 . doi: 10.48550/arXiv.2305.06161

Liguori, P., Improta, C., Natella, R., Cukic, B., and Cotroneo, D. (2023). Who evaluates the evaluators? On automatic metrics for assessing AI-based offensive code generators. Expert Syst. Appl. 225:120073. doi: 10.48550/arXiv.2212.06008

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., et al. (2019). RoBERTa: a robustly optimized BERT pretraining approach. CoRR abs/1907.11692. doi: 10.48550/arXiv.1907.11692

Moradi Dakhel, A., Majdinasab, V., Nikanjam, A., Khomh, F., Desmarais, M. C., and Jiang, Z. M. J. (2023). GitHub Copilot AI pair programmer: asset or liability? J. Syst. Softw . 203:111734. doi: 10.48550/arXiv.2206.15331

Multiple authors (2021). GPT Code Clippy: The Open Source Version of GitHub Copilot .

Nair, M., Sadhukhan, R., and Mukhopadhyay, D. (2023). “How hardened is your hardware? Guiding ChatGPT to generate secure hardware resistant to CWEs,” in International Symposium on Cyber Security, Cryptology, and Machine Learning (Berlin: Springer), 320–336.

Natella, R., Liguori, P., Improta, C., Cukic, B., and Cotroneo, D. (2024). AI code generators for security: friend or foe? IEEE Secur. Priv . 2024:1219. doi: 10.48550/arXiv.2402.01219

Nijkamp, E., Pang, B., Hayashi, H., Tu, L., Wang, H., Zhou, Y., et al. (2023). CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis . ICLR.

Nikitopoulos, G., Dritsa, K., Louridas, P., and Mitropoulos, D. (2021). “CrossVul: a cross-language vulnerability dataset with commit data,” in Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2021 (New York, NY: Association for Computing Machinery), 1565–1569.

Niu, L., Mirza, S., Maradni, Z., and Pöpper, C. (2023). “CodexLeaks: privacy leaks from code generation language models in GitHub's Copilot,” in 32nd USENIX Security Symposium (USENIX Security 23) , 2133–2150.

Olson, M., Wyner, A., and Berk, R. (2018). Modern neural networks generalize on small data sets. Adv. Neural Inform. Process. Syst . 31, 3623–3632. Available online at: https://proceedings.neurips.cc/paper/2018/hash/fface8385abbf94b4593a0ed53a0c70f-Abstract.html

Pa Pa, Y. M., Tanizaki, S., Kou, T., Van Eeten, M., Yoshioka, K., and Matsumoto, T. (2023). “An attacker's dream? Exploring the capabilities of chatgpt for developing malware,” in Proceedings of the 16th Cyber Security Experimentation and Test Workshop (New York, NY), 10–18.

Pearce, H., Ahmad, B., Tan, B., Dolan-Gavitt, B., and Karri, R. (2022). “Asleep at the keyboard? Assessing the security of GitHub Copilot's code contributions,” in 2022 IEEE Symposium on Security and Privacy (SP) (IEEE), 754–768.

Pearce, H., Tan, B., Ahmad, B., Karri, R., and Dolan-Gavitt, B. (2023). “Examining zero-shot vulnerability repair with large language models,” in 2023 IEEE Symposium on Security and Privacy (SP) (Los Alamitos, CA: IEEE), 2339–2356.

Perry, N., Srivastava, M., Kumar, D., and Boneh, D. (2023). “Do users write more insecure code with AI assistants?,” in Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (New York, NY), 2785–2799.

Petersen, K., Vakkalanka, S., and Kuzniarz, L. (2015). Guidelines for conducting systematic mapping studies in software engineering: an update. Inform. Softw. Technol . 64, 1–18. doi: 10.1016/j.infsof.2015.03.007

Sandoval, G., Pearce, H., Nys, T., Karri, R., Garg, S., and Dolan-Gavitt, B. (2023). “Lost at C: a user study on the security implications of large language model code assistants,” in 32nd USENIX Security Symposium (USENIX Security 23) (Anaheim, CA: USENIX Association), 2205–2222.

Siddiq, M. L., Majumder, S. H., Mim, M. R., Jajodia, S., and Santos, J. C. (2022). “An empirical study of code smells in transformer-based code generation techniques,” in 2022 IEEE 22nd International Working Conference on Source Code Analysis and Manipulation (SCAM) (Limassol: IEEE), 71–82.

Storhaug, A., Li, J., and Hu, T. (2023). “Efficient avoidance of vulnerabilities in auto-completed smart contract code using vulnerability-constrained decoding,” in 2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE) (Los Alamitos, CA: IEEE), 683–693.

Tan, C., Sun, F., Kong, T., Zhang, W., Yang, C., and Liu, C. (2018). “A survey on deep transfer learning,” in Artificial Neural Networks and Machine Learning – ICANN 2018 , eds. V. Kůrková, Y. Manolopoulos, B. Hammer, L. Iliadis, and I. Maglogiannis (Cham. Springer International Publishing), 270–279.

Tony, C., Ferreyra, N. E. D., and Scandariato, R. (2022). “GitHub considered harmful? Analyzing open-source projects for the automatic generation of cryptographic API call sequences,” in 2022 IEEE 22nd International Conference on Software Quality, Reliability and Security (QRS) (Guangzhou: IEEE), 270–279.

Tony, C., Mutas, M., Ferreyra, N. E. D., and Scandariato, R. (2023). “LLMSecEval: a dataset of natural language prompts for security evaluations,” in 20th IEEE/ACM International Conference on Mining Software Repositories, MSR 2023, Melbourne, Australia, May 15-16, 2023 (Los Alamitos, CA: IEEE), 588–592.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., et al. (2017). “Attention is all you need,” in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017 , eds. I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, et al. (Long Beach, CA), 5998–6008.

Wang, Y., Wang, W., Joty, S. R., and Hoi, S. C. H. (2021). “CodeT5: identifier-aware unified pre-trained encoder-decoder models for code understanding and generation,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , 8696–8708.

Wartschinski, L., Noller, Y., Vogel, T., Kehrer, T., and Grunske, L. (2022). VUDENC: vulnerability detection with deep learning on a natural codebase for Python. Inform. Softw. Technol . 144:106809. doi: 10.48550/arXiv.2201.08441

Wieringa, R., Maiden, N., Mead, N., and Rolland, C. (2006). Requirements engineering paper classification and evaluation criteria: a proposal and a discussion. Requir. Eng . 11, 102–107. doi: 10.1007/s00766-005-0021-6

Wohlin, C. (2014). “Guidelines for snowballing in systematic literature studies and a replication in software engineering,” in Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering (New York, NY), 1–10.

Wohlin, C., Runeson, P., Neto, P. A. d. M. S., Engström, E., do Carmo Machado, I., and De Almeida, E. S. (2013). On the reliability of mapping studies in software engineering. J. Syst. Softw . 86, 2594–2610. doi: 10.1016/j.jss.2013.04.076

Wu, Y., Jiang, N., Pham, H. V., Lutellier, T., Davis, J., Tan, L., et al. (2023). “How effective are neural networks for fixing security vulnerabilities,” in Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2023 (New York, NY: Association for Computing Machinery), 1282–1294.

PubMed Abstract | Google Scholar

Xu, F. F., Alon, U., Neubig, G., and Hellendoorn, V. J. (2022). “A systematic evaluation of large language models of code,” in Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming (New York, NY), 1–10.

Keywords: artificial intelligence, security, software engineering, programming, code generation

Citation: Negri-Ribalta C, Geraud-Stewart R, Sergeeva A and Lenzini G (2024) A systematic literature review on the impact of AI models on the security of code generation. Front. Big Data 7:1386720. doi: 10.3389/fdata.2024.1386720

Received: 15 February 2024; Accepted: 22 April 2024; Published: 13 May 2024.

Reviewed by:

Copyright © 2024 Negri-Ribalta, Geraud-Stewart, Sergeeva and Lenzini. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Claudia Negri-Ribalta, claudia.negriribalta@uni.lu

This article is part of the Research Topic

Cybersecurity and Artificial Intelligence: Advances, Challenges, Opportunities, Threats

Academia Insider

How To Use Elicit For Literature Review: AI Research Assistant 101

Navigating the vast sea of academic research can be daunting. Fortunately, Elicit, an advanced AI-driven tool, offers a streamlined solution for conducting comprehensive literature reviews.

 This article will guide you through the step-by-step process of using Elicit to efficiently locate, analyse, and organise relevant research papers.

Whether you’re a seasoned academic or a novice researcher, understanding how to leverage Elicit’s capabilities can significantly enhance your research efficiency and effectiveness. Let’s dive into how you can make the most of this powerful tool.

How To Use Elicit For Literature Review

What is elicit ai (elicit.org).

Elicit is an AI research assistant revolutionising the way you perform literature review research. At its core, Elicit.org allows you to automate parts of the research process that traditionally consumed hours.

When you use Elicit, you start by entering a specific research question.

The system uses advanced AI to filter through millions of research articles, showing relevant papers and summaries of key information about those papers in an easy-to-use format. 

Elicit’s capabilities extend to refining search results by:

  • citation count, or
  • study type,

which is particularly useful for conducting a systematic review. You can even see the number of citations a paper has received, helping gauge its impact and relevance. 

Plus, the tool offers options to export data in a CSV or bib file, integrating smoothly with reference managers like Zotero, which is a boon for maintaining academic integrity.

What sets Elicit apart is its mission to automate and scale your research workflow. Whether you’re behind academia’s paywalls or exploring open access, Elicit navigates the terrain, ensuring you don’t overlook any critical piece of literature.

This makes your literature review process not only faster but also more exhaustive, leaving you free to focus on synthesis and analysis rather than the mechanics of the literature search.

Steps To Use Elicit For Literature Review

If you are looking for a simple step-by-step guide to use Elicit for literature review, here’s a guide for you to start with: 

Step 1: Start with a Specific Research Question

When you log into Elicit.org, you’ll be prompted to enter a research question. This should be as specific as possible to ensure the results are directly relevant to your study.

If you’re interested in how virtual reality affects learning outcomes, your query could be “What are the impacts of virtual reality on student engagement and learning outcomes in higher education?”

How To Use Elicit For Literature Review

Step 2: Review the Search Results

Once you submit your question, Elicit utilises AI to sift through vast databases, displaying research papers that align with your query.

The results are presented with:

  • abstract summaries, and
  • the number of citations,

helping you gauge the relevance and influence of each study at a glance.

Step 3: Use Filters to Refine Your Search

Elicit provides various filters to refine your results further. You can filter papers by:

  • publication date,
  • study type, or
  • the number of citations.

This functionality is particularly helpful if you’re conducting a systematic review and need to adhere to specific criteria.

Step 4: Analyse Abstract Summaries

The abstract summaries provided by Elicit are generated using AI, offering a concise overview of each paper. You can quickly scan through to see if there are any papers that may be relevant to your work.

While these summaries are useful for quick scans, it’s crucial to access the full papers for a thorough review, ensuring the AI’s interpretation aligns with the actual content.

Step 5: Dive Deeper into Selected Papers

For papers that seem particularly relevant, click on them to see more detailed information. Elicit allows you to:

  • read the full abstract,
  • check the paper’s citation history, and
  • view any available PDFs.

This step is vital for understanding the context and methodology of the research, ensuring it fits your review’s scope.

Step 6: Export Data for Easy Access

You can export the data you find useful directly from Elicit in formats like CSV or BibTeX, which can be imported into reference management tools like Zotero.

This feature supports maintaining an organised and accessible bibliography and references, essential for academic integrity.

Step 7: “Star” Relevant Papers

As you comb through the search results, you might find papers that warrant closer examination later. Elicit’s “star” feature allows you to bookmark these papers.

ai in systematic literature review

You can easily access your starred list at any point, which helps in structuring your literature review and ensuring no critical research is overlooked.

Step 8: Adjust Your Query as Needed

Based on the papers you find, you might discover new keywords or concepts to explore. Elicit allows you to modify your search terms in real-time, dynamically adjusting the displayed results.

This iterative process helps you hone in on the most pertinent information. Feel free to use this features until you have found what you needed.

Step 9: Utilize Elicit’s Additional Features

Elicit also offers advanced features like suggesting related research questions or identifying methodological critiques within studies.

These insights can provide new directions for your review or highlight potential limitations in existing research.

Step 10: Continuously Update Your Review

Literature reviews are often ongoing projects, especially in fast-evolving fields. Fortunately, Elicit can help you keep up with this.

Elicit’s user-friendly interface and real-time data updates make it easy to add new research as it becomes available, ensuring your review remains current and comprehensive.

How To Use Elicit For Literature Review

Use AI Tools For Academic Work

By leveraging Elicit’s capabilities, you can significantly reduce the time and effort typically required for literature searches.

This AI-driven tool not only streamlines finding relevant papers but also enhances your ability to analyse and synthesise key information effectively. 

With Elicit as your research assistant, you’re well-equipped to undertake even the most complex literature reviews, making your research process more systematic and efficient. 

ai in systematic literature review

Dr Andrew Stapleton has a Masters and PhD in Chemistry from the UK and Australia. He has many years of research experience and has worked as a Postdoctoral Fellow and Associate at a number of Universities. Although having secured funding for his own research, he left academia to help others with his YouTube channel all about the inner workings of academia and how to make it work for you.

Thank you for visiting Academia Insider.

We are here to help you navigate Academia as painlessly as possible. We are supported by our readers and by visiting you are helping us earn a small amount through ads and affiliate revenue - Thank you!

ai in systematic literature review

2024 © Academia Insider

ai in systematic literature review

Duke University Libraries

Literature Reviews

  • Artificial intelligence (AI) tools
  • Getting started
  • Types of reviews
  • 1. Define your research question
  • 2. Plan your search
  • 3. Search the literature
  • 4. Organize your results
  • 5. Synthesize your findings
  • 6. Write the review

Introduction to AI

Research rabbit, copilot (powered by chatgpt4).

  • Thompson Writing Studio This link opens in a new window
  • Need to write a systematic review? This link opens in a new window

ai in systematic literature review

Contact a Librarian

Ask a Librarian

Generative AI tools have been receiving a lot of attention lately because they can create content like text, images, and music. These tools employ machine learning algorithms that can produce unique and sometimes unexpected results. Generative AI has opened up exciting possibilities in different fields, such as language models like GPT and image generators.

However, students need to approach these tools with awareness and responsibility. Here are some key points to consider:

Novelty and Creativity : Generative AI tools can produce content that is both innovative and unexpected. They allow users to explore new ideas, generate unique artworks, and even compose original music. This novelty is one of their most exciting aspects.

Ethical Considerations : While generative AI offers creative potential, it also raises ethical questions. Students should be aware of potential biases, unintended consequences, and the impact of their generated content. Responsible use involves considering the broader implications.

Academic Integrity : When using generative AI tools for academic purposes, students should consult their instructors. Policies regarding the use of AI-generated content may vary across institutions. Always seek guidance to ensure compliance with academic integrity standards.

In summary, generative AI tools are powerful and fascinating, but students should approach them thoughtfully, seek guidance, and adhere to institutional policies. Please refer to the Duke Community Standard  for questions related to ethical AI use.

Looking for a tool that isn't listed here? Let us know about it!

ai in systematic literature review

Research Rabbit is a literature mapping tool that takes one paper and performs backward- and forward citation searching in addition to recommending "similar work." It scans the Web for publicly available content to build its "database" of work.

Best suited for...

Disciplines whose literature is primarily published in academic journals.

Considerations

  • Integrates with Zotero
  • Works mostly with just journal articles
  • Potential for bias in citation searching/mapping

»   researchrabbit.ai   «

center

What is it?

Elicit is a tool that semi-automates time-intensive research processes, such as summarizing papers , extracting data , and synthesizing information . Elicit pulls academic literature from Semantic Scholar , an academic search engine that also uses machine learning to summarize information.

Empirical research (i.g., the sciences, especially biomedicine).

  • Both free and paid versions
  • Doesn't work well in identifying facts or in theoretical/non-empirical research (e.g., the humanities)
  • Potential biases in the natural language processing (NLP) algorithms
  • Summarized information and extracted data will still need to be critically analyzed and verified for accuracy by the user

»   elicit.com   «

ai in systematic literature review

Think of Consensus as ChatGPT for research! Consensus is "an AI-powered search engine designed to take in research questions, find relevant insights within research papers, and synthesize the results using the power of large language models" ( Consensus.app ).  Consensus runs its language model over its entire body of scientific literature (which is sourced from Semantic Scholar ) and extracts the “key takeaway” from every paper.

The social sciences and sciences (non-theoretical disciplines).

  • Free and paid versions
  • Similar to Elicit, Consensus should not be used to ask questions about basic facts
  • Consensus recommends that you ask questions related to research that has already been conducted by scientists
  • Potential for biases in the input data from participants

»   consensus.app   «

ai in systematic literature review

Dubbed the "AI-powered Swiss Army Knife for information discovery," Perplexity is used for answering questions (including basic facts, a function that many other AI tools are not adept at doing), exploring topics in depth utilizing Microsoft's Copilot, organizing your research into a library, and interacting with your data (including asking questions about your files).

Perplexity has wide-reaching applications and could be useful across disciplines.

  • Free and paid pro versions (the pro version utilizes Microsoft's Copilot AI tool)
  • Available in desktop, iOS, and Android apps
  • See  Perplexity's blog for more info
  • Your personal information and data on how you use the tool are stored for analytical purposes (however, this feature can be turned off in settings)
  • Features a browser plug-in, Perplexity Companion , that is essentially a blend of Google and ChatGPT

»   perplexity.ai   «

Did you know that as Duke faculty, staff, and students, we have free access to ChatGPT4 via Microsoft Copilot ?

Log in with your Duke credentials to start using it today.

ai in systematic literature review

The OG of generative AI tools, ChatGPT-4 is the latest iteration of the popular chatbot, answering questions and generating text that sounds like it was written by a human. While not a replacement for conducting research, it can be helpful when it comes to brainstorming topics or research questions and also as a writing tool (rewriting or paraphrasing content, assessing tone, etc.).

All users across all disciplines.

  • ChatGPT-3.5 is the default version of free and paid-tier chat users.
  • Since it can't verify its sources, be wary of hallucinations (or made-up citations) that can look very real.
  • It is not 100% accurate ! While ChatGPT-4 is touted as being 40% more accurate than its predecessor, users are still expected to verify the information generated by it.
  • There is always the potential for bias since ChatGPT was trained on a massive dataset of websites, articles, books, etc. (much of which is inherently biased since it was created by humans).

For ChatGPT-4 (access provided by Duke and requires login) »   copilot.microsoft.com   «

For ChatGPT-3.5 (free) »   chat.openai.com   «

  • << Previous: 6. Write the review
  • Next: Thompson Writing Studio >>
  • Last Updated: May 17, 2024 8:42 AM
  • URL: https://guides.library.duke.edu/litreviews

Duke University Libraries

Services for...

  • Faculty & Instructors
  • Graduate Students
  • Undergraduate Students
  • International Students
  • Patrons with Disabilities

Twitter

  • Harmful Language Statement
  • Re-use & Attribution / Privacy
  • Support the Libraries

Creative Commons License

Photo of a person's hands typing on a laptop.

AI-assisted writing is quietly booming in academic journals. Here’s why that’s OK

ai in systematic literature review

Lecturer in Bioethics, Monash University & Honorary fellow, Melbourne Law School, Monash University

Disclosure statement

Julian Koplin does not work for, consult, own shares in or receive funding from any company or organisation that would benefit from this article, and has disclosed no relevant affiliations beyond their academic appointment.

Monash University provides funding as a founding partner of The Conversation AU.

View all partners

If you search Google Scholar for the phrase “ as an AI language model ”, you’ll find plenty of AI research literature and also some rather suspicious results. For example, one paper on agricultural technology says:

As an AI language model, I don’t have direct access to current research articles or studies. However, I can provide you with an overview of some recent trends and advancements …

Obvious gaffes like this aren’t the only signs that researchers are increasingly turning to generative AI tools when writing up their research. A recent study examined the frequency of certain words in academic writing (such as “commendable”, “meticulously” and “intricate”), and found they became far more common after the launch of ChatGPT – so much so that 1% of all journal articles published in 2023 may have contained AI-generated text.

(Why do AI models overuse these words? There is speculation it’s because they are more common in English as spoken in Nigeria, where key elements of model training often occur.)

The aforementioned study also looks at preliminary data from 2024, which indicates that AI writing assistance is only becoming more common. Is this a crisis for modern scholarship, or a boon for academic productivity?

Who should take credit for AI writing?

Many people are worried by the use of AI in academic papers. Indeed, the practice has been described as “ contaminating ” scholarly literature.

Some argue that using AI output amounts to plagiarism. If your ideas are copy-pasted from ChatGPT, it is questionable whether you really deserve credit for them.

But there are important differences between “plagiarising” text authored by humans and text authored by AI. Those who plagiarise humans’ work receive credit for ideas that ought to have gone to the original author.

By contrast, it is debatable whether AI systems like ChatGPT can have ideas, let alone deserve credit for them. An AI tool is more like your phone’s autocomplete function than a human researcher.

The question of bias

Another worry is that AI outputs might be biased in ways that could seep into the scholarly record. Infamously, older language models tended to portray people who are female, black and/or gay in distinctly unflattering ways, compared with people who are male, white and/or straight.

This kind of bias is less pronounced in the current version of ChatGPT.

However, other studies have found a different kind of bias in ChatGPT and other large language models : a tendency to reflect a left-liberal political ideology.

Any such bias could subtly distort scholarly writing produced using these tools.

The hallucination problem

The most serious worry relates to a well-known limitation of generative AI systems: that they often make serious mistakes.

For example, when I asked ChatGPT-4 to generate an ASCII image of a mushroom, it provided me with the following output.

It then confidently told me I could use this image of a “mushroom” for my own purposes.

These kinds of overconfident mistakes have been referred to as “ AI hallucinations ” and “ AI bullshit ”. While it is easy to spot that the above ASCII image looks nothing like a mushroom (and quite a bit like a snail), it may be much harder to identify any mistakes ChatGPT makes when surveying scientific literature or describing the state of a philosophical debate.

Unlike (most) humans, AI systems are fundamentally unconcerned with the truth of what they say. If used carelessly, their hallucinations could corrupt the scholarly record.

Should AI-produced text be banned?

One response to the rise of text generators has been to ban them outright. For example, Science – one of the world’s most influential academic journals – disallows any use of AI-generated text .

I see two problems with this approach.

The first problem is a practical one: current tools for detecting AI-generated text are highly unreliable. This includes the detector created by ChatGPT’s own developers, which was taken offline after it was found to have only a 26% accuracy rate (and a 9% false positive rate ). Humans also make mistakes when assessing whether something was written by AI.

It is also possible to circumvent AI text detectors. Online communities are actively exploring how to prompt ChatGPT in ways that allow the user to evade detection. Human users can also superficially rewrite AI outputs, effectively scrubbing away the traces of AI (like its overuse of the words “commendable”, “meticulously” and “intricate”).

The second problem is that banning generative AI outright prevents us from realising these technologies’ benefits. Used well, generative AI can boost academic productivity by streamlining the writing process. In this way, it could help further human knowledge. Ideally, we should try to reap these benefits while avoiding the problems.

The problem is poor quality control, not AI

The most serious problem with AI is the risk of introducing unnoticed errors, leading to sloppy scholarship. Instead of banning AI, we should try to ensure that mistaken, implausible or biased claims cannot make it onto the academic record.

After all, humans can also produce writing with serious errors, and mechanisms such as peer review often fail to prevent its publication.

We need to get better at ensuring academic papers are free from serious mistakes, regardless of whether these mistakes are caused by careless use of AI or sloppy human scholarship. Not only is this more achievable than policing AI usage, it will improve the standards of academic research as a whole.

This would be (as ChatGPT might say) a commendable and meticulously intricate solution.

  • Artificial intelligence (AI)
  • Academic journals
  • Academic publishing
  • Hallucinations
  • Scholarly publishing
  • Academic writing
  • Large language models
  • Generative AI

ai in systematic literature review

Lecturer / Senior Lecturer - Marketing

ai in systematic literature review

Case Management Specialist

ai in systematic literature review

Assistant Editor - 1 year cadetship

ai in systematic literature review

Executive Dean, Faculty of Health

ai in systematic literature review

Lecturer/Senior Lecturer, Earth System Science (School of Science)

Synthesizing three decades of digital servitization: a systematic literature review and conceptual framework proposal

  • Theoretical article
  • Open access
  • Published: 08 May 2024

Cite this article

You have full access to this open access article

ai in systematic literature review

  • Pedro E. Minaya   ORCID: orcid.org/0000-0002-1179-9378 1 ,
  • Lucía Avella   ORCID: orcid.org/0000-0003-2598-7318 2 &
  • Juan A. Trespalacios   ORCID: orcid.org/0000-0003-0658-4038 2  

168 Accesses

Explore all metrics

This study, through a systematic literature review spanning 1990 to 2023, interrogates how servitization, and nowadays digital servitization, enhances manufacturing competitiveness. It introduces the DASOBI (Drivers, Actors, Strategies, Obstacles, Benefits, and Impact) framework for navigating the digital servitization transition, emphasizing strategic adaptability and technological alignment. Analysis of 157 articles reveals a significant increase in research, highlighting digital servitization’s role in competitive enhancement and customer engagement. The DASOBI framework offers manufacturers a novel approach for managing this transition, marking a unique contribution by distilling extensive literature into actionable insights for both theory and practice in the evolving field of digital servitization.

Similar content being viewed by others

ai in systematic literature review

How digital technologies reshape marketing: evidence from a qualitative investigation

ai in systematic literature review

Exploring Human Resource Management Digital Transformation in the Digital Age

ai in systematic literature review

The role of digitalization in business and management: a systematic literature review

Avoid common mistakes on your manuscript.

1 Introduction

1.1 context, motivation, and research topic.

In today’s dynamic manufacturing sector, companies are increasingly acknowledging the importance of complementing their product offerings with value-added services. This strategic shift, known as servitization—and more specifically digital servitization—marks a fundamental turn in the contemporary business paradigm. This transformation involves not only a shift from a product-centric to a service-centric focus but also a deep integration of advanced digital technologies. While considerable research has been conducted on individual aspects of servitization, a comprehensive analysis that encompasses all essential facets of this phenomenon, from its motivations to its final outcomes, remains relatively unexplored. This research proposal aims to develop a holistic conceptual framework that synthesizes and extends existing knowledge, thereby providing a more complete and nuanced understanding of digital servitization. This exhaustive review examines this evolving business model, highlighting its key benefits and challenges, its intersection with digital technologies, and its theoretical and practical implications.

The foundational premise, supported by Bustinza et al. ( 2015 ), suggests that manufacturing companies can achieve higher returns by offering services in conjunction with their products, a claim echoed in seminal works by Davies et al. ( 2007 ), Johnstone et al. ( 2009 ), Martín-Peña et al. ( 2017 ), and Leoni and Aria ( 2021 ). These services, ranging from maintenance and support to more sophisticated and customized solutions, expand the revenue streams of these firms. In this context, the contributions of Baines et al. ( 2007 ) and Neely et al. ( 2011 ) are pivotal, as they underscore how transitioning to a service-oriented market is driving strategic transformations in manufacturing firms, emphasizing value creation and differentiation in increasingly competitive markets (Brady et al. 2005 ).

The current market dynamics almost make this shift imperative. As noted by Sandström et al. ( 2008 ) and Tukker ( 2015 ), companies that limit their offerings to products alone face formidable challenges in maintaining profitability, driving them toward business model innovation that incorporates services into their product portfolios, as discussed in the literature by Gebauer and Fleisch ( 2007 ), Visnjic and Van Looy ( 2013 ), and Díaz-Garrido et al. ( 2018 ).

Servitization requires effective coordination among multiple stakeholders. Alghisi and Saccani ( 2015 ) address the critical importance of internal and external alignment, while Ayala et al. ( 2019 ) highlight the essential role of service providers in the successful adoption of servitization strategies. Moreover, Baines et al. ( 2011 ) and Lightfoot et al. ( 2013 ) explore how manufacturing firms can effectively integrate services into their product portfolio, emphasizing the importance of a strategically well-planned approach.

Beyond being a customer-facing strategy, the internal benefits are equally compelling. As delineated by Kamp and Alcalde ( 2014 ), servitization facilitates process optimization and extends the lifespan of machinery. These advantages are further enhanced with the incorporation of digital technologies, particularly in the era of Industry 4.0 (Kamp and Perry 2017 ). This digital servitization, explored in studies by Lee et al. ( 2014 ), Kans and Ingwald ( 2016 ), and Paiola and Gebauer ( 2020 ), offers an enhanced layer of value, encompassing innovative goods and services.

Researchers such as Favoretto et al. ( 2022 ) and Rabetino et al. ( 2023 ) have elucidated how technological advancements act as catalysts for developing differentiated products and services, thereby enhancing competitiveness (Müller et al. 2021 ). This leads to the formulation of hybrid business models, termed Product-Service Systems (PSS), which are economically, socially, and environmentally sustainable. This PSS model provides a more holistic solution, meeting specific customer needs beyond just providing functional products (Barquet et al. 2013 ).

In this process, a demand for specific organizational and technological capabilities is identified. Coreynen et al. ( 2017 ) and Schroeder et al. ( 2022 ) have pinpointed the importance of organizational structure and technological capabilities, particularly in the context of digitalization, as key factors for a successful transition to digital servitization (Parida et al. 2014 ; Kanninen et al. 2017 ).

Implementing servitization, as highlighted by Mathieu ( 2001 ) and Yu and Sung ( 2023 ), is not without its challenges, ranging from internal organizational resistance to external factors, such as customer reluctance. Brax ( 2005 ) and Benedettini et al. ( 2015 ) provide a comprehensive analysis of these risks, emphasizing the importance of effective management to navigate potential obstacles in achieving successful servitization (Windahl and Lakemond 2006 ; Pessôa and Becker 2017 ). The process demands a well-structured and strategically informed approach, incorporating both business and customer perspectives. Proper implementation of servitization can lead to substantial benefits, as demonstrated by Baines et al. ( 2009b , 2017 ) and Wang et al. ( 2018 ), highlighting its potential for long-term value creation (Brady et al. 2005 ).

The phenomenon of servitization, particularly in its digital form, has emerged as a prominent area of study, characterized by its complexity and multidimensionality. Academic literature has thoroughly explored this concept, from underlying motivations to implementation strategies, examining both inherent challenges and potential benefits (Raddats et al. 2016 ; Rabetino et al. 2021 ).

1.2 Research gap

Despite the extensive body of knowledge on servitization amassed by previous studies, there remains a discernible gap characterized by fragmented examinations rather than a consolidated analytical approach. This study pinpoints a need for a unified framework that can effectively guide servitization strategies, addressing this lacuna as a pivotal area for forthcoming research (Calabrese et al. 2019 ; Kohtamäki et al. 2020a ). The advent of the digital era has precipitated transformative shifts, underscoring the servitization concept—the transition from purely selling products to offering integrated product-service solutions. Nevertheless, the interaction between servitization and digital technologies, a realm referred to as digital servitization, remains a relatively uncharted territory. This area lacks a systematic and thorough review spanning the last three decades. This omission highlights the imperative need for an in-depth understanding of how servitization has evolved and the essential development of a framework to adeptly navigate the intricacies involved in implementing these strategies effectively.

1.3 Methodology proposed

To address the identified research gap, our study employs a comprehensive, multi-phased methodology structured as follows: Initially, we conduct an in-depth examination of the literature on servitization and digital servitization. This phase aims to develop an integrative theoretical framework that captures the evolution of servitization over the past three decades, emphasizing the shift toward digital service delivery within the manufacturing sector. Subsequently, the study undertakes a systematic literature review to classify the existing body of work. This review specifically focuses on selecting pertinent studies that encompass both traditional and digital servitization, aiming to identify trends, patterns, and existing research gaps. Following the review, we perform a detailed analysis of the selected articles to explore how various aspects of servitization and digital servitization interact and influence each other. In the final phase, we synthesize the findings from the study to deepen the conceptual understanding of the servitization phenomenon, including its digital components. This synthesis will provide valuable insights into effectively managing the transition toward servitization and digital servitization, highlighting its practical applicability in a business context.

1.4 Expected contributions

The primary goal of this research is to construct an integrative framework that captures the evolution, current state, and future trajectory of servitization and digital servitization. This framework will delineate both the theoretical underpinnings and practical ramifications of servitization, illuminating the challenges and opportunities that have surfaced. Particularly, it will explore the transformative influence of Industry 4.0 technologies—such as the Internet of Things, Big Data analytics, and Artificial Intelligence—on traditional servitization models, steering them toward more advanced digital practices. This examination is crucial for understanding how digital technologies can enhance the competitiveness and value proposition of manufacturing firms engaged in servitization.

The overarching aim of this study is to deepen the comprehension of servitization by exploring its interplay with digitalization, thus broadening its theoretical and managerial relevance. The research intends to offer an integrated perspective that not only advances the academic discourse in this field but also aids manufacturing companies in adeptly navigating the complexities of servitization and digital servitization. Furthermore, this review will articulate a roadmap for manufacturers considering this transition, conceptually enriching a domain that, despite its increasing importance, remains underexplored in scholarly research. By highlighting the enduring interest in adopting servitization correctly and underscoring the necessity for a unified theoretical framework, this study responds to calls for theoretical consolidation and a more comprehensive research agenda (Pettigrew 1988 ; Pye and Pettigrew 2005 ).

In summary, our proposed study aims to provide a detailed analysis that integrates insights from various studies into a cohesive narrative, with a particular focus on the servitization and digital servitization processes within the manufacturing sector. This synthesis will significantly contribute to both academic knowledge and practical applications, emphasizing the complex and evolving nature of servitization in manufacturing, and marking a key conclusion of this thorough examination.

2 Research aims

This study is dedicated to a comprehensive analysis of the servitization phenomenon and its progression toward digital servitization within the manufacturing sector, meticulously examining the most significant research from the past 30 years. The aim is to understand the development and various applications of servitization, along with the challenges and obstacles it entails. The study seeks to identify the motivations driving companies toward servitization, examine the various actors involved in the process and their interplay, and explore the strategies necessary for successful implementation. Furthermore, the organizational and technological capabilities required for transitioning to servitization will be analyzed, as well as the associated risks and challenges, including both internal and external hurdles that companies must overcome to reap the potential benefits of servitization. This analysis is guided by key research in the field (Zhang and Banerji 2017 ; Khanra et al. 2021 ) offering a comprehensive perspective on this significant shift in business dynamics within the manufacturing sector.

Essentially, this study seeks to answer the main research question: To what extent do servitization and digital servitization provide benefits that contribute to enhancing a company’s competitiveness? Alongside this primary question, the study intends to address the following aspects related to the development of servitization and digital servitization:

RQ1. Implementation of a digital servitization strategy. How it should be affected by the company’s business environment? How it should be the co-creation process in an international context? Which new knowledge and new skills need to be developed to be implemented correctly? Which benefits can be obtained by implementing the digital enablers of Industry 4.0? Which changes could it involve in the internal structure of the business? Which changes could it involve in the company’s business environment (relations with suppliers or strategic partners)? How could it face the challenges and obstacles that arise during the transition process?

RQ2. Benefits of developing an effective digital servitization strategy. How it provides greater value to the customer? How can product customization be optimized? How it encourages access to new markets? How it promotes gaining new customers? How it allows innovation in ideas or business models? How it allows the development of goods with novel services? How it effectively allows greater returns to be achieved? How it improves competitiveness?

The focus of this study is not only on analyzing servitization as a strategic shift for manufacturing companies but also on exploring how the integration of digital technologies can enrich and complicate this process. Additionally, the aim is to synthesize existing knowledge to provide a broader and more nuanced understanding of digital servitization, highlighting its key advantages, challenges, and intersection with digital technologies.

Four stages were established for this systematic literature review (Tranfield et al. 2003 ), one for each of the four phases outlined in the first section.

This collection focuses on four fields of research: business administration, marketing, operations management, and administration of services. The studies from the two main databases were examined: Web of Science and Scopus, as they are considered reference sources for the topic being analyzed. Once the information was screened, the most-cited studies were selected, which formed the basis for the present study.

3.1 Review process

In conducting a systematic literature review to gain a profound understanding of servitization and digital servitization within the manufacturing sector, our approach integrated multiple rigorous methodologies (Thomé et al. 2016 ). Initially, following the method proposed by Hertzberg and Rudner ( 1999 ), we conducted a meticulous keyword search in the Web of Science and Scopus databases, aiming to identify pertinent literature using terms like “servitization,” “digital servitization,” and their variants. This was instrumental in capturing the subject’s breadth and depth, allowing for the creation of search strings using the Boolean connector OR. The search strings were incorporated in titles, abstracts, and/or keywords, adhering to the time span of 1990 to 2023 in major databases, thus fulfilling the guidelines set by Tranfield et al. ( 2003 ) for inclusion criteria.

To further refine the search and ensure a robust database, we applied additional parameters and restrictions post-establishing the primary search strings for both databases. We limited our search to open access and hybrid gold journals, focusing on high-quality, readily available research outputs. Additionally, we set a citation threshold to include articles with significant field impact, thereby ensuring the inclusion of seminal works and recent influential studies. This strategy was pivotal in developing a comprehensive, relevant collection of literature, ensuring the inclusion of the most pertinent works in the field of digital servitization.

The approach was enhanced by strictly adhering to three key inclusion criteria: (a) considering publications from 1990 to 2023, to ensure a contemporary and comprehensive review, (b) prioritizing articles from prestigious academic journals within the relevant study areas, thus ensuring source quality and relevance, and (c) selecting articles focusing explicitly on key aspects of servitization and digital servitization. This approach, aligned with the study’s objectives and research questions, ensures a holistic and detailed understanding of the phenomenon, accurately reflecting the dynamics and transformations in the manufacturing sector.

The present study aimed to answer the research question and the various related questions. This was done via the PRISMA method (Preferred Reporting Items for Systemic Reviews and Meta-Analyses). The selection criteria produced 647 articles (from Web of Science) and 630 articles (from Scopus). Once identified, the abstracts of each article were read to screen and select only those in line with the fourth study phase: to help properly understand the concept, how it is managed, and how it is applied. 157 articles were ultimately identified that met all of the inclusion criteria. Figure  1 outlines the PRISMA method used.

figure 1

Source: Authors’ own work from Web of Science and Scopus databases

Flow diagram, based on the PRISMA Method, for the selection of relevant documents for the systematic literature review.

3.2 Descriptive analysis

Figure 2 offers an analytical synthesis of the publication trends within the realms of servitization and digital servitization over a span of more than three decades, utilizing data harvested from the Web of Science and Scopus databases. The blue bars across all three charts articulate the volume of literature pertaining to servitization, encompassing its theoretical underpinnings, industry applications, and cross-disciplinary studies. This scholarly corpus embodies the foundational and evolutionary aspects of servitization as a strategic paradigm shift in manufacturing and service industries.

figure 2

Source: Web of Science and Scopus databases and authors’ own work

Evolution of publications on Servitization and Digital Servitization (1990–2023).

In parallel, the orange bars specifically chart the trajectory of literature focused on digital servitization. This subset of research delves into the intricacies of embedding digital technologies within traditional servitization frameworks. It illuminates the burgeoning intersection of digital innovation and service strategies, reflecting a vibrant and rapidly advancing frontier of research.

The upward trend of both blue and orange bars in the separate charts for Web of Science and Scopus indicates a robust increase in scholarly output. This not only testifies to the growing academic and practical significance of servitization concepts but also their digital counterparts, which are pivotal in today’s technology-driven marketplaces.

The application of inclusion and exclusion criteria to the study of servitization and digital servitization clarifies the focus of academic research, emphasizing the most relevant and impactful studies in these areas. This refined approach highlights the critical and emerging conversations shaping the future of manufacturing industries through servitization and its digital augmentation. The graph reflects the scholarly community’s increasing investment in understanding these concepts and their application, suggesting a dual focus: the persistent importance of servitization in strengthening the interplay between manufacturing and services, and the transformative potential of digital technologies within this framework. Serving both as a retrospective and a forecast, the visualization indicates key areas for future research that promise to advance industrial practices and academic thought.

Regarding the countries in which the identified studies have been carried out, the visual data presented in Fig.  3 captures a comprehensive view of the global research output on servitization and digital servitization from 1990 to 2023, as indexed by the Web of Science and Scopus databases and further refined by the application of inclusion and exclusion criteria. The top section, shown in blue, delineates the Web of Science data, indicating a prominent concentration of scholarly activity within certain countries, possibly linked to their robust research infrastructures, funding provisions, or strong manufacturing sectors that are conducive to studies in servitization.

figure 3

Source: Web of Science and Scopus databases

Number of publications by country on Servitization and Digital Servitization (1990–2023).

The middle section, in orange, portrays the Scopus data, revealing a parallel distribution pattern to that of the Web of Science but with slight variances that may be indicative of the different regional research emphases or variations in the databases’ indexing methodologies. The countries with the highest volume of publications are recognized as potential centers of excellence and innovation in the field of servitization.

The bottom section of the graph, in green, represents the distilled essence of this academic output following the application of the inclusion and exclusion criteria. This section emphasizes the refined and concentrated scholarly work that aligns more closely with the specific nuances and requirements of servitization and digital servitization research as defined by the study. It presents a narrower but more focused spectrum of publications, suggesting a curated body of knowledge that serves as a critical resource for understanding the current state and future directions of servitization in the manufacturing sector.

Together, these three segments of Fig.  3 not only illustrate the quantitative aspects of the research output but also underscore the qualitative focus and depth of scholarly exploration achieved through rigorous selection. This tripartite analysis offers a lens through which to view the international dissemination and development of knowledge in servitization and digital servitization, highlighting established leaders in the field as well as regions with the potential for increased research activity, international collaboration, and contribution to the servitization discourse.

In Fig.  4 , the Web of Science data (represented by the blue graph) lists Oscar Bustinza as the author with the highest number of publications, closely followed by Marko Kohtamäki and Vinit Parida. In contrast, the Scopus data (illustrated by the orange graph) also positions Vinit Parida prominently, yet Marko Kohtamäki’s publication count is lower than that reported in the Web of Science, presenting a notable discrepancy.

figure 4

Number of publications by author on Servitization and Digital Servitization (1990–2023).

When the inclusion and exclusion criteria are applied (as shown in the green graph), there is a decrease in the number of publications, which aligns with expectations, given that these criteria aim to omit publications failing to meet the predetermined standards of quality and relevance. Following this filtration, Tim Baines emerges as the author with the most publications, indicating the significant relevance of his research work to the focused aims of this systematic literature review. Consequently, the filtration process underscores those authors whose contributions are particularly central or foundational to the field.

The comparison across the three graphs demonstrates the influence of database selection and methodological rigor on the perceived prominence of authors within the academic community. This analysis goes beyond merely highlighting the leading figures in servitization research; it underscores the importance of thorough evaluation in literature reviews to identify research of substantial impact.

Thus, the filtration process distinctly recognizes authors whose contributions are considered pivotal to the discipline.

Figure  5 provides a succinct overview of journal publication volumes on servitization and digital servitization from 1990 to 2023, based on data from Web of Science and Scopus databases. Prior to applying inclusion and exclusion criteria, the journals listed in the Web of Science (blue) and Scopus (orange) indicate a diverse quantity of publications.

figure 5

Number of publication volume in journals with the highest frequency of articles on Servitization and Digital Servitization (1990–2023).

Post-application (green), the data are refined to highlight the top ten journals that are most aligned with the research criteria. It is noteworthy that the application of these criteria significantly alters the landscape of the considered literature. Some journals that initially (in the Web of Science or Scopus databases) had a high volume of publications appear to have fewer articles meeting the requirements, which may reflect on the specificity and relevance of their contributions to the field.

The graphic serves as an insightful metric of the research landscape, indicating not only the journals that are most prolific in the domain but also the robustness of articles surviving rigorous scholarly scrutiny. This visual representation is integral to the academic discourse, as it not only informs researchers of the core journals within the field but also reflects the evolving standards and focal areas within the literature on servitization and digital servitization.

The descriptive analyses included in this section serve as a pivotal foundation for the authors’ elaboration, shedding light on the trajectory of academic inquiry into servitization and digital servitization. It encapsulates the dual analysis conducted using the Web of Science and Scopus databases and the meticulous selection process leading to the corpus of papers employed in the systematic literature review. The synthesis of these findings offers valuable insights into the progression of research in this domain, indicating a maturing yet dynamically expanding field of study.

3.3 Classification process

Upon identifying studies that met the established selection criteria, a thorough examination of each was conducted to categorize them according to specific themes. These encompassed the motivations driving companies toward servitization, namely the reasons why manufacturers transition from producing solely goods to combining these with services, including the anticipated benefits of such a transformation. The various actors involved in the servitization process and the nature of their interactions were scrutinized, as well as the strategies necessary for successful implementation, which entailed identifying potential needs for external partners, commonly service providers (Martínez et al. 2010 ; Bastl et al. 2012 ; Spring and Araujo 2013 ; Ziaee et al. 2018 ). The types of services commonly offered were analyzed, categorized as basic, intermediate, or advanced, along with the specific servitization strategies adopted by the companies. Furthermore, the study delved into the organizational and technological capabilities required for an effective transition to servitization (Momeni et al. 2023 ), as well as the potential risks and challenges arising in these transition processes, including both internal and external obstacles that must be overcome to fully capitalize on the potential benefits of servitization (Raddats et al. 2017 ; Reim et al. 2019 ; Minaya et al. 2023 ).

4 Results: theoretical background

4.1 from servitization to digital servitization.

The concept of servitization, which has significantly evolved over the years, has achieved solid recognition in both the academic and industrial spheres. Initially defined by Levitt ( 1972 ) and Vandermerwe and Rada ( 1988 ) as the process of adding value through services (Johnson and Mena 2008 ; Baines et al. 2011 ; Lindman et al. 2016 ; Ruiz-Martín and Díaz-Garrido 2021 ), servitization has expanded to encompass multiple strategic objectives, such as competitive advantage (Baines et al. 2009a ; Raddats et al. 2019 ), financial goals, and marketing benefits (Khanra et al. 2021 ).

The shift toward servitization entails a redefinition of traditional business models, focusing on innovation (Sandström et al. 2008 ; Martín-Peña et al. 2018 ; Qi et al. 2020 ; Xing et al. 2023 ), and transforming manufacturers into service-centric companies (Cusumano 2008 ; Santamaría et al. 2012 ; Mosch et al. 2021 ). In this regard, manufacturing companies are fundamentally reorienting their business models and operational strategies to include value-added services (Gebauer and Kowalkowski 2012 ; Hyun and Kim 2021 ). Baines and Lightfoot ( 2013 ) and Luoto et al. ( 2017 ) highlight the widespread changes this implies in management, marketing, and operations. The change is so substantial that over 50% of a company’s activities and personnel can be involved in providing these newly implemented services, as indicated by multiple studies cited by Martín-Peña and Ziaee ( 2016 ). This is because research has shown that servitization not only adds value but also increases profitability with relatively low asset investments (Davies et al. 2007 ; Kharlamov and Parry 2021 ).

The types of services offered range from basic to advanced (Gebauer et al. 2013 ; Kindström and Kowalkowski 2014 ; Sousa and Da Silveira 2017 ), with advanced services contributing to greater profitability (Eggert et al. 2014 ) and generating higher customer satisfaction (Mont 2002 ; Ostrom et al. 2010 ), leading to improved competitive positioning (Oliva and Kallenberg 2003 ; Durugbo 2014 ). Baines et al. ( 2011 ) argue that servitization involves creating distinctive and sustainable capabilities (Raddats 2011 ; Kimita et al. 2022 ), requiring not just the provision of goods, but also the innovation of value through added services (Tukker and Tischner 2006 ; García Martín et al. 2019 ; Zighan and Abualqumboz 2022 ), enabling companies to maintain their competitive edge (Tuli et al. 2007 ; Brax and Jonsson 2009 ; Nordin and Kowalkowski 2010 ).

While the goal of servitization is to enrich product offerings and drive competitiveness (Neely et al. 2011 ; Gaiardelli et al. 2014 ; Benedettini et al. 2015 ), companies must avoid the “service paradox,” where the focus on new services undermines existing production capabilities (Gebauer et al. 2005 ; Hyun and Kim 2021 ). To this end, various researchers advocate for a comprehensive analysis covering customer needs, pricing strategies, delivery infrastructure, and organizational change (Manzini and Vezzoli 2003 ; Kohtamäki and Partanen 2016 ; Ziaee et al. 2017 ). In summary, moving away from product-centric thinking and engaging in product and servitization logic.

In this context, Santamaría et al. ( 2012 ) and Rabetino et al. ( 2017 ) underscore three fundamental considerations for a successful servitization strategy: the content, process, and context of organizational change. This involves determining what to change, how to change, and why the change is necessary (Kreye et al. 2015 ).

The complexity of servitization also demands internal and external alignments within companies (Gebauer 2008 ; Alghisi and Saccani 2015 ; Kohtamäki et al. 2019a ; Zhang et al. 2023 ). Internally, this involves harmonizing the organization’s strategy with the service portfolio and aligning this strategy throughout the organization (Oliva and Kallenberg 2003 ; Yan et al. 2020 ). Externally, alignment extends to the service provider network and customer expectations (Ceci and Masini 2011 ; Paiola et al. 2013 ). Similarly, servitization applies in B2B and B2C domains, serving as a differentiator and pathway to future alliances and customer loyalty (Baines et al. 2017 ; Pombo and Franco 2023 ).

On the other hand, technological advancements act as significant facilitators in the transition toward servitization, particularly the digital elements of Industry 4.0 (Dalenogare et al. 2018 ; Paschou et al. 2020 ; Opazo-Basáez et al. 2021 ; Tian et al. 2022 ; Le-Dain et al. 2023 ). This involves both internal and external organizational changes, focusing on disruptive innovations and addressing legal and financial challenges (Bustinza et al. 2018 ; Tronvoll et al. 2020 ; Kolagar et al. 2022 ), leading to what is known as digital servitization.

Digital servitization represents the integration of enabling technologies from Industry 4.0 into the servitization process, generating additional benefits and creating value for the customer (Ibarra et al. 2018 ; Grandinetti et al. 2020 ; Ciasullo et al. 2021 ; Bettiol et al. 2022 ). This digital transformation expands the scope of traditional services, allowing for greater customization and efficiency (Frank et al. 2019 ; Chen et al. 2021 ).

Digitalization facilitates data collection and analysis, improving decision-making, and enabling more predictive and proactive services (Lee et al. 2014 ; Chen et al. 2022a ; Rakic et al. 2022 ). Moreover, data-based digital capabilities are fundamental for the success of digital servitization, as they enhance both product support services and customer support services (Chen et al. 2023 ).

Digital servitization also promotes value co-creation and collaboration among manufacturers, suppliers, and customers, optimizing service delivery and strengthening relationships (Coreynen et al. 2017 ; Vendrell-Herrero et al. 2017 ; Kohtamäki et al. 2020b ; Sjödin et al. 2020 ). The business models of digital servitization are also influenced by Industry 4.0 technologies, such as Internet of Things and Big Data, enabling the development of more integrated and customer-centric solutions (Naik et al. 2020 ; Bortoluzzi et al. 2022 ; Minaya et al.  2023 ).

Furthermore, an integral aspect of the servitization landscape, especially in the digital era, is the evolution of Product-Service Systems (PSS). PSS represents a strategic approach that shifts the focus from selling products to offering a combination of products and services designed to fulfill specific customer needs more efficiently (Tukker and Tischner 2006 ; Baines et al. 2017 ). This transition to PSS reflects a broader industry movement toward sustainable and customer-centric business models, where the value proposition extends beyond the physical product to include personalized services. The advent of Industry 4.0 technologies has further propelled this evolution, leading to the development of Smart PSS. Smart PSS integrates digital technologies, such as the Internet of Things, Big Data, and Artificial Intelligence to enhance service delivery, improve customer experience and enable new forms of value creation (Chowdhury et al. 2018 ; Bortoluzzi et al. 2022 ). The adoption of these advanced technologies within PSS frameworks represents a significant leap in how companies’ approach servitization, allowing for greater customization, efficiency, and proactive engagement with customers. Therefore, understanding the role and impact of PSS, particularly Smart PSS, is crucial for comprehending the full scope of digital servitization and its implications for future business strategies.

4.2 Integrating smart product-service systems (smart PSS) into digital servitization: evolution, challenges, and opportunities

Product-Service Systems (PSS) epitomize an evolution in business models, integrating goods and services to fulfill customer needs sustainably and effectively (Galbraith 2002 ; Gebauer et al. 2011 ; Oliveira et al. 2015 ; Haase et al. 2017 ; Gaiardelli et al. 2021 ; Zhou and Song 2021 ). Tukker ( 2004 ) categorizes PSS into product oriented, use oriented, and result oriented, with each type offering distinct benefits, such as improved profit margins and differentiation from competitors (Tukker and Tischner 2006 ; Reim et al. 2015 ; Baines et al. 2017 ; Rabetino et al. 2017 ). Service-oriented PSS prioritize personalized customer experiences, requiring greater customer involvement (Matthyssens and Vandenbempt 2010 ; Cusumano et al. 2014 ; Zighan and Abualqumboz 2022 ).

The advent of Industry 4.0 technologies has given rise to Smart PSS, enhancing traditional PSS frameworks with digital capabilities and aligning with digital servitization’s goals to maximize customer value and competitive advantage (Chowdhury et al. 2018 ; Zheng et al. 2019 ; Wang et al. 2021 ; Bortoluzzi et al. 2022 ; Chen et al. 2023 ). Smart PSS incorporate Internet of Things, Big Data, and Artificial Intelligence to offer tailored services and predictive maintenance, thus improving product reliability and customer experience. However, transitioning to Smart PSS necessitates overcoming internal challenges, such as developing digital capabilities and adapting organizational culture, and external challenges like aligning strategies with customer and supplier expectations (Alghisi and Saccani 2015 ; Baines and Shi 2015 ; Ceci and Masini 2011 ; Mosch et al. 2021 ).

Business models in the context of Smart PSS vary from product centered to service oriented, depending on the company’s servitization maturity and technological capacity, leading to greater competitive differentiation and new market opportunities (Kowalkowski et al. 2017 ; Zheng et al. 2019 ; Baines et al. 2020 ; Chen et al. 2021 ). Implementing Smart PSS calls for a holistic approach, from strategic planning to system design and operational management, with a focus on how digital capabilities enhance PSS offerings and the overall value chain (Coreynen et al. 2017 ; Zheng et al. 2018 ).

In sum, the transition from traditional servitization to digital servitization, through the deployment of Smart PSS, marks a critical shift in value creation and sustaining customer loyalty, propelled by Industry 4.0 innovations (Vandermerwe and Rada 1988 ; Frank et al. 2019 ; Pinillos et al. 2022 ; Raddats et al. 2022 ; Schroeder et al. 2022 ; Chen et al. 2023 ; Martín-Peña et al. 2023 ). Realizing the potential of digital servitization demands an understanding of technological capabilities, fostering innovation, and market adaptability (Kohtamäki et al. 2019b ; Zhang et al. 2023 ). Successful digital servitization and Smart PSS rely on integrating technology with strategic vision and customer centricity, cultivating a business model focused on collaboration, innovation, and value co-creation (Naik et al. 2020 ; Chen et al. 2021 ; Zhou et al. 2021 ; Kolagar et al. 2022 ).

4.3 Digital servitization: crafting superior value in the modern era

As previously noted, servitization, as it evolves into digital servitization, catalyzes a profound and strategic transformation of business models and operational paradigms, emphasizing the importance of both internal and external strategic alignments. This process not only optimizes existing service offerings but also unlocks significant potential for service innovation and market competitiveness. Specifically, the integration of advanced technologies in digital servitization allows companies to create superior and customized value for their customers. This expanded value creation is achieved through a synergistic combination of technological resources and human capabilities, facilitating more predictive, personalized, and proactive services. Thus, digital servitization emerges as an essential and transformative step in business strategy, driving not only efficiency and strategic alignment but also fostering innovation and strengthening competitive positioning in the market.

Digital servitization, a contemporary evolution of traditional servitization, integrates Industry 4.0 technologies into the service domain, creating significant value for the customer. This value manifests in several key dimensions, all driven by digitalization and the emerging capabilities it offers.

Enhanced personalization and customer experience. The ability to collect and analyze large volumes of data using digital technologies enables companies to better understand the needs and preferences of their customers (Tao and Qi 2017 ; Chen et al. 2023 ). This leads to the creation of more personalized service offerings, tailored specifically to individual customer requirements. For instance, data analytics capabilities enhance servitization by enabling service personalization, which is fundamental for improving customer satisfaction and fostering long-term loyalty (Chen et al. 2022b ).

Efficiency and proactivity in service delivery. Digital servitization allows companies to be more efficient and proactive in delivering services. Technologies like the Internet of Things and Artificial Intelligence facilitate remote monitoring and predictive maintenance, anticipating problems before they occur and minimizing downtime (Lee et al. 2014 ; Tao and Qi 2017 ; Raddats et al. 2022 ). This not only improves product reliability but also reduces costs for the customer.

Creation of new opportunities and business models. The integration of digital services opens new avenues for innovative business models. For example, companies can offer usage-based solutions or subscriptions, where customers pay for performance or outcomes rather than the product itself (Vendrell-Herrero et al. 2017 ; Martín-Peña et al. 2020 ; Bortoluzzi et al. 2022 ). This can result in greater flexibility and more attractive cost options for the customer.

Enhanced customer–supplier relationships. Digital servitization fosters greater collaboration and value co-creation between suppliers and customers (Coreynen et al. 2017 ; Sjödin et al. 2020 ; Harrmann et al. 2023 ). This is because digital capabilities enable smoother communication and more transparent information exchange, resulting in stronger and more reliable relationships (Davies et al. 2023 ).

Continuous improvement of products and services. Ongoing feedback and data analysis enable continuous improvement of the products and services offered. Companies can quickly adjust their offerings in response to customer feedback or market changes, ensuring that their services remain relevant and of high quality (Chen et al. 2021 ).

Access to new markets. Digital servitization enables companies to access new markets and customer segments. By offering digital solutions, companies can overcome geographical and logistical barriers, reaching customers who were previously inaccessible (Münch et al. 2022 ; Rakic et al. 2022 ).

In summary, digital servitization not only enhances existing service offerings but also opens new opportunities for service innovation, strategic alignment, and market competitiveness. Its successful implementation is key to creating substantial value for the customer, highlighting the importance of a well-planned and executed strategy in the context of modern servitization.

5 Proposed conceptual framework: guiding the transition to digital servitization

Digital servitization represents a pivotal shift in the business landscape, where manufacturing companies evolve into providers of comprehensive solutions that seamlessly integrate products and services, augmented by digital technologies. This transformation is driven by the need for enhanced competitiveness, customer engagement, and value creation in a rapidly changing digital economy.

The development of our DASOBI conceptual framework, designed to guide the transition to digital servitization, is grounded in a rigorous methodological approach, underpinned by a comprehensive systematic literature review. This review meticulously synthesized three decades of academic research and industry insights, incorporating a total of 157 articles. Our comprehensive review process involved a deep analysis of the most influential and relevant publications in the field, among which notable contributions include Alghisi and Saccani ( 2015 ); Ayala et al. ( 2017 , 2019 ); Coreynen et al. ( 2017 ); Tao and Qi ( 2017 ); Vendrell-Herrero et al. ( 2017 ); Bustinza et al. ( 2018 ); Frank et al. ( 2019 ); Baines et al. ( 2020 ); Martín-Peña et al. ( 2020 ); Naik et al. ( 2020 ); Brax et al. ( 2021 ); Gaiardelli et al. ( 2021 ); Kohtamäki et al. ( 2021 ); Bettiol et al. ( 2022 ); Bortoluzzi et al. ( 2022 ); Marcon et al. ( 2022 ); Münch et al. ( 2022 ); Brekke et al. ( 2023 ); Chen et al. ( 2023 ); Chirumalla et al. ( 2023 ); Shen et al. ( 2023 ). These articles were particularly significant for identifying emerging trends, key challenges, and effective strategies in digital servitization. By systematically analyzing this extensive body of literature, we identified critical themes, challenges, strategies, and outcomes associated with the digital servitization journey. This analysis not only highlighted the multifaceted nature of digital servitization but also emphasized the critical importance of aligning strategic considerations, technological capabilities, and stakeholder roles to successfully navigate this complex transition. The structured framework presented herein not only reflects the evolution of the field but also provides clear guidance for manufacturing companies advancing toward more sophisticated and digitalized servitization practices.

The DASOBI framework, while empirically grounded in a comprehensive literature review, also draws extensively on classical and emerging theories to provide a robust theoretical foundation. For instance, diffusion of innovations theory (Rogers 2003 ) elucidates the “Drivers” and “Obstacles” in the adoption of digital servitization by explaining the rate and process through which new technological innovations spread within industries. Furthermore, the resource-based view (Barney 1991 ) is instrumental in understanding the “Strategies” component of the framework, emphasizing the importance of internal capabilities and resources in gaining a competitive advantage through digital transformation. These theoretical integrations not only enhance the academic rigor of our framework but also offer a deeper understanding of the multifaceted nature of digital servitization.

Therefore, the proposed DASOBI (Drivers, Actors, Strategies, Obstacles, Benefits, and Impact) model emerges as a synthesis of empirical evidence and theoretical insights, designed to offer a coherent and actionable guide for organizations seeking to embrace digital servitization.

This conceptual framework delineates a roadmap for organizations to navigate this complex transition. The framework identifies the core components essential for a successful journey toward digital servitization:

Underlying reasons for the shift (Drivers). Recognizing the strategic imperatives for transitioning toward a digital servitization model is critical. This includes understanding market dynamics, competitive pressures, and technological advancements driving this change.

Key actors involved (Actors). Successful digital servitization necessitates the involvement and alignment of various stakeholders, including internal teams, customers, technology partners, and suppliers. Their roles, expectations, and contributions are pivotal in shaping the servitization journey.

Strategic considerations and tools (Strategies). This encompasses adopting strategic frameworks, methodologies, and digital tools that are conducive to servitization. These tools and strategies should facilitate the integration of digital technologies with traditional product-service offerings, ensuring a seamless transition.

Potential challenges and obstacles (Obstacles). Identifying and addressing challenges such as cultural resistance, skill gaps, technological complexities, and integration issues with existing processes is crucial. Proactive strategies and contingency plans are essential to mitigate these barriers.

Anticipated benefits of the transition (Benefits). The transition to digital servitization should bring about significant benefits, including enhanced customer value, increased revenue streams, and improved competitive positioning. This component focuses on quantifying these benefits and aligning them with organizational goals.

Expected outcomes and impact (Impact). The final component of the framework revolves around the tangible outcomes and impacts of digital servitization. This includes enhanced customer satisfaction, increased market share, and improved operational efficiency.

In the digital servitization framework, the transition toward digital servitization, driven by market dynamics, competitive pressures, and technological advancements, is intrinsically linked to the roles and contributions of key stakeholders, such as internal teams, customers, and technology partners. Strategic considerations and tools must be selected in light of potential challenges, like cultural resistance and skill gaps, ensuring alignment with stakeholder capabilities and expectations for a seamless integration of digital technologies with traditional offerings. This strategic alignment is pivotal in overcoming obstacles and realizing anticipated benefits, such as enhanced customer value and competitive positioning. These benefits, in turn, lead to tangible outcomes, like improved customer satisfaction and operational efficiency, which feedback into the market, influencing ongoing strategic imperatives and shaping the evolution of digital servitization strategies. This dynamic interplay highlights a continuous feedback loop where outcomes inform underlying reasons, reinforcing the need for adaptability and strategic foresight in the digital servitization journey.

The contribution of the DASOBI framework to the existing literature is manifold. By synthesizing empirical findings with theoretical insights from servitization and digital transformation research, this framework addresses identified gaps, such as the integration of digital technologies in traditional servitization models and the management of organizational changes associated with such transitions (Baines and Lightfoot 2013 ; Vargo and Lusch 2008 ). Specifically, the DASOBI framework aids in conceptualizing how companies can strategically navigate the complexities of digital servitization, providing a structured approach that is missing in previous studies. This not only extends the theoretical discourse around servitization but also sets a foundation for future research to explore the dynamic interactions between digital technologies and service strategies in manufacturing sectors.

In conclusion, this conceptual framework serves as a comprehensive guide for firms embarking on the digital servitization journey. It provides a structured approach to understanding and implementing the necessary changes, ensuring a smooth transition and realization of the potential benefits of digital servitization. Figure  6 summarizes this meticulously formulated model (DASOBI), referred to as the Drivers (underlying reasons for the shift), Actors (key actors involved), Strategies (strategic considerations and tools), Obstacles (potential challenges and obstacles), Benefits (anticipated benefits of the transition), and Impact (expected outcomes and impact) of Digital Servitization Strategy, offers a robust framework for scholarly exploration, grounded in an exhaustive review of extant literature.

figure 6

Source: Authors’ own work

Conceptual theoretical model for the analysis of Digital Servitization.

The DASOBI framework orchestrates the shift from traditional service strategies to digitally-enhanced service offerings, underpinned by the alignment of core elements: Drivers, Actors, Strategy, Obstacles, Benefits, and Impact. The model emphasizes a strategic approach, incorporating digital catalytic factors to augment adaptability, customer-centric analytics, and the pursuit of novel revenue streams through digital innovations.

Within this framework, the digital knowledge and capability development are crucial. Firms must harness Big Data to distill customer insights, leverage Artificial Intelligence for identifying opportunities, and increase the flexibility of their service offerings via digital platforms. The role of digital service providers is pivotal, offering expertise to mitigate transition risks, assure service quality, and bolster productivity with cutting-edge technological solutions.

However, the shift is not without its challenges. The resistance to digital transformation and the complexity of measuring profitability in the digital service landscape can impede progress. Moreover, the implications of Industry 4.0 are profound, necessitating organizational restructuring, workforce upskilling, and technological investments to realize the potential of digital servitization.

The anticipated benefits of this digital shift are manifold. Enhanced customer understanding through sophisticated data analytics, improved market positioning through digital innovation, and elevated creative capability with advanced technology are but a few of the advantages. Furthermore, embracing Industry 4.0 technologies within digital servitization amplifies these benefits, leading to superior product quality via smart manufacturing, greater adaptability in production, and increased operational efficiency ensuring timely delivery.

In summary, the DASOBI model meticulously integrates the transition to digital servitization with the digital economy’s imperatives, presenting a coherent roadmap for firms aspiring to harness the full spectrum of benefits offered by Industry 4.0 innovations.

6 Conclusions, limitations, and further research

This study embarked on an exhaustive journey through three decades of literature on servitization and its evolution toward digital servitization within the manufacturing sector. Through a systematic literature review, we explored the strategic transformation that involves integrating advanced services and digital technologies into product offerings, a change driven by the need to enhance competitiveness, customer engagement, and value creation in a rapidly evolving digital economy.

Our research findings have identified key drivers, actors, strategies, challenges, and benefits associated with the transition toward digital servitization. The DASOBI conceptual framework tries to provide a structured guide for understanding and managing this complex transition. This framework emphasizes the importance of recognizing the underlying reasons for adopting digital servitization models, the necessity of aligning and collaborating with diverse stakeholders, and the use of specific strategies to overcome the inherent challenges of this process.

Despite this study’s contribution to the body of knowledge on digital servitization, we acknowledge several limitations. The geographical concentration of the research activity analyzed might limit the generalizability of our findings across diverse cultural and economic contexts. The rapid evolution of digital technologies and business models also suggests that the relevance of our discoveries could be challenged by future developments. Additionally, our research focused primarily on manufacturing firms, which limits the applicability of the findings to other sectors.

These limitations open several avenues for future research. It is imperative to validate and test the generalizability of the DASOBI framework across various organizational and industry contexts. Further research is also needed to develop specific metrics that can measure the impacts of digital servitization. Longitudinal studies could provide a deeper understanding of how servitization strategies influence business outcomes over time.

This study contributes to the academic discussion by clarifying and deepening the concept of servitization and its intersection with digitalization, offering an integrative view that can assist manufacturing firms in navigating the complex landscape of servitization and digital servitization. Although we have tried to establish a solid foundation for future research, it is evident that the field of digital servitization remains dynamic and evolving, requiring ongoing examination to fully comprehend its impact on business strategy and practice.

Data availability

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

Alghisi A, Saccani N (2015) Internal and external alignment in the servitization journey—overcoming the challenges. Prod Plann Control 26:1219–1232. https://doi.org/10.1080/09537287.2015.1033496

Article   Google Scholar  

Ayala NF, Paslauski CA, Ghezzi A, Frank AG (2017) Knowledge sharing dynamics in service suppliers’ involvement for servitization of manufacturing companies. Int J Prod Econ 193:538–553. https://doi.org/10.1016/j.ijpe.2017.08.019

Ayala NF, Gerstlberger W, Frank AG (2019) Managing servitization in product companies: the moderating role of service suppliers. Int J Oper Prod Manag 39(1):43–74. https://doi.org/10.1108/IJOPM-08-2017-0484

Baines T, Lightfoot H (2013) Servitization of the manufacturing firm: exploring the operations practices and technologies that deliver advanced services. Int J Oper Prod Manag 34(1):2–35. https://doi.org/10.1108/IJOPM-02-2012-0086

Baines T, Shi VG (2015) A Delphi study to explore the adoption of servitization in UK companies. Prod Plann Control 26:1171–1187. https://doi.org/10.1080/09537287.2015.1033490

Baines T, Lightfoot HW, Evans S, Neely A et al (2007) State-of-the-art in product-service systems. J Eng Manuf 221(10):1543–1552. https://doi.org/10.1243/09544054JEM858

Baines T, Lightfoot H, Benedettini O, Kay JM (2009a) The servitization of manufacturing: a review of literature and reflection on future challenges. J Manuf Technol Manag 20(5):547–567. https://doi.org/10.1108/17410380910960984

Baines T, Lightfoot H, Peppard J, Johnson M et al (2009b) Towards an operations strategy for product-centric servitization. Int J Oper Prod Manag 29(5):494–519. https://doi.org/10.1108/01443570910953603

Baines T, Lightfoot H, Smart P (2011) Servitization within manufacturing: exploring the provision of advanced services and their impact on vertical integration. J Manuf Technol Manag 22(7):947–954. https://doi.org/10.1108/17410381111160988

Baines T, Ziaee Bigdeli A, Bustinza OF, Shi VG et al (2017) Servitization: revisiting the state-of-the-art and research priorities. Int J Oper Prod Manag 37(2):256–278. https://doi.org/10.1108/IJOPM-06-2015-0312

Baines T, Ziaee Bigdeli A, Sousa R, Schroeder A (2020) Framing the servitization transformation process: a model to understand and facilitate the servitization journey. Int J Prod Econ 221:1–44. https://doi.org/10.1016/j.ijpe.2019.07.036

Barney J (1991) Firm resources and sustained competitive advantage. J Manag 17(1):99–120

Google Scholar  

Barquet APB, De Oliveira MG, Amigo CR, Cunha VP, Rozenfeld H (2013) Employing the business model concept to support the adoption of product-service systems (PSS). Ind Mark Manag 42(5):693–704. https://doi.org/10.1016/j.indmarman.2013.05.003

Bastl M, Johnson M, Lightfoot H, Evans S (2012) Buyer-supplier relationships in a servitized environment. Int J Oper Prod Manag 32(6):650–675. https://doi.org/10.1108/01443571211230916

Benedettini O, Neely A, Swink M (2015) Why do servitized firms fail? A risk-based explanation. Int J Oper Prod Manag 35(6):946–979. https://doi.org/10.1108/IJOPM-02-2014-0052

Bettiol M, Capestro M, Di Maria E, Micelli S (2022) Overcoming pandemic challenges through product innovation: the role of digital technologies and servitization. Eur Manag J 40(5):707–717. https://doi.org/10.1016/j.emj.2022.05.003

Bortoluzzi G, Chiarvesio M, Romanello R, Tabacco R, Veglio V (2022) Servitisation and performance in the business-to-business context: the moderating role of Industry 4.0 technologies. J Manuf Technol Manag 33(9):108–128. https://doi.org/10.1108/JMTM-08-2021-0317

Brady T, Davies A, Gann D (2005) Creating value by delivering integrated solutions. Int J Proj Manag 23(5):360–365. https://doi.org/10.1016/j.ijproman.2005.01.001

Brax SA (2005) A manufacturer becoming service provider—challenges and a paradox. Manag Serv Qual 15(2):142–155. https://doi.org/10.1108/09604520510585334

Brax SA, Jonsson K (2009) Developing integrated solution offerings for remote diagnostics: a comparative case study of two manufacturers. Int J Oper Prod Manag 29(5):539–560. https://doi.org/10.1108/01443570910953621

Brax SA, Calabrese A, Levialdi Ghiron N, Tiburzi L, Gronroos C (2021) Explaining the servitization paradox: a configurational theory and a performance measurement framework. Int J Oper Prod Manag 41(5):517–546. https://doi.org/10.1108/IJOPM-08-2020-0535

Brekke T, Lenka S, Kohtamaki M, Parida V, Solem BAA (2023) Overcoming barriers to transformation in manufacturing firms. A path-dependence perspective of digital servitization. Rev Manag Sci. https://doi.org/10.1007/s11846-023-00641-0

Bustinza OF, Bigdeli AZ, Baines T, Elliot C (2015) Servitization and competitive advantage: the importance of organizational structure and value chain position. Res Technol Manag 58:53–60. https://doi.org/10.5437/08956308X5805354

Bustinza OF, Gomes E, Vendrell-Herrero F, Tarba SY (2018) An organizational change framework for digital servitization: evidence from the Veneto region. Strateg Change 27:111–119. https://doi.org/10.1002/jsc.2186

Calabrese A, Levialdi Ghiron N, Tiburzi L, Baines T, Ziaee Bigdeli A (2019) The measurement of degree of servitization: literature review and recommendations. Prod Plann Control 30:1118–1135. https://doi.org/10.1080/09537287.2019.1592260

Ceci F, Masini A (2011) Balancing specialized and generic capabilities in the provision of integrated solutions. Ind Corp Change 20(1):91–131. https://doi.org/10.1093/icc/dtq069

Chen Y, Visnjic I, Parida V, Zhang Z (2021) On the road to digital servitization—the (dis)continuous interplay between business model and digital technology. Int J Oper Prod Manag 41(5):694–722. https://doi.org/10.1108/IJOPM-08-2020-0544

Chen M, Pu X, Zhang M, Cai Z et al (2022a) Data analytics capability and servitization: the moderated mediation role of bricolage and innovation orientation. Int J Oper Prod Manag 42(4):440–470. https://doi.org/10.1108/IJOPM-10-2021-0663

Chen Y, Wu Z, Yi W, Wang B et al (2022b) Bibliometric method for manufacturing servitization: a review and future research directions. Sustainability 14:1–26. https://doi.org/10.3390/su14148743

Chen L, Dai Y, Ren F, Dong X (2023) Data-driven digital capabilities enable servitization strategy—from service supporting the product to service supporting the client. Technol Forecast Soc Change 197:1–15. https://doi.org/10.1016/j.techfore.2023.122901

Chirumalla K, Leoni L, Oghazi P (2023) Moving from servitization to digital servitization: identifying the required dynamic capabilities and related microfoundations to facilitate the transition. J Bus Res 158:1–23. https://doi.org/10.1016/j.jbusres.2023.113668

Chowdhury S, Haftor D, Pashkevich N (2018) Smart product-service systems (Smart PSS) in industrial firms: a literature review. Procedia CIRP 73:26–31. https://doi.org/10.1016/j.procir.2018.03.333

Ciasullo MV, Polese F, Montera R, Carrubbo L (2021) A digital servitization framework for viable manufacturing companies. J Bus Ind Mark 36(13):142–160. https://doi.org/10.1108/JBIM-07-2020-0349

Coreynen W, Matthyssens P, Van Bockhaven W (2017) Boosting servitization through digitization: pathways and dynamic resource configurations for manufacturers. Ind Mark Manag 60:42–53. https://doi.org/10.1016/j.indmarman.2016.04.012

Cusumano MA (2008) The changing software business: moving from products to services. Computer 41:20–27. https://doi.org/10.1109/MC.2008.29

Cusumano MA, Kahl SJ, Suárez FF (2014) Services, industry evolution, and the competitive strategies of product firms. Strateg Manag J 36:559–575. https://doi.org/10.2139/ssrn.2378868

Dalenogare LS, Benitez GB, Ayala NF, Frank AG (2018) The expected contribution of Industry 4.0 technologies for industrial performance. Int J Prod Econ 204:383–394. https://doi.org/10.1016/j.ijpe.2018.08.019

Davies A, Brady T, Hobday M (2007) Organizing for solutions: systems seller vs. systems integrator. Ind Mark Manag 36(2):183–193. https://doi.org/10.1016/j.indmarman.2006.04.009

Davies P, Bustinza OF, Parry G, Jovanovic M (2023) Unpacking the relationship between digital capabilities, services capabilities, and firm financial performance: a moderated mediation model. Ind Mark Manag 115:1–10. https://doi.org/10.1016/j.indmarman.2023.09.005

Díaz-Garrido E, Pinillos MJ, Soriano-Pinar I, García-Magro C (2018) Changes in the intellectual basis of servitization research: a dynamic analysis. J Eng Technol Manag JET M 48:1–14. https://doi.org/10.1016/j.jengtecman.2018.01.005

Durugbo C (2014) Strategic framework for industrial product-service co-design: findings from the microsystems industry. Int J Prod Res 52:2881–2900. https://doi.org/10.1080/00207543.2013.857054

Eggert A, Hogreve J, Ulaga W, Muenkhoff E (2014) Revenue and profit implications of industrial service strategies. J Serv Res 17:23–39. https://doi.org/10.1177/1094670513485823

Favoretto C, Mendes G, Oliveira M, Cauchick-Miguel P, Coreynen W (2022) From servitization to digital servitization: how digitalization transforms companies’ transition towards services. Ind Mark Manag 102:104–121. https://doi.org/10.1016/j.indmarman.2022.01.003

Frank AG, Mendes GHS, Ayala NF, Ghezzi A (2019) Servitization and Industry 4.0 convergence in the digital transformation of product firms: a business model innovation perspective. Technol Forecast Soc Change 141:341–351. https://doi.org/10.1016/j.techfore.2019.01.014

Gaiardelli P, Songini L, Saccani N (2014) The automotive industry: heading towards servitization in turbulent times. Servitization in Industry. Springer, Cham

Gaiardelli P, Pezzotta G, Rondini A, Romero D et al (2021) Product-service systems evolution in the era of Industry 4.0. Serv Bus 15:177–207. https://doi.org/10.1007/s11628-021-00438-9

Galbraith JR (2002) Organizing to deliver solutions. Organ Dyn 31(2):194–207. https://doi.org/10.1016/S0090-2616(02)00101-8

García Martin PC, Schroeder A, Bigdeli AZ (2019) The value architecture of servitization: expanding the research scope. J Bus Res 104:438–449. https://doi.org/10.1016/j.jbusres.2019.04.010

Gebauer H (2008) Identifying service strategies in product manufacturing companies by exploring environment—strategy configurations. Ind Mark Manage 37(3):278–291. https://doi.org/10.1016/j.indmarman.2007.05.018

Gebauer H, Fleisch E (2007) An investigation of the relationship between behavioral processes, motivation, investments in the service business and service revenue. Ind Mark Manag 36(3):337–348. https://doi.org/10.1016/j.indmarman.2005.09.005

Gebauer H, Kowalkowski C (2012) Customer-focused and service-focused orientation in organizational structures. J Bus Ind Mark 27(7):527–537. https://doi.org/10.1108/08858621211257293

Gebauer H, Elgar F, Thomas F (2005) Overcoming the service paradox in manufacturing companies. Eur Manag J 23:14–26. https://doi.org/10.1016/j.emj.2004.12.006

Gebauer H, Gustafsson A, Witell L (2011) Competitive advantage through service differentiation by manufacturing companies. J Bus Res 64(12):1270–1280. https://doi.org/10.1016/j.jbusres.2011.01.015

Gebauer H, Paiola M, Saccani N (2013) Characterizing service networks for moving from products to solutions. Ind Mark Manag 42:31–46. https://doi.org/10.1016/j.indmarman.2012.11.002

Grandinetti R, Ciasullo MV, Paiola M, Schiavone F (2020) Fourth industrial revolution, digital servitization and relationship quality in Italian B2B manufacturing firms. Explor Study TQM J 32(4):647–671. https://doi.org/10.1108/TQM-01-2020-0006

Haase RP, Pigosso DCA, McAloone TC (2017) Product/service-system origins and trajectories: a systematic literature review of PSS definitions and their characteristics. Procedia CIRP 64:157–162. https://doi.org/10.1016/j.procir.2017.03.053

Harrmann LK, Eggert A, Böhm E (2023) Digital technology usage as a driver of servitization paths in manufacturing industries. Eur J Mark 57(3):834–857. https://doi.org/10.1108/EJM-11-2021-0914

Hertzberg S, Rudner L (1999) Quality of researchers’ searches of the ERIC database. Educ Policy Anal Arch. https://doi.org/10.14507/epaa.v7n25.1999

Hyun M, Kim J (2021) Challenge or opportunity? A case of tire rental servitization from financial and channel perspectives. Serv Bus 15:1–17. https://doi.org/10.1007/s11628-020-00433-6

Ibarra D, Ganzarain J, Igartua JI (2018) Business model innovation through Industry 4.0: a review. Procedia Manuf 22:4–10. https://doi.org/10.1016/J.PROMFG.2018.03.002

Johnson M, Mena C (2008) Supply chain management for servitised products: a multi-industry case study. Int J Prod Econ 114:27–39. https://doi.org/10.1016/j.ijpe.2007.09.011

Johnstone S, Dainty A, Wilkinson A (2009) Integrating products and services through life: an aerospace experience. Int J Oper Prod Manag 29(5):520–538. https://doi.org/10.1108/01443570910953612

Kamp B, Alcalde H (2014) Servitization in the basque economy. Strateg Change 23:359–374. https://doi.org/10.1002/jsc.1982

Kamp B, Parry G (2017) Servitization and advanced business services as levers for competitiveness. Ind Mark Manag 60:11–16. https://doi.org/10.1016/j.indmarman.2016.12.008

Kanninen T, Penttinen E, Tinnilä M, Kaario K (2017) Exploring the dynamic capabilities required for servitization: the case process industry. Bus Process Manag J 23(2):226–247. https://doi.org/10.1108/BPMJ-03-2015-0036

Kans M, Ingwald A (2016) Business model development towards service management 4.0. Procedia CIRP 47:489–494. https://doi.org/10.1016/J.PROCIR.2016.03.228

Khanra S, Dhir A, Parida V, Kohtamäki M (2021) Servitization research: a review and bibliometric analysis of past achievements and future promises. J Bus Res 131:151–166. https://doi.org/10.1016/j.jbusres.2021.03.056

Kharlamov AA, Parry G (2021) The impact of servitization and digitization on productivity and profitability of the firm: a systematic approach. Prod Plann Control 32:185–197. https://doi.org/10.1080/09537287.2020.1718793

Kimita K, McAloone T, Ogata K, Pigosso D (2022) Servitization maturity model: developing distinctive capabilities for successful servitization in manufacturing companies. J Manuf Technol Manag 33(9):61–87. https://doi.org/10.1108/JMTM-07-2021-0248

Kindström D, Kowalkowski C (2014) Service innovation in product-centric firms: a multidimensional business model perspective. J Bus Ind Mark 29(2):96–111. https://doi.org/10.1108/JBIM-08-2013-0165

Kohtamaki M, Henneberg SC, Martinez V, Kimita K, Gebauer H (2019a) A configurational approach to servitization: review and research directions. Serv Sci 11(3):1–29. https://doi.org/10.1287/serv.2019.0245

Kohtamaki M, Rabetino R, Einola S, Parida V, Patel P (2021) Unfolding the digital servitization path from products to product-service-software systems: practicing change through intentional narratives. J Bus Res 137:379–392. https://doi.org/10.1016/j.jbusres.2021.08.027

Kohtamäki M, Partanen J (2016) Co-creating value from knowledge-intensive business services in manufacturing firms: the moderating role of relationship learning in supplier-customer interactions. J Bus Res 69(7):2498–2506. https://doi.org/10.1016/j.jbusres.2016.02.019

Kohtamäki M, Parida V, Oghazi P, Gebauer H, Baines T (2019b) Digital servitization business models in ecosystems: a theory of the firm. J Bus Res 104:380–392. https://doi.org/10.1016/j.jbusres.2019.06.027

Kohtamäki M, Einola S, Rabetino R (2020a) Exploring servitization through the paradox lens: coping practices in servitization. Int J Prod Econ 226:1–15. https://doi.org/10.1016/j.ijpe.2020.107619

Kohtamäki M, Parida V, Patel P, Gebauer H (2020b) The relationship between digitalization and servitization: the role of servitization in capturing the financial potential of digitalization. Technol Forecast Soc Change 151:1–35. https://doi.org/10.1016/j.techfore.2019.119804

Kolagar M, Parida V, Sjödin D (2022) Ecosystem transformation for digital servitization: a systematic review, integrative framework, and future research agenda. J Bus Res 146:176–200. https://doi.org/10.1016/j.jbusres.2022.03.067

Kowalkowski C, Gebauer H, Kamp B, Parry G (2017) Servitization and deservitization: overview, concepts, and definitions. Ind Mark Manag 60:4–10. https://doi.org/10.1016/j.indmarman.2016.12.007

Kreye ME, Roehrich JK, Lewis MA (2015) Servitizing manufacturers: the impact of service complexity and contractual and relational capabilities. Prod Plann Control 26:1233–1246. https://doi.org/10.1080/09537287.2015.1033489

Le-Dain MA, Benhayoun L, Matthews J, Liard M (2023) Barriers and opportunities of digital servitization for SMEs: the effect of smart product-service system business models. Serv Bus 17:359–393. https://doi.org/10.1007/s11628-023-00520-4

Lee J, Kao HA, Yang S (2014) Service innovation and smart analytics for Industry 4.0 and big data environment. Procedia CIRP 16:3–8. https://doi.org/10.1016/j.procir.2014.02.001

Leoni L, Aria M (2021) A thirty-year bibliometric analysis on servitization. Int J Serv Sci Manag Eng Technol 12(3):73–95. https://doi.org/10.4018/IJSSMET.2021050105

Levitt T (1972) Production-line approach to service. Harv Bus Rev 50:41–52

Lightfoot H, Baines T, Smart P (2013) The servitization of manufacturing: a systematic literature review of interdependent trends. Int J Oper Prod Manag 33(11/12):1408–1434. https://doi.org/10.1108/IJOPM-07-2010-0196

Lindman M, Pennanen K, Rothenstei J, Scozzi B, Vincze Z (2016) The value space: how firms facilitate value creation. Bus Process Manag J 22(4):736–762. https://doi.org/10.1108/BPMJ-09-2015-0126

Luoto S, Brax SA, Kohtamäki M (2017) Critical meta-analysis of servitization research: constructing a model-narrative to reveal paradigmatic assumptions. Ind Mark Manag 60:89–100. https://doi.org/10.1016/j.indmarman.2016.04.008

Manzini E, Vezzoli C (2003) A strategic design approach to develop sustainable product service systems: examples taken from the ‘environmentally friendly innovation’ Italian prize. J Clean Prod 11(8):851–857. https://doi.org/10.1016/S0959-6526(02)00153-1

Marcon É, Marcon A, Ayala NF, Frank AG et al (2022) Capabilities supporting digital servitization: a multi-actor perspective. Ind Mark Manag 103:97–116. https://doi.org/10.1016/j.indmarman.2022.03.003

Martínez V, Bastl M, Kingston J, Evans S (2010) Challenges in transforming manufacturing organizations into product-service providers. J Manuf Technol Manag 21(4):449–469. https://doi.org/10.1108/17410381011046571

Martín-Peña ML, Ziaee Bigdeli A (2016) Servitization: academic research and business practice. Univ Bus Rev 49:18–31

Martín-Peña ML, Pinillos MJ, Reyes LE (2017) The intellectual basis of servitization: a bibliometric analysis. J Eng Technol Manag JET M 43:83–97. https://doi.org/10.1016/j.jengtecman.2017.01.005

Martín-Peña ML, Díaz-Garrido E, Sánchez-López JM (2018) The digitalization and servitization of manufacturing: a review on digital business models. Strateg Change 27:91–99. https://doi.org/10.1002/jsc.2184

Martín-Peña ML, Sánchez-López JM, Díaz-Garrido E (2020) Servitization and digitalization in manufacturing: the influence on firm performance. J Bus Ind Mark 35(3):564–574. https://doi.org/10.1108/JBIM-12-2018-0400

Martín-Peña ML, Sanchez-Lopez JM, Kamp B, Gimenez-Fernandez EM (2023) The innovation antecedents behind the servitization-performance relationship. R D Manag 53:1–23. https://doi.org/10.1111/radm.12586

Mathieu V (2001) Service strategies within the manufacturing sector: benefits, costs and partnership. Int J Serv Ind Manag 12(5):451–475. https://doi.org/10.1108/EUM0000000006093

Matthyssens P, Vandenbempt K (2010) Service addition as business market strategy: identification of transition trajectories. J Serv Manag 21(5):693–714. https://doi.org/10.1108/09564231011079101

Minaya PE, Avella L, Trespalacios JA (2023) The effects of digital servitization on business competitiveness: A case study of Spanish manufacturers. J Int Entrep 21:180–213. https://doi.org/10.1007/s10843-023-00333-6

Momeni K, Raddats C, Martinsuo M (2023) Mechanisms for developing operational capabilities in digital servitization. Int J Oper Prod Manag 43(13):101–127. https://doi.org/10.1108/IJOPM-04-2022-0259

Mont O (2002) Clarifying the concept of product-service system. J Clean Prod 10(3):237–245. https://doi.org/10.1016/S0959-6526(01)00039-7

Mosch P, Schweikl S, Obermaier R (2021) Trapped in the supply chain? Digital servitization strategies and power relations in the case of an industrial technology supplier. Int J Prod Econ 236:1–14. https://doi.org/10.1016/j.ijpe.2021.108141

Müller JM, Buliga O, Voigt KI (2021) The role of absorptive capacity and innovation strategy in the design of Industry 4.0 business models—a comparison between SMEs and large enterprises. Eur Manag J 39(3):333–343. https://doi.org/10.1016/j.emj.2020.01.002

Münch C, Marx E, Benz L, Hartmann E, Matzner M (2022) Capabilities of digital servitization: evidence from the socio-technical systems theory. Technol Forecast Soc Change 176:1–17. https://doi.org/10.1016/j.techfore.2021.121361

Naik P, Schroeder A, Kapoor K, Ziaee Bigdeli A (2020) Behind the scenes of digital servitization: actualising IoT-enabled affordances. Ind Mark Manag 89:232–244. https://doi.org/10.1016/j.indmarman.2020.03.010

Neely A, Benedettini O, Visnjic I (2011) The servitization of manufacturing: further evidence. University of Cambridge, Cambridge, pp 1–11

Nordin F, Kowalkowski C (2010) Solutions offerings: a critical review and reconceptualization. J Serv Manag 21(4):441–459. https://doi.org/10.1108/09564231011066105

Oliva R, Kallenberg R (2003) Managing the transition from products to services. Int J Serv Ind Manag 14(2):160–172. https://doi.org/10.1108/09564230310474138

Oliveira MG, Mendes GH, Rozenfeld H (2015) Bibliometric analysis of the product-service system research field. Procedia CIRP 30:114–119. https://doi.org/10.1016/j.procir.2015.02.139

Opazo-Basáez M, Vendrell-Herrero F, Bustinza OF (2021) Digital service innovation: a paradigm shift in technological innovation. J Serv Manag 33:97–120. https://doi.org/10.1108/JOSM-11-2020-0427

Ostrom AL, Bitner MJ, Brown SW, Burkhard KA et al (2010) Moving forward and making a difference: research priorities for the science of service. J Serv Res 13:4–36. https://doi.org/10.1177/1094670509357611

Paiola M, Gebauer H (2020) Internet of things technologies, digital servitization and business model innovation in BtoB manufacturing firms. Ind Mark Manag 89:245–264. https://doi.org/10.1016/j.indmarman.2020.03.009

Paiola M, Saccani N, Perona M, Gebauer H (2013) Moving from products to solutions: strategic approaches for developing capabilities. Eur Manag J 31(4):390–409. https://doi.org/10.1016/j.emj.2012.10.002

Parida V, Sjödin DR, Wincent J, Kohtamäki M (2014) Mastering the transition to product-service provision: insights into business models, learning activities, and capabilities. Res Technol Manag 57:44–52. https://doi.org/10.5437/08956308X5703227

Paschou T, Rapaccini M, Adrodegari F, Saccani N (2020) Digital servitization in manufacturing: a systematic literature review and research agenda. Ind Mark Manag 89:278–292. https://doi.org/10.1016/j.indmarman.2020.02.012

Pessôa MVP, Becker JMJ (2017) Overcoming the product-service model adoption obstacles. Procedia CIRP 64:163–168. https://doi.org/10.1016/j.procir.2017.03.062

Pettigrew AM (1988) The management of strategic change. B. Blackwell, Oxford

Pinillos MJ, Díaz-Garrido E, Martín-Peña ML (2022) The origin and evolution of the concept of servitization: a co-word and network analysis. J Bus Ind Mark 37(7):1497–1514. https://doi.org/10.1108/JBIM-02-2021-0120

Pombo D, Franco M (2023) A qualitative investigation of infusing products with service via strategic alliances among SMEs: a case of servitization. Serv Bus 17:529–555. https://doi.org/10.1007/s11628-023-00530-2

Pye A, Pettigrew A (2005) Studying board context, process and dynamics: some challenges for the future. Brit J Manag 16:27–38. https://doi.org/10.1111/j.1467-8551.2005.00445.x

Qi Y, Mao Z, Zhang M, Guo H (2020) Manufacturing practices and servitization: the role of mass customization and product innovation capabilities. Int J Prod Econ 228:1–10. https://doi.org/10.1016/j.ijpe.2020.107747

Rabetino R, Kohtamäki M, Gebauer H (2017) Strategy map of servitization. Int J Prod Econ 192:144–156. https://doi.org/10.1016/j.ijpe.2016.11.004

Rabetino R, Kohtamäki M, Brax SA, Sihvonen J (2021) The tribes in the field of servitization: discovering latent streams across 30 years of research. Ind Mark Manag 95:70–84. https://doi.org/10.1016/j.indmarman.2021.04.005

Rabetino R, Kohtamäki M, Huikkola T (2023) Digital service innovation (DSI): a multidisciplinary (re)view of its origins and progress using bibliometric and text mining methods. J Serv Manag. https://doi.org/10.1108/JOSM-12-2022-0375

Raddats C (2011) Aligning industrial services with strategies and sources of market differentiation. J Bus Ind Mark 26(5):332–343. https://doi.org/10.1108/08858621111144398

Raddats C, Baines T, Burton J, Story VM, Zolkiewski J (2016) Motivations for servitization: the impact of product complexity. Int J Oper Prod Manag 36(5):572–591. https://doi.org/10.1108/IJOPM-09-2014-0447

Raddats C, Zolkiewski J, Story VM, Burton J et al (2017) Interactively developed capabilities: evidence from dyadic servitization relationships. Int J Oper Prod Manag 37(3):382–400. https://doi.org/10.1108/IJOPM-08-2015-0512

Raddats C, Kowalkowski C, Benedettini O, Burton J, Gebauer H (2019) Servitization: a contemporary thematic review of four major research streams. Ind Mark Manag 83:207–223. https://doi.org/10.1016/j.indmarman.2019.03.015

Raddats C, Naik P, Ziaee Bigdeli A (2022) Creating value in servitization through digital service innovations. Ind Mark Manag 104:1–13. https://doi.org/10.1016/j.indmarman.2022.04.002

Rakic S, Pero M, Sianesi A, Marjanovic U (2022) Digital servitization and firm performance: technology intensity approach. Eng Econ 33(4):398–413. https://doi.org/10.5755/j01.ee.33.4.29649

Reim W, Parida V, Örtqvist D (2015) Product-Service Systems (PSS) business models and tactics—a systematic literature review. J Clean Prod 97:61–75. https://doi.org/10.1016/J.JCLEPRO.2014.07.003

Reim W, Sjödin DR, Parida V (2019) Servitization of global service network actors—a contingency framework for matching challenges and strategies in service transition. J Bus Res 104:461–471. https://doi.org/10.1016/j.jbusres.2019.01.032

Rogers EM (2003) Diffusion of innovations. Free Press, New York

Ruiz-Martín A, Díaz-Garrido E (2021) A review of servitization theoretical foundations. J Ind Eng Manag 14(3):496–519. https://doi.org/10.3926/jiem.3466

Sandström S, Edvardsson B, Kristensson P, Magnusson P (2008) Value in use through service experience. Manag Serv Qual 18(2):112–126. https://doi.org/10.1108/09604520810859184

Santamaría L, Jesús Nieto M, Miles I (2012) Service innovation in manufacturing firms: evidence from Spain. Technovation 32(2):144–155. https://doi.org/10.1016/j.technovation.2011.08.006

Schroeder A, Baines T, Sakao T (2022) Increasing value capture by enhancing manufacturer commitment-managing the servitization process. IEEE Eng Manag Rev 50(3):1–13. https://doi.org/10.1109/EMR.2022.3197075

Shen L, Sun W, Parida V (2023) Consolidating digital servitization research: a systematic review, integrative framework, and future research directions. Technol Forecast Soc Change 191:1–24. https://doi.org/10.1016/j.techfore.2023.122478

Sjödin D, Parida V, Kohtamaki M, Wincent J (2020) An agile co-creation process for digital servitization: a micro-service innovation approach. J Bus Res 112:478–491. https://doi.org/10.1016/j.jbusres.2020.01.009

Sousa R, Da Silveira G (2017) Capability antecedents and performance outcomes of servitization: differences between basic and advanced services. Int J Oper Prod Manag 37(4):444–467. https://doi.org/10.1108/IJOPM-11-2015-0696

Spring M, Araujo L (2013) Beyond the service factory: service innovation in manufacturing supply networks. Ind Mark Manag 42:59–70. https://doi.org/10.1016/j.indmarman.2012.11.006

Tao F, Qi Q (2017) New IT driven service-oriented smart manufacturing: framework and characteristics. IEEE Trans Syst Man Cybern -Syst 49:81–91. https://doi.org/10.1109/TSMC.2017.2723764

Thomé AMT, Scavarda LF, Scavarda AJ (2016) Conducting systematic literature review in operations management. Prod Plann Control 27(5):408–420. https://doi.org/10.1080/09537287.2015.1129464

Tian J, Coreynen W, Matthyssens P, Shen L (2022) Platform-based servitization and business model adaptation by established manufacturers. Technovation 118:1–22. https://doi.org/10.1016/j.technovation.2021.102222

Tranfield D, Denyer D, Smart P (2003) Towards a methodology for developing evidence-informed management knowledge by means of systematic review. Brit J Manag 14:207–222. https://doi.org/10.1111/1467-8551.00375

Tronvoll B, Sklyar A, Sorhammar D, Kowalkowski C (2020) Transformational shifts through digital servitization. Ind Mark Manag 89:293–305. https://doi.org/10.1016/j.indmarman.2020.02.005

Tukker A (2004) Eight types of product-service system: eight ways to sustainability? Experience from SusProNet. Bus Strategy Environ 13:246–260. https://doi.org/10.1002/bse.414

Tukker A (2015) Product services for a resource-efficient and circular economy—a review. J Clean Prod 97:76–91. https://doi.org/10.1016/J.JCLEPRO.2013.11.049

Tukker A, Tischner U (2006) Product-services as a research field: past, present and future. Reflections from a decade of research. J Clean Prod 14(17):1552–1556. https://doi.org/10.1016/j.jclepro.2006.01.022

Tuli KR, Kohli AK, Bharadwaj SG (2007) Rethinking customer solutions: from product bundles to relational processes. J Mark 71(3):1–17. https://doi.org/10.1509/jmkg.71.3.1

Vandermerwe S, Rada J (1988) Servitization of business: adding value by adding services. Eur Manag J 6(4):314–324. https://doi.org/10.1016/0263-2373(88)90033-3

Vargo SL, Lusch RF (2008) Service-dominant logic: continuing the evolution. J Acad Mark Sci 36(1):1–10. https://doi.org/10.1007/s11747-007-0069-6

Vendrell-Herrero F, Bustinza OF, Parry G, Georgantzis N (2017) Servitization, digitization and supply chain interdependency. Ind Mark Manag 60:69–81. https://doi.org/10.1016/j.indmarman.2016.06.013

Visnjic I, Van Looy B (2013) Servitization: disentangling the impact of service business model innovation on manufacturing firm performance. J Oper Manag 31(4):169–180. https://doi.org/10.2139/ssrn.2407380

Wang W, Lai K, Shou Y (2018) The impact of servitization on firm performance: a meta-analysis. Int J Oper Prod Manag 38(7):1562–1588. https://doi.org/10.1108/IJOPM-04-2017-0204

Wang Z, Chen CH, Zheng P, Li X, Khoo LP (2021) A graph-based context-aware requirement elicitation approach in smart product-service systems. Int J Prod Res 59(2):635–651. https://doi.org/10.1080/00207543.2019.1702227

Windahl C, Lakemond N (2006) Developing integrated solutions: the importance of relationships within the network. Ind Mark Manag 35(7):806–818. https://doi.org/10.1016/J.INDMARMAN.2006.05.010

Xing Y, Liu Y, Davies P (2023) Servitization innovation: a systematic review, integrative framework, and future research directions. Technovation 122:1–15. https://doi.org/10.1016/j.technovation.2022.102641

Yan K, Li G, Cheng TCE (2020) The impact of service-oriented organizational design factors on firm performance: the moderating role of service-oriented corporate culture. Int J Prod Econ 228:1–13. https://doi.org/10.1016/j.ijpe.2020.107745

Yu Y, Sung TJ (2023) A value-based view of the smart PSS adoption: a study of smart kitchen appliances. Serv Bus 17:499–527. https://doi.org/10.1007/s11628-023-00529-9

Zhang W, Banerji S (2017) Challenges of servitization: a systematic literature review. Ind Mark Manag 65:217–227. https://doi.org/10.1016/j.indmarman.2017.06.003

Zhang K, Feng L, Wang J, Lin KY, Li Q (2023) Servitization in business ecosystem: a systematic review and implications for business-to-business servitization research. Technol Anal Strateg Manag 35(11):1480–1496. https://doi.org/10.1080/09537325.2021.2010698

Zheng P, Lin T, Chen C, Xu X (2018) A systematic design approach for service innovation of smart product-service systems. J Clean Prod 201:657–667. https://doi.org/10.1016/j.jclepro.2018.08.101

Zheng P, Liu Y, Tao F, Wang Z, Chen C (2019) Smart product-service systems solution design via hybrid crowd sensing approach. IEEE Access 7:1–12. https://doi.org/10.1109/ACCESS.2019.2939828

Zhou C, Song W (2021) Digitalization as a way forward: a bibliometric analysis of 20 years of servitization research. J Clean Prod 300:1–14. https://doi.org/10.1016/j.jclepro.2021.126943

Zhou D, Yan T, Dai W, Feng J (2021) Disentangling the interactions within and between servitization and digitalization strategies: a service-dominant logic. Int J Prod Econ 238:1–16. https://doi.org/10.1016/j.ijpe.2021.108175

Ziaee Bigdeli A, Baines T, Bustinza OF, Guang Shi V (2017) Organisational change towards servitization: a theoretical framework. Compet Rev 27(1):12–39. https://doi.org/10.1108/CR-03-2015-0015

Ziaee Bigdeli A, Baines T, Schroeder A, Brown S (2018) Measuring servitization progress and outcome: the case of ‘advanced services.’ Prod Plann Control 29(4):315–332. https://doi.org/10.1080/09537287.2018.1429029

Zighan S, Abualqumboz M (2022) Dual focus: service-product orientation to manage the change paradox following servitization strategy. Serv Bus 16:29–55. https://doi.org/10.1007/s11628-022-00483-y

Download references

Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature.

Author information

Authors and affiliations.

Management and Business Economics Department, University of Leon, Leon, Spain

Pedro E. Minaya

Business Administration Department, University of Oviedo, Oviedo, Spain

Lucía Avella & Juan A. Trespalacios

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Pedro E. Minaya .

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Minaya, P.E., Avella, L. & Trespalacios, J.A. Synthesizing three decades of digital servitization: a systematic literature review and conceptual framework proposal. Serv Bus (2024). https://doi.org/10.1007/s11628-024-00559-x

Download citation

Received : 28 September 2023

Accepted : 16 April 2024

Published : 08 May 2024

DOI : https://doi.org/10.1007/s11628-024-00559-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Digital servitization
  • Industry 4.0
  • Product-service system
  • Systematic literature review
  • Business competitiveness
  • Find a journal
  • Publish with us
  • Track your research

IMAGES

  1. Artificial intelligence maturity model: a systematic literature review

    ai in systematic literature review

  2. Mastering Systematic Literature Reviews with AI Tools

    ai in systematic literature review

  3. AI-guide pipeline for Automated Systematic Review (ASReview

    ai in systematic literature review

  4. How to Conduct a Systematic Review

    ai in systematic literature review

  5. Use AI to Start Your Literature Review in a second|| Paper Digest Literature Review Tool Tutorial

    ai in systematic literature review

  6. (PDF) AI literacy in K-12: a systematic literature review

    ai in systematic literature review

VIDEO

  1. Reporting Systematic Review Results

  2. Powerful AI Techniques for Systematic Literature Reviews!

  3. Introduction Systematic Literature Review-Various frameworks Bibliometric Analysis

  4. Fundamentals of Systematic Literature Review #SLR #Urdu #Hindi #English

  5. Lecture 13: Systematic Literature Review

  6. Penggunaan Elicit untuk penulisan systematic literature review (SLR)

COMMENTS

  1. Artificial intelligence in systematic reviews: promising when

    Background Systematic reviews provide a structured overview of the available evidence in medical-scientific research. However, due to the increasing medical-scientific research output, it is a time-consuming task to conduct systematic reviews. To accelerate this process, artificial intelligence (AI) can be used in the review process. In this communication paper, we suggest how to conduct a ...

  2. Artificial intelligence in systematic reviews: promising when

    Systematic reviews provide a structured overview of the available evidence in medical-scientific research. However, due to the increasing medical-scientific research output, it is a time-consuming task to conduct systematic reviews. To accelerate this process, artificial intelligence (AI) can be used in the review process.

  3. Artificial intelligence to automate the systematic review of scientific

    AI provides methods to represent and infer knowledge, efficiently manipulate texts and learn from vast amount of data. ... A systematic literature review (SLR) is a secondary study that follows a well-established methodology to find relevant papers, extract information from them and properly present their key findings . The literature review is ...

  4. [2402.08565] Artificial Intelligence for Literature Reviews

    This manuscript presents a comprehensive review of the use of Artificial Intelligence (AI) in Systematic Literature Reviews (SLRs). A SLR is a rigorous and organised methodology that assesses and integrates previous research on a given topic. Numerous tools have been developed to assist and partially automate the SLR process. The increasing role of AI in this field shows great potential in ...

  5. How to optimize the systematic review process using AI tools

    A comprehensive literature review is a key component of any systematic review, and must be complete and thorough. It also helps identify gaps that studies can and should address. AI tools streamline this process by allowing authors to rapidly have access to manuscripts that are relevant to their research question.

  6. Cheap, Quick, and Rigorous: Artificial Intelligence and the Systematic

    The systematic literature review (SLR) is the gold standard in providing research a firm evidence foundation to support decision-making. ... (AI) and Machine Learning Techniques (MLTs) developed with computer programming languages can provide methods to increase the speed, rigour, transparency, and repeatability of SLRs. Aimed towards ...

  7. A systematic literature review of artificial intelligence in the

    Applying the rules and guidelines of systematic reviews is crucial for researchers who undertake this approach (Kitchenham and Charters, 2007).Commencing the review process using a protocol to identify, select, and assess the relevant literature will make the systematic review highly efficient (Tranfield et al., 2003).The systematic process should be reproducible, objective, transparent ...

  8. Using artificial intelligence methods for systematic review in health

    This review delineated automated tools and platforms that employ artificial intelligence (AI) approaches and evaluated the reported benefits and challenges in using such methods. A search was conducted in 4 databases (Medline, Embase, CDSR, and Epistemonikos) up to April 2021 for systematic reviews and other related reviews implementing AI methods.

  9. PRISMA AI reporting guidelines for systematic reviews and meta ...

    An ongoing umbrella review (a review of reviews) has found that nearly 7,000 reviews (systematic and non-systematic) have been published on AI in the category 'medicine'.

  10. Toward systematic review automation: a practical guide to using machine

    Technologies and methods to speed up the production of systematic reviews by reducing the manual labour involved have recently emerged. Automation has been proposed or used to expedite most steps of the systematic review process, including search, screening, and data extraction. However, how these technologies work in practice and when (and when not) to use them is often not clear to ...

  11. Automation of literature screening using machine ...

    Systematic review is an indispensable tool for optimal evidence collection and evaluation in evidence-based medicine. However, the explosive increase of the original literatures makes it difficult to accomplish critical appraisal and regular update. Artificial intelligence (AI) algorithms have been applied to automate the literature screening procedure in medical systematic reviews.

  12. An open source machine learning framework for efficient and ...

    In fact, if the literature of a field is growing faster than the amount of time available for systematic reviews, adequate manual review of this field then becomes impossible 11.

  13. Artificial intelligence in innovation research: A systematic review

    Artificial Intelligence (AI) is increasingly adopted by organizations to innovate, and this is ever more reflected in scholarly work. To illustrate, assess and map research at the intersection of AI and innovation, we performed a Systematic Literature Review (SLR) of published work indexed in the Clarivate Web of Science (WOS) and Elsevier Scopus databases (the final sample includes 1448 ...

  14. Rayyan

    Rayyan Enterprise and Rayyan Teams+ make it faster, easier and more convenient for you to manage your research process across your organization. Accelerate your research across your team or organization and save valuable researcher time. Build and preserve institutional assets, including literature searches, systematic reviews, and full-text ...

  15. AI in marketing, consumer research and psychology: A systematic

    This study is the first to provide an integrated view on the body of knowledge of artificial intelligence (AI) published in the marketing, consumer research, and psychology literature. By leveraging a systematic literature review using a data-driven approach and quantitative methodology (including bibliographic coupling), this study provides an ...

  16. Artificial intelligence in information systems research: A systematic

    The aim of this research is to understand the various characteristics of AI studied within the context of IS. A systematic literature review is important as it can be used to provide a valuable baseline to aid in further research efforts (Kitchenham et al., 2011, Petersen et al., 2015). The aims of this systematic review are to: 1.

  17. Role of Artificial Intelligence in Patient Safety Outcomes: Systematic

    Thus, it is essential to study how AI has been shown to influence patient safety outcomes at the clinical level, along with reported AI performance in the literature. In this systematic review, we address this gap by exploring the studies that utilized AI algorithms as defined in this review to address and report changes in patient safety ...

  18. Artificial intelligence in disease diagnosis: a systematic literature

    The current review suggests that AI and healthcare have developed a present synergy. Investigation. Investigation 1: Why do we need AI? ... The emerging role of cognitive computing in healthcare: a systematic literature review. J Med Inform. 2019; 129:154-166. doi: 10.1016/j.ijmedinf.2019.04.024. [Google Scholar] Bhatt V, Pal V (2019) An ...

  19. AI tools in evidence synthesis

    A variety of AI tools can be used during the systematic review or evidence synthesis process. These may be used to assist with developing a search strategy; locating relevant articles or resources; or during the data screening, data extraction or synthesis stage.They can also be used to draft plain language summaries.. The overall consensus is that the AI tools can be very useful in different ...

  20. AI in Education: A Systematic Literature Review

    A review of. available and relevant literature was done using the systematic re view method t o identify the current. research focus and provide an in-depth understanding of AI technology in ...

  21. Role of AI chatbots in education: systematic literature review

    AI chatbots shook the world not long ago with their potential to revolutionize education systems in a myriad of ways. AI chatbots can provide immediate support by answering questions, offering explanations, and providing additional resources. Chatbots can also act as virtual teaching assistants, supporting educators through various means. In this paper, we try to understand the full benefits ...

  22. Frontiers

    This systematic literature review (SLR) aims to critically examine how the code generated by AI models impacts software and system security. Following the categorization of the research questions provided by Kitchenham and Charters (2007) on SLR questions, this work has a 2-fold objective: analyzing the impact and systematizing the knowledge ...

  23. How To Use Elicit For Literature Review: AI Research Assistant 101

    The system uses advanced AI to filter through millions of research articles, showing relevant papers and summaries of key information about those papers in an easy-to-use format. Elicit's capabilities extend to refining search results by: keyword, citation count, or; study type, which is particularly useful for conducting a systematic review.

  24. LibGuides: Literature Reviews: Artificial intelligence (AI) tools

    Here are some key points to consider: Novelty and Creativity: Generative AI tools can produce content that is both innovative and unexpected. They allow users to explore new ideas, generate unique artworks, and even compose original music. This novelty is one of their most exciting aspects. Ethical Considerations: While generative AI offers ...

  25. AI-assisted writing is quietly booming in academic journals. Here's why

    Many people are worried by the use of AI in academic papers. Indeed, the practice has been described as "contaminating" scholarly literature. Some argue that using AI output amounts to plagiarism.

  26. Identification of Problem-Solving Techniques in Computational Thinking

    The systematic PRISMA review model, developed by Moher et al. (2009), was adopted for the systematic review process. Procedures for Systematic Literature Review A six-phase procedure was utilized to reflect logical thinking in the collection and screening of the literature in the scope of this study.

  27. 'AI Fairness in Action: A Human-Computer Perspective on AI Fairness in

    Therefore, we need to situate the psychology of how humans perceive fairness into the realm of algorithmic decision-making. … In this special issue, we aim to fill these voids [in the literature] by presenting theoretical and empirical work that contribute to our understanding of what AI fairness encapsulates."

  28. Generative Artificial Intelligence in Latin American Higher Education

    This systematic review offers an updated understanding of AI's role in Latin American higher education, with a particular emphasis on the latest AI technologies. The utilization of Artificial Intelligence (AI) and Generative AI (GenAI) in higher education has increased importantly in the last years. Studies show that AI holds promise in enhancing the learning experiences for both students and ...

  29. Synthesizing three decades of digital servitization: a systematic

    This study, through a systematic literature review spanning 1990 to 2023, interrogates how servitization, and nowadays digital servitization, enhances manufacturing competitiveness. It introduces the DASOBI (Drivers, Actors, Strategies, Obstacles, Benefits, and Impact) framework for navigating the digital servitization transition, emphasizing strategic adaptability and technological alignment ...

  30. AI adoption and diffusion in public administration: A systematic

    The use of a critical realist perspective in a systematic literature review enabled us to propose underlying constructs at each stage of the process. We identify absorptive capacity and a comprehensive list of variables under technology, organisational, and environmental context as factors influencing AI adoption as discussed in the literature.