Logo

The Nature of Mathematics: Let’s Talk about It

  • eTOC Alerts
  • Get Permissions

Teachers can offer opportunities for K–12 students to reflect on the nature of mathematics (NOM) as they learn.

Cover Mathematics Teacher: Learning and Teaching PK-12

Article Section Headings

  • VIEWS OF THE NATURE OF MATHEMATICS
  • The Five-Point View of the Nature of Mathematics
  • Counting in Kindergarten ARC Activity
  • Equivalent Fractions in Grade 3 ARC Activity
  • Discovering Area Relationships in Middle School ARC Activity
  • Absolute Value in High School ARC Activity

Article References

Association of Mathematics Teacher Educators . 2017 . “ Standards for Preparing Teachers of Mathematics ." amte.net/standards.

  • Search Google Scholar
  • Export Citation

Dossey , John. 1992 . “ The Nature of Mathematics: Its Role and Its Influence ." In Handbook of Research in Mathematics Teaching and Learning , edited by Douglas A. Grouws , pp. 39 – 48 . New York: Macmillan .

National Council of Teachers of Mathematics (NCTM) . 2000 . Principles and Standards for School Mathematics . Reston, VA: NCTM .

National Council of Teachers of Mathematics (NCTM) . 2014 . Principles to Actions: Ensuring Mathematical Successes for All . Reston, VA: NCTM .

National Council of Teachers of Mathematics (NCTM) . 2020 . “Activities with Rigor and Coherence.” https://www.nctm.org/ARCs/ .

National Governors Association Center for Best Practices (NGA Center) and Council of Chief State School Officers (CCSSO). 2010 . Common Core State Standards for Mathematics . Washington, DC: NGA Center and CCSSO . http://www.corestandards.org .

National Research Council (NRC). 2001 . Adding It up: Helping Children Learn Mathematics . Washington, DC: National Academies Press .

Schwartz , Renee , Norman Lederman , and Barbara Crawford . 2004 . “ Developing Views of Nature of Science in an Authentic Context: An Explicit Approach to Bridging the Gap between Nature of Science and Scientific Inquiry ." Science Education 88 , no. 4 (May): 610 – 45 .

Thompson , Alba. 1992 . “ Teachers’ Beliefs and Conceptions: A Synthesis of Research ." In Handbook of Research on Mathematics Teaching and Learning , edited by Douglas A. Grouws , pp. 127 – 46 . New York: Macmillan .

Watson , Lucy Anna. 2019 . “ Elementary Prospective Teachers and the Nature of Mathematics: An Explanatory Phenomenological Study .” PhD diss., Middle Tennessee State University .

Article Information

Google scholar.

  • Article by Lucy A. Watson
  • Article by Christopher T. Bonnesen
  • Article by Jeremy F. Strayer

Article Metrics

All Time Past Year Past 30 Days
Abstract Views 9847 1141 68
Full Text Views 1287 153 5
PDF Downloads 1671 220 24
EPUB Downloads 0 0 0

NCTM

© 2024 National Council of Teachers of Mathematics (NCTM)

Powered by: PubFactory

  • [185.194.105.172]
  • 185.194.105.172

Character limit 500 /500

“What is Mathematics?” and why we should ask, where one should experience and learn that, and how to teach it

  • Conference paper
  • Open Access
  • First Online: 02 November 2017
  • Cite this conference paper

You have full access to this open access conference paper

nature of mathematics research paper

  • Günter M. Ziegler 3 &
  • Andreas Loos 4  

Part of the book series: ICME-13 Monographs ((ICME13Mo))

111k Accesses

9 Citations

3 Altmetric

“What is Mathematics?” [with a question mark!] is the title of a famous book by Courant and Robbins, first published in 1941, which does not answer the question. The question is, however, essential: The public image of the subject (of the science, and of the profession) is not only relevant for the support and funding it can get, but it is also crucial for the talent it manages to attract—and thus ultimately determines what mathematics can achieve, as a science, as a part of human culture, but also as a substantial component of economy and technology. In this lecture we thus

discuss the image of mathematics (where “image” might be taken literally!),

sketch a multi-facetted answer to the question “What is Mathematics?,”

stress the importance of learning “What is Mathematics” in view of Klein’s “double discontinuity” in mathematics teacher education,

present the “Panorama project” as our response to this challenge,

stress the importance of telling stories in addition to teaching mathematics, and finally,

suggest that the mathematics curricula at schools and at universities should correspondingly have space and time for at least three different subjects called Mathematics.

This paper is a slightly updated reprint of: Günter M. Ziegler and Andreas Loos, Learning and Teaching “ What is Mathematics ”, Proc. International Congress of Mathematicians, Seoul 2014, pp. 1201–1215; reprinted with kind permission by Prof. Hyungju Park, the chairman of ICM 2014 Organizing Committee.

You have full access to this open access chapter,  Download conference paper PDF

Similar content being viewed by others

nature of mathematics research paper

The Missing Element for Teachers: Learning What Mathematics Is

nature of mathematics research paper

Educational Paths to Mathematics: Which Paths Forward to What Mathematics?

nature of mathematics research paper

Mathematics at the Center of Distinct Fields: A Response to Michael and Ted

What is mathematics.

Defining mathematics. According to Wikipedia in English, in the March 2014 version, the answer to “What is Mathematics?” is

Mathematics is the abstract study of topics such as quantity (numbers), [2] structure, [3] space, [2] and change. [4][5][6] There is a range of views among mathematicians and philosophers as to the exact scope and definition of mathematics. [7][8] Mathematicians seek out patterns (Highland & Highland, 1961 , 1963 ) and use them to formulate new conjectures. Mathematicians resolve the truth or falsity of conjectures by mathematical proof. When mathematical structures are good models of real phenomena, then mathematical reasoning can provide insight or predictions about nature. Through the use of abstraction and logic, mathematics developed from counting, calculation, measurement, and the systematic study of the shapes and motions of physical objects. Practical mathematics has been a human activity for as far back as written records exist. The research required to solve mathematical problems can take years or even centuries of sustained inquiry.

None of this is entirely wrong, but it is also not satisfactory. Let us just point out that the fact that there is no agreement about the definition of mathematics, given as part of a definition of mathematics, puts us into logical difficulties that might have made Gödel smile. Footnote 1

The answer given by Wikipedia in the current German version, reads (in our translation):

Mathematics […] is a science that developed from the investigation of geometric figures and the computing with numbers. For mathematics , there is no commonly accepted definition; today it is usually described as a science that investigates abstract structures that it created itself by logical definitions using logic for their properties and patterns.

This is much worse, as it portrays mathematics as a subject without any contact to, or interest from, a real world.

The borders of mathematics. Is mathematics “stand-alone”? Could it be defined without reference to “neighboring” subjects, such as physics (which does appear in the English Wikipedia description)? Indeed, one possibility to characterize mathematics describes the borders/boundaries that separate it from its neighbors. Even humorous versions of such “distinguishing statements” such as

“Mathematics is the part of physics where the experiments are cheap.”

“Mathematics is the part of philosophy where (some) statements are true—without debate or discussion.”

“Mathematics is computer science without electricity.” (So “Computer science is mathematics with electricity.”)

contain a lot of truth and possibly tell us a lot of “characteristics” of our subject. None of these is, of course, completely true or completely false, but they present opportunities for discussion.

What we do in mathematics . We could also try to define mathematics by “what we do in mathematics”: This is much more diverse and much more interesting than the Wikipedia descriptions! Could/should we describe mathematics not only as a research discipline and as a subject taught and learned at school, but also as a playground for pupils, amateurs, and professionals, as a subject that presents challenges (not only for pupils, but also for professionals as well as for amateurs), as an arena for competitions, as a source of problems, small and large, including some of the hardest problems that science has to offer, at all levels from elementary school to the millennium problems (Csicsery, 2008 ; Ziegler, 2011 )?

What we teach in mathematics classes . Education bureaucrats might (and probably should) believe that the question “What is Mathematics?” is answered by high school curricula. But what answers do these give?

This takes us back to the nineteenth century controversies about what mathematics should be taught at school and at the Universities. In the German version this was a fierce debate. On the one side it saw the classical educational ideal as formulated by Wilhelm von Humboldt (who was involved in the concept for and the foundation 1806 of the Berlin University, now named Humboldt Universität, and to a certain amount shaped the modern concept of a university); here mathematics had a central role, but this was the classical “Greek” mathematics, starting from Euclid’s axiomatic development of geometry, the theory of conics, and the algebra of solving polynomial equations, not only as cultural heritage, but also as a training arena for logical thinking and problem solving. On the other side of the fight were the proponents of “Realbildung”: Realgymnasien and the technical universities that were started at that time tried to teach what was needed in commerce and industry: calculation and accounting, as well as the mathematics that could be useful for mechanical and electrical engineering—second rate education in the view of the classical German Gymnasium.

This nineteenth century debate rests on an unnatural separation into the classical, pure mathematics, and the useful, applied mathematics; a division that should have been overcome a long time ago (perhaps since the times of Archimedes), as it is unnatural as a classification tool and it is also a major obstacle to progress both in theory and in practice. Nevertheless the division into “classical” and “current” material might be useful in discussing curriculum contents—and the question for what purpose it should be taught; see our discussion in the Section “ Three Times Mathematics at School? ”.

The Courant–Robbins answer . The title of the present paper is, of course, borrowed from the famous and very successful book by Richard Courant and Herbert Robbins. However, this title is a question—what is Courant and Robbins’ answer? Indeed, the book does not give an explicit definition of “What is Mathematics,” but the reader is supposed to get an idea from the presentation of a diverse collection of mathematical investigations. Mathematics is much bigger and much more diverse than the picture given by the Courant–Robbins exposition. The presentation in this section was also meant to demonstrate that we need a multi-facetted picture of mathematics: One answer is not enough, we need many.

Why Should We Care?

The question “What is Mathematics?” probably does not need to be answered to motivate why mathematics should be taught, as long as we agree that mathematics is important.

However, a one-sided answer to the question leads to one-sided concepts of what mathematics should be taught.

At the same time a one-dimensional picture of “What is Mathematics” will fail to motivate kids at school to do mathematics, it will fail to motivate enough pupils to study mathematics, or even to think about mathematics studies as a possible career choice, and it will fail to motivate the right students to go into mathematics studies, or into mathematics teaching. If the answer to the question “What is Mathematics”, or the implicit answer given by the public/prevailing image of the subject, is not attractive, then it will be very difficult to motivate why mathematics should be learned—and it will lead to the wrong offers and the wrong choices as to what mathematics should be learned.

Indeed, would anyone consider a science that studies “abstract” structures that it created itself (see the German Wikipedia definition quoted above) interesting? Could it be relevant? If this is what mathematics is, why would or should anyone want to study this, get into this for a career? Could it be interesting and meaningful and satisfying to teach this?

Also in view of the diversity of the students’ expectations and talents, we believe that one answer is plainly not enough. Some students might be motivated to learn mathematics because it is beautiful, because it is so logical, because it is sometimes surprising. Or because it is part of our cultural heritage. Others might be motivated, and not deterred, by the fact that mathematics is difficult. Others might be motivated by the fact that mathematics is useful, it is needed—in everyday life, for technology and commerce, etc. But indeed, it is not true that “the same” mathematics is needed in everyday life, for university studies, or in commerce and industry. To other students, the motivation that “it is useful” or “it is needed” will not be sufficient. All these motivations are valid, and good—and it is also totally valid and acceptable that no single one of these possible types of arguments will reach and motivate all these students.

Why do so many pupils and students fail in mathematics, both at school and at universities? There are certainly many reasons, but we believe that motivation is a key factor. Mathematics is hard. It is abstract (that is, most of it is not directly connected to everyday-life experiences). It is not considered worth-while. But a lot of the insufficient motivation comes from the fact that students and their teachers do not know “What is Mathematics.”

Thus a multi-facetted image of mathematics as a coherent subject, all of whose many aspects are well connected, is important for a successful teaching of mathematics to students with diverse (possible) motivations.

This leads, in turn, to two crucial aspects, to be discussed here next: What image do students have of mathematics? And then, what should teachers answer when asked “What is Mathematics”? And where and how and when could they learn that?

The Image of Mathematics

A 2008 study by Mendick, Epstein, and Moreau ( 2008 ), which was based on an extensive survey among British students, was summarized as follows:

Many students and undergraduates seem to think of mathematicians as old, white, middle-class men who are obsessed with their subject, lack social skills and have no personal life outside maths. The student’s views of maths itself included narrow and inaccurate images that are often limited to numbers and basic arithmetic.

The students’ image of what mathematicians are like is very relevant and turns out to be a massive problem, as it defines possible (anti-)role models, which are crucial for any decision in the direction of “I want to be a mathematician.” If the typical mathematician is viewed as an “old, white, male, middle-class nerd,” then why should a gifted 16-year old girl come to think “that’s what I want to be when I grow up”? Mathematics as a science, and as a profession, looses (or fails to attract) a lot of talent this way! However, this is not the topic of this presentation.

On the other hand the first and the second diagnosis of the quote from Mendick et al. ( 2008 ) belong together: The mathematicians are part of “What is Mathematics”!

And indeed, looking at the second diagnosis, if for the key word “mathematics” the images that spring to mind don’t go beyond a per se meaningless “ \( a^{2} + b^{2} = c^{2} \) ” scribbled in chalk on a blackboard—then again, why should mathematics be attractive, as a subject, as a science, or as a profession?

We think that we have to look for, and work on, multi-facetted and attractive representations of mathematics by images. This could be many different, separate images, but this could also be images for “mathematics as a whole.”

Four Images for “What Is Mathematics?”

Striking pictorial representations of mathematics as a whole (as well as of other sciences!) and of their change over time can be seen on the covers of the German “Was ist was” books. The history of these books starts with the series of “How and why” Wonder books published by Grosset and Dunlop, New York, since 1961, which was to present interesting subjects (starting with “Dinosaurs,” “Weather,” and “Electricity”) to children and younger teenagers. The series was published in the US and in Great Britain in the 1960s and 1970s, but it was and is much more successful in Germany, where it was published (first in translation, then in volumes written in German) by Ragnar Tessloff since 1961. Volume 18 in the US/UK version and Volume 12 in the German version treats “Mathematics”, first published in 1963 (Highland & Highland, 1963 ), but then republished with the same title but a new author and contents in 2001 (Blum, 2001 ). While it is worthwhile to study the contents and presentation of mathematics in these volumes, we here focus on the cover illustrations (see Fig.  1 ), which for the German edition exist in four entirely different versions, the first one being an adaption of the original US cover of (Highland & Highland, 1961 ).

The four covers of “Was ist was. Band 12: Mathematik” (Highland & Highland, 1963 ; Blum, 2001 )

All four covers represent a view of “What is Mathematics” in a collage mode, where the first one represents mathematics as a mostly historical discipline (starting with the ancient Egyptians), while the others all contain a historical allusion (such as pyramids, Gauß, etc.) alongside with objects of mathematics (such as prime numbers or \( \pi \) , dices to illustrate probability, geometric shapes). One notable object is the oddly “two-colored” Möbius band on the 1983 cover, which was changed to an entirely green version in a later reprint.

One can discuss these covers with respect to their contents and their styles, and in particular in terms of attractiveness to the intended buyers/readers. What is over-emphasized? What is missing? It seems more important to us to

think of our own images/representations for “What is Mathematics”,

think about how to present a multi-facetted image of “What is Mathematics” when we teach.

Indeed, the topics on the covers of the “Was ist was” volumes of course represent interesting (?) topics and items discussed in the books. But what do they add up to? We should compare this to the image of mathematics as represented by school curricula, or by the university curricula for teacher students.

In the context of mathematics images, let us mention two substantial initiatives to collect and provide images from current mathematics research, and make them available on internet platforms, thus providing fascinating, multi-facetted images of mathematics as a whole discipline:

Guy Métivier et al.: “Image des Maths. La recherche mathématique en mots et en images” [“Images of Maths. Mathematical research in words and images”], CNRS, France, at images.math.cnrs.fr (texts in French)

Andreas D. Matt, Gert-Martin Greuel et al.: “IMAGINARY. open mathematics,” Mathematisches Forschungsinstitut Oberwolfach, at imaginary.org (texts in German, English, and Spanish).

The latter has developed from a very successful travelling exhibition of mathematics images, “IMAGINARY—through the eyes of mathematics,” originally created on occasion of and for the German national science year 2008 “Jahr der Mathematik. Alles was zählt” [“Year of Mathematics 2008. Everything that counts”], see www.jahr-der-mathematik.de , which was highly successful in communicating a current, attractive image of mathematics to the German public—where initiatives such as the IMAGINARY exhibition had a great part in the success.

Teaching “What Is Mathematics” to Teachers

More than 100 years ago, in 1908, Felix Klein analyzed the education of teachers. In the introduction to the first volume of his “Elementary Mathematics from a Higher Standpoint” he wrote (our translation):

At the beginning of his university studies, the young student is confronted with problems that do not remind him at all of what he has dealt with up to then, and of course, he forgets all these things immediately and thoroughly. When after graduation he becomes a teacher, he has to teach exactly this traditional elementary mathematics, and since he can hardly link it with his university mathematics, he soon readopts the former teaching tradition and his studies at the university become a more or less pleasant reminiscence which has no influence on his teaching (Klein, 1908 ).

This phenomenon—which Klein calls the double discontinuity —can still be observed. In effect, the teacher students “tunnel” through university: They study at university in order to get a degree, but nevertheless they afterwards teach the mathematics that they had learned in school, and possibly with the didactics they remember from their own school education. This problem observed and characterized by Klein gets even worse in a situation (which we currently observe in Germany) where there is a grave shortage of Mathematics teachers, so university students are invited to teach at high school long before graduating from university, so they have much less university education to tunnel at the time when they start to teach in school. It may also strengthen their conviction that University Mathematics is not needed in order to teach.

How to avoid the double discontinuity is, of course, a major challenge for the design of university curricula for mathematics teachers. One important aspect however, is tied to the question of “What is Mathematics?”: A very common highschool image/concept of mathematics, as represented by curricula, is that mathematics consists of the subjects presented by highschool curricula, that is, (elementary) geometry, algebra (in the form of arithmetic, and perhaps polynomials), plus perhaps elementary probability, calculus (differentiation and integration) in one variable—that’s the mathematics highschool students get to see, so they might think that this is all of it! Could their teachers present them a broader picture? The teachers after their highschool experience studied at university, where they probably took courses in calculus/analysis, linear algebra, classical algebra, plus some discrete mathematics, stochastics/probability, and/or numerical analysis/differential equations, perhaps a programming or “computer-oriented mathematics” course. Altogether they have seen a scope of university mathematics where no current research becomes visible, and where most of the contents is from the nineteenth century, at best. The ideal is, of course, that every teacher student at university has at least once experienced how “doing research on your own” feels like, but realistically this rarely happens. Indeed, teacher students would have to work and study and struggle a lot to see the fascination of mathematics on their own by doing mathematics; in reality they often do not even seriously start the tour and certainly most of them never see the “glimpse of heaven.” So even if the teacher student seriously immerges into all the mathematics on the university curriculum, he/she will not get any broader image of “What is Mathematics?”. Thus, even if he/she does not tunnel his university studies due to the double discontinuity, he/she will not come back to school with a concept that is much broader than that he/she originally gained from his/her highschool times.

Our experience is that many students (teacher students as well as classical mathematics majors) cannot name a single open problem in mathematics when graduating the university. They have no idea of what “doing mathematics” means—for example, that part of this is a struggle to find and shape the “right” concepts/definitions and in posing/developing the “right” questions and problems.

And, moreover, also the impressions and experiences from university times will get old and outdated some day: a teacher might be active at a school for several decades—while mathematics changes! Whatever is proved in mathematics does stay true, of course, and indeed standards of rigor don’t change any more as much as they did in the nineteenth century, say. However, styles of proof do change (see: computer-assisted proofs, computer-checkable proofs, etc.). Also, it would be good if a teacher could name “current research focus topics”: These do change over ten or twenty years. Moreover, the relevance of mathematics in “real life” has changed dramatically over the last thirty years.

The Panorama Project

For several years, the present authors have been working on developing a course [and eventually a book (Loos & Ziegler, 2017 )] called “Panorama der Mathematik” [“Panorama of Mathematics”]. It primarily addresses mathematics teacher students, and is trying to give them a panoramic view on mathematics: We try to teach an overview of the subject, how mathematics is done, who has been and is doing it, including a sketch of main developments over the last few centuries up to the present—altogether this is supposed to amount to a comprehensive (but not very detailed) outline of “What is Mathematics.” This, of course, turns out to be not an easy task, since it often tends to feel like reading/teaching poetry without mastering the language. However, the approach of Panorama is complementing mathematics education in an orthogonal direction to the classic university courses, as we do not teach mathematics but present (and encourage to explore ); according to the response we get from students they seem to feel themselves that this is valuable.

Our course has many different components and facets, which we here cast into questions about mathematics. All these questions (even the ones that “sound funny”) should and can be taken seriously, and answered as well as possible. For each of them, let us here just provide at most one line with key words for answers:

When did mathematics start?

Numbers and geometric figures start in stone age; the science starts with Euclid?

How large is mathematics? How many Mathematicians are there?

The Mathematics Genealogy Project had 178854 records as of 12 April 2014.

How is mathematics done, what is doing research like?

Collect (auto)biographical evidence! Recent examples: Frenkel ( 2013 ) , Villani ( 2012 ).

What does mathematics research do today? What are the Grand Challenges?

The Clay Millennium problems might serve as a starting point.

What and how many subjects and subdisciplines are there in mathematics?

See the Mathematics Subject Classification for an overview!

Why is there no “Mathematical Industry”, as there is e.g. Chemical Industry?

There is! See e.g. Telecommunications, Financial Industry, etc.

What are the “key concepts” in mathematics? Do they still “drive research”?

Numbers, shapes, dimensions, infinity, change, abstraction, …; they do.

What is mathematics “good for”?

It is a basis for understanding the world, but also for technological progress.

Where do we do mathematics in everyday life?

Not only where we compute, but also where we read maps, plan trips, etc.

Where do we see mathematics in everyday life?

There is more maths in every smart phone than anyone learns in school.

What are the greatest achievements of mathematics through history?

Make your own list!

An additional question is how to make university mathematics more “sticky” for the tunneling teacher students, how to encourage or how to force them to really connect to the subject as a science. Certainly there is no single, simple, answer for this!

Telling Stories About Mathematics

How can mathematics be made more concrete? How can we help students to connect to the subject? How can mathematics be connected to the so-called real world?

Showing applications of mathematics is a good way (and a quite beaten path). Real applications can be very difficult to teach since in most advanced, realistic situation a lot of different mathematical disciplines, theories and types of expertise have to come together. Nevertheless, applications give the opportunity to demonstrate the relevance and importance of mathematics. Here we want to emphasize the difference between teaching a topic and telling about it. To name a few concrete topics, the mathematics behind weather reports and climate modelling is extremely difficult and complex and advanced, but the “basic ideas” and simplified models can profitably be demonstrated in highschool, and made plausible in highschool level mathematical terms. Also success stories like the formula for the Google patent for PageRank (Page, 2001 ), see Langville and Meyer ( 2006 ), the race for the solution of larger and larger instances of the Travelling Salesman Problem (Cook, 2011 ), or the mathematics of chip design lend themselves to “telling the story” and “showing some of the maths” at a highschool level; these are among the topics presented in the first author’s recent book (Ziegler, 2013b ), where he takes 24 images as the starting points for telling stories—and thus developing a broader multi-facetted picture of mathematics.

Another way to bring maths in contact with non-mathematicians is the human level. Telling stories about how maths is done and by whom is a tricky way, as can be seen from the sometimes harsh reactions on www.mathoverflow.net to postings that try to excavate the truth behind anecdotes and legends. Most mathematicians see mathematics as completely independent from the persons who explored it. History of mathematics has the tendency to become gossip , as Gian-Carlo Rota once put it (Rota, 1996 ). The idea seems to be: As mathematics stands for itself, it has also to be taught that way.

This may be true for higher mathematics. However, for pupils (and therefore, also for teachers), transforming mathematicians into humans can make science more tangible, it can make research interesting as a process (and a job?), and it can be a starting/entry point for real mathematics. Therefore, stories can make mathematics more sticky. Stories cannot replace the classical approaches to teaching mathematics. But they can enhance it.

Stories are the way by which knowledge has been transferred between humans for thousands of years. (Even mathematical work can be seen as a very abstract form of storytelling from a structuralist point of view.) Why don’t we try to tell more stories about mathematics, both at university and in school—not legends, not fairy tales, but meta-information on mathematics—in order to transport mathematics itself? See (Ziegler, 2013a ) for an attempt by the first author in this direction.

By stories, we do not only mean something like biographies, but also the way of how mathematics is created or discovered: Jack Edmonds’ account (Edmonds, 1991 ) of how he found the blossom shrink algorithm is a great story about how mathematics is actually done . Think of Thomas Harriot’s problem about stacking cannon balls into a storage space and what Kepler made out of it: the genesis of a mathematical problem. Sometimes scientists even wrap their work into stories by their own: see e.g. Leslie Lamport’s Byzantine Generals (Lamport, Shostak, & Pease, 1982 ).

Telling how research is done opens another issue. At school, mathematics is traditionally taught as a closed science. Even touching open questions from research is out of question, for many good and mainly pedagogical reasons. However, this fosters the image of a perfect science where all results are available and all problems are solved—which is of course completely wrong (and moreover also a source for a faulty image of mathematics among undergraduates).

Of course, working with open questions in school is a difficult task. None of the big open questions can be solved with an elementary mathematical toolbox; many of them are not even accessible as questions. So the big fear of discouraging pupils is well justified. On the other hand, why not explore mathematics by showing how questions often pop up on the way? Posing questions in and about mathematics could lead to interesting answers—in particular to the question of “What is Mathematics, Really?”

Three Times Mathematics at School?

So, what is mathematics? With school education in mind, the first author has argued in Ziegler ( 2012 ) that we are trying cover three aspects the same time, which one should consider separately and to a certain extent also teach separately:

A collection of basic tools, part of everyone’s survival kit for modern-day life—this includes everything, but actually not much more than, what was covered by Adam Ries’ “Rechenbüchlein” [“Little Book on Computing”] first published in 1522, nearly 500 years ago;

A field of knowledge with a long history, which is a part of our culture and an art, but also a very productive basis (indeed a production factor) for all modern key technologies. This is a “story-telling” subject.

An introduction to mathematics as a science—an important, highly developed, active, huge research field.

Looking at current highschool instruction, there is still a huge emphasis on Mathematics I, with a rather mechanical instruction on arithmetic, “how to compute correctly,” and basic problem solving, plus a rather formal way of teaching Mathematics III as a preparation for possible university studies in mathematics, sciences or engineering. Mathematics II, which should provide a major component of teaching “What is Mathematics,” is largely missing. However, this part also could and must provide motivation for studying Mathematics I or III!

What Is Mathematics, Really?

There are many, and many different, valid answers to the Courant-Robbins question “What is Mathematics?”

A more philosophical one is given by Reuben Hersh’s book “What is Mathematics, Really?” Hersh ( 1997 ), and there are more psychological ones, on the working level. Classics include Jacques Hadamard’s “Essay on the Psychology of Invention in the Mathematical Field” and Henri Poincaré’s essays on methodology; a more recent approach is Devlin’s “Introduction to Mathematical Thinking” Devlin ( 2012 ), or Villani’s book ( 2012 ).

And there have been many attempts to describe mathematics in encyclopedic form over the last few centuries. Probably the most recent one is the gargantuan “Princeton Companion to Mathematics”, edited by Gowers et al. ( 2008 ), which indeed is a “Princeton Companion to Pure Mathematics.”

However, at a time where ZBMath counts more than 100,000 papers and books per year, and 29,953 submissions to the math and math-ph sections of arXiv.org in 2016, it is hopeless to give a compact and simple description of what mathematics really is, even if we had only the “current research discipline” in mind. The discussions about the classification of mathematics show how difficult it is to cut the science into slices, and it is even debatable whether there is any meaningful way to separate applied research from pure mathematics.

Probably the most diplomatic way is to acknowledge that there are “many mathematics.” Some years ago Tao ( 2007 ) gave an open list of mathematics that is/are good for different purposes—from “problem-solving mathematics” and “useful mathematics” to “definitive mathematics”, and wrote:

As the above list demonstrates, the concept of mathematical quality is a high-dimensional one, and lacks an obvious canonical total ordering. I believe this is because mathematics is itself complex and high-dimensional, and evolves in unexpected and adaptive ways; each of the above qualities represents a different way in which we as a community improve our understanding and usage of the subject.

In this sense, many answers to “What is Mathematics?” probably show as much about the persons who give the answers as they manage to characterize the subject.

According to Wikipedia , the same version, the answer to “Who is Mathematics” should be:

Mathematics , also known as Allah Mathematics , (born: Ronald Maurice Bean [1] ) is a hip hop producer and DJ for the Wu-Tang Clan and its solo and affiliate projects. This is not the mathematics we deal with here.

Blum, W. (2001). Was ist was. Band 12: Mathematik , Tessloff Verlag, Nürnberg. Revised version, with new cover, 2010.

Google Scholar  

Cook, W. (2011). In pursuit of the traveling salesman: Mathematics at the limits of computation . Princeton NJ: Princeton University Press.

Courant, R., & Robbins, H. (1941). What is mathematics? an elementary approach to ideas and methods (2nd ed.), Oxford: Oxford University Press. Stewart, I (ed), 1996.

Csicsery, G. (2008). Hard problems. the road to the world’s toughest math contest , Documentary film, 82 minutes (feature)/45 minutes (classroom version), Washington, DC: Mathematical Association of America.

Devlin, K. J. (2012). Introduction to mathematical thinking , published by Keith Devlin, Palo Alto CA.

Edmonds, J. (1991). A glimpse of heaven, In: J. K. Lenstra, A. Schrijver, & A. Rinnooy Kan (eds.) History of mathematical programming—A collection of personal reminiscences (pp. 32–54). Amsterdam: CWI and North-Holland.

Frenkel, E. (2013). Love & math. The heart of hidden reality . Philadelphia PA: Basic Books/Perseus Books.

Gowers, Timothy, Leader, Imre, & Barrow-Green, June (Eds.). (2008). The princeton companion to mathematics . Princeton NJ: Princeton University Press.

Highland, E. H., & Highland, H. J. (1961). The how and why wonder book of mathematics . New York: Grosset & Dunlop.

Highland, E. H., & Highland, H. J. (1963). Was ist was. Band 12: Mathematik , Neuer Tessloff Verlag, Hamburg, 1963. Revised edition 1969. New cover 1983.

Hersh, R. (1997). What is mathematics, really? . Oxford: Oxford University Press.

Klein, F.(1933). Elementarmathematik vom höheren Standpunkte aus. Teil I: Arithmetik, Algebra, Analysis , B. G. Teubner, Leipzig, 1908. Vierte Auflage. Heidelberg: Springer.

Lamport, L., Shostak, R., & Pease, M. (1982). The byzantine generals problem. ACM Transactions on Programming Languages and Systems, 4, 382–401.

Article   Google Scholar  

Langville, A. N., & Meyer, C. D. (2006). Google’s pagerank and beyond. The science of search engine rankings . Princeton and Oxford: Princeton University Press.

Loos, A., & Ziegler, G. M. (2017). Panorama der Mathematik . Heidelberg: Springer Spectrum, to appear.

Mendick, H., Epstein, D., & Moreau, M.-P. (2008). Mathematical images and identities: Education, entertainment, social justice . London: Institute for Policy Studies in Education, London Metropolitan University.

Page, L. (2001) Method for node ranking in a linked database , United States Patent No. US 6,285,999 B1, (submitted: January 9, 1998), http://www.google.com/patents/US6285999

Rota, G.-C. (1996). Indiscrete thoughts . Basel: Birkhäuser.

Tao, T. (2007). What is good mathematics? Bulletin of the American Mathematical Society, 44 (4), 623–634.

Villani, C. (2012). Théorème vivant . Paris: Bernard Grasset. (in French).

Ziegler, G. M. (2011). Three competitions. In D. Schleicher & M. Lackmann (Eds.), Invitation to mathematics. From competition to research (pp. 195–205). Berlin: Springer.

Chapter   Google Scholar  

Ziegler, G. M. (2012). Mathematics school education provides answers—To which questions? EMS Newsletter (84), 8–11.

Ziegler, G. M.(2013a). Do I count? stories from mathematics , Boca Raton FL: CRC Press/Taylor & Francis. English translation of “Darf ich Zahlen? Geschichten aus der Mathematik”, Piper, München, 2010.

Ziegler, G. M. (2013b). Mathematik—Das ist doch keine Kunst! . München: Knaus.

Download references

Acknowledgment

The authors’ work has received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP7/2007-2013)/ERC grant agreement no. 247029, the DFG Research Center Matheon, and the the DFG Collaborative Research Center TRR 109 “Discretization in Geometry and Dynamics”.

Author information

Authors and affiliations.

Institut Für Mathematik, FU Berlin, Arnimallee 2, 14195, Berlin, Germany

Günter M. Ziegler

Zeit Online, Askanischer Platz 1, 10963, Berlin, Germany

Andreas Loos

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Günter M. Ziegler .

Editor information

Editors and affiliations.

Faculty of Education, Universität Hamburg, Hamburg, Hamburg, Germany

Gabriele Kaiser

Rights and permissions

Open Access Except where otherwise noted, this chapter is licensed under a Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

Copyright information

© 2017 The Author(s)

About this paper

Cite this paper.

Ziegler, G.M., Loos, A. (2017). “What is Mathematics?” and why we should ask, where one should experience and learn that, and how to teach it. In: Kaiser, G. (eds) Proceedings of the 13th International Congress on Mathematical Education. ICME-13 Monographs. Springer, Cham. https://doi.org/10.1007/978-3-319-62597-3_5

Download citation

DOI : https://doi.org/10.1007/978-3-319-62597-3_5

Published : 02 November 2017

Publisher Name : Springer, Cham

Print ISBN : 978-3-319-62596-6

Online ISBN : 978-3-319-62597-3

eBook Packages : Education Education (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research
R S T I Chapter 2: THE NATURE OF MATHEMATICS

 

P R

Mathematics is the science of patterns and relationships. As a theoretical discipline, mathematics explores the possible relationships among abstractions without concern for whether those abstractions have counterparts in the real world. The abstractions can be anything from strings of numbers to geometric figures to sets of equations. In addressing, say, "Does the interval between prime numbers form a pattern?" as a theoretical question, mathematicians are interested only in finding a pattern or proving that there is none, but not in what use such knowledge might have. In deriving, for instance, an expression for the change in the surface area of any regular solid as its volume approaches zero, mathematicians have no interest in any correspondence between geometric solids and physical objects in the real world.

A central line of investigation in theoretical mathematics is identifying in each field of study a small set of basic ideas and rules from which all other interesting ideas and rules in that field can be logically deduced. Mathematicians, like other scientists, are particularly pleased when previously unrelated parts of mathematics are found to be derivable from one another, or from some more general theory. Part of the sense of beauty that many people have perceived in mathematics lies not in finding the greatest elaborateness or complexity but on the contrary, in finding the greatest economy and simplicity of representation and proof. As mathematics has progressed, more and more relationships have been found between parts of it that have been developed separately—for example, between the symbolic representations of algebra and the spatial representations of geometry. These cross-connections enable insights to be developed into the various parts; together, they strengthen belief in the correctness and underlying unity of the whole structure.

Mathematics is also an applied science. Many mathematicians focus their attention on solving problems that originate in the world of experience. They too search for patterns and relationships, and in the process they use techniques that are similar to those used in doing purely theoretical mathematics. The difference is largely one of intent. In contrast to theoretical mathematicians, applied mathematicians, in the examples given above, might study the interval pattern of prime numbers to develop a new system for coding numerical information, rather than as an abstract problem. Or they might tackle the area/volume problem as a step in producing a model for the study of crystal behavior.

The results of theoretical and applied mathematics often influence each other. The discoveries of theoretical mathematicians frequently turn out—sometimes decades later—to have unanticipated practical value. Studies on the mathematical properties of random events, for example, led to knowledge that later made it possible to improve the design of experiments in the social and natural sciences. Conversely, in trying to solve the problem of billing long-distance telephone users fairly, mathematicians made fundamental discoveries about the mathematics of complex networks. Theoretical mathematics, unlike the other sciences, is not constrained by the real world, but in the long run it contributes to a better understanding of that world.

 

S , T

Because of its abstractness, mathematics is universal in a sense that other fields of human thought are not. It finds useful applications in business, industry, music, historical scholarship, politics, sports, medicine, agriculture, engineering, and the social and natural sciences. The relationship between mathematics and the other fields of basic and applied science is especially strong. This is so for several reasons, including the following:

= is not simply a shorthand way of saying that the acceleration of an object depends on the force applied to it and its mass; rather, it is a precise statement of the quantitative relationship among those variables. More important, mathematics provides the grammar of science—the rules for analyzing scientific ideas and data rigorously.

 

I

Using mathematics to express ideas or to solve problems involves at least three phases: (1) representing some aspects of things abstractly, (2) manipulating the abstractions by rules of logic to find new relationships between them, and (3) seeing whether the new relationships say something useful about the original things.

Mathematical thinking often begins with the process of abstraction—that is, noticing a similarity between two or more objects or events. Aspects that they have in common, whether concrete or hypothetical, can be represented by symbols such as numbers, letters, other marks, diagrams, geometrical constructions, or even words. Whole numbers are abstractions that represent the size of sets of things and events or the order of things within a set. The circle as a concept is an abstraction derived from human faces, flowers, wheels, or spreading ripples; the letter A may be an abstraction for the surface area of objects of any shape, for the acceleration of all moving objects, or for all objects having some specified property; the symbol + represents a process of addition, whether one is adding apples or oranges, hours, or miles per hour. And abstractions are made not only from concrete objects or processes; they can also be made from other abstractions, such as kinds of numbers (the even numbers, for instance).

Such abstraction enables mathematicians to concentrate on some features of things and relieves them of the need to keep other features continually in mind. As far as mathematics is concerned, it does not matter whether a triangle represents the surface area of a sail or the convergence of two lines of sight on a star; mathematicians can work with either concept in the same way. The resulting economy of effort is very useful—provided that in making an abstraction, care is taken not to ignore features that play a significant role in determining the outcome of the events being studied.

After abstractions have been made and symbolic representations of them have been selected, those symbols can be combined and recombined in various ways according to precisely defined rules. Sometimes that is done with a fixed goal in mind; at other times it is done in the context of experiment or play to see what happens. Sometimes an appropriate manipulation can be identified easily from the intuitive meaning of the constituent words and symbols; at other times a useful series of manipulations has to be worked out by trial and error.

Typically, strings of symbols are combined into statements that express ideas or propositions. For example, the symbol for the area of any square may be used with the symbol for the length of the square's side to form the proposition = . This equation specifies how the area is related to the side—and also implies that it depends on nothing else. The rules of ordinary algebra can then be used to discover that if the length of the sides of a square is doubled, the square's area becomes four times as great. More generally, this knowledge makes it possible to find out what happens to the area of a square no matter how the length of its sides is changed, and conversely, how any change in the area affects the sides.

Mathematical insights into abstract relationships have grown over thousands of years, and they are still being extended—and sometimes revised. Although they began in the concrete experience of counting and measuring, they have come through many layers of abstraction and now depend much more on internal logic than on mechanical demonstration. In a sense, then, the manipulation of abstractions is much like a game: Start with some basic rules, then make any moves that fit those rules—which includes inventing additional rules and finding new connections between old rules. The test for the validity of new ideas is whether they are consistent and whether they relate logically to the other rules.

Mathematical processes can lead to a kind of model of a thing, from which insights can be gained about the thing itself. Any mathematical relationships arrived at by manipulating abstract statements may or may not convey something truthful about the thing being modeled. For example, if 2 cups of water are added to 3 cups of water and the abstract mathematical operation 2+3 = 5 is used to calculate the total, the correct answer is 5 cups of water. However, if 2 cups of sugar are added to 3 cups of hot tea and the same operation is used, 5 is an incorrect answer, for such an addition actually results in only slightly more than 4 cups of very sweet tea. The simple addition of volumes is appropriate to the first situation but not to the second—something that could have been predicted only by knowing something of the physical differences in the two situations. To be able to use and interpret mathematics well, therefore, it is necessary to be concerned with more than the mathematical validity of abstract operations and to also take into account how well they correspond to the properties of the things represented.

Sometimes common sense is enough to enable one to decide whether the results of the mathematics are appropriate. For example, to estimate the height 20 years from now of a girl who is 5' 5" tall and growing at the rate of an inch per year, common sense suggests rejecting the simple "rate times time" answer of 7' 1" as highly unlikely, and turning instead to some other mathematical model, such as curves that approach limiting values. Sometimes, however, it may be difficult to know just how appropriate mathematical results are—for example, when trying to predict stock-market prices or earthquakes.

Often a single round of mathematical reasoning does not produce satisfactory conclusions, and changes are tried in how the representation is made or in the operations themselves. Indeed, jumps are commonly made back and forth between steps, and there are no rules that determine how to proceed. The process typically proceeds in fits and starts, with many wrong turns and dead ends. This process continues until the results are good enough.

But what degree of accuracy is good enough? The answer depends on how the result will be used, on the consequences of error, and on the likely cost of modeling and computing a more accurate answer. For example, an error of 1 percent in calculating the amount of sugar in a cake recipe could be unimportant, whereas a similar degree of error in computing the trajectory for a space probe could be disastrous. The importance of the "good enough" question has led, however, to the development of mathematical processes for estimating how far off results might be and how much computation would be required to obtain the desired degree of accuracy.

Copyright © 1989, 1990 by American Association for the Advancement of Science

nature of mathematics research paper

Academia.edu no longer supports Internet Explorer.

To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to  upgrade your browser .

  •  We're Hiring!
  •  Help Center

The Nature of Mathematics

  • Most Cited Papers
  • Most Downloaded Papers
  • Newest Papers
  • Save to Library
  • Understanding Mathematics Follow Following
  • Mathematical Modelling Follow Following
  • Applied Mathematics Follow Following
  • Timss Follow Following
  • Teaching research Follow Following
  • Education in Sub Saharan Africa Follow Following
  • Cabri 3D Software Follow Following
  • mathemathics and IT Follow Following
  • Skills, learning & development Follow Following
  • Epistemology of Mathematics Follow Following

Enter the email address you signed up with and we'll email you a reset link.

  • Academia.edu Publishing
  •   We're Hiring!
  •   Help Center
  • Find new research papers in:
  • Health Sciences
  • Earth Sciences
  • Cognitive Science
  • Mathematics
  • Computer Science
  • Academia ©2024

> > Introduction to Research in the Classroom

What is mathematics research?

Mathematics research is the long-term, open-ended exploration of a set of related mathematics questions whose answers connect to and build upon each other. Problems are open-ended because students continually come up with new questions to ask based on their observations. Additional characteristics of student research include:

How do students benefit from doing mathematics research?

Mathematics research influences student learning in a number of ways:

Students develop mastery of mathematics topics. Philosopher and educator claimed that we don’t learn the basics by studying the basics but by engaging in rich activities which require them. Research experiences require the repeated application of technical skills in the service of looking for patterns and testing conjectures (e.g., factoring and graphing polynomials for the project). It is this repetition, in the context of motivating and meaningful problems, that leads to greater understanding and retention of mathematics skills. During an investigation, students make connections between ideas that further enhance retention.

For which students is research appropriate?

This question is usually more bluntly framed as "Can kids really do this?!" The experience of teachers in all types of school settings is that children can successfully engage in mathematics research. teachers have undertaken research with urban, rural, and suburban students from grades 4 through 12. They have guided at-risk, honors, and English as a Second Language (ESL) classes through projects lasting from a few weeks up to a year. Students in math clubs, individual students, and home-schooled students have carried out successful investigations. One of our teachers first introduced research to her honors seventh graders. Once she was confident in her own experience, she tried the same project with two low-tracked eighth-grade sections. The quality of the questions, experimenting, reasoning, and writing was excellent in all three sections and indistinguishable between the honors and non-honors students. Research drew upon a richer array of student abilities than were assessed for tracking purposes.

Research can thrive in a heterogeneous class of students if you pick a project that does not require a lot of background to get started but which also inspires sophisticated questions. Students will pose problems at a level that is both challenging and appropriate for them.

How can I get my feet wet with research?

Making Mathematics teachers have been most comfortable trying research for the first time with one of their "stronger than average" sections. Some teachers have begun work with one or more interested students as part of a mathematics club or independent seminar. The purpose of these first excursions has been for the students to become familiar with the research process and for the teacher to see how students respond to lengthy, open-ended problem-solving.

Popular starting projects have been , , , and . These projects are good starting points for any secondary group because they quickly inspire observations, conjectures, and new questions ("What if we do this…?") and can get to informal reasoning to justify some of the conjectures within a day or two. This easy entry is due to the familiarity of the content (e.g., counting, arithmetic, shapes).

You should commit at least three consecutive class periods at the start of a first investigation in order to maintain the momentum of the experience. You want students to appreciate that the questions are not typical quick exercises, so it is important that they get to wade into the work. Interruptions also make it harder for them to maintain a line of thinking. After the initial burst, you can sustain a project through weekly discussions of work done at home. If a problem is working well, do not be afraid to let kids pursue it for a long period of time. All of these projects have proven to remain challenging and interesting during weeks of student exploration (except for the , which works best as a shorter introductory activity for older students).

What can I do once my feet are wet?

If you have tried research with just a few students, try it with a class. If you have begun research with one class, try it with others. Read more chapters of the and integrate some of the supporting activities that focus on particular research skills. The most fun and greatest benefits accrue when research becomes an ongoing strand within a course. One investigation gives us a taste of research. When we engage in research regularly, we hone our intuitions about what approaches to attempt at each juncture in the process. Additionally, students who do research periodically start to apply to all of their mathematics studies the habits of extending questions, conjecturing, looking for patterns, generating confirming and counter- examples, and checking their reasoning carefully.

When students become really excited about doing mathematics and want to try a long-term project, you can form a seminar or club to support them as they work on one topic for a semester or more. Meetings can alternate between discussing the students’ progress with their questions and studying specific research skills (e.g., , , etc.).

is central to long projects. Once a student has solved an initial question, they should look for extensions of the question that build on their work. They will discover that research problems can last forever. Each new piece of work can spawn many more questions for research. However, students need to be thoughtful about the research agenda that they pursue. Endless generalizations and extensions of a problem may not yield a satisfyingly cohesive research product. For example, the many cow problems listed in the problem-posing chapter are all related by context and type, but they may not produce some larger vision that makes the solving of the next cow problem easier. There may be no interesting of cow problems and ultimately one does not just want a bag of problems but a connected whole with overarching patterns and methods that recur throughout many of the questions and solutions.

What kind of support will I need?

Many teachers independently introduce research into a class. Your work will have greater impact on students if they encounter research in all of their mathematics classes. Both for that reason and in order to feel less isolated as you experiment, it is helpful to recruit one or more colleagues to try out research along with you. Share ideas and observations and even visit each other’s classes on days when the students are doing research. Talk with your department head or supervisor to garner support for your efforts.

If you want an advisor for yourself or an outside audience for the work that your students do, you can contact the mathematics or mathematics education department at a local college and ask if any of the professors would be willing to serve as a mentor (either via email, phone, or in person) for you and your class. We have also found good mentors contacting corporations that employ scientists and mathematicians. Your mentor may just communicate with you or she may be willing to read updates or reports from the students and provide responses. You should make these exchanges via your email account—parental consent is required by law for direct internet communication. Be sure to let any prospective mentor know what your goals and expectations are for the students and for their involvement.

Mentors can help in a number of ways. They can:

s efforts). s mathematical statements.

What do I need to do before I begin?

project, and start your work looking for patterns, trying to state clear conjectures, searching for proofs or disproofs, and studying new, related problems (read about the in the chapter and work through the and chapters together as well). Many teachers have found the summer a good time for professional growth via a research project. ). If you come to feel that research is a necessary outcome of studying mathematics, then your questions will shift from " I do this?" to " can I do this?" home to parents that helps them to understand what you will be doing and why. You or your department head can talk with your principal about your goals for your students.

How do I choose a project topic?

Choose projects that are at the right level of challenge for your students. For novice student researchers, it is preferable if the focus is on learning about the research process. Projects that involve familiar content allow for a gentle introduction and for the greatest possibility of multiple interpretations and avenues of exploration that draw upon well-developed student understandings. When students can jump in fast, they are more likely to work through the more than once and grasp the iterative and open-ended nature of research. We describe these projects as having a low threshold and a high ceiling—every student can participate and there is lots of room for the most advanced students to find challenging questions.

As students gain experience with research, they will be more confident and ready to tackle questions involving less familiar areas of mathematics. It is at this point that it will be easier to have students learn new mathematics topics in the context of research. This combination will allow you to give students practice developing important mathematical habits of mind while covering the content required of a given course (see below).

Certain projects are particularly inspiring for students because of their visual appeal. For example, the pictures that emerge during the or investigations can catch students’ attention and stimulate them to look for the underlying explanations of what they see. See Alan Schoenfeld’s discussion of criteria for good problems at ).

You need to consider your own comfort level when picking a project as well. You may want to spend some time working on and familiarizing yourself with the questions before you introduce them to the class. Do not feel that you have to have the entire project mastered. Once students get working, they invariably raise questions that none of us anticipate, so it is impossible to figure out all of the answers ahead of time (see below and in ).

If you are working with a small number of students, you may want to have them pick the project. One advantage to giving students a choice is that they will feel more motivated having picked a question that most interests them. They will also see that you want them to develop their own personal mathematical tastes. It is better if at least two or three students work on a given project so that they can share ideas with each other. We have, however, seen many cases of individual students working productively on problems that they have chosen or posed themselves.

Finally, one or more students may come to you with an original question or you can invite students to pose their own questions (see ). Students who tackle their own questions are coming into their own as mathematicians, but there is a caveat that accompanies such an endeavor. Since the problems are original, it may not be clear ahead of time if they are too difficult for the student. Similarly, the examples may not turn out to follow any recognizable patterns or yield any conjectures. Original questions do not come with guarantees.

What if I am not familiar with a problem?

Perhaps the greatest anxiety that teachers express about doing research is that they themselves may not be able to answer the questions that students are exploring. As noted , we cannot expect to know all of the answers to all questions, nor should we portray ourselves in that light. It is not our job to answer all of questions that students might pose—it is our job to model for them the questions that they should be asking themselves when they are having difficulty making progress (see ). We have, in fact, been unable to answer numerous problems posed by our researching students, in part because they have had much more time to think about each question than we have and in part because some have been quite hard (and remain unsolved). Consider the following note from a mentor to a teacher who had just finished a research unit with her class:

validate you’re work because you’re the first one to try it! have to figure it out, convince yourself, and then convince others.

For every project that a class investigates, the students should have a running list of conjectures that they have not yet proven or disproven. This will help them see that it is the natural state of mathematics to have open questions with which many researchers are grappling.

How do I help my students during research?

When students are engaged in research, our job is to teach them the stages of the process and to coach them to develop the habits that lead to success. The most common coaching maneuver is to ask a question. The purpose of an inquiry is to model the types of questions that the student should be asking herself and to help the student and her teacher understand what she is doing and why.

The other key to helping your students is to be enthusiastic about their ideas and questions and to be patient when they are stuck. Acknowledge both the satisfactions and the difficulties of research so that students can address the emotions that accompany learning. Because progress in research can take time and come sporadically, it is important that you remove any external stresses when students begin research (unless you are very careful, grading can be a distraction and hindrance for novice researchers). Here are some of the basic acts that teachers use when coaching students (note that many of these are just statements of good teaching in general):

See the for mentor comments that exemplify the above list of responses.

How should I use the warm-up problems?

Each Making Mathematics project has associated warm-up problems. Which, if any, you use will depend on the background of your students. Students can start most research projects at an interesting level without work on any of the warm-up problems. In some cases, you may want to use the warm-ups after an initial exploration so that students are thinking about the problems within the context of the main project questions. Certain warm-up problems may turn out to be lengthy research challenges themselves (so gauge your available time accordingly or just use the warm-up as a research question).

The teaching notes accompanying the project and activity can serve as models that you can adapt to other projects. As noted , it is best if you can introduce research with a burst that permits a coherent presentation of the research process before separating discussions with several days of non-research studies.

Once research is underway, each student or group of students may work on different, but related, questions. During whole-class discussion, classmates should describe the different problems that they are exploring. Students should report back on their progress (new questions, conjectures, proofs, etc.) periodically.

At the end of a class session devoted to research, each group should give themselves a homework assignment in their . You can check these recorded tasks to make sure that the assignments were meaningful and check the subsequent entry in the logbook to make sure that the student made reasonable progress with the tasks. Typical homework challenges include:

Students can think about where they are in the in order to decide what step to attempt next. Their work should have some narrative explanations ("I did this because…"). Students can work on their homework for a few days, but groups will also need regular class time to catch up on each other’s thinking, to work together, and to then coordinate next steps before their next stretch of independent work.

Although the teaching notes for many of the Making Mathematics projects suggest what to do on the first day, the second day, and so forth, you will need to pace the phases of a particular investigation according to the length of your class periods and the timing of a given class’s particular questions and discoveries. Here are some other decisions that you should be alert to as work proceeds:

(generating test cases, remaining skeptical in the face of confirming examples, extreme and degenerate cases, and counter-examples).

As a class works thorough its early research experiences, be sure to document for them as much of their work as possible. Posters listing the students’ conjectures, questions, and theorems help students grasp the cyclical nature of the research process. They see how their different questions connect and build upon each other and learn which research methods are most helpful at which stages of an investigation. After these beginning projects, students are ready to work more independently and should be encouraged to pose their own questions for research.

Stand-alone activities from the teacher handbook and entries can be used during research explorations or in between as a way to keep research thinking fresh when other topics are taking central stage in your class. When used in the midst of an investigation, they are a response to a "teachable moment" that makes them a timely interruption. You can also intersperse readings (see the chapter) about present-day and their work as a way to broaden students’ view of the field and to inspire them with the personal stories of persistence and discovery.

See Writing Math Research Papers by Robert Gerver for more advice on structuring individual research projects.

How does a research project end?

A project can end when a student or group has resolved some central question. Often, there are many questions and, after good progress with some of them, students’ enthusiasm for the others may wane. You may have established certain goals for students: to create a proof, to generate a few clear conjectures, to pose a new problem and make progress with it. Each of these possibilities is a reasonable time for work on a project to end. Students can come to a satisfying sense of closure even with a project that leaves many unanswered questions. That feeling can be enhanced if they write a final report that summarizes their main questions and work and that concludes with a list of possible extensions worth exploring. See for ideas about formal write-ups for students who have engaged in a lengthy examination of a research question.

How will doing research affect my workload?

Ultimately, research is no more demanding on your time than teaching that is more traditional. In some cases, it shifts the balance so that you spend less time preparing lessons and more time responding to student work. If you have not taught research before, there will be an initial need to think through the different issues that will arise in class. This work will prepare you to take advantage of any "teachable moments" (student comments that can lead the class to new understandings). The is a valuable resource as you develop experience doing research with students.

One strategy for managing the demands of teaching research is to keep good notes on your observations during class. Thorough ongoing documentation will facilitate the comments that you need to make when you collect work because you will have a good sense of the entire research process that an individual or group has gone through. The more often you can read and respond to student’s entries in a their logbooks, the better, but you do not have to collect everyone’s work all at once. You can sample a few each night. Lastly, having each group submit a single final report reduces the number of papers that you need to study to a manageable number.

How can I balance the development of research skills with the need to cover specific mathematics topics?

Mentor: I appreciate your frustration about the tension between covering technical content and giving your students the opportunity to learn about the process of doing mathematics. There is no question that teachers are being asked to whiz through too many topics. I try to remind teachers of what they already know: when we go too quickly, the material is not mastered well and so we are not being efficient.

The above exchange between a Making Mathematics teacher and her mentor is typical of the most common and emotional question with which teachers interested in research have grappled. Many have expressed stress at feeling trapped by competing demands. In some cases, the answer is simple: if there is a major state test next week and you need to cover five topics, it is definitely a bad time to start research. But, if you are months away and you consider how often students forget what they have studied, now is a good time to introduce your students to mathematics investigations.

As Schoenfeld and remind us, the content versus research question reflects a false dichotomy. We know how fruitless it is to teach disconnected topics. If you do not use knowledge in active ways that allow you to make meaning of what you have learned, you do not retain that learning. Why do students seem to forget so much of what they study? Sometimes, they still have the skills but are only able to apply them when prompted (e.g., "I am doing a chapter four problem" or "I was told to use triangle trigonometry techniques"). Sometimes, the learning experience was not memorable (consider what you have remembered and forgotten from high school and try to identify why). The more research work becomes a strand throughout a course and a school’s curriculum, the better the interconnections between, and mastery of, technical content will be.

The NCTM Standards include many important goals (e.g., being able to conjecture, show persistence in problem solving, develop mathematical models, etc.) that we are supposed to "cover" that do not fit well in the framework of timed tests.

So, how do we combine research and technical content goals and what are some of the challenges that we face in our efforts? We can choose a research problem that will reinforce technical skills that a class has already studied. Alternatively, we can pick a problem that will introduce our students to and help them develop an understanding of a new topic. For example, we could use the research project in place of or after a textbook introduction on combinatorics.

One problem that arises when using a research experience as a way to develop or reinforce a particular technical skill is that students’ questions and methods may not head in the direction that you expected. One group of students, presented with the project, wanted to be able to test the behavior of all starting positions. To do so, they had to know how many starting positions there were and so, unwittingly, began a combinatorics exploration of the possible arrangements involving recruits with 2 facing the wrong way. Another group created a circular version of the problem and learned about periodic behavior. If you tell students to use a particular technique, then you short-circuit the research process. You are also risking turning the effort into a planned discovery activity, which usually lacks the motivational and intellectual power of true research.

You can address this problem in a few ways. A careful choice of project or framing of the question can often make certain skills inevitable. For example, a high school class proving theorems about would be hard pressed to avoid using algebraic expressions or thinking about factors. You can also add your own questions to the class’s list. This makes you a participant in the process and assures that the class will spend some time on the issues that you want considered. Alternatively, you can let the students’ work take them where it will knowing that some other important area of mathematics is being developed or reinforced that you will not have to spend as much time on in the future. Then, after the research is over, you can return to the topic that you originally had in mind.

When students do get to follow their own intellectual muse, they are more likely to experience a wide range of mathematics topics. For example, in a class of fifth graders working on the project, one student asked what would happen if each jump was chosen randomly. The shapes were no longer as attractive, but the question of whether they would ever close led to the idea of expected value. An independent research project on randomness in DNA led a student to study matrices and Markov processes. Students will teach themselves a chapter of content from a textbook if they think it will help them on a task about which they care.

How should students keep track of their work?

Students should maintain a logbook throughout a research experience. In this logbook, they will keep a record of everything they do and everything they read. Students should be encouraged to write down questions that they have when they are reading or working on their mathematics. This journal will become a record of the student’s entire mathematics research experience. It will be an invaluable tool during their investigation and as they produce their final write-up at the end of the project.

There are two common approaches to the organization of a mathematics logbook. You should decide which type of logbook better meets the needs of you and your students.

For lengthy research projects, some teachers prefer that students use a bound logbook. Science logbooks, filled with graph paper and pre-numbered pages, are ideal for this sort of journal. Since the page numbers come pre-printed, it is obvious that something is missing if a page is torn out. Logbooks of this type encourage students to keep all of their work, even work that they do not actually use in their final project. It demonstrates a clear progression of mathematical development and thought throughout the research experience. If students want to add copies of articles or diagrams, they can staple or tape them into place. A formal logbook of this type is often for science fair projects. See for student instructions for this type of logbook.

In other cases, we recommend the use of loose-leaf binders for logbooks. Loose-leaf notebooks make it easier to keep material in sections and to move pages around. They also make it easier for teachers to ask students to hand in portions of their logbook because they can remove the pages and then put them back when the teacher is done looking at them. Students can insert computer printouts, pictures, copies of articles, etc. in an appropriate place. (Gerver, pp. 91-92). See for student instructions for this type of logbook.

No matter which format is used, we recommend that students:

Students should write what they are feeling and thinking in their logs. The log is a record of a student’s dialogue with herself and the mathematics ideas of her project. Dry, formal writing is an impediment at this stage of work. One of our students had the following observations and questions in his log:

? + ( , not 0) irrational fractional base like the others?

His comments served to provide a clear narrative of his reasoning and motivation.

Neatness and organization are not an intrinsic virtue in a log book, but they are important to the extent that the student must be able to make sense of her writing days later and will not want messiness to distract any reader of her log.

When and how should students work in groups?

Students benefit from group work in a number of different ways. Students can more readily adjust to the unfamiliar aspects of research with the support and exchange of ideas that a group can provide. Group efforts allow students to contribute their strengths to a research project without getting stuck because of an area of weakness. In other words, groups can be crucial to the early confidence-building stages of teaching research. As research continues in a class, group efforts allow students to discover the power of being part of a mathematical community that is building an interconnected set of mathematics ideas stimulated by each other’s thoughts and questions.

Although a whole class can work on a problem together, smaller groups are preferable inasmuch as they give more students the chance to participate. Multiple groups are also more likely to produce an interesting variety of ideas than will a whole-class discussion. Before starting students off in groups for an extended activity (doing research or anything else), it is worthwhile presenting the discussion questions from the chapter.

We recommend giving each student the chance to spend some time individually making sense of a problem before putting groups together. This initial period allows students to figure out at their own pace what they know about a problem and what questions they have. After the class makes a list of their questions, you can form groups and ask each one to pick a question for their members to explore. Alternatively, you can invite students to join a group based on which question they would like to explore ("If you like problem A, please move over here."). Although there is no hard and fast rule for group size, groups of three or four students often provide a good critical mass of ideas while allowing for plenty of participation.

You should decide whether you want each group to appoint a daily recorder who writes down a full description of all of the group’s work in a log or whether each member is responsible for keeping a record. If students are going to be working at home on the problems, the latter arrangement may be best (although in some classes the teacher photocopies the notes at the end of class for each group member).

When groups work in class, your job is to visit each group, to observe and take notes, and to ask questions. Your goal is to assess where the students are heading (e.g., by asking "What are you all working on at this moment?" followed by "How does that relate to the main question that you are investigating?") and whether they can explain their own decision-making and reasoning (e.g., "Why do you think that that conjecture might be true?"). See and for more advice on helping groups during the research process.

Students also grow from doing research independently. Independent work allows them to follow their own muse, to make progress at their own pace, and to work through challenges and learn from that process in all of its richness and difficulty. The victories are all their own.

What role can technology play in research?

Advanced calculators and computer software can promote research because, in the exploration of functions, numbers, and shapes, they can change the nature and number of questions that students ask. It can be quite exciting when students take advantage of technology’s ability to facilitate rote work and expedite deeper conjecturing about patterns in mathematics.

For example, a student might look at how – 1 factors for different whole numbers using a computer algebra system (CAS) such as Mathematica or the TI-92. But, they are unlikely to be willing to factor – 1 without computer help any more than we would be likely to do long division of 6-digit numbers. The field of fractals and chaos would not have blossomed without the aid of computers that freed researchers up to ask questions that would have been unanswerable in the past. Many of these questions only yielded to analysis after simulations and number crunching revealed patterns. Similarly, access to a spreadsheet or dynamic geometry program can free students to ask "What if…?" about mathematical objects that would be too daunting to study without a technological boost.

As with any tool, students need to learn the benefits and limitations associated with using a particular piece of software. For example, if a student working on a difficult combinatorics problem writes a program to "number crunch" an answer instead of patiently analyzing the structure of the situation, she will usually fail to develop a solution that she can generalize. She is likely to miss the insight that a pencil-and-paper route might have provided.

Although CAS programs can produce exact answers to many problems, most calculators and programs still display approximations, such as 1.7320508 instead of .

Schoenfeld, Alan (1994, 13(1)). What do we know about mathematics curricula? , 55-80. Available online at

APPENDIX A

Sample Responses to Middle School Groups Working on the Project (taken from email exchanges between students and a Making Mathematics mentor.

2) Identify and celebrate research skills

a)

b)

c) When students came up with an effective representation of the problem:

d)

e)

all the numbers are changing, but what doesn't change is the relationship between x and y: y is always one more than twice x. That is, y=2x+1. Finding what doesn't change "tames" the situation. So, you have tamed this problem! Yay. And if you want a fancy mathematical name for things that don’t vary, we call these things "invariants." The number of messed-up recruits is invariant, even though they are all wiggling back and forth, trying to figure out which way is right!

3) Encourage generalizations

So, of course, the next question that comes to my mind is how to generalize what you’ve already discovered: there are 15 ways that 2 mistakes can be arranged in a line of 6 recruits. What about a different number of mistakes? Or a different number of recruits? Is there some way to predict? Or, alternatively, is there some way to predict how these 15 ways of making mistakes will play out as the recruits try to settle themselves down? Which direction interests you?

4) Inquire about reasoning and rigor

The students were looking at the number of ways the recruits could line up with 2 out of n faced the wrong way: Anyway, I had a question of my own. It looks like the number of possibilities increases pretty fast, as the number of recruits increases. For example, I counted 15 possibilities in your last set (the line of six). What I wonder is this: when the numbers get that large, how you can possibly know that you've found all the possibilities? (For example, I noticed that >>>><< is missing.) The question "How do I know I've counted 'em all?" is actually quite a big deal in mathematics, as mathematicians are often called upon to find ways of counting things that nobody has ever listed (exactly like the example you are working on).

The students responded by finding a pattern for generating the lineups in a meaningful order: The way that we can prove that we have all the possibilities is that we can just add the number of places that the second wrong person could be in. For example, if 2 are wrong in a line of 6, then the first one doesn’t move and you count the space in which the second one can move in. So for the line of six, it would be 5+4+3+2+1=15. That is the way to make sure that we have all the ways. Thanks so much for giving challenges. We enjoyed thinking!

5) Work towards proof

a) The group wrote the following: When we found out that 6 recruits had 15 different starting arrangements, we needed more information. We needed to figure out how many starting positions are there for a different number of recruits.

By drawing out the arrangements for 5 recruits and 7 recruits we found out that the number of starting arrangements for the recruit number before plus that recruit number before it would equal the number of starting arrangements for that number of recruits.

We also found out that if you divide the starting arrangements by the number of recruits there is a pattern.

To which the mentor replied: Wow! I don't think (in all the years I've been hanging around mathematics) I've ever seen anyone describe this particular pattern before! Really nice! If you already knew me, you'd be able to predict what I'm about to ask, but you don't, so I have to ask it: "But why?" That is, why is this pattern (the 6, 10, 15, 21, 28…) the pattern that you find for this circumstance (two recruits wrong in lines of lengths, 4, 5, 6, 7, 8…)? Answering that—explaining why you should get those numbers and why the pattern must continue for longer lines—is doing the kind of thing that mathematics is really about.

b) Responding to students studying a circular variation of raw recruits that never settled down: This is a really interesting conclusion! How can you show that it will always continue forever and that it doesn’t matter what the original arrangement was? Have you got a reason or did you try all the cases or…? I look forward to hearing more from you.

6) Distinguish between examples and reasons

a) You have very thoroughly dealt with finding the answer to the problem you posed—it really does seem, as you put it, "safe to say" how many there will be. Is there a way that you can show that that pattern must continue? I guess I’d look for some reason why adding the new recruit adds exactly the number of additional cases that you predict. If you could say how the addition of one new recruit depends on how long the line already is, you’d have a complete proof. Want to give that a try?

b) A student, working on Amida Kuji and having provided an example, wrote the following as part of a proof: In like manner, to be given each relationship of objects in an arrangement, you can generate the arrangement itself, for no two different arrangements can have the same object relationships. The mentor response points out the gap and offers ways to structure the process of extrapolating from the specific to the general: This statement is the same as your conjecture, but this is not a proof. You repeat your claim and suggest that the example serves as a model for a proof. If that is so, it is up to you to make the connections explicit. How might you prove that a set of ordered pairs, one per pair of objects forces a unique arrangement for the entire list? Try thinking about a given object (e.g., C) and what each of its ordered pairs tells us? Try to generalize from your example. What must be true for the set of ordered pairs? Are all sets of n C2 ordered pairs legal? How many sets of n C2 ordered pairs are there? Do they all lead to a particular arrangement? Your answers to these questions should help you work toward a proof of your conjecture.

9) Encourage extensions

What you’ve done—finding the pattern, but far more important, finding the explanation (and stating it so clearly)—is really great! (Perhaps I should say "finding and stating explanations like this is real mathematics"!) Yet it almost sounded as if you put it down at the very end, when you concluded "making our project mostly an interesting coincidence." This is a truly nice piece of work!

The question, now, is "What next?" You really have completely solved the problem you set out to solve: found the answer, and proved that you’re right!

I began looking back at the examples you gave, and noticed patterns in them that I had never seen before. At first, I started coloring parts red, because they just "stuck out" as noticeable and I wanted to see them better. Then, it occurred to me that I was coloring the recruits that were back-to-back, and that maybe I should be paying attention to the ones who were facing each other, as they were "where the action was," so I started coloring them pink. (In one case, I recopied your example to do the pinks.) To be honest, I’m not sure what I’m looking for, but there was such a clear pattern of the "action spot" moving around that I thought it might tell me something new. Anything come to your minds?

10) Build a Mathematical Community

I just went back to another paper and then came back to yours to look again. There's another pattern in the table. Add the recruits and the corresponding starting arrangements (for example, add 6 and 15) and you get the next number of starting arrangements. I don't know whether this, or your 1.5, 2, 2.5, 3, 3.5… pattern will help you find out why 6, 10, 15… make sense as answers, but they might. Maybe you can work with [your classmates] who made the other observation to try to develop a complete understanding of the problem.

11) Highlight Connections

Your rule—the (n-1)+(n-2)+(n-3)+… +3+2+1 part—is interesting all by itself, as it counts the number of dots in a triangle of dots. See how?

12) Wrap Up

This is really a very nice and complete piece of work: you've stated a problem, found a solution, and given a proof (complete explanation of why that solution must be correct). To wrap it up and give it the polish of a good piece of mathematical research, I'd suggest two things.

The first thing is to extend the idea to account for all but two mistakes and the (slightly trivial) one mistake and all but one mistake. (If you felt like looking at 3 and all but 3, that'd be nice, too, but it's more work—though not a ton—and the ones that I suggested are really not more work.)

The second thing I'd suggest is to write it all up in a way that would be understandable by someone who did not know the problem or your class: clear statement of the problem, the solution, what you did to get the solution, and the proof.

I look forward to seeing your masterpiece!

Advice for Keeping a Formal Mathematics Research Logbook

As part of your mathematics research experience, you will keep a mathematics research logbook. In this logbook, keep a record of everything you do and everything you read that relates to this work. Write down questions that you have as you are reading or working on the project. Experiment. Make conjectures. Try to prove your conjectures. Your journal will become a record of your entire mathematics research experience. Don’t worry if your writing is not always perfect. Often journal pages look rough, with notes to yourself, false starts, and partial solutions. However, be sure that you can read your own notes later and try to organize your writing in ways that will facilitate your thinking. Your logbook will serve as a record of where you are in your work at any moment and will be an invaluable tool when you write reports about your research.

Ideally, your mathematics research logbook should have pre-numbered pages. You can often find numbered graph paper science logs at office supply stores. If you can not find a notebook that has the pages already numbered, then the first thing you should do is go through the entire book putting numbers on each page using pen.

• Date each entry.

• Work in pen.

• Don’t erase or white out mistakes. Instead, draw a single line through what you would like ignored. There are many reasons for using this approach:

– Your notebook will look a lot nicer if it doesn’t have scribbled messes in it.

– You can still see what you wrote at a later date if you decide that it wasn’t a mistake after all.

– It is sometimes useful to be able to go back and see where you ran into difficulties.

– You’ll be able to go back and see if you already tried something so you won’t spend time trying that same approach again if it didn’t work.

• When you do research using existing sources, be sure to list the bibliographic information at the start of each section of notes you take. It is a lot easier to write down the citation while it is in front of you than it is to try to find it at a later date.

• Never tear a page out of your notebook. The idea is to keep a record of everything you have done. One reason for pre-numbering the pages is to show that nothing has been removed.

• If you find an interesting article or picture that you would like to include in your notebook, you can staple or tape it onto a page.

Advice for Keeping a Loose-Leaf Mathematics Research Logbook

Get yourself a good loose-leaf binder, some lined paper for notes, some graph paper for graphs and some blank paper for pictures and diagrams. Be sure to keep everything that is related to your project in your binder.

– Your notebook will look a lot nicer if it does not have scribbled messes in it.

• Be sure to keep everything related to your project. The idea is to keep a record of everything you have done.

• If you find an interesting article or picture that you would like to include in your notebook, punch holes in it and insert it in an appropriate section in your binder.

Making Mathematics Home | Mathematics Projects | Students | Teachers | Mentors | Parents | Hard Math Café |

.


-->

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 19 June 2024

Detecting hallucinations in large language models using semantic entropy

  • Sebastian Farquhar   ORCID: orcid.org/0000-0002-9185-6415 1   na1 ,
  • Jannik Kossen 1   na1 ,
  • Lorenz Kuhn 1   na1 &
  • Yarin Gal   ORCID: orcid.org/0000-0002-2733-2078 1  

Nature volume  630 ,  pages 625–630 ( 2024 ) Cite this article

57k Accesses

1456 Altmetric

Metrics details

  • Computer science
  • Information technology

Large language model (LLM) systems, such as ChatGPT 1 or Gemini 2 , can show impressive reasoning and question-answering capabilities but often ‘hallucinate’ false outputs and unsubstantiated answers 3 , 4 . Answering unreliably or without the necessary information prevents adoption in diverse fields, with problems including fabrication of legal precedents 5 or untrue facts in news articles 6 and even posing a risk to human life in medical domains such as radiology 7 . Encouraging truthfulness through supervision or reinforcement has been only partially successful 8 . Researchers need a general method for detecting hallucinations in LLMs that works even with new and unseen questions to which humans might not know the answer. Here we develop new methods grounded in statistics, proposing entropy-based uncertainty estimators for LLMs to detect a subset of hallucinations—confabulations—which are arbitrary and incorrect generations. Our method addresses the fact that one idea can be expressed in many ways by computing uncertainty at the level of meaning rather than specific sequences of words. Our method works across datasets and tasks without a priori knowledge of the task, requires no task-specific data and robustly generalizes to new tasks not seen before. By detecting when a prompt is likely to produce a confabulation, our method helps users understand when they must take extra care with LLMs and opens up new possibilities for using LLMs that are otherwise prevented by their unreliability.

Similar content being viewed by others

nature of mathematics research paper

Testing theory of mind in large language models and humans

nature of mathematics research paper

Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in ChatGPT

nature of mathematics research paper

ThoughtSource: A central hub for large language model reasoning data

‘Hallucinations’ are a critical problem 9 for natural language generation systems using large language models (LLMs), such as ChatGPT 1 or Gemini 2 , because users cannot trust that any given output is correct.

Hallucinations are often defined as LLMs generating “content that is nonsensical or unfaithful to the provided source content” 9 , 10 , 11 but they have come to include a vast array of failures of faithfulness and factuality. We focus on a subset of hallucinations which we call ‘confabulations’ 12 for which LLMs fluently make claims that are both wrong and arbitrary—by which we mean that the answer is sensitive to irrelevant details such as random seed. For example, when asked a medical question “What is the target of Sotorasib?” an LLM confabulates by sometimes answering KRASG12 ‘C’ (correct) and other times KRASG12 ‘D’ (incorrect) despite identical instructions. We distinguish this from cases in which a similar ‘symptom’ is caused by the following different mechanisms: when LLMs are consistently wrong as a result of being trained on erroneous data such as common misconceptions 13 ; when the LLM ‘lies’ in pursuit of a reward 14 ; or systematic failures of reasoning or generalization. We believe that combining these distinct mechanisms in the broad category hallucination is unhelpful. Our method makes progress on a portion of the problem of providing scalable oversight 15 by detecting confabulations that people might otherwise find plausible. However, it does not guarantee factuality because it does not help when LLM outputs are systematically bad. Nevertheless, we significantly improve question-answering accuracy for state-of-the-art LLMs, revealing that confabulations are a great source of error at present.

We show how to detect confabulations by developing a quantitative measure of when an input is likely to cause an LLM to generate arbitrary and ungrounded answers. Detecting confabulations allows systems built on LLMs to avoid answering questions likely to cause confabulations, to make users aware of the unreliability of answers to a question or to supplement the LLM with more grounded search or retrieval. This is essential for the critical emerging field of free-form generation in which naive approaches, suited to closed vocabulary and multiple choice, fail. Past work on uncertainty for LLMs has focused on simpler settings, such as classifiers 16 , 17 and regressors 18 , 19 , whereas the most exciting applications of LLMs relate to free-form generations.

The term hallucination in the context of machine learning originally comes from filling in ungrounded details, either as a deliberate strategy 20 or as a reliability problem 4 . The appropriateness of the metaphor has been questioned as promoting undue anthropomorphism 21 . Although we agree that metaphor must be used carefully with LLMs 22 , the widespread adoption of the term hallucination reflects the fact that it points to an important phenomenon. This work represents a step towards making that phenomenon more precise.

To detect confabulations, we use probabilistic tools to define and then measure the ‘semantic’ entropy of the generations of an LLM—an entropy that is computed over meanings of sentences. High entropy corresponds to high uncertainty 23 , 24 , 25 —so semantic entropy is one way to estimate semantic uncertainties. Semantic uncertainty, the broader category of measures we introduce, could be operationalized with other measures of uncertainty, such as mutual information, instead. Entropy in free-form generation is normally hard to measure because answers might mean the same thing (be semantically equivalent) despite being expressed differently (being syntactically or lexically distinct). This causes naive estimates of entropy or other lexical variation scores 26 to be misleadingly high when the same correct answer might be written in many ways without changing its meaning.

By contrast, our semantic entropy moves towards estimating the entropy of the distribution of meanings of free-form answers to questions, insofar as that is possible, rather than the distribution over the ‘tokens’ (words or word-pieces) which LLMs natively represent. This can be seen as a kind of semantic consistency check 27 for random seed variation. An overview of our approach is provided in Fig. 1 and a worked example in Supplementary Table 1 .

figure 1

a , Naive entropy-based uncertainty measures variation in the exact answers, treating ‘Paris’, ‘It’s Paris’ and ‘France’s capital Paris’ as different. But this is unsuitable for language tasks for which sometimes different answers mean the same things. Our semantic entropy clusters answers which share meanings before computing the entropy. A low semantic entropy shows that the LLM is confident about the meaning. b , Semantic entropy can also detect confabulations in longer passages. We automatically decompose a long generated answer into factoids. For each factoid, an LLM generates questions to which that factoid might have been the answer. The original LLM then samples  M possible answers to these questions. Finally, we compute the semantic entropy over the answers to each specific question, including the original factoid. Confabulations are indicated by high average semantic entropy for questions associated with that factoid. Here, semantic entropy classifies Fact 1 as probably not a confabulation because generations often mean the same thing, despite very different wordings, which a naive entropy would have missed.

Intuitively, our method works by sampling several possible answers to each question and clustering them algorithmically into answers that have similar meanings, which we determine on the basis of whether answers in the same cluster entail each other bidirectionally 28 . That is, if sentence A entails that sentence B is true and vice versa, then we consider them to be in the same semantic cluster. We measure entailment using both general-purpose LLMs and natural language inference (NLI) tools developed specifically for detecting entailment for which we show direct evaluations in Supplementary Tables 2 and 3 and Supplementary Fig. 1 . Textual entailment has previously been shown to correlate with faithfulness 10 in the context of factual consistency 29 as well as being used to measure factuality in abstractive summarization 30 , especially when applied at the right granularity 31 .

Semantic entropy detects confabulations in free-form text generation across a range of language models and domains, without previous domain knowledge. Our evaluations cover question answering in trivia knowledge (TriviaQA 32 ), general knowledge (SQuAD 1.1; ref. 33 ), life sciences (BioASQ 34 ) and open-domain natural questions (NQ-Open 35 ) derived from actual queries to Google Search 36 . In addition, semantic entropy detects confabulations in mathematical word problems (SVAMP 37 ) and in a biography-generation dataset, FactualBio, accompanying this paper.

Our results for TriviaQA, SQuAD, BioASQ, NQ-Open and SVAMP are all evaluated context-free and involve sentence-length answers (96 ± 70 characters, mean ± s.d.) and use LLaMA 2 Chat (7B, 13B and 70B parameters) 38 , Falcon Instruct (7B and 40B) 39 and Mistral Instruct (7B) 40 . In the Supplementary Information , we further consider short-phrase-length answers. Results for FactualBio (442 ± 122 characters) use GPT-4 (ref. 1 ). At the time of writing, GPT-4 (ref. 1 ) did not expose output probabilities 41 or hidden states, although it does now. As a result, we propose a discrete approximation of our estimator for semantic entropy which allows us to run experiments without access to output probabilities, which we use for all GPT-4 results in this paper and which performs similarly well.

Our confabulation detection with semantic entropy is more robust to user inputs from previously unseen domains than methods which aim to ‘learn’ how to detect confabulations from a set of example demonstrations. Our method is unsupervised, meaning that we do not need labelled examples of confabulations. By contrast, supervised methods detect confabulations by learning patterns behind examples of confabulations, assuming that future questions preserve these patterns. But this assumption is often untrue in new situations or with confabulations that human overseers are unable to identify (compare Fig. 17 of ref. 24 ). As a strong supervised baseline, we compare to an embedding regression method inspired by ref. 24 which trains a logistic regression classifier to predict whether the model correctly answered a question on the basis of the final ‘embedding’ (hidden state) of the LLM. We also use the P (True) method 24 which looks at the probability with which an LLM predicts that the next token is ‘True’ when few-shot prompted to compare a main answer with ‘brainstormed’ alternatives.

Confabulations contribute substantially to incorrect answers given by language models. We show that semantic entropy can be used to predict many incorrect model answers and to improve question-answering accuracy by refusing to answer those questions the model is uncertain about. Corresponding to these two uses, we evaluate two main metrics. First, the widely used area under the receiver operating characteristic (AUROC) curve for the binary event that a given answer is incorrect. This measure captures both precision and recall and ranges from 0 to 1, with 1 representing a perfect classifier and 0.5 representing an un-informative classifier. We also show a new measure, the area under the ‘rejection accuracy’ curve (AURAC). This studies the case in which the confabulation detection score is used to refuse to answer the questions judged most likely to cause confabulations. Rejection accuracy is the accuracy of the answers of the model on the remaining questions and the area under this curve is a summary statistic over many thresholds (representative threshold accuracies are provided in Supplementary Material ). The AURAC captures the accuracy improvement which users would experience if semantic entropy was used to filter out questions causing the highest entropy.

Detecting confabulations in QA and math

In Fig. 2 , we show that both semantic entropy and its discrete approximation outperform our best baselines for sentence-length generations. These results are averaged across datasets and provide the actual scores on the held-out evaluation dataset. We report the raw average score across held-out evaluation datasets without standard error because the distributional characteristics are more a property of the models and datasets selected than the method. Consistency of relative results across different datasets is a stronger indicator of variation in this case.

figure 2

Semantic entropy outperforms leading baselines and naive entropy. AUROC (scored on the y -axes) measures how well methods predict LLM mistakes, which correlate with confabulations. AURAC (likewise scored on the y -axes) measures the performance improvement of a system that refuses to answer questions which are judged likely to cause confabulations. Results are an average over five datasets, with individual metrics provided in the Supplementary Information .

Semantic entropy greatly outperforms the naive estimation of uncertainty using entropy: computing the entropy of the length-normalized joint probability of the token sequences. Naive entropy estimation ignores the fact that token probabilities also express the uncertainty of the model over phrasings that do not change the meaning of an output.

Our methods also outperform the supervised embedding regression method both in- and out-of-distribution. In pale-yellow bars we show that embedding regression performance deteriorates when its training data do not match the deployment distribution—which mirrors the common real-world case in which there is a distribution shift between training and deployment 42 —the plotted value is the average metric for embedding regression trained on one of the four ‘off-distribution’ datasets for that evaluation. This is critical because reliable uncertainty is most important when the data distribution shifts. Semantic entropy also outperforms P (True) which is supervised ‘in-context’; that is, it is adapted to the deployment task with a few training examples provided in the LLM prompt itself. The discrete variant of semantic entropy performs similarly to our standard estimator, despite not requiring exact output probabilities.

Averaged across the 30 combinations of tasks and models we study, semantic entropy achieves the best AUROC value of 0.790 whereas naive entropy (0.691), P (True) (0.698) and the embedding regression baseline (0.687) lag behind it. Semantic entropy performs well consistently, with stable performance (between 0.78 and 0.81 AUROC) across the different model families (LLaMA, Falcon and Mistral) and scales (from 7B to 70B parameters) which we study (we report summary statistics for each dataset and model as before). Although semantic entropy outperforms the baselines across all model sizes, P (True) seems to improve with model size, suggesting that it might become more competitive for very capable honest models in settings that the model understands well (which are, however, not the most important cases to have good uncertainty). We use ten generations to compute entropy, selected using analysis in Supplementary Fig. 2 . Further results for short-phrase generations are described in Supplementary Figs. 7 – 10 .

The results in Fig. 2 offer a lower bound on the effectiveness of semantic entropy at detecting confabulations. These evaluations determine whether semantic entropy and baseline methods can detect when the answers of the model are incorrect (which we validate against human correctness evaluations in Supplementary Table 4 ). In addition to errors from confabulations (arbitrary incorrectness), this also includes other types of mistakes for which semantic entropy is not suited, such as consistent errors learned from the training data. The fact that methods such as embedding regression are able to spot other kinds of errors, not just confabulations, but still are outperformed by semantic entropy, suggests that confabulations are a principal category of errors for actual generations.

Examples of questions and answers from TriviaQA, SQuAD and BioASQ, for LLaMA 2 Chat 70B, are shown in Table 1 . These illustrate how only semantic entropy detects when the meaning is constant but the form varies (the first row of the table) whereas semantic entropy and naive entropy both correctly predict the presence of confabulations when the form and meaning vary together (second row) and predict the absence of confabulations when the form and meaning are both constant across several resampled generations (third row). In the final row, we give an example in which semantic entropy is erroneously high as a result of overly sensitive semantic clustering relative to the reference answer. Our clustering method distinguishes the answers which provide a precise date from those which only provide a year. For some contexts that would have been correct but in this context the distinction between the specific day and the year is probably irrelevant. This highlights the importance of context and judgement in clustering, especially in subtle cases, as well as the shortcomings of evaluating against fixed reference answers which do not capture the open-ended flexibility of conversational deployments of LLMs.

Detecting confabulations in biographies

Semantic entropy is most natural for sentences that express a single proposition but the idea of semantic equivalence is trickier to apply to longer passages which express many propositions which might only agree partially 43 . Nevertheless, we can use semantic entropy to detect confabulations in longer generations, such as entire paragraphs of text. To show this, we develop a dataset of biographical generations from GPT-4 (v.0613) for 21 individuals notable enough to have their own Wikipedia page but without extensive online biographies. From each biography generated by GPT-4, we automatically extract propositional factual claims about the individual (150 factual claims in total), which we manually label as true or false.

Applying semantic entropy to this problem is challenging. Naively, one might simply regenerate each sentence (conditioned on the text so far) and then compute semantic entropy over these regenerations. However, the resampled sentences often target different aspects of the biography: for example, one time describing family and the next time profession. This is analogous to the original problem semantic entropy was designed to resolve: the model is uncertain about the right ordering of facts, not about the facts themselves. To address this, we break down the entire paragraph into factual claims and reconstruct questions which might have been answered by those claims. Only then do we apply semantic entropy (Fig. 1 ) by generating three new answers to each question (selected with analysis in Supplementary Figs. 3 and 4 ) and computing the semantic entropy over those generations plus the original factual claim. We aggregate these by averaging the semantic entropy over all the questions to get an uncertainty score for each proposition, which we use to detect confabulations. Unaggregated results are shown in Supplementary Figs. 5 and 6 .

As GPT-4 did not allow access to the probability of the generation at the time of writing, we use a discrete variant of semantic entropy which makes the further approximation that we can infer a discrete empirical distribution over semantic meaning clusters from only the generations ( Methods ). This allows us to compute semantic entropy using only the black-box outputs of an LLM. However, we were unable to compute the naive entropy baseline, the standard semantic entropy estimator or the embedding regression baseline for GPT-4 without output probabilities and embeddings.

In Fig. 3 we show that the discrete variant of semantic entropy effectively detects confabulations on this dataset. Its AUROC and AURAC are higher than either a simple ‘self-check’ baseline—which just asks the LLM whether the factoid is likely to be true—or a variant of P (True) which has been adapted to work for the paragraph-length setting. Discrete semantic entropy has better rejection accuracy performance until 20% of the questions have been rejected at which point P (True) has a narrow edge. This indicates that the questions predicted to cause confabulations are indeed more likely to be wrong.

figure 3

The discrete variant of our semantic entropy estimator outperforms baselines both when measured by AUROC and AURAC metrics (scored on the y -axis). The AUROC and AURAC are substantially higher than for both baselines. At above 80% of questions being answered, semantic entropy has the highest accuracy. Only when the top 20% of answers judged most likely to be confabulations are rejected does the answer accuracy on the remainder for the P (True) baseline exceed semantic entropy.

Our probabilistic approach, accounting for semantic equivalence, detects an important class of hallucinations: those that are caused by a lack of LLM knowledge. These are a substantial portion of the failures at present and will continue even as models grow in capabilities because situations and cases that humans cannot reliably supervise will persist. Confabulations are a particularly noteworthy failure mode for question answering but appear in other domains too. Semantic entropy needs no previous domain knowledge and we expect that algorithmic adaptations to other problems will allow similar advances in, for example, abstractive summarization. In addition, extensions to alternative input variations such as rephrasing or counterfactual scenarios would allow a similar method to act as a form of cross-examination 44 for scalable oversight through debate 45 .

The success of semantic entropy at detecting errors suggests that LLMs are even better at “knowing what they don’t know” than was argued by ref. 24 —they just don’t know they know what they don’t know. Our method explicitly does not directly address situations in which LLMs are confidently wrong because they have been trained with objectives that systematically produce dangerous behaviour, cause systematic reasoning errors or are systematically misleading the user. We believe that these represent different underlying mechanisms—despite similar ‘symptoms’—and need to be handled separately.

One exciting aspect of our approach is the way it makes use of classical probabilistic machine learning methods and adapts them to the unique properties of modern LLMs and free-form language generation. We hope to inspire a fruitful exchange of well-studied methods and emerging new problems by highlighting the importance of meaning when addressing language-based machine learning problems.

Semantic entropy as a strategy for overcoming confabulation builds on probabilistic tools for uncertainty estimation. It can be applied directly to any LLM or similar foundation model without requiring any modifications to the architecture. Our ‘discrete’ variant of semantic uncertainty can be applied even when the predicted probabilities for the generations are not available, for example, because access to the internals of the model is limited.

In this section we introduce background on probabilistic methods and uncertainty in machine learning, discuss how it applies to language models and then discuss our contribution, semantic entropy, in detail.

Uncertainty and machine learning

We aim to detect confabulations in LLMs, using the principle that the model will be uncertain about generations for which its output is going to be arbitrary.

One measure of uncertainty is the predictive entropy of the output distribution, which measures the information one has about the output given the input 25 . The predictive entropy (PE) for an input sentence x is the conditional entropy ( H ) of the output random variable Y with realization y given x ,

A low predictive entropy indicates an output distribution which is heavily concentrated whereas a high predictive entropy indicates that many possible outputs are similarly likely.

Aleatoric and epistemic uncertainty

We do not distinguish between aleatoric and epistemic uncertainty in our analysis. Researchers sometimes separate aleatoric uncertainty (uncertainty in the underlying data distribution) from epistemic uncertainty (caused by having only limited information) 46 . Further advances in uncertainty estimation which separate these kinds of uncertainty would enhance the potential for our semantic uncertainty approach by allowing extensions beyond entropy.

Joint probabilities of sequences of tokens

Generative LLMs produce strings of text by selecting tokens in sequence. Each token is a wordpiece that often represents three or four characters (though especially common sequences and important words such as numbers typically get their own token). To compute entropies, we need access to the probabilities the LLM assigns to the generated sequence of tokens. The probability of the entire sequence, s , conditioned on the context, x , is the product of the conditional probabilities of new tokens given past tokens, whose resulting log-probability is \(\log P({\bf{s}}| {\boldsymbol{x}})={\sum }_{i}\log P({s}_{i}| {{\bf{s}}}_{ < i},{\boldsymbol{x}})\) , where s i is the i th output token and s < i denotes the set of previous tokens.

Length normalization

When comparing the log-probabilities of generated sequences, we use ‘length normalization’, that is, we use an arithmetic mean log-probability, \(\frac{1}{N}{\sum }_{i}^{N}\log P({s}_{i}| {{\bf{s}}}_{ < i},{\boldsymbol{x}})\) , instead of the sum. In expectation, longer sequences have lower joint likelihoods because of the conditional independence of the token probabilities 47 . The joint likelihood of a sequence of length N shrinks exponentially in N . Its negative log-probability therefore grows linearly in N , so longer sentences tend to contribute more to entropy. We therefore interpret length-normalizing the log-probabilities when estimating the entropy as asserting that the expected uncertainty of generations is independent of sentence length. Length normalization has some empirical success 48 , including in our own preliminary experiments, but little theoretical justification in the literature.

Principles of semantic uncertainty

If we naively calculate the predictive entropy directly from the probabilities of the generated sequence of tokens, we conflate the uncertainty of the model over the meaning of its answer with the uncertainty over the exact tokens used to express that meaning. For example, even if the model is confident in the meaning of a generation, there are still usually many different ways for phrasing that generation without changing its meaning. For the purposes of detecting confabulations, the uncertainty of the LLM over meanings is more important than the uncertainty over the exact tokens used to express those meanings.

Our semantic uncertainty method therefore seeks to estimate only the uncertainty the LLM has over the meaning of its generation, not the choice of words. To do this, we introduce an algorithm that clusters model generations by meaning and subsequently calculates semantic uncertainty. At a high level this involves three steps:

Generation: sample output sequences of tokens from the predictive distribution of a LLM given a context x .

Clustering: cluster sequences by their meaning using our clustering algorithm based on bidirectional entailment.

Entropy estimation: estimate semantic entropy by summing probabilities of sequences that share a meaning following equation ( 2 ) and compute their entropy.

Generating a set of answers from the model

Given some context x as input to the LLM, we sample M sequences, { s (1) , …,  s ( M ) } and record their token probabilities, { P ( s (1) ∣ x ), …,  P ( s ( M ) ∣ x )}. We sample all our generations from a single model, varying only the random seed used for sampling from the token probabilities. We do not observe the method to be particularly sensitive to details of the sampling scheme. In our implementation, we sample at temperature 1 using nucleus sampling ( P  = 0.9) (ref. 49 ) and top- K sampling ( K  = 50) (ref. 50 ). We also sample a single generation at low temperature (0.1) as an estimate of the ‘best generation’ of the model to the context, which we use to assess the accuracy of the model. (A lower sampling temperature increases the probability of sampling the most likely tokens).

Clustering by semantic equivalence

To estimate semantic entropy we need to cluster generated outputs from the model into groups of outputs that mean the same thing as each other.

This can be described using ‘semantic equivalence’ which is the relation that holds between two sentences when they mean the same thing. We can formalize semantic equivalence mathematically. Let the space of tokens in a language be \({\mathcal{T}}\) . The space of all possible sequences of tokens of length N is then \({{\mathcal{S}}}_{N}\equiv {{\mathcal{T}}}^{N}\) . Note that N can be made arbitrarily large to accommodate whatever size of sentence one can imagine and one of the tokens can be a ‘padding’ token which occurs with certainty for each token after the end-of-sequence token. For some sentence \({\bf{s}}\in {{\mathcal{S}}}_{N}\) , composed of a sequence of tokens, \({s}_{i}\in {\mathcal{T}}\) , there is an associated meaning. Theories of meaning are contested 51 . However, for specific models and deployment contexts many considerations can be set aside. Care should be taken comparing very different models and contexts.

Let us introduce a semantic equivalence relation, E (  ⋅  ,  ⋅  ), which holds for any two sentences that mean the same thing—we will operationalize this presently. Recall that an equivalence relation is any reflexive, symmetric and transitive relation and that any equivalence relation on a set corresponds to a set of equivalence classes. Each semantic equivalence class captures outputs that can be considered to express the same meaning. That is, for the space of semantic equivalence classes \({\mathcal{C}}\) the sentences in the set \(c\in {\mathcal{C}}\) can be regarded in many settings as expressing a similar meaning such that \(\forall {\bf{s}},{{\bf{s}}}^{{\prime} }\in c:E({\bf{s}},{{\bf{s}}}^{{\prime} })\) . So we can build up these classes of semantically equivalent sentences by checking if new sentences share a meaning with any sentences we have already clustered and, if so, adding them into that class.

We operationalize E (  ⋅  ,  ⋅  ) using the idea of bidirectional entailment, which has a long history in linguistics 52 and natural language processing 28 , 53 , 54 . A sequence, s , means the same thing as a second sequence, s ′, only if the sequences entail (that is, logically imply) each other. For example, ‘The capital of France is Paris’ entails ‘Paris is the capital of France’ and vice versa because they mean the same thing. (See later for a discussion of soft equivalence and cases in which bidirectional entailment does not guarantee equivalent meanings).

Importantly, we require that the sequences mean the same thing with respect to the context—key meaning is sometimes contained in the context. For example, ‘Paris’ does not entail ‘The capital of France is Paris’ because ‘Paris’ is not a declarative sentence without context. But in the context of the question ‘What is the capital of France?’, the one-word answer does entail the longer answer.

Detecting entailment has been the object of study of a great deal of research in NLI 55 . We rely on language models to predict entailment, such as DeBERTa-Large-MNLI 56 , which has been trained to predict entailment, or general-purpose LLMs such as GPT-3.5 (ref. 57 ), which can predict entailment given suitable prompts.

We then cluster sentences according to whether they bidirectionally entail each other using the algorithm presented in Extended Data Fig. 1 . Note that, to check if a sequence should be added to an existing cluster, it is sufficient to check if the sequence bidirectionally entails any of the existing sequences in that cluster (we arbitrarily pick the first one), given the transitivity of semantic equivalence. If a sequence does not share meaning with any existing cluster, we assign it its own cluster.

Computing the semantic entropy

Having determined the classes of generated sequences that mean the same thing, we can estimate the likelihood that a sequence generated by the LLM belongs to a given class by computing the sum of the probabilities of all the possible sequences of tokens which can be considered to express the same meaning as

Formally, this treats the output as a random variable whose event-space is the space of all possible meaning-classes, C , a sub- σ -algebra of the standard event-space S . We can then estimate the semantic entropy (SE) as the entropy over the meaning-distribution,

There is a complication which prevents direct computation: we do not have access to every possible meaning-class c . Instead, we can only sample c from the sequence-generating distribution induced by the model. To handle this, we estimate the expectation in equation ( 3 ) using a Rao–Blackwellized Monte Carlo integration over the semantic equivalence classes C ,

where \(P({C}_{i}| {\boldsymbol{x}})=\frac{P({c}_{i}| {\boldsymbol{x}})}{{\sum }_{c}P(c| {\boldsymbol{x}})}\) estimates a categorical distribution over the cluster meanings, that is, ∑ i P ( C i ∣ x ) = 1. Without this normalization step cluster ‘probabilities’ could exceed one because of length normalization, resulting in degeneracies. Equation ( 5 ) is the estimator giving our main method that we refer to as semantic entropy throughout the text.

For scenarios in which the sequence probabilities are not available, we propose a variant of semantic entropy which we call ‘discrete’ semantic entropy. Discrete semantic entropy approximates P ( C i ∣ x ) directly from the number of generations in each cluster, disregarding the token probabilities. That is, we approximate P ( C i ∣ x ) as \({\sum }_{1}^{M}\frac{{I}_{c={C}_{i}}}{M}\) , the proportion of all the sampled answers which belong to that cluster. Effectively, this just assumes that each output that was actually generated was equally probable—estimating the underlying distribution as the categorical empirical distribution. In the limit of M the estimator converges to equation ( 5 ) by the law of large numbers. We find that discrete semantic entropy results in similar performance empirically.

We provide a worked example of the computation of semantic entropy in Supplementary Note  1 .

Semantic entropy is designed to detect confabulations, that is, model outputs with arbitrary meaning. In our experiments, we use semantic uncertainty to predict model accuracy, demonstrating that confabulations make up a notable fraction of model mistakes. We further show that semantic uncertainty can be used to improve model accuracy by refusing to answer questions when semantic uncertainty is high. Last, semantic uncertainty can be used to give users a way to know when model generations are probably unreliable.

We use the datasets BioASQ 34 , SQuAD 33 , TriviaQA 32 , SVAMP 37 and NQ-Open 35 . BioASQ is a life-sciences question-answering dataset based on the annual challenge of the same name. The specific dataset we use is based on the QA dataset from Task B of the 2023 BioASQ challenge (11B). SQuAD is a reading comprehension dataset whose context passages are drawn from Wikipedia and for which the answers to questions can be found in these passages. We use SQuAD 1.1 which excludes the unanswerable questions added in v.2.0 that are deliberately constructed to induce mistakes so they do not in practice cause confabulations to occur. TriviaQA is a trivia question-answering dataset. SVAMP is a word-problem maths dataset containing elementary-school mathematical reasoning tasks. NQ-Open is a dataset of realistic questions aggregated from Google Search which have been chosen to be answerable without reference to a source text. For each dataset, we use 400 train examples and 400 test examples randomly sampled from the original larger dataset. Note that only some of the methods require training, for example semantic entropy does not use the training data. If the datasets themselves are already split into train and test (or validation) samples, we sample our examples from within the corresponding split.

All these datasets are free-form, rather than multiple choice, because this better captures the opportunities created by LLMs to produce free-form sentences as answers. We refer to this default scenario as our ‘sentence-length’ experiments. In Supplementary Note  7 , we also present results for confabulation detection in a ‘short-phrase’ scenario, in which we constrain model answers on these datasets to be as concise as possible.

To make the problems more difficult and induce confabulations, we do not provide the context passages for any of the datasets. When the context passages are provided, the accuracy rate is too high for these datasets for the latest generations of models to meaningfully study confabulations.

For sentence-length generations we use: Falcon 39 Instruct (7B and 40B), LLaMA 2 Chat 38 (7B, 13B and 70B) and Mistral 40 Instruct (7B).

In addition to reporting results for semantic entropy, discrete semantic entropy and naive entropy, we consider two strong baselines.

Embedding regression is a supervised baseline inspired by the P (IK) method 24 . In that paper, the authors fine-tune their proprietary LLM on a dataset of questions to predict whether the model would have been correct. This requires access to a dataset of ground-truth answers to the questions. Rather than fine-tuning the entire LLM in this way, we simply take the final hidden units and train a logistic regression classifier to make the same prediction. By contrast to their method, this is much simpler because it does not require fine-tuning the entire language model, as well as being more reproducible because the solution to the logistic regression optimization problem is not as seed-dependent as the fine-tuning procedure. As expected, this supervised approach performs well in-distribution but fails when the distribution of questions is different from that on which the classifier is trained.

The second baseline we consider is the P (True) method 24 , in which the model first samples M answers (identically to our semantic entropy approach) and then is prompted with the list of all answers generated followed by the highest probability answer and a question whether this answer is “(a) True” or “(b) False”. The confidence score is then taken to be the probability with which the LLM responds with ‘a’ to the multiple-choice question. The performance of this method is boosted with a few-shot prompt, in which up to 20 examples from the training set are randomly chosen, filled in as above, but then provided with the actual ground truth of whether the proposed answer was true or false. In this way, the method can be considered as supervised ‘in-context’ because it makes use of some ground-truth training labels but can be used without retraining the model. Because of context-size constraints, this method cannot fit a full 20 few-shot examples in the context when input questions are long or large numbers of generations are used. As a result, we sometimes have to reduce the number of few-shot examples to suit the context size and we note this in the  Supplementary Material .

Entailment estimator

Any NLI classification system could be used for our bidirectional entailment clustering algorithm. We consider two different kinds of entailment detector.

One option is to use an instruction-tuned LLM such as LLaMA 2, GPT-3.5 (Turbo 1106) or GPT-4 to predict entailment between generations. We use the following prompt:

We are evaluating answers to the question {question} Here are two possible answers: Possible Answer 1: {text1} Possible Answer 2: {text2} Does Possible Answer 1 semantically entail Possible Answer 2? Respond with entailment, contradiction, or neutral.

Alternatively, we consider using a language model trained for entailment prediction, specifically the DeBERTa-large model 56 fine-tuned on the NLI dataset MNLI 58 . This builds on past work towards paraphrase identification based on embedding similarity 59 , 60 and BERT-style models 61 , 62 . We template more simply, checking if DeBERTa predicts entailment between the concatenation of the question and one answer and the concatenation of the question and another answer. Note that DeBERTa-large is a relatively lightweight model with only 1.5B parameters which is much less powerful than most of the LLMs under study.

In Supplementary Note 2 , we carefully evaluate the benefits and drawbacks of these methods for entailment prediction. We settle on using GPT-3.5 with the above prompt, as its entailment predictions agree well with human raters and lead to good confabulation detection performance.

In Supplementary Note  3 , we provide a discussion of the computational cost and choosing the number of generations for reliable clustering.

Prompting templates

We use a simple generation template for all sentence-length answer datasets:

Answer the following question in a single brief but complete sentence. Question: {question} Answer:

Metrics and accuracy measurements

We use three main metrics to evaluate our method: AUROC, rejection accuracy and AURAC. Each of these is grounded in an automated factuality estimation measurement relative to the reference answers provided by the datasets that we use.

AUROC, rejection accuracy and AURAC

First, we use the AUROC curve, which measures the reliability of a classifier accounting for both precision and recall. The AUROC can be interpreted as the probability that a randomly chosen correct answer has been assigned a higher confidence score than a randomly chosen incorrect answer. For a perfect classifier, this is 1.

Second, we compute the ‘rejection accuracy at X %’, which is the question-answering accuracy of the model on the most-confident X % of the inputs as identified by the respective uncertainty method. If an uncertainty method works well, predictions on the confident subset should be more accurate than predictions on the excluded subset and the rejection accuracy should increase as we reject more inputs.

To summarize this statistic we compute the AURAC—the total area enclosed by the accuracies at all cut-off percentages X %. This should increase towards 1 as given uncertainty method becomes more accurate and better at detecting likely-inaccurate responses but it is more sensitive to the overall accuracy of the model than the AUROC metric.

In Supplementary Note  5 , we provide the unaggregated rejection accuracies for sentence-length generations.

Assessing accuracy

For the short-phrase-length generation setting presented in Supplementary Note  7 , we simply assess the accuracy of the generations by checking if the F1 score of the commonly used SQuAD metric exceeds 0.5. There are limitations to such simple scoring rules 63 but this method is widely used in practice and its error is comparatively small on these standard datasets.

For our default scenario, the longer sentence-length generations, this measure fails, as the overlap between the short reference answer and our long model answer is invariably too small. For sentence-length generations, we therefore automatically determine whether an answer to the question is correct or incorrect by using GPT-4 to compare the given answer to the reference answer. We use the template:

We are assessing the quality of answers to the following question: {question} The expected answer is: {reference answer} The proposed answer is: {predicted answer} Within the context of the question, does the proposed answer mean the same as the expected answer? Respond only with yes or no.

We make a small modification for datasets with several reference answers: line two becomes “The following are expected answers to this question:” and the final line asks “does the proposed answer mean the same as any of the expected answers?”.

In Supplementary Note 6 , we check the quality of our automated ground-truth evaluations against human judgement by hand. We find that GPT-4 gives the best results for determining model accuracy and thus use it in all our sentence-length experiments.

In this section we describe the application of semantic entropy to confabulation detection in longer model generations, specifically paragraph-length biographies.

We introduce a biography-generation dataset—FactualBio—available alongside this paper. FactualBio is a collection of biographies of individuals who are notable enough to have Wikipedia pages but not notable enough to have large amounts of detailed coverage, generated by GPT-4 (v.0613). To generate the dataset, we randomly sampled 21 individuals from the WikiBio dataset 64 . For each biography, we generated a list of factual claims contained in each biography using GPT-4, with 150 total factual claims (the total number is only coincidentally a round number). For each of these factual claims, we manually determined whether the claim was correct or incorrect. Out of 150 claims, 45 were incorrect. As before, we apply confabulation detection to detect incorrect model predictions, even though there may be model errors which are not confabulations.

Prompting and generation

Given a paragraph-length piece of LLM-generated text, we apply the following sequence of steps:

Automatically decompose the paragraph into specific factual claims using an LLM (not necessarily the same as the original).

For each factual claim, use an LLM to automatically construct Q questions which might have produced that claim.

For each question, prompt the original LLM to generate M answers.

For each question, compute the semantic entropy of the answers, including the original factual claim.

Average the semantic entropies over the questions to arrive at a score for the original factual claim.

We pursue this slightly indirect way of generating answers because we find that simply resampling each sentence creates variation unrelated to the uncertainty of the model about the factual claim, such as differences in paragraph structure.

We decompose the paragraph into factual claims using the following prompt:

Please list the specific factual propositions included in the answer above. Be complete and do not leave any factual claims out. Provide each claim as a separate sentence in a separate bullet point.

We found that we agreed with the decompositions in all cases in the dataset.

We then generate six questions for each of the facts from the decomposition. We generate these questions by prompting the model twice with the following:

Following this text: {text so far} You see the sentence: {proposition} Generate a list of three questions, that might have generated the sentence in the context of the preceding original text, as well as their answers. Please do not use specific facts that appear in the follow-up sentence when formulating the question. Make the questions and answers diverse. Avoid yes-no questions. The answers should not be a full sentence and as short as possible, e.g. only a name, place, or thing. Use the format “1. {question} – {answer}”.

These questions are not necessarily well-targeted and the difficulty of this step is the main source of errors in the procedure. We generate three questions with each prompt, as this encourages diversity of the questions, each question targeting a different aspect of the fact. However, we observed that the generated questions will sometimes miss obvious aspects of the fact. Executing the above prompt twice (for a total of six questions) can improve coverage. We also ask for brief answers because the current version of GPT-4 tends to give long, convoluted and highly hedged answers unless explicitly told not to.

Then, for each question, we generate three new answers using the following prompt:

We are writing an answer to the question “{user question}”. So far we have written: {text so far} The next sentence should be the answer to the following question: {question} Please answer this question. Do not answer in a full sentence. Answer with as few words as possible, e.g. only a name, place, or thing.

We then compute the semantic entropy over these answers plus the original factual claim. Including the original fact ensures that the estimator remains grounded in the original claim and helps detect situations in which the question has been interpreted completely differently from the original context. We make a small modification to handle the fact that GPT-4 generations often include refusals to answer questions. These refusals were not something we commonly observe in our experiments with LLaMA 2, Falcon or Mistral models. If more than half of the answers include one of the strings ‘not available’, ‘not provided’, ‘unknown’ or ‘unclear’ then we treat the semantic uncertainty as maximal.

We then average the semantic entropies for each question corresponding to the factual claim to get an entropy for this factual claim.

Despite the extra assumptions and complexity, we find that this method greatly outperforms the baselines.

To compute semantic entailment between the original claim and regenerated answers, we rely on the DeBERTa entailment prediction model as we find empirically that DeBERTa predictions result in higher train-set AUROC than other methods. Because DeBERTa has slightly lower recall than GPT-3.5/4, we use a modified set-up for which we say the answers mean the same as each other if at least one of them entails the other and neither is seen to contradict the other—a kind of ‘non-defeating’ bidirectional entailment check rather than true bidirectional entailment. The good performance of DeBERTa in this scenario is not surprising as both factual claims and regenerated answers are relatively short. We refer to Supplementary Notes 2 and 3 for ablations and experiments regarding our choice of entailment estimator for paragraph-length generations.

We implement two baselines. First, we implement a variant of the P (True) method, which is adapted to the new setting. For each factoid, we generate a question with answers in the same way as for semantic entropy. We then use the following prompt:

Question: {question} Here are some brainstormed ideas: {list of regenerated answers} Possible answer: {original answer} Is the possible answer true? Respond with “yes” or “no”.

As we cannot access the probabilities GPT-4 assigns to predicting ‘yes’ and ‘no’ as the next token, we approximate this using Monte Carlo samples. Concretely, we execute the above prompt ten times (at temperature 1) and then take the fraction of answers which was ‘yes’ as our unbiased Monte Carlo estimate of the token probability GPT-4 assigns to ‘yes’.

As a second, simpler, baseline we check if the model thinks the answer is true. We simply ask:

Following this text: {text so far} You see this statement: {proposition} Is it likely that the statement is true? Respond with ‘yes’ or ‘no’.

It is interesting that this method ought to perform very well if we think that the model has good ‘self-knowledge’ (that is, if “models mostly know what they don’t know” 24 ) but in fact semantic entropy is much better at detecting confabulations.

Data availability

The data used for the short-phrase and sentence-length generations are publicly available and the released code details how to access it. We release a public version of the FactualBio dataset as part of the code base for reproducing the paragraph-length experiments.

Code availability

We release all code used to produce the main experiments. The code for short-phrase and sentence-length experiments can be found at github.com/jlko/semantic_uncertainty and https://doi.org/10.5281/zenodo.10964366 (ref. 65 ). The code for paragraph-length experiments can be found at github.com/jlko/long_hallucinations and https://doi.org/10.5281/zenodo.10964366 (ref. 65 ).

GPT-4 technical report. Preprint at https://arxiv.org/abs/2303.08774 (2023).

Gemini: a family of highly capable multimodal models. Preprint at https://arxiv.org/abs/2312.11805 (2023).

Xiao, Y. & Wang, W. Y. On hallucination and predictive uncertainty in conditional language generation. In Proc. 16th Conference of the European Chapter of the Association for Computational Linguistics 2734–2744 (Association for Computational Linguistics, 2021).

Rohrbach, A., Hendricks, L. A., Burns, K., Darrell, T. & Saenko, K. Object hallucination in image captioning. In Proc. 2018 Conference on Empirical Methods in Natural Language Processing (eds Riloff, E., Chiang, D., Hockenmaier, J. & Tsujii, J.) 4035–4045 (Association for Computational Linguistics, 2018).

Weiser, B. Lawyer who used ChatGPT faces penalty for made up citations. The New York Times (8 Jun 2023).

Opdahl, A. L. et al. Trustworthy journalism through AI. Data Knowl. Eng . 146 , 102182 (2023).

Shen, Y. et al. ChatGPT and other large language models are double-edged swords. Radiology 307 , e230163 (2023).

Article   PubMed   Google Scholar  

Schulman, J. Reinforcement learning from human feedback: progress and challenges. Presented at the Berkeley EECS Colloquium. YouTube www.youtube.com/watch?v=hhiLw5Q_UFg (2023).

Ji, Z. et al. Survey of hallucination in natural language generation. ACM Comput. Surv. 55 , 248 (2023).

Maynez, J., Narayan, S., Bohnet, B. & McDonald, R. On faithfulness and factuality in abstractive summarization. In Proc. 58th Annual Meeting of the Association for Computational Linguistics (eds Jurafsky, D., Chai, J., Schluter, N. & Tetreault, J.) 1906–1919 (Association for Computational Linguistics, 2020).

Filippova, K. Controlled hallucinations: learning to generate faithfully from noisy data. In Findings of the Association for Computational Linguistics: EMNLP 2020 (eds Webber, B., Cohn, T., He, Y. & Liu, Y.) 864–870 (Association for Computational Linguistics, 2020).

Berrios, G. Confabulations: a conceptual history. J. Hist. Neurosci. 7 , 225–241 (1998).

Article   CAS   PubMed   Google Scholar  

Lin, S., Hilton, J. & Evans, O. Teaching models to express their uncertainty in words. Transact. Mach. Learn. Res. (2022).

Evans, O. et al. Truthful AI: developing and governing AI that does not lie. Preprint at https://arxiv.org/abs/2110.06674 (2021).

Amodei, D. et al. Concrete problems in AI safety. Preprint at https://arxiv.org/abs/1606.06565 (2016).

Jiang, Z., Araki, J., Ding, H. & Neubig, G. How can we know when language models know? On the calibration of language models for question answering. Transact. Assoc. Comput. Linguist. 9 , 962–977 (2021).

Article   Google Scholar  

Desai, S. & Durrett, G. Calibration of pre-trained transformers. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (eds Webber, B., Cohn, T., He, Y. & Liu, Y.) 295–302 (Association for Computational Linguistics, 2020).

Glushkova, T., Zerva, C., Rei, R. & Martins, A. F. Uncertainty-aware machine translation evaluation. In Findings of the Association for Computational Linguistics: EMNLP 2021 (eds Moens, M-F., Huang, X., Specia, L. & Yih, S.) 3920–3938 (Association for Computational Linguistics, 2021).

Wang, Y., Beck, D., Baldwin, T. & Verspoor, K. Uncertainty estimation and reduction of pre-trained models for text regression. Transact. Assoc. Comput. Linguist. 10 , 680–696 (2022).

Baker, S. & Kanade, T. Hallucinating faces. In Proc. Fourth IEEE International Conference on Automatic Face and Gesture Recognition . 83–88 (IEEE, Catalogue no PR00580, 2002).

Eliot, L. AI ethics lucidly questioning this whole hallucinating AI popularized trend that has got to stop. Forbes Magazine (24 August 2022).

Shanahan, M. Talking about large language models. Commun. Assoc. Comp. Machinery 67 , 68–79 (2024).

MacKay, D. J. C. Information-based objective functions for active data selection. Neural Comput. 4 , 590–604 (1992).

Kadavath, S. et al. Language models (mostly) know what they know. Preprint at https://arxiv.org/abs/2207.05221 (2022).

Lindley, D. V. On a measure of the information provided by an experiment. Ann. Math. Stat. 27 , 986–1005 (1956).

Article   MathSciNet   Google Scholar  

Xiao, T. Z., Gomez, A. N. & Gal, Y. Wat zei je? Detecting out-of-distribution translations with variational transformers. In Workshop on Bayesian Deep Learning at the Conference on Neural Information Processing Systems (NeurIPS, Vancouver, 2019).

Christiano, P., Cotra, A. & Xu, M. Eliciting Latent Knowledge (Alignment Research Center, 2021); https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit .

Negri, M., Bentivogli, L., Mehdad, Y., Giampiccolo, D. & Marchetti, A. Divide and conquer: crowdsourcing the creation of cross-lingual textual entailment corpora. In Proc. 2011 Conference on Empirical Methods in Natural Language Processing 670–679 (Association for Computational Linguistics, 2011).

Honovich, O. et al. TRUE: Re-evaluating factual consistency evaluation. In Proc. Second DialDoc Workshop on Document-grounded Dialogue and Conversational Question Answering 161–175 (Association for Computational Linguistics, 2022).

Falke, T., Ribeiro, L. F. R., Utama, P. A., Dagan, I. & Gurevych, I. Ranking generated summaries by correctness: an interesting but challenging application for natural language inference. In Proc. 57th Annual Meeting of the Association for Computational Linguistics 2214–2220 (Association for Computational Linguistics, 2019).

Laban, P., Schnabel, T., Bennett, P. N. & Hearst, M. A. SummaC: re-visiting NLI-based models for inconsistency detection in summarization. Trans. Assoc. Comput. Linguist. 10 , 163–177 (2022).

Joshi, M., Choi, E., Weld, D. S. & Zettlemoyer, L. TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension. In Proc. 55th Annual Meeting of the Association for Computational Linguistics 1601–1611 (Association for Computational Linguistics. 2017).

Rajpurkar, P., Zhang, J., Lopyrev, K. & Liang, P. SQuAD: 100,000+ questions for machine compression of text. In Proc. 2016 Conference on Empirical Methods in Natural Language Processing (eds Su, J., Duh, K. & Carreras, X.) 2383–2392 (Association for Computational Linguistics, 2016).

Tsatsaronis, G. et al. An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinformatics 16 , 138 (2015).

Article   PubMed   PubMed Central   Google Scholar  

Lee, K., Chang, M.-W. & Toutanova, K. Latent retrieval for weakly supervised open domain question answering. In Proc. 57th Annual Meeting of the Association for Computational Linguistics 6086–6096 (Association for Computational Linguistics, 2019).

Kwiatkowski, T. et al. Natural questions: a benchmark for question answering research. Transact. Assoc. Comput. Linguist. 7 , 452–466 (2019).

Patel, A., Bhattamishra, S. & Goyal, N. Are NLP models really able to solve simple math word problems? In Proc. 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (eds Toutanova, K. et al.) 2080–2094 (Assoc. Comp. Linguistics, 2021).

Touvron, H. et al. Llama 2: open foundation and fine-tuned chat models. Preprint at https://arxiv.org/abs/2307.09288 (2023).

Penedo, G. et al. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. In Proc. 36th Conference on Neural Information Processing Systems (eds Oh, A. et al.) 79155–79172 (Curran Associates, 2023)

Jiang, A. Q. et al. Mistral 7B. Preprint at https://arxiv.org/abs/2310.06825 (2023).

Manakul, P., Liusie, A. & Gales, M. J. F. SelfCheckGPT: Zero-Resource Black-Box hallucination detection for generative large language models. In Findings of the Association for Computational Linguistics: EMNLP 2023 (eds Bouamor, H., Pino, J. & Bali, K.) 9004–9017 (Assoc. Comp. Linguistics, 2023).

Mukhoti, J., Kirsch, A., van Amersfoort, J., Torr, P. H. & Gal, Y. Deep deterministic uncertainty: a new simple baseline. In IEEE/CVF Conference on Computer Vision and Pattern Recognition 24384–24394 (Computer Vision Foundation, 2023).

Schuster, T., Chen, S., Buthpitiya, S., Fabrikant, A. & Metzler, D. Stretching sentence-pair NLI models to reason over long documents and clusters. In Findings of the Association for Computational Linguistics: EMNLP 2022 (eds Goldberg, Y. et al.) 394–412 (Association for Computational Linguistics, 2022).

Barnes, B. & Christiano, P. Progress on AI Safety via Debate. AI Alignment Forum www.alignmentforum.org/posts/Br4xDbYu4Frwrb64a/writeup-progress-on-ai-safety-via-debate-1 (2020).

Irving, G., Christiano, P. & Amodei, D. AI safety via debate. Preprint at https://arxiv.org/abs/1805.00899 (2018).

Der Kiureghian, A. & Ditlevsen, O. Aleatory or epistemic? Does it matter? Struct. Saf. 31 , 105–112 (2009).

Malinin, A. & Gales, M. Uncertainty estimation in autoregressive structured prediction. In Proceedings of the International Conference on Learning Representations https://openreview.net/forum?id=jN5y-zb5Q7m (2021).

Murray, K. & Chiang, D. Correcting length bias in neural machine translation. In Proc. Third Conference on Machine Translation (eds Bojar, O. et al.) 212–223 (Assoc. Comp. Linguistics, 2018).

Holtzman, A., Buys, J., Du, L., Forbes, M. & Choi, Y. The curious case of neural text degeneration. In Proceedings of the International Conference on Learning Representations https://openreview.net/forum?id=rygGQyrFvH (2020).

Fan, A., Lewis, M. & Dauphin, Y. Hierarchical neural story generation. In Proc. 56th Annual Meeting of the Association for Computational Linguistics (eds Gurevych, I. & Miyao, Y.) 889–898 (Association for Computational Linguistics, 2018).

Speaks, J. in The Stanford Encyclopedia of Philosophy (ed. Zalta, E. N.) (Metaphysics Research Lab, Stanford Univ., 2021).

Culicover, P. W. Paraphrase generation and information retrieval from stored text. Mech. Transl. Comput. Linguist. 11 , 78–88 (1968).

Google Scholar  

Padó, S., Cer, D., Galley, M., Jurafsky, D. & Manning, C. D. Measuring machine translation quality as semantic equivalence: a metric based on entailment features. Mach. Transl. 23 , 181–193 (2009).

Androutsopoulos, I. & Malakasiotis, P. A survey of paraphrasing and textual entailment methods. J. Artif. Intell. Res. 38 , 135–187 (2010).

MacCartney, B. Natural Language Inference (Stanford Univ., 2009).

He, P., Liu, X., Gao, J. & Chen, W. Deberta: decoding-enhanced BERT with disentangled attention. In International Conference on Learning Representations https://openreview.net/forum?id=XPZIaotutsD (2021).

Brown, T. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33 , 1877–1901 (2020).

Williams, A., Nangia, N. & Bowman, S. R. A broad-coverage challenge corpus for sentence understanding through inference. In Proc. 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (eds Walker, M. et al.) 1112–1122 (Assoc. Comp. Linguistics, 2018).

Yu, L., Hermann, K. M., Blunsom, P. & Pulman, S. Deep learning for answer sentence selection. Preprint at https://arxiv.org/abs/1412.1632 (2014).

Socher, R., Huang, E., Pennin, J., Manning, C. D. & Ng, A. Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In Proceedings of the 24th Conference on Neural Information Processing Systems (eds Shawe-Taylor, J. et al.) (2011)

He, R., Ravula, A., Kanagal, B. & Ainslie, J. Realformer: Transformer likes residual attention. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 (eds Zhong, C., et al.) 929–943 (Assoc. Comp. Linguistics, 2021).

Tay, Y. et al. Charformer: fast character transformers via gradient-based subword tokenization. In Proceedings of the International Conference on Learning Representations https://openreview.net/forum?id=JtBRnrlOEFN (2022).

Kane, H., Kocyigit, Y., Abdalla, A., Ajanoh, P. & Coulibali, M. Towards neural similarity evaluators. In Workshop on Document Intelligence at the 32nd conference on Neural Information Processing (2019).

Lebret, R., Grangier, D. & Auli, M. Neural text generation from structured data with application to the biography domain. In Proc. 2016 Conference on Empirical Methods in Natural Language Processing (eds Su, J. et al.) 1203–1213 (Association for Computational Linguistics, 2016).

Kossen, J., jlko/semantic_uncertainty: Initial release v.1.0.0. Zenodo https://doi.org/10.5281/zenodo.10964366 (2024).

Download references

Acknowledgements

We thank G. Irving, K. Perlin, J. Richens, L. Rimell and M. Turpin for their comments or discussion related to this work. We thank K. Handa for his help with the human evaluation of our automated accuracy assessment. We thank F. Bickford Smith and L. Melo for their code review. Y.G. is supported by a Turing AI Fellowship funded by the UK government’s Office for AI, through UK Research and Innovation (grant reference EP/V030302/1), and delivered by the Alan Turing Institute.

Author information

These authors contributed equally: Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn

Authors and Affiliations

OATML, Department of Computer Science, University of Oxford, Oxford, UK

Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn & Yarin Gal

You can also search for this author in PubMed   Google Scholar

Contributions

S.F. led the work from conception to completion and proposed using bidirectional entailment to cluster generations as a way of computing entropy in LLMs. He wrote the main text, most of the Methods and Supplementary Information and prepared most of the figures. J.K. improved the mathematical formalization of semantic entropy; led the extension of semantic entropy to sentence- and paragraph-length generations; wrote the code for, and carried out, all the experiments and evaluations; wrote much of the Methods and Supplementary Information and prepared drafts of many figures; and gave critical feedback on the main text. L.K. developed the initial mathematical formalization of semantic entropy; wrote code for, and carried out, the initial experiments around semantic entropy and its variants which demonstrated the promise of the idea and helped narrow down possible research avenues to explore; and gave critical feedback on the main text. Y.G. ideated the project, proposing the idea to differentiate semantic and syntactic diversity as a tool for detecting hallucinations, provided high-level guidance on the research and gave critical feedback on the main text; he runs the research laboratory in which the work was carried out.

Corresponding author

Correspondence to Sebastian Farquhar .

Ethics declarations

Competing interests.

S.F. is currently employed by Google DeepMind and L.K. by OpenAI. For both, this paper was written under their University of Oxford affiliation. The remaining authors declare no competing interests.

Peer review

Peer review information.

Nature thanks Mirella Lapata and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended data fig. 1 algorithm outline for bidirectional entailment clustering..

Given a set of outputs in response to a context, the bidirectional entailment answer returns a set of sets of outputs which have been classified as sharing a meaning.

Supplementary information

Supplementary information.

Supplementary Notes 1–7, Figs. 1–10, Tables 1–4 and references. Includes, worked example for semantic entropy calculation, discussion of limitations and computational cost of entailment clustering, ablation of entailment prediction and clustering methods, discussion of automated accuracy assessment, unaggregated results for sentence-length generations and further results for short-phrase generations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Farquhar, S., Kossen, J., Kuhn, L. et al. Detecting hallucinations in large language models using semantic entropy. Nature 630 , 625–630 (2024). https://doi.org/10.1038/s41586-024-07421-0

Download citation

Received : 17 July 2023

Accepted : 12 April 2024

Published : 19 June 2024

Issue Date : 20 June 2024

DOI : https://doi.org/10.1038/s41586-024-07421-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

nature of mathematics research paper

IMAGES

  1. (DOC) THE NATURE OF MATHEMATICS AND TEACHING

    nature of mathematics research paper

  2. Nature of Mathematics

    nature of mathematics research paper

  3. International Mathematics Research Papers Template

    nature of mathematics research paper

  4. The Nature of Mathematics

    nature of mathematics research paper

  5. The Nature of Mathematics

    nature of mathematics research paper

  6. International Mathematics Research Papers Template

    nature of mathematics research paper

COMMENTS

  1. The nature of mathematics: Its role and its influence

    The scholarly understanding of the nature of mathematics has evolved over its long history (Devlin, 2012; Dossey, 1992).Explicit discussions regarding the nature of mathematics took place among ...

  2. PDF PERSPECTIVES ON THE NATURE OF MATHEMATICS

    Virginia Tech [email protected]. As mathematics educators, we teach and research a particular form of knowledge. However, in reacting to Platonic views of mathematics, we often overlook its unique characteristics. This paper presents a Kantian and Piagetian perspective that defines mathematics as a product of psychology.

  3. PDF PERSPECTIVES ON THE NATURE OF MATHEMATICS AND RESEARCH

    Plenary Papers Hodges, T.E., Roy, G. J., & Tyminski, A. M. (Eds.). (2018). Proceedings of the 40th annual meeting of the North American Chapter of the International Group for the Psychology of Mathematics Education. Greenville, SC: University of South Carolina & Clemson University. 72 PERSPECTIVES ON THE NATURE OF MATHEMATICS AND RESEARCH Julie ...

  4. The nature of mathematics

    For mathematics is the servant as well as the queen of the sci. ences, and she weaves a rich fabric of. creative theory, which is often inspired by. observations in the phenomenal world, but is also often inspired by a creative in. sight that recognizes identical mathe matical structures in dissimilar realiza.

  5. The Nature of Mathematics and Its Impact on K-12 Education

    What K-12 students should come to understand about the nature of mathematics is not clearly defined. Although research has offered little advice on how the absolutist and fallibilist philosophies impact teachers' decisions, several curriculum efforts have sought to describe what mathematical practices and processes students should come to enact when doing mathematics.

  6. The Nature of Mathematics: Let's Talk about It

    Teachers can offer opportunities for K-12 students to reflect on the nature of mathematics (NOM) as they learn. ... Murfreesboro Search for other papers by Jeremy F. Strayer in Current site Google Scholar PubMed Close. View More View Less. ... In Handbook of Research on Mathematics Teaching and Learning, ...

  7. The nature of mathematics: Its role and its influence

    Research in the field of mathematics education has found that students often have naïve views about the nature of mathematics. Some believe that mathematics is a body of unchanging knowledge, a collection of arbitrary rules and procedures that must be memorized.

  8. PDF Deep Questions on the Nature of Mathematics

    identify full-strength realism about mathematics Deep Questions on the Nature of Mathematics Reviewed by Michael Heller Michael Heller is a professor in the Faculty of Philosophy at the Pontifical University of John Paul II in Cracow, director of the Copernicus Centre for Interdisciplinary Research, and an adjunct scholar at the Vatican ...

  9. The Nature of Mathematics

    erations of mathematical research. In discussions of this subject we find a sharp difference in the views of able mathematicians. This reflects the con-cern of some that the trend toward abstraction has gone too far, and the insistence of others that this trend is the essence of the great vitality of present-day mathematics. On one thing,

  10. The Nature of Mathematics

    The Nature of Mathematics: Both constructive intuition and the study of abstract structures characterize the growth of mathematics. The Nature of Mathematics: Mina Rees Authors Info & Affiliations. Science. 5 Oct 1962. Vol 138, Issue 3536. pp. 9 - 12.

  11. PDF Exploring Teachers' Experiences on the Nature of Mathematics ...

    research proved to be useful for analyzing the nature of mathematical knowledge and pedagogical practices of mathematics teachers. Moving towards Nature of Mathematics Mathematics is defined as a definite body of knowledge and involves the study of quantity, structure, space, and change (Ndlovu, Pournara, & Mwakapenda, 2019).

  12. Reflection on the Nature of Mathematics and Its Implications for

    In this article, I make a case for the inputs that Martin Heidegger's theoretical perspective offers to current concerns about the nature of mathematics, its teaching and learning, and the problem ...

  13. Nature of Mathematics and Pedagogical Practices

    Previous research study has shown that nature of mathematics affects in/directly the teaching and learning process. Thus, this paper has discussed the constructed notions of nature of mathematics ...

  14. "What is Mathematics?" and why we should ask, where one should

    Mathematics is the abstract study of topics such as quantity (numbers), [2] structure, [3] space, [2] and change. [4][5][6] There is a range of views among mathematicians and philosophers as to the exact scope and definition of mathematics. [7][8] Mathematicians seek out patterns (Highland & Highland, 1961, 1963) and use them to formulate new conjectures.. Mathematicians resolve the truth or ...

  15. Pure mathematics

    Pure mathematics uses mathematics to explore abstract ideas, mathematics that does not necessarily describe a real physical system. This can include developing the fundamental tools used by ...

  16. Chapter 2: The Nature of Mathematics

    M ATHEMATICAL I NQUIRY. Chapter 2: THE NATURE OF MATHEMATICS. Mathematics relies on both logic and creativity, and it is pursued both for a variety of practical purposes and for its intrinsic interest. For some people, and not only professional mathematicians, the essence of mathematics lies in its beauty and its intellectual challenge.

  17. PDF Aspects of the Nature and State of Research in Mathematics ...

    To quite a few researchers in mathematics education the perspectives of pure, fundamental research are predominant. However, it is fair to claim that the over-arching, ultimate end of the whole enterprise is to promote/improve students' learning of mathematics and acquisition. of mathematical competencies.

  18. PDF Nature of Mathematics that Impacts Difficulties in learning it: A ...

    Nature of Mathematics that Impacts Difficulties in learning it: A Comparison of Student Perspectives on Learning School Subjects from Kerala Abdul Gafoor K. & Sarabi M K Paper Presented in All India Association for Educational Research Annual cum International Conference on Standards and Benchmarks for Excellence in Learning, Teaching and Research

  19. The Nature of Mathematics Research Papers

    Research in the field of mathematics education has found that students often have naïve views about the nature of mathematics. Some believe that mathematics is a body of unchanging knowledge, a collection of arbitrary rules and procedures that must be memorized. Mathematics is seen as an impersonal and uncreative subject.

  20. The Nature of Mathematics

    As a practical matter, mathematics is a science of pattern and order. Its domain is not molecules or cells, but numbers, chance, form, algorithms, and change. As a science of abstract objects, mathematics relies on logic rather than on observation as its standard of truth, yet employs observation, simulation, and even experimentation as means ...

  21. The Very Multi-faceted Nature of Mathematics Education Research

    well ask whether the papers published in ESM and JRME around 1970 would pass as research papers according to the predominant perceptions of research today. During the past fifty years, what has been published under the label of mathematics education research constitutes a very broad, diverse and extensive body of contributions to the field.

  22. Making Mathematics: Mathematics Research Teacher Handbook

    Mathematics research influences student learning in a number of ways: Research provides students with an understanding of what it means to do mathematics and of mathematics as a living, growing field. Writing mathematics and problem-solving become central to student's learning. Students develop mastery of mathematics topics.

  23. (PDF) MATHEMATICS IN THE MODERN WORLD AS A SCIENCE OF ...

    Students recognized the importance of mathematics in one's life and in the society. Students thrive well in a learning environment that fosters exploration, creativity, collaboration, love and ...

  24. Detecting hallucinations in large language models using ...

    a, Naive entropy-based uncertainty measures variation in the exact answers, treating 'Paris', 'It's Paris' and 'France's capital Paris' as different.But this is unsuitable for ...