{"id":25790,"date":"2025-04-04T11:41:47","date_gmt":"2025-04-04T05:56:47","guid":{"rendered":"https:\/\/www.revoscience.com\/en\/?p=25790"},"modified":"2025-04-05T22:18:58","modified_gmt":"2025-04-05T16:33:58","slug":"new-method-assesses-and-improves-the-reliability-of-radiologists-diagnostic-reports","status":"publish","type":"post","link":"https:\/\/www.revoscience.com\/en\/new-method-assesses-and-improves-the-reliability-of-radiologists-diagnostic-reports\/","title":{"rendered":"New method assesses and improves the reliability of radiologists\u2019 diagnostic reports"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\"><strong><em>The framework helps clinicians choose phrases that more accurately reflect the likelihood that certain conditions are present in X-rays.&nbsp;<\/em><\/strong><\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"675\" height=\"439\" src=\"https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/04\/MIT-Calibrating-Certainty-01-press-675x439.jpg\" alt=\"\" class=\"wp-image-25803\" style=\"width:840px;height:auto\" title=\"\" srcset=\"https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/04\/MIT-Calibrating-Certainty-01-press-675x439.jpg 675w, https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/04\/MIT-Calibrating-Certainty-01-press-615x400.jpg 615w, https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/04\/MIT-Calibrating-Certainty-01-press-768x499.jpg 768w, https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/04\/MIT-Calibrating-Certainty-01-press.jpg 1000w\" sizes=\"auto, (max-width: 675px) 100vw, 675px\" \/><figcaption class=\"wp-element-caption\"><em>A new calibration method developed by MIT researchers can improve the accuracy of clinical reports written by radiologists by helping them express their confidence more reliably.<\/em> IMAGE: MIT<\/figcaption><\/figure>\n\n\n<div class=\"wp-block-post-author\"><div class=\"wp-block-post-author__content\"><p class=\"wp-block-post-author__name\">Adam Zewe<\/p><\/div><\/div>\n\n\n<p class=\"wp-block-paragraph\">CAMBRIDGE, MA \u2013 Due to the inherent ambiguity in medical images like X-rays, radiologists often use words like \u201cmay\u201d or \u201clikely\u201d when describing the presence of a certain pathology, such as pneumonia.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">But do the words radiologists use to express their confidence level accurately reflect how often a particular pathology occurs in patients? A new study shows that when radiologists express confidence about a certain pathology using a phrase like \u201cvery likely,\u201d they tend to be overconfident, and vice-versa when they express less confidence using a word like \u201cpossibly.\u201d<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Using clinical data, a multidisciplinary team of MIT researchers in collaboration with researchers and clinicians at hospitals affiliated with Harvard Medical School created a framework to quantify how reliable radiologists are when they express certainty using natural language terms.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">They used this approach to provide clear suggestions that help radiologists choose certainty phrases that would improve the reliability of their clinical reporting. They also showed that the same technique can effectively measure and improve the calibration of large language models by better aligning the words models use to express confidence with the accuracy of their predictions.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">By helping radiologists more accurately describe the likelihood of certain pathologies in medical images, this new framework could improve the reliability of critical clinical information.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u201cThe words radiologists use are important. They affect how doctors intervene, in terms of their decision-making for the patient. If these practitioners can be more reliable in their reporting, patients will be the ultimate beneficiaries,\u201d says Peiqi Wang, an MIT graduate student and lead author of a paper on this research.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">He is joined on the paper by senior author Polina Golland, a Sunlin and Priscilla Chou Professor of Electrical Engineering and Computer Science (EECS), a principal investigator in the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL), and the leader of the Medical Vision Group; as well as Barbara D. Lam, a clinical fellow at the Beth Israel Deaconess Medical Center; Yingcheng Liu, at MIT graduate student; Ameneh Asgari-Targhi, a research fellow at Massachusetts General Brigham (MGB); Rameswar Panda, a research staff member at the MIT-IBM Watson AI Lab; William M. Wells, a professor of radiology at MGB and a research scientist in CSAIL; and Tina Kapur, an assistant professor of radiology at MGB. The research will be presented at the International Conference on Learning Representations.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Decoding uncertainty in words<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">A radiologist writing a report about a chest X-ray might say the image shows a \u201cpossible\u201d pneumonia, which is an infection that inflames the air sacs in the lungs. In that case, a doctor could order a follow-up CT scan to confirm the diagnosis.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">However, if the radiologist writes that the X-ray shows a \u201clikely\u201d pneumonia, the doctor might begin treatment immediately, such as by prescribing antibiotics, while still ordering additional tests to assess severity.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Trying to measure the calibration, or reliability, of ambiguous natural language terms like \u201cpossibly\u201d and \u201clikely\u201d presents many challenges, Wang says.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Existing calibration methods typically rely on the confidence score provided by an AI model, which represents the model\u2019s estimated likelihood that its prediction is correct.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For instance, a weather app might predict an 83 percent chance of rain tomorrow. That model is well-calibrated if, across all instances where it predicts an 83 percent chance of rain, it rains approximately 83 percent of the time.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u201cBut humans use natural language, and if we map these phrases to a single number, it is not an accurate description of the real world. If a person says an event is \u2018likely,\u2019 they aren\u2019t necessarily thinking of the exact probability, such as 75 percent,\u201d Wang says.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Rather than trying to map certainty phrases to a single percentage, the researchers\u2019 approach treats them as probability distributions. A distribution describes the range of possible values and their likelihoods \u2014 think of the classic bell curve in statistics.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u201cThis captures more nuances of what each word means,\u201d Wang adds.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Assessing and improving calibration<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The researchers leveraged prior work that surveyed radiologists to obtain probability distributions that correspond to each diagnostic certainty phrase, ranging from \u201cvery likely\u201d to \u201cconsistent with.\u201d<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For instance, since more radiologists believe the phrase \u201cconsistent with\u201d means a pathology is present in a medical image, its probability distribution climbs sharply to a high peak, with most values clustered around the 90 to 100 percent range.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In contrast the phrase \u201cmay represent\u201d conveys greater uncertainty, leading to a broader, bell-shaped distribution centered around 50 percent.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Typical methods evaluate calibration by comparing how well a model\u2019s predicted probability scores align with the actual number of positive results.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The researchers\u2019 approach follows the same general framework but extends it to account for the fact that certainty phrases represent probability distributions rather than probabilities.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To improve calibration, the researchers formulated and solved an optimization problem that adjusts how often certain phrases are used, to better align confidence with reality.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">They derived a calibration map that suggests certainty terms a radiologist should use to make the reports more accurate for a specific pathology.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u201cPerhaps, for this dataset, if every time the radiologist said pneumonia was \u2018present,\u2019 they changed the phrase to \u2018likely present\u2019 instead, then they would become better calibrated,\u201d Wang explains.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">When the researchers used their framework to evaluate clinical reports, they found that radiologists were generally underconfident when diagnosing common conditions like atelectasis, but overconfident with more ambiguous conditions like infection.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In addition, the researchers evaluated the reliability of language models using their method, providing a more nuanced representation of confidence than classical methods that rely on confidence scores. &nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u201cA lot of times, these models use phrases like \u2018certainly\u2019 But because they are so confident in their answers, it does not encourage people to verify the correctness of the statements themselves,\u201d Wang adds.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In the future, the researchers plan to continue collaborating with clinicians in the hopes of improving diagnoses and treatment. They are working to expand their study to include data from abdominal CT scans.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In addition, they are interested in studying how receptive radiologists are to calibration-improving suggestions and whether they can mentally adjust their use of certainty phrases effectively.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The work is funded, in part, by the Takeda Fellowship, the MIT-IBM Watson AI Lab, the MIT CSAIL Wistrom Program, and the MIT Jameel Clinic.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n","protected":false},"excerpt":{"rendered":"<p>The framework helps clinicians choose phrases that more accurately reflect the likelihood that certain conditions are present in X-rays.\u00a0<\/p>\n","protected":false},"author":2,"featured_media":25803,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3,26],"tags":[],"class_list":["post-25790","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-news","category-medicine"],"featured_image_urls":{"full":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/04\/MIT-Calibrating-Certainty-01-press.jpg",1000,650,false],"thumbnail":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/04\/MIT-Calibrating-Certainty-01-press-200x200.jpg",200,200,true],"medium":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/04\/MIT-Calibrating-Certainty-01-press-615x400.jpg",615,400,true],"medium_large":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/04\/MIT-Calibrating-Certainty-01-press-768x499.jpg",750,487,true],"large":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/04\/MIT-Calibrating-Certainty-01-press-675x439.jpg",675,439,true],"1536x1536":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/04\/MIT-Calibrating-Certainty-01-press.jpg",1000,650,false],"2048x2048":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/04\/MIT-Calibrating-Certainty-01-press.jpg",1000,650,false],"ultp_layout_landscape_large":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/04\/MIT-Calibrating-Certainty-01-press.jpg",1000,650,false],"ultp_layout_landscape":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/04\/MIT-Calibrating-Certainty-01-press-870x570.jpg",870,570,true],"ultp_layout_portrait":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/04\/MIT-Calibrating-Certainty-01-press-600x650.jpg",600,650,true],"ultp_layout_square":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/04\/MIT-Calibrating-Certainty-01-press-600x600.jpg",600,600,true],"newspaper-x-single-post":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/04\/MIT-Calibrating-Certainty-01-press-760x490.jpg",760,490,true],"newspaper-x-recent-post-big":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/04\/MIT-Calibrating-Certainty-01-press-550x360.jpg",550,360,true],"newspaper-x-recent-post-list-image":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/04\/MIT-Calibrating-Certainty-01-press-95x65.jpg",95,65,true],"web-stories-poster-portrait":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/04\/MIT-Calibrating-Certainty-01-press.jpg",640,416,false],"web-stories-publisher-logo":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/04\/MIT-Calibrating-Certainty-01-press.jpg",96,62,false],"web-stories-thumbnail":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/04\/MIT-Calibrating-Certainty-01-press.jpg",150,98,false]},"author_info":{"info":["Adam Zewe"]},"category_info":"<a href=\"https:\/\/www.revoscience.com\/en\/category\/news\/\" rel=\"category tag\">News<\/a> <a href=\"https:\/\/www.revoscience.com\/en\/category\/health\/medicine\/\" rel=\"category tag\">Medicine<\/a>","tag_info":"Medicine","comment_count":"0","_links":{"self":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/posts\/25790","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/comments?post=25790"}],"version-history":[{"count":3,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/posts\/25790\/revisions"}],"predecessor-version":[{"id":25804,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/posts\/25790\/revisions\/25804"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/media\/25803"}],"wp:attachment":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/media?parent=25790"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/categories?post=25790"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/tags?post=25790"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}