{"id":26254,"date":"2025-05-15T13:04:33","date_gmt":"2025-05-15T07:19:33","guid":{"rendered":"https:\/\/www.revoscience.com\/en\/?p=26254"},"modified":"2025-05-15T13:04:55","modified_gmt":"2025-05-15T07:19:55","slug":"study-shows-vision-language-models-cant-handle-queries-with-negation-words","status":"publish","type":"post","link":"https:\/\/www.revoscience.com\/en\/study-shows-vision-language-models-cant-handle-queries-with-negation-words\/","title":{"rendered":"Study shows vision-language models can\u2019t handle queries with negation words"},"content":{"rendered":"\n<p><em><strong>Words like \u201cno\u201d and \u201cnot\u201d can cause this popular class of AI models to fail unexpectedly in high-stakes settings, such as medical diagnosis.<\/strong><\/em><\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"900\" height=\"600\" sizes=\"auto, (max-width: 900px) 100vw, 900px\" src=\"https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/05\/MIT-LMNegation-01-press_0.jpg\" alt=\"\" class=\"wp-image-26255\" title=\"\" srcset=\"https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/05\/MIT-LMNegation-01-press_0.jpg 900w, https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/05\/MIT-LMNegation-01-press_0-675x450.jpg 675w, https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/05\/MIT-LMNegation-01-press_0-768x512.jpg 768w, https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/05\/MIT-LMNegation-01-press_0-150x100.jpg 150w\" \/><\/figure>\n\n\n<div class=\"wp-block-post-author\"><div class=\"wp-block-post-author__avatar\"><img loading=\"lazy\" decoding=\"async\" alt=\"\" src=\"https:\/\/secure.gravatar.com\/avatar\/f5d0b0abe156e88d51d7b6d568de8a9deb35ed7004fd88e0abc5c96894d14fd9?s=48&#038;d=mm&#038;r=g\" srcset=\"https:\/\/secure.gravatar.com\/avatar\/f5d0b0abe156e88d51d7b6d568de8a9deb35ed7004fd88e0abc5c96894d14fd9?s=96&#038;d=mm&#038;r=g 2x\" class=\"avatar avatar-48 photo\" height=\"48\" width=\"48\" title=\"\"><\/div><div class=\"wp-block-post-author__content\"><p class=\"wp-block-post-author__name\">Adam Zewe<\/p><\/div><\/div>\n\n\n<p>CAMBRIDGE, MA \u2013 Imagine a radiologist examining a chest X-ray from a new patient. She notices the patient has swelling in the tissue but does not have an enlarged heart. Looking to speed up diagnosis, she might use a vision-language machine-learning model to search for reports from similar patients.<\/p>\n\n\n\n<p>But if the model mistakenly identifies reports with both conditions, the most likely diagnosis could be quite different: If a patient has tissue swelling and an enlarged heart, the condition is very likely to be cardiac related, but with no enlarged heart there could be several underlying causes.<\/p>\n\n\n\n<p>In a new study, MIT researchers have found that vision-language models are extremely likely to make such a mistake in real-world situations because they don\u2019t understand negation \u2014 words like \u201cno\u201d and \u201cdoesn\u2019t\u201d that specify what is false or absent.&nbsp;<\/p>\n\n\n\n<p>\u201cThose negation words can have a very significant impact, and if we are just using these models blindly, we may run into catastrophic consequences,\u201d says Kumail Alhamoud, an MIT graduate student and lead author of&nbsp;<a href=\"https:\/\/link.mediaoutreach.meltwater.com\/ls\/click?upn=u001.aGL2w8mpmadAd46sBDLfbJQfXi-2BgjtsRXhSuJl6mKAg3ZNB5BLBCPxH1w2PQGefa72NN_Gmh-2FjktplCfWo1o-2BFbkY3J9eYBJUJc-2BSUmMkHo42Dqe4Z0qTEKCmSFnQfWCe8-2B8jgXgQQcW-2Fb1rLKfKZRu-2BLLGScwMYc-2FOCX9RDmpXEBR4BY9i7y-2BNgpMuREG7n76alZV7dwhY4kQkuDwWejIPbLklaiQu8T5AGP0VNHcMTJ9nYynBO3K5CEBRpBkbTbIftGa0wa8FxRegnx5Uq-2FvA6U1Fb2FD7QJba7IQqU7tF5kJEF5aSXzmj1XbNySBq9EUipGd-2BvG7J-2F8riK9nlbQIZBTJxzaF-2FUA6Y0Etpp0NFv8JLeBQ5eP9A-2FI9i6pzh6rmDXYkqT-2ByvC5N6bftALDD3ZqWMLOmPCPx-2FvanfSS2oAsxpKZKDtCqb8CdHJEvGu2-2F7-2FvGCKzrFvkrUw9mNaLi-2F3cQ-3D-3D\" rel=\"noreferrer noopener\" target=\"_blank\">this study<\/a>.<\/p>\n\n\n\n<p>The researchers tested the ability of vision-language models to identify negation in image captions. The models often performed as well as a random guess. Building on those findings, the team created a dataset of images with corresponding captions that include negation words describing missing objects.<\/p>\n\n\n\n<p>They show that retraining a vision-language model with this dataset leads to performance improvements when a model is asked to retrieve images that do not contain certain objects. It&nbsp;also boosts accuracy on multiple choice question answering with negated captions.<\/p>\n\n\n\n<p>But the researchers caution that more work is needed to address the root causes of this problem. They hope their research alerts potential users to a previously unnoticed shortcoming that could have serious implications in high-stakes settings where these models are currently being used, from determining which patients receive certain treatments to identifying product defects in manufacturing plants.<\/p>\n\n\n\n<p>\u201cThis is a technical paper, but there are bigger issues to consider. If something as fundamental as negation is broken, we shouldn\u2019t be using large vision\/language models in many of the ways we are using them now \u2014 without intensive evaluation,\u201d says senior author Marzyeh Ghassemi, an associate professor in the Department of Electrical Engineering and Computer Science (EECS) and a member of the Institute of Medical Engineering Sciences and the Laboratory for Information and Decision Systems.<\/p>\n\n\n\n<p>Ghassemi and Alhamoud are joined on the paper by Shaden Alshammari, an MIT graduate student; Yonglong Tian of OpenAI; Guohao Li, a former postdoc at Oxford University; Philip H.S. Torr, a professor at Oxford; and Yoon Kim, an assistant professor of EECS and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL) at MIT. The research will be presented at Conference on Computer Vision and Pattern Recognition.<\/p>\n\n\n\n<p><strong>Neglecting negation<\/strong><\/p>\n\n\n\n<p>Vision-language models (VLM) are trained using huge collections of images and corresponding captions, which they learn to encode as sets of numbers, called vector representations. The models use these vectors to distinguish between different images.<\/p>\n\n\n\n<p>A VLM utilizes two separate encoders, one for text and one for images, and the encoders learn to output similar vectors for an image and its corresponding text caption.<\/p>\n\n\n\n<p>\u201cThe captions express what is in the images \u2014 they are a positive label. And that is actually the whole problem. No one looks at an image of a dog jumping over a fence and captions it by saying \u2018a dog jumping over a fence, with no helicopters,\u2019\u201d Ghassemi says.<\/p>\n\n\n\n<p>Because the image-caption datasets don\u2019t contain examples of negation, VLMs never learn to identify it.<\/p>\n\n\n\n<p>To dig deeper into this problem, the researchers designed two benchmark tasks that test the ability of VLMs to understand negation.<\/p>\n\n\n\n<p>For the first, they used a large language model (LLM) to re-caption images in an existing dataset by asking the LLM to think about related objects not in an image and write them into the caption. Then they tested models by prompting them with negation words to retrieve images that contain certain objects, but not others.<\/p>\n\n\n\n<p>For the second task, they designed multiple choice questions that ask a VLM to select the most appropriate caption from a list of closely related options. These captions differ only by adding a reference to an object that doesn\u2019t appear in the image or negating an object that does appear in the image.<\/p>\n\n\n\n<p>The models often failed at both tasks, with image retrieval performance dropping by nearly 25 percent with negated captions. When it came to answering multiple choice questions, the best models only achieved about 39 percent accuracy, with several models performing at or even below random chance.<\/p>\n\n\n\n<p>One reason for this failure is a shortcut the researchers call affirmation bias \u2014 VLMs ignore negation words and focus on objects in the images instead.<\/p>\n\n\n\n<p>\u201cThis does not just happen for words like \u2018no\u2019 and \u2018not.\u2019 Regardless of how you express negation or exclusion, the models will simply ignore it,\u201d Alhamoud says.<\/p>\n\n\n\n<p>This was consistent across every VLM they tested.<\/p>\n\n\n\n<p><strong>\u201cA solvable problem\u201d<\/strong><\/p>\n\n\n\n<p>Since VLMs aren\u2019t typically trained on image captions with negation, the researchers developed datasets with negation words as a first step toward solving the problem.<\/p>\n\n\n\n<p>Using a dataset with 10 million image-text caption pairs, they prompted an LLM to propose related captions that specify what is excluded from the images, yielding new captions with negation words.<\/p>\n\n\n\n<p>They had to be especially careful that these synthetic captions still read naturally, or it could cause a VLM to fail in the real world when faced with more complex captions written by humans.<\/p>\n\n\n\n<p>They found that finetuning VLMs with their dataset led to performance gains across the board. It improved models\u2019 image retrieval abilities by about 10 percent, while also boosting performance in the multiple-choice question answering task by about 30 percent.<\/p>\n\n\n\n<p>\u201cBut our solution is not perfect. We are just recaptioning datasets, a form of data augmentation. We haven\u2019t even touched how these models work, but we hope this is a signal that this is a solvable problem and others can take our solution and improve it,\u201d Alhamoud says.<\/p>\n\n\n\n<p>At the same time, he hopes their work encourages more users to think about the problem they want to use a VLM to solve and design some examples to test it before deployment.<\/p>\n\n\n\n<p>In the future, the researchers could expand upon this work by teaching VLMs to process text and images separately, which may improve their ability to understand negation. In addition, they could develop additional datasets that include image-caption pairs for specific applications, such as health care.<\/p>\n\n\n<div  class=\"wp-block-ultimate-post-heading ultp-block-d7b92b\"><div class=\"ultp-block-wrapper\"><div class=\"ultp-heading-wrap ultp-heading-style5 ultp-heading-left\"><h2 class=\"ultp-heading-inner\"><span>WEB-STORIES<\/span><\/h2><\/div><\/div><\/div>\n\n\t\t<div class=\"web-stories-list alignnone is-view-type-grid columns-3 is-style-default\" data-id=\"1\">\n\t\t\t<div class=\"web-stories-list__inner-wrapper\" style=\"\">\n\t\t\t\t\t\t\t<div\n\t\t\t\tclass=\"web-stories-list__story\"\n\t\t\t\tdata-wp-interactive=\"web-stories-block\"\n\t\t\t\tdata-wp-context='{\"instanceId\":1}'\t\t\t\tdata-wp-on--click=\"actions.open\"\n\t\t\t\tdata-wp-on-window--popstate=\"actions.onPopstate\"\n\t\t\t\t>\n\t\t\t\t\t\t\t<div class=\"web-stories-list__story-poster\">\n\t\t\t\t<a href=\"https:\/\/www.revoscience.com\/en\/web-stories\/worlds-biodiversity-ecosystem\/\">\n\t\t\t\t\t<img\n\t\t\t\t\t\tsrc=\"https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/05\/cropped-worlds-biodiversity-and-ecosystem.jpeg\"\n\t\t\t\t\t\talt=\"World&#8217;s Biodiversity &#038; Ecosystem\"\n\t\t\t\t\t\twidth=\"185\"\n\t\t\t\t\t\theight=\"308\"\n\t\t\t\t\t\t\t\t\t\t\t\t\tsrcset=\"https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/05\/cropped-worlds-biodiversity-and-ecosystem.jpeg 640w, https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/05\/cropped-worlds-biodiversity-and-ecosystem-506x675.jpeg 506w, https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/05\/cropped-worlds-biodiversity-and-ecosystem-150x200.jpeg 150w\"\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tsizes=\"auto, (max-width: 640px) 100vw, 640px\"\n\t\t\t\t\t\t\t\t\t\t\t\tloading=\"lazy\"\n\t\t\t\t\t\tdecoding=\"async\"\n\t\t\t\t\t>\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\t\t<div class=\"web-stories-list__story-content-overlay\">\n\t\t\t\t\t\t\t<div class=\"story-content-overlay__title\">\n\t\t\t\t\tWorld&#8217;s Biodiversity &#038; Ecosystem\t\t\t\t<\/div>\n\t\t\t\n\t\t\t\n\t\t\t\n\t\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t\t\t<div\n\t\t\t\tclass=\"web-stories-list__story\"\n\t\t\t\tdata-wp-interactive=\"web-stories-block\"\n\t\t\t\tdata-wp-context='{\"instanceId\":1}'\t\t\t\tdata-wp-on--click=\"actions.open\"\n\t\t\t\tdata-wp-on-window--popstate=\"actions.onPopstate\"\n\t\t\t\t>\n\t\t\t\t\t\t\t<div class=\"web-stories-list__story-poster\">\n\t\t\t\t<a href=\"https:\/\/www.revoscience.com\/en\/web-stories\/voyager-mission-beyond-the-stars\/\">\n\t\t\t\t\t<img\n\t\t\t\t\t\tsrc=\"https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/05\/cropped-voyager.jpeg\"\n\t\t\t\t\t\talt=\"Voyager Mission: Beyond the Stars!\"\n\t\t\t\t\t\twidth=\"185\"\n\t\t\t\t\t\theight=\"308\"\n\t\t\t\t\t\t\t\t\t\t\t\t\tsrcset=\"https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/05\/cropped-voyager.jpeg 640w, https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/05\/cropped-voyager-506x675.jpeg 506w, https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/05\/cropped-voyager-150x200.jpeg 150w\"\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tsizes=\"auto, (max-width: 640px) 100vw, 640px\"\n\t\t\t\t\t\t\t\t\t\t\t\tloading=\"lazy\"\n\t\t\t\t\t\tdecoding=\"async\"\n\t\t\t\t\t>\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\t\t<div class=\"web-stories-list__story-content-overlay\">\n\t\t\t\t\t\t\t<div class=\"story-content-overlay__title\">\n\t\t\t\t\tVoyager Mission: Beyond the Stars!\t\t\t\t<\/div>\n\t\t\t\n\t\t\t\n\t\t\t\n\t\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t\t\t<div\n\t\t\t\tclass=\"web-stories-list__story\"\n\t\t\t\tdata-wp-interactive=\"web-stories-block\"\n\t\t\t\tdata-wp-context='{\"instanceId\":1}'\t\t\t\tdata-wp-on--click=\"actions.open\"\n\t\t\t\tdata-wp-on-window--popstate=\"actions.onPopstate\"\n\t\t\t\t>\n\t\t\t\t\t\t\t<div class=\"web-stories-list__story-poster\">\n\t\t\t\t<a href=\"https:\/\/www.revoscience.com\/en\/web-stories\/wearable-technology-health-monitoring\/\">\n\t\t\t\t\t<img\n\t\t\t\t\t\tsrc=\"https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/05\/cropped-Wearable-Technology-Health-Monitoring.jpeg\"\n\t\t\t\t\t\talt=\"Wearable Technology: Health Monitoring!\"\n\t\t\t\t\t\twidth=\"185\"\n\t\t\t\t\t\theight=\"308\"\n\t\t\t\t\t\t\t\t\t\t\t\t\tsrcset=\"https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/05\/cropped-Wearable-Technology-Health-Monitoring.jpeg 640w, https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/05\/cropped-Wearable-Technology-Health-Monitoring-506x675.jpeg 506w, https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/05\/cropped-Wearable-Technology-Health-Monitoring-150x200.jpeg 150w\"\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tsizes=\"auto, (max-width: 640px) 100vw, 640px\"\n\t\t\t\t\t\t\t\t\t\t\t\tloading=\"lazy\"\n\t\t\t\t\t\tdecoding=\"async\"\n\t\t\t\t\t>\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\t\t<div class=\"web-stories-list__story-content-overlay\">\n\t\t\t\t\t\t\t<div class=\"story-content-overlay__title\">\n\t\t\t\t\tWearable Technology: Health Monitoring!\t\t\t\t<\/div>\n\t\t\t\n\t\t\t\n\t\t\t\n\t\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>Words like \u201cno\u201d and \u201cnot\u201d can cause this popular class of AI models to fail unexpectedly in high-stakes settings, such as medical diagnosis.<\/p>\n","protected":false},"author":2,"featured_media":26255,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[163,6],"tags":[],"class_list":["post-26254","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai","category-health"],"featured_image_urls":{"full":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/05\/MIT-LMNegation-01-press_0.jpg",900,600,false],"thumbnail":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/05\/MIT-LMNegation-01-press_0-200x200.jpg",200,200,true],"medium":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/05\/MIT-LMNegation-01-press_0-675x450.jpg",675,450,true],"medium_large":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/05\/MIT-LMNegation-01-press_0-768x512.jpg",750,500,true],"large":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/05\/MIT-LMNegation-01-press_0.jpg",750,500,false],"1536x1536":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/05\/MIT-LMNegation-01-press_0.jpg",900,600,false],"2048x2048":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/05\/MIT-LMNegation-01-press_0.jpg",900,600,false],"ultp_layout_landscape_large":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/05\/MIT-LMNegation-01-press_0.jpg",900,600,false],"ultp_layout_landscape":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/05\/MIT-LMNegation-01-press_0-870x570.jpg",870,570,true],"ultp_layout_portrait":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/05\/MIT-LMNegation-01-press_0-600x600.jpg",600,600,true],"ultp_layout_square":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/05\/MIT-LMNegation-01-press_0-600x600.jpg",600,600,true],"newspaper-x-single-post":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/05\/MIT-LMNegation-01-press_0-760x490.jpg",760,490,true],"newspaper-x-recent-post-big":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/05\/MIT-LMNegation-01-press_0-550x360.jpg",550,360,true],"newspaper-x-recent-post-list-image":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/05\/MIT-LMNegation-01-press_0-95x65.jpg",95,65,true],"web-stories-poster-portrait":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/05\/MIT-LMNegation-01-press_0-640x600.jpg",640,600,true],"web-stories-publisher-logo":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/05\/MIT-LMNegation-01-press_0-96x96.jpg",96,96,true],"web-stories-thumbnail":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/05\/MIT-LMNegation-01-press_0-150x100.jpg",150,100,true]},"author_info":{"info":["Adam Zewe"]},"category_info":"<a href=\"https:\/\/www.revoscience.com\/en\/category\/techbiz\/ai\/\" rel=\"category tag\">AI<\/a> <a href=\"https:\/\/www.revoscience.com\/en\/category\/health\/\" rel=\"category tag\">Health<\/a>","tag_info":"Health","comment_count":"0","_links":{"self":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/posts\/26254","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/comments?post=26254"}],"version-history":[{"count":1,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/posts\/26254\/revisions"}],"predecessor-version":[{"id":26256,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/posts\/26254\/revisions\/26256"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/media\/26255"}],"wp:attachment":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/media?parent=26254"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/categories?post=26254"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/tags?post=26254"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}