{"id":31836,"date":"2025-11-19T10:05:00","date_gmt":"2025-11-19T04:20:00","guid":{"rendered":"https:\/\/www.revoscience.com\/en\/?p=31836"},"modified":"2025-11-18T22:09:40","modified_gmt":"2025-11-18T16:24:40","slug":"bigger-datasets-arent-always-better","status":"publish","type":"post","link":"https:\/\/www.revoscience.com\/en\/bigger-datasets-arent-always-better\/","title":{"rendered":"Bigger datasets aren\u2019t always better"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\"><strong><em>MIT researchers developed a way to identify the smallest dataset that guarantees optimal solutions to complex problems.<\/em><\/strong><\/p>\n\n\n<div class=\"wp-block-post-author\"><div class=\"wp-block-post-author__content\"><p class=\"wp-block-post-author__name\">Adam Zewe<\/p><\/div><\/div>\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"900\" height=\"600\" src=\"https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/11\/MIT-Optimal-Decisions-01-press_0.webp\" alt=\"\" class=\"wp-image-31837\" title=\"\" srcset=\"https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/11\/MIT-Optimal-Decisions-01-press_0.webp 900w, https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/11\/MIT-Optimal-Decisions-01-press_0-675x450.webp 675w, https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/11\/MIT-Optimal-Decisions-01-press_0-768x512.webp 768w, https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/11\/MIT-Optimal-Decisions-01-press_0-150x100.webp 150w\" sizes=\"auto, (max-width: 900px) 100vw, 900px\" \/><figcaption class=\"wp-element-caption\"><em><sup>Credit: MIT<\/sup><\/em><\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Cambridge, MA \u2013 Determining the least expensive path for a new subway line underneath a metropolis like New York City is a colossal planning challenge \u2014 involving thousands of potential routes through hundreds of city blocks, each with uncertain construction costs. Conventional wisdom suggests extensive field studies across many locations would be needed to determine the costs associated with digging below certain city blocks.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Because these studies are costly to conduct, a city planner would want to perform as few as possible while still gathering the most useful data for making an optimal decision.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">With almost countless possibilities, how would they know where to start?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">A new algorithmic method developed by MIT researchers could help. Their mathematical framework provably identifies the smallest dataset that guarantees finding the optimal solution to a problem, often requiring fewer measurements than traditional approaches suggest.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In the case of the subway route, this method considers the structure of the problem (the network of city blocks, construction constraints, and budget limits) and the uncertainty surrounding costs. The algorithm then identifies the minimum set of locations where field studies would guarantee finding the least expensive route. The method also identifies how to use this strategically collected data to find the optimal decision.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This framework applies to a broad class of structured decision-making problems under uncertainty, such as supply chain management or electricity network optimization.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u201cData are one of the most important aspects of the AI economy. Models are trained on more and more data, consuming enormous computational resources. But most real-world problems have structure that can be exploited. We\u2019ve shown that with careful selection, you can guarantee optimal solutions with a small dataset, and we provide a method to identify exactly which data you need,\u201d says Asu Ozdaglar, Mathworks Professor and head of the MIT Department of Electrical Engineering and Computer Science (EECS), deputy dean of the MIT Schwarzman College of Computing, and a principal investigator in the Laboratory for Information and Decision Systems (LIDS).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Ozdaglar, co-senior author of a&nbsp;<a href=\"https:\/\/link.mediaoutreach.meltwater.com\/ls\/click?upn=u001.aGL2w8mpmadAd46sBDLfbJQfXi-2BgjtsRXhSuJl6mKAifVI6cDq4H9Mn5mSQhNl56OcQE_Gmh-2FjktplCfWo1o-2BFbkY3J9eYBJUJc-2BSUmMkHo42Dqe4Z0qTEKCmSFnQfWCe8-2B8jgXgQQcW-2Fb1rLKfKZRu-2BLLGScwMYc-2FOCX9RDmpXEBR4BY9i7y-2BNgpMuREG7n76alZ5DCnMfU4BpYahzFVJVDqr3gTfs4-2FUWzm2H05d30sjMLwtSgfh0v1Vh-2BQEguywy5uqpeaYBNbpiXCT-2Ffms-2FpOF5k3nVn0v-2B9rNosoVEU7YCmCD-2FyYpgREDpzzlO-2Bz794a-2FLzE0WwcKiy1-2FhHBHwdEXpuBTNSFkO9uU2NYVU1QKxYvDYKdVlt0G1woSVs-2FGpP45RVRXlVmZRyTl5V5HlUhcSNJEc1qxAcJpOHR2OJ1QpFPu7EY1pavi4zH7nDLKyh0MPMJZ5M-2BM75TXWKiLpq6sw-3D-3D\" rel=\"noreferrer noopener\" target=\"_blank\">paper on this research<\/a>, is joined by co-lead authors Omar Bennouna, an EECS graduate student, and his brother Amine Bennouna, a former MIT postdoc who is now an assistant professor at Northwestern University; and co-senior author Saurabh Amin, co-director of Operations Research Center, a professor in the MIT Department of Civil and Environmental Engineering, and a principal investigator in LIDS. The research will be presented at the Conference on Neural Information Processing Systems.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>An optimality guarantee<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Much of the recent work in operations research focuses on how to best use data to make decisions, but this assumes these data already exist.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The MIT researchers started by asking a different question \u2014 what are the minimum data needed to optimally solve a problem? With this knowledge, one could collect far fewer data to find the best solution, spending less time, money, and energy conducting experiments and training AI models.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The researchers first developed a precise geometric and mathematical characterization of what it means for a dataset to be sufficient. Every possible set of costs (travel times, construction expenses, energy prices) makes some particular decision optimal. These \u201coptimality regions\u201d partition the decision space. A dataset is sufficient if it can determine which region contains the true cost.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This characterization offers the foundation of the practical algorithm they developed that identifies datasets that guarantee finding the optimal solution.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Their theoretical exploration revealed that a small, carefully selected dataset is often all one needs.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u201cWhen we say a dataset is sufficient, we mean that it contains exactly the information needed to solve the problem. You don\u2019t need to estimate all the parameters accurately; you just need data that can discriminate between competing optimal solutions,\u201d says Amine Bennouna.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Building on these mathematical foundations, the researchers developed an algorithm that finds the smallest sufficient dataset.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Capturing the right data<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To use this tool, one inputs the structure of the task, such as the objective and constraints, along with the information they know about the problem.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For instance, in supply chain management, the task might be to reduce operational costs across a network of dozens of potential routes. The company may already know that some shipment routes are especially costly, but lack complete information on others.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The researchers\u2019 iterative algorithm works by repeatedly asking, \u201cIs there any scenario that would change the optimal decision in a way my current data can&#8217;t detect?\u201d If yes, it adds a measurement that captures that difference. If no, the dataset is provably sufficient.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This algorithm pinpoints the subset of locations that need to be explored to guarantee finding the minimum-cost solution.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Then, after collecting those data, the user can feed them to another algorithm the researchers developed which finds that optimal solution. In this case, that would be the shipment routes to include in a cost-optimal supply chain.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u201cThe algorithm guarantees that, for whatever scenario could occur within your uncertainty, you\u2019ll identify the best decision,\u201d Omar Bennouna says.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The researchers\u2019 evaluations revealed that, using this method, it is possible to guarantee an optimal decision with a much smaller dataset than would typically be collected.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u201cWe challenge this misconception that small data means approximate solutions. These are exact sufficiency results with mathematical proofs. We\u2019ve identified when you\u2019re guaranteed to get the optimal solution with very little data \u2014 not probably, but with certainty,\u201d Amin says.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In the future, the researchers want to extend their framework to other types of problems and more complex situations. They also want to study how noisy observations could affect dataset optimality.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Cambridge, MA \u2013 Determining the least expensive path for a new subway line underneath a metropolis like New York City is a colossal planning challenge \u2014 involving thousands of potential routes through hundreds of city blocks, each with uncertain construction costs.<\/p>\n","protected":false},"author":2,"featured_media":31837,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[17,47],"tags":[],"class_list":["post-31836","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research","category-it"],"featured_image_urls":{"full":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/11\/MIT-Optimal-Decisions-01-press_0.webp",900,600,false],"thumbnail":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/11\/MIT-Optimal-Decisions-01-press_0-200x200.webp",200,200,true],"medium":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/11\/MIT-Optimal-Decisions-01-press_0-675x450.webp",675,450,true],"medium_large":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/11\/MIT-Optimal-Decisions-01-press_0-768x512.webp",750,500,true],"large":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/11\/MIT-Optimal-Decisions-01-press_0.webp",750,500,false],"1536x1536":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/11\/MIT-Optimal-Decisions-01-press_0.webp",900,600,false],"2048x2048":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/11\/MIT-Optimal-Decisions-01-press_0.webp",900,600,false],"ultp_layout_landscape_large":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/11\/MIT-Optimal-Decisions-01-press_0.webp",900,600,false],"ultp_layout_landscape":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/11\/MIT-Optimal-Decisions-01-press_0-870x570.webp",870,570,true],"ultp_layout_portrait":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/11\/MIT-Optimal-Decisions-01-press_0-600x600.webp",600,600,true],"ultp_layout_square":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/11\/MIT-Optimal-Decisions-01-press_0-600x600.webp",600,600,true],"newspaper-x-single-post":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/11\/MIT-Optimal-Decisions-01-press_0-760x490.webp",760,490,true],"newspaper-x-recent-post-big":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/11\/MIT-Optimal-Decisions-01-press_0-550x360.webp",550,360,true],"newspaper-x-recent-post-list-image":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/11\/MIT-Optimal-Decisions-01-press_0-95x65.webp",95,65,true],"web-stories-poster-portrait":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/11\/MIT-Optimal-Decisions-01-press_0-640x600.webp",640,600,true],"web-stories-publisher-logo":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/11\/MIT-Optimal-Decisions-01-press_0-96x96.webp",96,96,true],"web-stories-thumbnail":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2025\/11\/MIT-Optimal-Decisions-01-press_0-150x100.webp",150,100,true]},"author_info":{"info":["Adam Zewe"]},"category_info":"<a href=\"https:\/\/www.revoscience.com\/en\/category\/news\/research\/\" rel=\"category tag\">Research<\/a> <a href=\"https:\/\/www.revoscience.com\/en\/category\/news\/it\/\" rel=\"category tag\">IT<\/a>","tag_info":"IT","comment_count":"0","_links":{"self":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/posts\/31836","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/comments?post=31836"}],"version-history":[{"count":1,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/posts\/31836\/revisions"}],"predecessor-version":[{"id":31838,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/posts\/31836\/revisions\/31838"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/media\/31837"}],"wp:attachment":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/media?parent=31836"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/categories?post=31836"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/tags?post=31836"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}