{"id":10390,"date":"2016-10-27T06:30:22","date_gmt":"2016-10-27T06:30:22","guid":{"rendered":"http:\/\/revoscience.com\/en\/?p=10390"},"modified":"2016-10-27T06:30:22","modified_gmt":"2016-10-27T06:30:22","slug":"finding-patterns-in-corrupted-data","status":"publish","type":"post","link":"https:\/\/www.revoscience.com\/en\/finding-patterns-in-corrupted-data\/","title":{"rendered":"Finding patterns in corrupted data"},"content":{"rendered":"<p style=\"text-align: justify;\"><span style=\"color: #000000;\"><em><strong style=\"color: #222222;\">New model-fitting technique is efficient even for data sets with hundreds of variables.<\/strong><\/em><\/span><\/p>\n<figure id=\"attachment_10391\" aria-describedby=\"caption-attachment-10391\" style=\"width: 639px\" class=\"wp-caption alignnone\"><a href=\"http:\/\/revoscience.com\/en\/wp-content\/uploads\/2016\/10\/MIT-High-Def-Learning_0.jpg\" target=\"_blank\" rel=\"noopener\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-10391\" src=\"http:\/\/revoscience.com\/en\/wp-content\/uploads\/2016\/10\/MIT-High-Def-Learning_0.jpg\" alt=\"A team, including researchers from MIT\u2019s Computer Science and Artificial Intelligence Laboratory, has created a new set of algorithms that can efficiently fit probability distributions to high-dimensional data. Image: MIT News\" width=\"639\" height=\"426\" title=\"\" srcset=\"https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/10\/MIT-High-Def-Learning_0.jpg 639w, https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/10\/MIT-High-Def-Learning_0-300x200.jpg 300w\" sizes=\"auto, (max-width: 639px) 100vw, 639px\" \/><\/a><figcaption id=\"caption-attachment-10391\" class=\"wp-caption-text\">A team, including researchers from MIT\u2019s Computer Science and Artificial Intelligence Laboratory, has created a new set of algorithms that can efficiently fit probability distributions to high-dimensional data.<br \/>Image: MIT News<\/figcaption><\/figure>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\"><strong>CAMBRIDGE, Mass<\/strong> &#8212; Data analysis \u2014 and particularly big-data analysis \u2014 is often a matter of fitting data to some sort of mathematical model. The most familiar example of this might be\u00a0<a style=\"color: #1155cc;\" href=\"http:\/\/mit.pr-optout.com\/Tracking.aspx?Data=HHL%3d8085%3e0-%3eLCE9%3b4%3b8%3f%26SDG%3c90%3a.&amp;RE=MC&amp;RI=4334046&amp;Preview=False&amp;DistributionActionID=32576&amp;Action=Follow+Link\" target=\"_blank\" data-saferedirecturl=\"https:\/\/www.google.com\/url?hl=en&amp;q=http:\/\/mit.pr-optout.com\/Tracking.aspx?Data%3DHHL%253d8085%253e0-%253eLCE9%253b4%253b8%253f%2526SDG%253c90%253a.%26RE%3DMC%26RI%3D4334046%26Preview%3DFalse%26DistributionActionID%3D32576%26Action%3DFollow%2BLink&amp;source=gmail&amp;ust=1477634029798000&amp;usg=AFQjCNEOuDFZ7k1ZtE9adYUzjjSBvAqJYA\" rel=\"noopener\"><span style=\"color: #000000;\">linear regression<\/span><\/a>, which finds a line that approximates a distribution of data points. But fitting data to probability distributions, such as the familiar bell curve, is just as common.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">If, however, a data set has just a few corrupted entries \u2014 say, outlandishly improbable measurements \u2014 standard data-fitting techniques can break down. This problem becomes much more acute with high-dimensional data, or data with many variables, which is ubiquitous in the digital age.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">Since the early 1960s, it\u2019s been known that there are algorithms for weeding corruptions out of high-dimensional data, but none of the algorithms proposed in the past 50 years are practical when the variable count gets above, say, 12.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">That\u2019s about to change. Earlier this month, at the IEEE Symposium on Foundations of Computer Science, a team of researchers from MIT\u2019s Computer Science and Artificial Intelligence Laboratory, the University of Southern California, and the University of California at San Diego presented a new set of algorithms that can efficiently fit probability distributions to high-dimensional data.<\/span><\/p>\n<p style=\"text-align: justify;\">[pullquote]\u201cFrom the vantage point of theoretical computer science, it\u2019s much more apparent how rare it is for a problem to be efficiently solvable,\u201d says Ankur Moitra, the Rockwell International Career Development Assistant Professor of Mathematics at MIT and one of the leaders of the MIT-USC-UCSD project.\u00a0[\/pullquote]<\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">Remarkably, at the same conference, researchers from Georgia Tech presented a very similar algorithm.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">The pioneering work on \u201crobust statistics,\u201d or statistical methods that can tolerate corrupted data, was done by statisticians, but both new papers come from groups of computer scientists. That probably reflects a shift of attention within the field, toward the computational efficiency of model-fitting techniques.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">\u201cFrom the vantage point of theoretical computer science, it\u2019s much more apparent how rare it is for a problem to be efficiently solvable,\u201d says Ankur Moitra, the Rockwell International Career Development Assistant Professor of Mathematics at MIT and one of the leaders of the MIT-USC-UCSD project. \u201cIf you start off with some hypothetical thing \u2014 \u2018Man, I wish I could do this. If I could, it would be robust\u2019 \u2014 you\u2019re going to have a bad time, because it will be inefficient. You should start off with the things that you know that you can efficiently do, and figure out how to piece them together to get robustness.\u201d<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\"><strong>Resisting corruption<\/strong><\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">To understand the principle behind robust statistics, Moitra explains, consider the normal distribution \u2014 the bell curve, or in mathematical parlance, the one-dimensional Gaussian distribution. The one-dimensional Gaussian is completely described by two parameters: the mean, or average, value of the data, and the variance, which is a measure of how quickly the data spreads out around the mean.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">If the data in a data set \u2014 say, people\u2019s heights in a given population \u2014 is well-described by a Gaussian distribution, then the mean is just the arithmetic average. But suppose you have a data set consisting of height measurements of 100 women, and while most of them cluster around 64 inches \u2014 some a little higher, some a little lower \u2014 one of them, for some reason, is 1,000 inches. Taking the arithmetic average will peg a woman\u2019s mean height at 6 feet 4 inches, not 5 feet 4 inches.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">One way to avoid such a nonsensical result is to estimate the mean, not by taking the numerical average of the data, but by finding its median value. This would involve listing all the 100 measurements in order, from smallest to highest, and taking the 50th or 51st. An algorithm that uses the median to estimate the mean is thus more robust, meaning it\u2019s less responsive to corrupted data, than one that uses the average.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">The median is just an approximation of the mean, however, and the accuracy of the approximation decreases rapidly with more variables. Big-data analysis might require examining thousands or even millions of variables; in such cases, approximating the mean with the median would often yield unusable results.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\"><strong>Identifying outliers<\/strong><\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">One way to weed corrupted data out of a high-dimensional data set is to take 2-D cross sections of the graph of the data and see whether they look like Gaussian distributions. If they don\u2019t, you may have located a cluster of spurious data points, such as that 80-foot-tall woman, which can simply be excised.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">The problem is that, with all previously known algorithms that adopted this approach, the number of cross sections required to find corrupted data was an exponential function of the number of dimensions. By contrast, Moitra and his coauthors \u2014 Gautam Kamath and Jerry Li, both MIT graduate students in electrical engineering and computer science; Ilias Diakonikolas and Alistair Stewart of USC; and Daniel Kane of USCD \u2014 found an algorithm whose running time increases with the number of data dimensions at a much more reasonable rate (or, polynomially, in computer science jargon).<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">Their algorithm relies on two insights. The first is what\u00a0<a style=\"color: #1155cc;\" href=\"http:\/\/mit.pr-optout.com\/Tracking.aspx?Data=HHL%3d8085%3e0-%3eLCE9%3b4%3b8%3f%26SDG%3c90%3a.&amp;RE=MC&amp;RI=4334046&amp;Preview=False&amp;DistributionActionID=32575&amp;Action=Follow+Link\" target=\"_blank\" data-saferedirecturl=\"https:\/\/www.google.com\/url?hl=en&amp;q=http:\/\/mit.pr-optout.com\/Tracking.aspx?Data%3DHHL%253d8085%253e0-%253eLCE9%253b4%253b8%253f%2526SDG%253c90%253a.%26RE%3DMC%26RI%3D4334046%26Preview%3DFalse%26DistributionActionID%3D32575%26Action%3DFollow%2BLink&amp;source=gmail&amp;ust=1477634029798000&amp;usg=AFQjCNFOZaNxB1vlbEoRhIfybkbyoJj4pA\" rel=\"noopener\"><span style=\"color: #000000;\">metric<\/span><\/a>\u00a0to use when measuring how far away a data set is from a range of distributions with approximately the same shape. That allows them to tell when they\u2019ve winnowed out enough corrupted data to permit a good fit.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">The other is how to identify the regions of data in which to begin taking cross sections. For that, the researchers rely on something called the kurtosis of a distribution, which measures the size of its tails, or the rate at which the concentration of data decreases far from the mean. Again, there are multiple ways to infer kurtosis from data samples, and selecting the right one is central to the algorithm\u2019s efficiency.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">The researchers\u2019 approach works with Gaussian distributions, certain combinations of Gaussian distributions, another common distribution called the product distribution, and certain combinations of product distributions. Although they believe that their approach can be extended to other types of distributions, in ongoing work, their chief focus is on applying their techniques to real-world data.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Data analysis \u2014 and particularly big-data analysis \u2014 is often a matter of fitting data to some sort of mathematical model. The most familiar example of this might be linear regression, which finds a line that approximates a distribution of data points. But fitting data to probability distributions, such as the familiar bell curve, is just as common.<\/p>\n","protected":false},"author":6,"featured_media":10391,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[43,22,17],"tags":[],"class_list":["post-10390","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-computer-science","category-other","category-research"],"featured_image_urls":{"full":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/10\/MIT-High-Def-Learning_0.jpg",639,426,false],"thumbnail":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/10\/MIT-High-Def-Learning_0-150x150.jpg",150,150,true],"medium":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/10\/MIT-High-Def-Learning_0-300x200.jpg",300,200,true],"medium_large":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/10\/MIT-High-Def-Learning_0.jpg",639,426,false],"large":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/10\/MIT-High-Def-Learning_0.jpg",639,426,false],"1536x1536":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/10\/MIT-High-Def-Learning_0.jpg",639,426,false],"2048x2048":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/10\/MIT-High-Def-Learning_0.jpg",639,426,false],"ultp_layout_landscape_large":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/10\/MIT-High-Def-Learning_0.jpg",639,426,false],"ultp_layout_landscape":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/10\/MIT-High-Def-Learning_0.jpg",639,426,false],"ultp_layout_portrait":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/10\/MIT-High-Def-Learning_0.jpg",600,400,false],"ultp_layout_square":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/10\/MIT-High-Def-Learning_0.jpg",600,400,false],"newspaper-x-single-post":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/10\/MIT-High-Def-Learning_0.jpg",639,426,false],"newspaper-x-recent-post-big":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/10\/MIT-High-Def-Learning_0.jpg",540,360,false],"newspaper-x-recent-post-list-image":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/10\/MIT-High-Def-Learning_0.jpg",95,63,false],"web-stories-poster-portrait":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/10\/MIT-High-Def-Learning_0.jpg",639,426,false],"web-stories-publisher-logo":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/10\/MIT-High-Def-Learning_0.jpg",96,64,false],"web-stories-thumbnail":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/10\/MIT-High-Def-Learning_0.jpg",150,100,false]},"author_info":{"info":["Amrita Tuladhar"]},"category_info":"<a href=\"https:\/\/www.revoscience.com\/en\/category\/computer-science\/\" rel=\"category tag\">Computer Science<\/a> <a href=\"https:\/\/www.revoscience.com\/en\/category\/news\/other\/\" rel=\"category tag\">Other<\/a> <a href=\"https:\/\/www.revoscience.com\/en\/category\/news\/research\/\" rel=\"category tag\">Research<\/a>","tag_info":"Research","comment_count":"0","_links":{"self":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/posts\/10390","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/users\/6"}],"replies":[{"embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/comments?post=10390"}],"version-history":[{"count":0,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/posts\/10390\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/media\/10391"}],"wp:attachment":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/media?parent=10390"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/categories?post=10390"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/tags?post=10390"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}