{"id":10916,"date":"2016-12-18T06:31:09","date_gmt":"2016-12-18T06:31:09","guid":{"rendered":"http:\/\/revoscience.com\/en\/?p=10916"},"modified":"2016-12-18T06:31:09","modified_gmt":"2016-12-18T06:31:09","slug":"data-diversity","status":"publish","type":"post","link":"https:\/\/www.revoscience.com\/en\/data-diversity\/","title":{"rendered":"Data diversity"},"content":{"rendered":"<p style=\"text-align: justify;\"><span style=\"color: #000000;\"><em><strong>Preserving variety in subsets of unmanageably large data sets should aid machine learning.<\/strong><\/em><\/span><\/p>\n<figure id=\"attachment_10917\" aria-describedby=\"caption-attachment-10917\" style=\"width: 621px\" class=\"wp-caption alignnone\"><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-10917\" src=\"http:\/\/revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-SamplingSubsets_0.jpg\" alt=\"Researchers from MIT\u2019s Computer Science and Artificial Intelligence Laboratory and its Laboratory for Information and Decision Systems have designed a new algorithm that makes it much more practical to select diverse subsets from a much larger dataset. Illustration: Christine Daniloff\/MIT\" width=\"621\" height=\"419\" title=\"\"><figcaption id=\"caption-attachment-10917\" class=\"wp-caption-text\">Researchers from MIT\u2019s Computer Science and Artificial Intelligence Laboratory and its Laboratory for Information and Decision Systems have designed a new algorithm that makes it much more practical to select diverse subsets from a much larger dataset.<br \/>Illustration: Christine Daniloff\/MIT<\/figcaption><\/figure>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">CAMBRIDGE, Mass. &#8212;\u00a0When data sets get too big, sometimes the only way to do anything useful with them is to extract much smaller subsets and analyze those instead.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">Those subsets have to preserve certain properties of the full sets, however, and one property that\u2019s useful in a wide range of applications is diversity. If, for instance, you\u2019re using your data to train a machine-learning system, you want to make sure that the subset you select represents the full range of cases that the system will have to confront.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">Last week at the Conference on Neural Information Processing Systems, researchers from MIT\u2019s Computer Science and Artificial Intelligence Laboratory and its Laboratory for Information and Decision Systems <a style=\"color: #000000;\" href=\"http:\/\/mit.pr-optout.com\/Tracking.aspx?Data=HHL%3d80%3a2%405-%3eLCE9%3b4%3b8%3f%26SDG%3c90%3a.&amp;RE=MC&amp;RI=4334046&amp;Preview=False&amp;DistributionActionID=33502&amp;Action=Follow+Link\" target=\"_blank\" data-saferedirecturl=\"https:\/\/www.google.com\/url?hl=en&amp;q=http:\/\/mit.pr-optout.com\/Tracking.aspx?Data%3DHHL%253d80%253a2%25405-%253eLCE9%253b4%253b8%253f%2526SDG%253c90%253a.%26RE%3DMC%26RI%3D4334046%26Preview%3DFalse%26DistributionActionID%3D33502%26Action%3DFollow%2BLink&amp;source=gmail&amp;ust=1482127116243000&amp;usg=AFQjCNFwuVG6egF2AVuOSZHyNneM7OAm6A\" rel=\"noopener\">presented<\/a> a new algorithm that makes the selection of diverse subsets much more practical.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">Whereas the running times of earlier subset-selection algorithms depended on the number of data points in the complete data set, the running time of the new algorithm depends on the number of data points in the subset. That means that if the goal is to winnow a data set with 1 million points down to one with 1,000, the new algorithm is 1 billion times faster than its predecessors.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">\u201cWe want to pick sets that are diverse,\u201d says Stefanie Jegelka, the X-Window Consortium Career Development Assistant Professor in MIT\u2019s Department of Electrical Engineering and Computer Science and senior author on the new paper. \u201cWhy is this useful? One example is recommendation. If you recommend books or movies to someone, you maybe want to have a diverse set of items, rather than 10 little variations on the same thing. Or if you search for, say, the word \u2018Washington.\u2019 There\u2019s many different meanings that this word can have, and you maybe want to show a few different ones. Or if you have a large data set and you want to explore \u2014 say, a large collection of images or health records \u2014 and you want a brief synopsis of your data, you want something that is diverse, that captures all the directions of variation of the data.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">\u201cThe other application where we actually use this thing is in large-scale learning. You have a large data set again, and you want to pick a small part of it from which you can learn very well.\u201d<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">Joining Jegelka on the paper are first author Chengtao Li, a graduate student in electrical engineering and computer science; and Suvrit Sra, a principal research scientist at MIT\u2019s Laboratory for Information and Decision Systems.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\"><strong>Thinking small<\/strong><\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">Traditionally, if you want to extract a diverse subset from a large data set, the first step is to create a similarity matrix \u2014 a huge table that maps every point in the data set against every other point. The intersection of the row representing one data item and the column representing another contains the points\u2019 similarity score on some standard measure.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">There are several standard methods to extract diverse subsets, but they all involve operations performed on the matrix as a whole. With a data set with a million data points \u2014 and a million-by-million similarity matrix \u2014 this is prohibitively time consuming.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">The MIT researchers\u2019 algorithm begins, instead, with a small subset of the data, chosen at random. Then it picks one point inside the subset and one point outside it and randomly selects one of three simple operations: swapping the points, adding the point outside the subset to the subset, or deleting the point inside the subset.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">The probability with which the algorithm selects one of those operations depends on both the size of the full data set and the size of the subset, so it changes slightly with every addition or deletion. But the algorithm doesn\u2019t necessarily perform the operation it selects.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">Again, the decision to perform the operation or not is probabilistic, but here the probability depends on the improvement in diversity that the operation affords. For additions and deletions, the decision also depends on the size of the subset relative to that of the original data set. That is, as the subset grows, it becomes harder to add new points unless they improve diversity dramatically.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">This process repeats until the diversity of the subset reflects that of the full set. Since the diversity of the full set is never calculated, however, the question is how many repetitions are enough. The researchers\u2019 chief results are a way to answer that question and a proof that the answer will be reasonable.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Those subsets have to preserve certain properties of the full sets, however, and one property that\u2019s useful in a wide range of applications is diversity. If, for instance, you\u2019re using your data to train a machine-learning system, you want to make sure that the subset you select represents the full range of cases that the system will have to confront.<\/p>\n","protected":false},"author":6,"featured_media":10917,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[43,17],"tags":[],"class_list":["post-10916","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-computer-science","category-research"],"featured_image_urls":{"full":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-SamplingSubsets_0.jpg",511,341,false],"thumbnail":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-SamplingSubsets_0-150x150.jpg",150,150,true],"medium":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-SamplingSubsets_0-300x200.jpg",300,200,true],"medium_large":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-SamplingSubsets_0.jpg",511,341,false],"large":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-SamplingSubsets_0.jpg",511,341,false],"1536x1536":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-SamplingSubsets_0.jpg",511,341,false],"2048x2048":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-SamplingSubsets_0.jpg",511,341,false],"ultp_layout_landscape_large":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-SamplingSubsets_0.jpg",511,341,false],"ultp_layout_landscape":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-SamplingSubsets_0.jpg",511,341,false],"ultp_layout_portrait":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-SamplingSubsets_0.jpg",511,341,false],"ultp_layout_square":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-SamplingSubsets_0.jpg",511,341,false],"newspaper-x-single-post":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-SamplingSubsets_0.jpg",511,341,false],"newspaper-x-recent-post-big":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-SamplingSubsets_0.jpg",511,341,false],"newspaper-x-recent-post-list-image":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-SamplingSubsets_0.jpg",95,63,false],"web-stories-poster-portrait":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-SamplingSubsets_0.jpg",511,341,false],"web-stories-publisher-logo":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-SamplingSubsets_0.jpg",96,64,false],"web-stories-thumbnail":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-SamplingSubsets_0.jpg",150,100,false]},"author_info":{"info":["Amrita Tuladhar"]},"category_info":"<a href=\"https:\/\/www.revoscience.com\/en\/category\/computer-science\/\" rel=\"category tag\">Computer Science<\/a> <a href=\"https:\/\/www.revoscience.com\/en\/category\/news\/research\/\" rel=\"category tag\">Research<\/a>","tag_info":"Research","comment_count":"0","_links":{"self":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/posts\/10916","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/users\/6"}],"replies":[{"embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/comments?post=10916"}],"version-history":[{"count":0,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/posts\/10916\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/media\/10917"}],"wp:attachment":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/media?parent=10916"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/categories?post=10916"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/tags?post=10916"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}