{"id":5929,"date":"2015-08-27T05:30:56","date_gmt":"2015-08-27T05:30:56","guid":{"rendered":"http:\/\/revoscience.com\/en\/?p=5929"},"modified":"2015-08-27T05:30:56","modified_gmt":"2015-08-27T05:30:56","slug":"searching-big-data-faster","status":"publish","type":"post","link":"https:\/\/www.revoscience.com\/en\/searching-big-data-faster\/","title":{"rendered":"Searching big data faster"},"content":{"rendered":"<p style=\"text-align: justify;\"><span style=\"color: rgb(0, 0, 0);\"><em><strong style=\"color: #222222;\">Theoretical analysis could expand applications of accelerated searching in biology, other fields.<\/strong><\/em><\/span><\/p>\n<figure id=\"attachment_5930\" aria-describedby=\"caption-attachment-5930\" style=\"width: 639px\" class=\"wp-caption alignnone\"><a href=\"http:\/\/revoscience.com\/en\/wp-content\/uploads\/2015\/08\/MIT-Compressive-Genomics-1_0.jpg\" target=\"_blank\" rel=\"noopener\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-5930\" src=\"http:\/\/revoscience.com\/en\/wp-content\/uploads\/2015\/08\/MIT-Compressive-Genomics-1_0.jpg\" alt=\"Illustration of points in an arbitrary high-dimensional space that live close to a one-dimensional, tree-like structure, as might arise from genomes generated by mutation and selection in evolution. Although high-dimensional at a fine scale, at the coarser scale of covering spheres, the data cloud looks nearly one-dimensional, which enables entropy scaling of similarity search. Courtesy of the researchers\" width=\"639\" height=\"426\" title=\"\" srcset=\"https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2015\/08\/MIT-Compressive-Genomics-1_0.jpg 639w, https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2015\/08\/MIT-Compressive-Genomics-1_0-300x200.jpg 300w\" sizes=\"auto, (max-width: 639px) 100vw, 639px\" \/><\/a><figcaption id=\"caption-attachment-5930\" class=\"wp-caption-text\">Illustration of points in an arbitrary high-dimensional space that live close to a one-dimensional, tree-like structure, as might arise from genomes generated by mutation and selection in evolution. Although high-dimensional at a fine scale, at the coarser scale of covering spheres, the data cloud looks nearly one-dimensional, which enables entropy scaling of similarity search.<br \/>Courtesy of the researchers<\/figcaption><\/figure>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\"><strong>CAMBRIDGE, Mass<\/strong>.&#8211;For more than a decade, gene sequencers have been improving more rapidly than the computers required to make sense of their outputs. Searching for DNA sequences in existing genomic databases can already take hours, and the problem is likely to get worse.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: rgb(0, 0, 0);\">Recently, Bonnie Berger\u2019s group at MIT\u2019s Computer Science and Artificial Intelligence Laboratory (CSAIL) has been investigating\u00a0techniquesto make biological and chemical data easier to analyze by, in some sense,\u00a0compressing\u00a0it.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: rgb(0, 0, 0);\">In the latest issue of the journal\u00a0<i>Cell Systems<\/i>, Berger and colleagues present a theoretical analysis that demonstrates why their previous compression schemes have been so successful. They identify properties of data sets that make them amenable to compression and present an algorithm for determining whether a given data set has those properties. They also show that several existing databases of chemical compounds and biological molecules do indeed exhibit them.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: rgb(0, 0, 0);\">Given measurements for those properties, the researchers can also calculate the improvements in search efficiency that their compression techniques afford. For the data sets they analyze, those efficiencies scale sublinearly, meaning that the larger the data set, the more efficient the search should be.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: rgb(0, 0, 0);\">\u201cThis paper provides a framework for how we can apply compressive algorithms to large-scale biological data,\u201d says Berger, a professor of applied mathematics at MIT. \u201cWe also have proofs for how much efficiency we can get.\u201d<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: rgb(0, 0, 0);\">The key to the researchers\u2019 compression scheme is that evolution is stingy with good designs. There tends to be a lot of redundancy in the genomes of closely related \u2014 or even distantly related \u2014 organisms.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: rgb(0, 0, 0);\">That means that of all the possible sequences of the four DNA letters \u2014 A, T, C, and G \u2014 only a very small subset is represented by the genomes of real organisms. Moreover, within the space of possible genomes, those of real organisms are not distributed randomly. Instead, they trace out continuous patterns, which represent the relatively slow rate at which species diverge.<\/span><\/p>\n<p style=\"text-align: justify;\"><strong><span style=\"color: #000000;\">Birds of a feather<\/span><\/strong><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: rgb(0, 0, 0);\">To make searching more efficient, the Berger group\u2019s compression algorithms cluster together similar genomic sequences \u2014 those that diverge by only a few DNA letters \u2014then choose one sequence as representative of the cluster. A search can concentrate only on the likeliest clusters; most of the data never has to be examined.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: rgb(0, 0, 0);\">If genomic data is envisioned as tracing a continuous path through a much larger space of possibilities, then the clusters can be envisioned as spheres superimposed on the data. Data points that fall within a single sphere are closely related.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: rgb(0, 0, 0);\">Berger and her colleagues \u2014 first authors Noah Daniels, a postdoc in her group, and William Yu, a graduate student in applied mathematics, and David Danko, an undergraduate major in computational biology \u2014 show that data sets are amenable to their compressive search techniques if they meet two criteria. The first they refer to as metric entropy. This means that the data inhabits only a small part of the larger space of possibilities.\u00a0<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: rgb(0, 0, 0);\">The second is low fractal dimension. That means that the density of the data points doesn\u2019t vary greatly as you move through the data. If your search requires you to explore three spheres rather than one, it takes only three times as long \u2014 not 10 times, or 100 times.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: rgb(0, 0, 0);\">In their paper, the MIT researchers analyze three data sets. Two describe proteins \u2014 one according to their sequences of amino acids, the other according to their shape \u2014 and the third describes organic molecules. In a separate paper, now under submission, the researchers apply the same types of analysis to DNA segments between 32 and 63 letters in length.<\/span><\/p>\n<p style=\"text-align: justify;\"><strong><span style=\"color: #000000;\">Time\u2019s arrow<\/span><\/strong><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: rgb(0, 0, 0);\">The efficiency of their search algorithm scales sublinearly, not with the number of data points, but with the metric entropy of the data set, which is a formal measure of the continuity of the data and their sparseness, relative to the space of possibilities. Because evolution is conservative, the metric entropy of genomic data should increase as new genomes are sequenced. That is, the addition of new genomes will not, in all likelihood, add new branches to the pattern traced out in the space of possibilities; rather, it will fill in gaps in the existing pattern, increasing the metric entropy.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: rgb(0, 0, 0);\">Many other large data sets, however, could prove to be conservative in the same way. The range of behaviors exhibited by Web users, for instance, may, relative to the entire space of possibilities, be constrained by biology, by cultural history, or both. The MIT researchers\u2019 compression techniques could thus be applicable to a wide range of data outside biology.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Searching for DNA sequences in existing genomic databases can already take hours, and the problem is likely to get worse.<\/p>\n","protected":false},"author":6,"featured_media":5930,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[17],"tags":[],"class_list":["post-5929","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research"],"featured_image_urls":{"full":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2015\/08\/MIT-Compressive-Genomics-1_0.jpg",639,426,false],"thumbnail":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2015\/08\/MIT-Compressive-Genomics-1_0-150x150.jpg",150,150,true],"medium":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2015\/08\/MIT-Compressive-Genomics-1_0-300x200.jpg",300,200,true],"medium_large":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2015\/08\/MIT-Compressive-Genomics-1_0.jpg",639,426,false],"large":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2015\/08\/MIT-Compressive-Genomics-1_0.jpg",639,426,false],"1536x1536":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2015\/08\/MIT-Compressive-Genomics-1_0.jpg",639,426,false],"2048x2048":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2015\/08\/MIT-Compressive-Genomics-1_0.jpg",639,426,false],"ultp_layout_landscape_large":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2015\/08\/MIT-Compressive-Genomics-1_0.jpg",639,426,false],"ultp_layout_landscape":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2015\/08\/MIT-Compressive-Genomics-1_0.jpg",639,426,false],"ultp_layout_portrait":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2015\/08\/MIT-Compressive-Genomics-1_0.jpg",600,400,false],"ultp_layout_square":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2015\/08\/MIT-Compressive-Genomics-1_0.jpg",600,400,false],"newspaper-x-single-post":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2015\/08\/MIT-Compressive-Genomics-1_0.jpg",639,426,false],"newspaper-x-recent-post-big":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2015\/08\/MIT-Compressive-Genomics-1_0.jpg",540,360,false],"newspaper-x-recent-post-list-image":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2015\/08\/MIT-Compressive-Genomics-1_0.jpg",95,63,false],"web-stories-poster-portrait":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2015\/08\/MIT-Compressive-Genomics-1_0.jpg",639,426,false],"web-stories-publisher-logo":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2015\/08\/MIT-Compressive-Genomics-1_0.jpg",96,64,false],"web-stories-thumbnail":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2015\/08\/MIT-Compressive-Genomics-1_0.jpg",150,100,false]},"author_info":{"info":["Amrita Tuladhar"]},"category_info":"<a href=\"https:\/\/www.revoscience.com\/en\/category\/news\/research\/\" rel=\"category tag\">Research<\/a>","tag_info":"Research","comment_count":"0","_links":{"self":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/posts\/5929","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/users\/6"}],"replies":[{"embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/comments?post=5929"}],"version-history":[{"count":0,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/posts\/5929\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/media\/5930"}],"wp:attachment":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/media?parent=5929"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/categories?post=5929"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/tags?post=5929"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}