{"id":10283,"date":"2016-10-23T06:19:05","date_gmt":"2016-10-23T06:19:05","guid":{"rendered":"http:\/\/revoscience.com\/en\/?p=10283"},"modified":"2016-10-23T06:19:05","modified_gmt":"2016-10-23T06:19:05","slug":"big-data-algorithms-could-cut-analysis-times-from-months-to-days","status":"publish","type":"post","link":"https:\/\/www.revoscience.com\/en\/big-data-algorithms-could-cut-analysis-times-from-months-to-days\/","title":{"rendered":"Big-data algorithms could cut analysis times from months to days"},"content":{"rendered":"<p style=\"text-align: justify;\"><span style=\"color: #000000;\"><em><strong style=\"color: #222222;\">With new algorithms, data scientists could accomplish in days what has traditionally taken months.<\/strong><\/em><\/span><\/p>\n<figure id=\"attachment_10284\" aria-describedby=\"caption-attachment-10284\" style=\"width: 639px\" class=\"wp-caption alignnone\"><a href=\"http:\/\/revoscience.com\/en\/wp-content\/uploads\/2016\/10\/MIT-Automatic-Data_0.jpg\" target=\"_blank\" rel=\"noopener\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-10284\" src=\"http:\/\/revoscience.com\/en\/wp-content\/uploads\/2016\/10\/MIT-Automatic-Data_0.jpg\" alt=\"\u201cThe goal of all this is to present the interesting stuff to the data scientists so that they can more quickly address all these new data sets that are coming in,\u201d says Max Kanter MEng \u201915.\" width=\"639\" height=\"426\" title=\"\" srcset=\"https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/10\/MIT-Automatic-Data_0.jpg 639w, https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/10\/MIT-Automatic-Data_0-300x200.jpg 300w\" sizes=\"auto, (max-width: 639px) 100vw, 639px\" \/><\/a><figcaption id=\"caption-attachment-10284\" class=\"wp-caption-text\">\u201cThe goal of all this is to present the interesting stuff to the data scientists so that they can more quickly address all these new data sets that are coming in,\u201d says Max Kanter MEng \u201915.<\/figcaption><\/figure>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\"><strong>CAMBRIDGE, Mass.<\/strong> &#8212; Last year, MIT researchers presented a system that automated a crucial step in big-data analysis: the selection of a \u201cfeature set,\u201d or aspects of the data that are useful for making predictions. The researchers entered the system in several data science contests, where it outperformed most of the human competitors and took only hours instead of months to perform its analyses.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">This week, in a pair of papers at the IEEE International Conference on Data Science and Advanced Analytics, the team described an approach to automating most of the rest of the process of big-data analysis \u2014 the preparation of the data for analysis and even the specification of problems that the analysis might be able to solve.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">The researchers believe that, again, their systems could perform in days tasks that used to take data scientists months.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">\u201cThe goal of all this is to present the interesting stuff to the data scientists so that they can more quickly address all these new data sets that are coming in,\u201d says Max Kanter MEng \u201915, who is first author on last year\u2019s paper and one of this year\u2019s papers. \u201c[Data scientists want to know], \u2018Why don\u2019t you show me the top 10 things that I can do the best, and then I\u2019ll dig down into those?\u2019 So [these methods are] shrinking the time between getting a data set and actually producing value out of it.\u201d<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">Both papers focus on time-varying data, which reflects observations made over time, and they assume that the goal of analysis is to produce a probabilistic model that will predict future events on the basis of current observations.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\"><strong>Real-world problems<\/strong><\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">The first paper describes a general framework for analyzing time-varying data. It splits the analytic process into three stages: labeling the data, or categorizing salient data points so they can be fed to a machine-learning system; segmenting the data, or determining which time sequences of data points are relevant to which problems; and \u201cfeaturizing\u201d the data, the step performed by the system the researchers presented last year.<\/span><\/p>\n<p style=\"text-align: justify;\">[pullquote]The second paper describes a new language for describing data-analysis problems and a set of algorithms that automatically recombine data in different ways, to determine what types of prediction problems the data might be useful for solving.[\/pullquote]<\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">The second paper describes a new language for describing data-analysis problems and a set of algorithms that automatically recombine data in different ways, to determine what types of prediction problems the data might be useful for solving.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">According to Kalyan Veeramachaneni, a principal research scientist at MIT\u2019s Laboratory for Information and Decision Systems and senior author on all three papers, the work grew out of his team\u2019s experience with real data-analysis problems brought to it by industry researchers.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">\u201cOur experience was, when we got the data, the domain experts and data scientists sat around the table for a couple months to define a prediction problem,\u201d he says. \u201cThe reason I think that people did that is they knew that the label-segment-featurize process takes six to eight months. So we better define a good prediction problem to even start that process.\u201d<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">In 2015, after completing his master\u2019s, Kanter joined Veeramachaneni\u2019s group as a researcher. Then, in the fall of 2015, Kanter and Veeramachaneni founded a company called\u00a0<a style=\"color: #1155cc;\" href=\"http:\/\/mit.pr-optout.com\/Tracking.aspx?Data=HHL%3d8083%405-%3eLCE9%3b4%3b8%3f%26SDG%3c90%3a.&amp;RE=MC&amp;RI=4334046&amp;Preview=False&amp;DistributionActionID=32350&amp;Action=Follow+Link\" target=\"_blank\" data-saferedirecturl=\"https:\/\/www.google.com\/url?hl=en&amp;q=http:\/\/mit.pr-optout.com\/Tracking.aspx?Data%3DHHL%253d8083%25405-%253eLCE9%253b4%253b8%253f%2526SDG%253c90%253a.%26RE%3DMC%26RI%3D4334046%26Preview%3DFalse%26DistributionActionID%3D32350%26Action%3DFollow%2BLink&amp;source=gmail&amp;ust=1477287218026000&amp;usg=AFQjCNHwAO3eq9tZskehx5cTumUMGQOtWA\" rel=\"noopener\"><span style=\"color: #000000;\">Feature Labs<\/span><\/a>\u00a0to commercialize their data-analysis technology. Kanter is now the company\u2019s CEO, and after receiving his master\u2019s in 2016, another master\u2019s student in Veeramachaneni\u2019s group, Benjamin Schreck, joined the company as chief data scientist.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\"><strong>Data preparation<\/strong><\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">Developed by Schreck and Veeramachaneni, the new language, dubbed Trane, should reduce the time it takes data scientists to define good prediction problems, from months to days. Kanter, Veeramachaneni, and another Feature Labs employee, Owen Gillespie, have also devised a method that should do the same for the label-segment-featurize (LSF) process.\u00a0<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">To get a sense of what labeling and segmentation entails, suppose that a data scientist is presented with electroencephalogram (EEG) data for several patients with epilepsy and asked to identify patterns in the data that might signal the onset of seizures.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">The first step is to identify the EEG spikes that indicate seizures. The next is to extract a segment of the EEG signal that precedes each seizure. For purposes of comparison, \u201cnormal\u201d segments of the signal \u2014 segments of similar length but far removed from seizures \u2014 should also be extracted. The segments are then labeled as either preceding a seizure or not, information that a machine-learning algorithm can use to identify patterns that indicate seizure onset.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">In their LSF paper, Kanter, Veeramachaneni, and Gillespie define a general mathematical framework for describing such labeling and segmentation problems. Rather than EEG readings, for instance, the data might be the purchases by customers of a particular company, and the problem might be to determine from a customer\u2019s buying history whether he or she is likely to buy a new product.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">There, the pertinent data, for predictive purposes, may be not a customer\u2019s behavior over some time span, but information about his or her three most recent purchases, whenever they occurred. The framework is flexible enough to accommodate such different specifications. But once those specifications are made, the researchers\u2019 algorithm performs the corresponding segmentation and labeling automatically.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\"><strong>Finding problems<\/strong><\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">With Trane, time-series data is represented in tables, where the columns contain measurements and the times at which they were made. Schreck and Veeramachaneni defined a small set of operations that can be performed on either columns or rows. A row operation is something like determining whether a measurement in one row is greater than some threshold number, or raising it to particular power. A column operation is something like taking the differences between successive measurements in a column, or summing all the measurements, or taking just the first or last one.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">Fed a table of data, Trane exhaustively iterates through combinations of such operations, enumerating a huge number of potential questions that can be asked of the data \u2014 whether, for instance, the differences between measurements in successive rows ever exceeds a particular value, or whether there are any rows for which it is true that the square of the data equals a particular number.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">To test Trane\u2019s utility, the researchers considered a suite of questions that data scientists had posed about roughly 60 real data sets. They limited the number of sequential operations that Trane could perform on the data to five, and those operations were drawn from a set of only six row operations and 11 column operations. Remarkably, that comparatively limited set was enough to reproduce every question that researchers had in fact posed \u2014 in addition to hundreds of others that they hadn\u2019t.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Last year, MIT researchers presented a system that automated a crucial step in big-data analysis: the selection of a \u201cfeature set,\u201d or aspects of the data that are useful for making predictions. The researchers entered the system in several data science contests, where it outperformed most of the human competitors and took only hours instead of months to perform its analyses.<\/p>\n<p>This week, in a pair of papers at the IEEE International Conference on Data Science and Advanced Analytics, the team described an approach to automating most of the rest of the process of big-data analysis \u2014 the preparation of the data for analysis and even the specification of problems that the analysis might be able to solve.<\/p>\n","protected":false},"author":6,"featured_media":10284,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[17],"tags":[],"class_list":["post-10283","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research"],"featured_image_urls":{"full":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/10\/MIT-Automatic-Data_0.jpg",639,426,false],"thumbnail":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/10\/MIT-Automatic-Data_0-150x150.jpg",150,150,true],"medium":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/10\/MIT-Automatic-Data_0-300x200.jpg",300,200,true],"medium_large":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/10\/MIT-Automatic-Data_0.jpg",639,426,false],"large":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/10\/MIT-Automatic-Data_0.jpg",639,426,false],"1536x1536":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/10\/MIT-Automatic-Data_0.jpg",639,426,false],"2048x2048":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/10\/MIT-Automatic-Data_0.jpg",639,426,false],"ultp_layout_landscape_large":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/10\/MIT-Automatic-Data_0.jpg",639,426,false],"ultp_layout_landscape":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/10\/MIT-Automatic-Data_0.jpg",639,426,false],"ultp_layout_portrait":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/10\/MIT-Automatic-Data_0.jpg",600,400,false],"ultp_layout_square":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/10\/MIT-Automatic-Data_0.jpg",600,400,false],"newspaper-x-single-post":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/10\/MIT-Automatic-Data_0.jpg",639,426,false],"newspaper-x-recent-post-big":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/10\/MIT-Automatic-Data_0.jpg",540,360,false],"newspaper-x-recent-post-list-image":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/10\/MIT-Automatic-Data_0.jpg",95,63,false],"web-stories-poster-portrait":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/10\/MIT-Automatic-Data_0.jpg",639,426,false],"web-stories-publisher-logo":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/10\/MIT-Automatic-Data_0.jpg",96,64,false],"web-stories-thumbnail":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/10\/MIT-Automatic-Data_0.jpg",150,100,false]},"author_info":{"info":["Amrita Tuladhar"]},"category_info":"<a href=\"https:\/\/www.revoscience.com\/en\/category\/news\/research\/\" rel=\"category tag\">Research<\/a>","tag_info":"Research","comment_count":"0","_links":{"self":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/posts\/10283","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/users\/6"}],"replies":[{"embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/comments?post=10283"}],"version-history":[{"count":0,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/posts\/10283\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/media\/10284"}],"wp:attachment":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/media?parent=10283"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/categories?post=10283"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/tags?post=10283"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}