{"id":13514,"date":"2017-11-02T06:31:48","date_gmt":"2017-11-02T06:31:48","guid":{"rendered":"https:\/\/www.revoscience.com\/en\/?p=13514"},"modified":"2017-11-02T06:31:48","modified_gmt":"2017-11-02T06:31:48","slug":"crowdsourcing-big-data-analysis","status":"publish","type":"post","link":"https:\/\/www.revoscience.com\/en\/crowdsourcing-big-data-analysis\/","title":{"rendered":"Crowdsourcing big-data analysis"},"content":{"rendered":"<p style=\"text-align: justify;\"><span style=\"color: #000000;\"><em><strong>Web-based system automatically evaluates proposals from far-flung data scientis<\/strong><\/em><\/span><\/p>\n<figure id=\"attachment_13515\" aria-describedby=\"caption-attachment-13515\" style=\"width: 639px\" class=\"wp-caption alignnone\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-13515\" src=\"https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2017\/11\/MIT-Crowdsource-Features_0.jpg\" alt=\"\" width=\"639\" height=\"426\" title=\"\" srcset=\"https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2017\/11\/MIT-Crowdsource-Features_0.jpg 639w, https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2017\/11\/MIT-Crowdsource-Features_0-300x200.jpg 300w\" sizes=\"auto, (max-width: 639px) 100vw, 639px\" \/><figcaption id=\"caption-attachment-13515\" class=\"wp-caption-text\">\u201cI think that the concept of massive and open data science can be really leveraged for areas where there\u2019s a strong social impact but not necessarily a single profit-making or government organization that is coordinating responses,\u201d MIT graduate student Micah Smith says about FeatureHub.<br \/>Image: MIT News<\/figcaption><\/figure>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">CAMBRIDGE, Mass. &#8212;\u00a0In the analysis of big data sets, the first step is usually the identification of \u201cfeatures\u201d \u2014 data points with particular predictive power or analytic utility. Choosing features usually requires some human intuition. For instance, a sales database might contain revenues and date ranges, but it might take a human to recognize that average revenues \u2014 revenues divided by the sizes of the ranges \u2014 is the really useful metric.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">MIT researchers have developed a new collaboration tool, dubbed FeatureHub, intended to make feature identification more efficient and effective. With FeatureHub, data scientists and experts on particular topics could log on to a central site and spend an hour or two reviewing a problem and proposing features. Software then tests myriad combinations of features against target data, to determine which are most useful for a given predictive task.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">In tests, the researchers recruited 32 analysts with data science experience, who spent five hours each with the system, familiarizing themselves with it and using it to propose candidate features for each of two data-science problems.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">The predictive models produced by the system were tested against those submitted to a data-science competition called Kaggle. The Kaggle entries had been scored on a 100-point scale, and the FeatureHub models were within three and five points of the winning entries for the two problems.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">But where the top-scoring entries were the result of weeks or even months of work, the FeatureHub entries were produced in a matter of days. And while 32 collaborators on a single data science project is a lot by today\u2019s standards, Micah Smith, an MIT graduate student in electrical engineering and computer science who helped lead the project, has much larger ambitions.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">FeatureHub \u2014 like its name \u2014 was inspired by GitHub, an online repository of open-source programming projects, some of which have drawn thousands of contributors. Smith hopes that FeatureHub might someday attain a similar scale.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">\u201cI do hope that we can facilitate having thousands of people working on a single solution for predicting where traffic accidents are most likely to strike in New York City or predicting which patients in a hospital are most likely to require some medical intervention,\u201d he says. \u201cI think that the concept of massive and open data science can be really leveraged for areas where there\u2019s a strong social impact but not necessarily a single profit-making or government organization that is coordinating responses.\u201d<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">Smith and his colleagues presented a\u00a0<a href=\"http:\/\/mit.pr-optout.com\/Tracking.aspx?Data=HHL%3d81%3c3%403-%3eLCE9%3b4%3b8%3f%26SDG%3c90%3a.&amp;RE=MC&amp;RI=4334046&amp;Preview=False&amp;DistributionActionID=43140&amp;Action=Follow+Link\" target=\"_blank\" rel=\"noopener\" data-saferedirecturl=\"https:\/\/www.google.com\/url?hl=en&amp;q=http:\/\/mit.pr-optout.com\/Tracking.aspx?Data%3DHHL%253d81%253c3%25403-%253eLCE9%253b4%253b8%253f%2526SDG%253c90%253a.%26RE%3DMC%26RI%3D4334046%26Preview%3DFalse%26DistributionActionID%3D43140%26Action%3DFollow%2BLink&amp;source=gmail&amp;ust=1509690359994000&amp;usg=AFQjCNEOgPVwmKtebmAbSK7PdhpnoEL3TQ\">paper<\/a>\u00a0describing FeatureHub at the IEEE International Conference on Data Science and Advanced Analytics. His coauthors on the paper are his thesis advisor, Kalyan Veeramachaneni, a principal research scientist at MIT\u2019s Laboratory for Information and Decision Systems, and Roy Wedge, who began working with Veeramachaneni\u2019s group as an MIT undergraduate and is now a software engineer at\u00a0<a href=\"http:\/\/mit.pr-optout.com\/Tracking.aspx?Data=HHL%3d81%3c3%403-%3eLCE9%3b4%3b8%3f%26SDG%3c90%3a.&amp;RE=MC&amp;RI=4334046&amp;Preview=False&amp;DistributionActionID=43139&amp;Action=Follow+Link\" target=\"_blank\" rel=\"noopener\" data-saferedirecturl=\"https:\/\/www.google.com\/url?hl=en&amp;q=http:\/\/mit.pr-optout.com\/Tracking.aspx?Data%3DHHL%253d81%253c3%25403-%253eLCE9%253b4%253b8%253f%2526SDG%253c90%253a.%26RE%3DMC%26RI%3D4334046%26Preview%3DFalse%26DistributionActionID%3D43139%26Action%3DFollow%2BLink&amp;source=gmail&amp;ust=1509690359994000&amp;usg=AFQjCNFJ9WJEbUBiI9Dp1pA5-_Yoi9FNCw\">Feature Labs<\/a>, a data science company based on the group\u2019s work.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">FeatureHub\u2019s user interface is built on top of a common data-analysis software suite called the Jupyter Notebook, and the evaluation of feature sets is performed by standard machine-learning software packages. Features must be written in the Python programming language, but their design has to follow a template that intentionally keeps the syntax simple. A typical feature might require between five and 10 lines of code.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">The MIT researchers wrote code that mediates between the other software packages and manages data, pooling features submitted by many different users and tracking those collections of features that perform best on particular data analysis tasks.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">In the past, Veeramachaneni\u2019s group has developed software that\u00a0<a href=\"http:\/\/mit.pr-optout.com\/Tracking.aspx?Data=HHL%3d81%3c3%403-%3eLCE9%3b4%3b8%3f%26SDG%3c90%3a.&amp;RE=MC&amp;RI=4334046&amp;Preview=False&amp;DistributionActionID=43138&amp;Action=Follow+Link\" target=\"_blank\" rel=\"noopener\" data-saferedirecturl=\"https:\/\/www.google.com\/url?hl=en&amp;q=http:\/\/mit.pr-optout.com\/Tracking.aspx?Data%3DHHL%253d81%253c3%25403-%253eLCE9%253b4%253b8%253f%2526SDG%253c90%253a.%26RE%3DMC%26RI%3D4334046%26Preview%3DFalse%26DistributionActionID%3D43138%26Action%3DFollow%2BLink&amp;source=gmail&amp;ust=1509690359994000&amp;usg=AFQjCNH0LMqbrzjvLhnGaPmFB1o-xIYUTQ\">automatically generates<\/a>\u00a0features by inferring relationships between data from the manner in which they\u2019re organized. When that organizational information is missing, however, the approach is less effective.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">Still, Smith imagines, automatic feature synthesis could be used in conjunction with FeatureHub, getting projects started before volunteers have begun to contribute to them, saving the grunt work of enumerating the obvious features, and augmenting the best-performing sets of features contributed by humans.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Web-based system automatically evaluates proposals from far-flung data scientis CAMBRIDGE, Mass. &#8212;\u00a0In the analysis of big data sets, the first step is usually the identification of \u201cfeatures\u201d \u2014 data points with particular predictive power or analytic utility. Choosing features usually requires some human intuition. For instance, a sales database might contain revenues and date ranges, [&hellip;]<\/p>\n","protected":false},"author":6,"featured_media":13515,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[17],"tags":[],"class_list":["post-13514","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research"],"featured_image_urls":{"full":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2017\/11\/MIT-Crowdsource-Features_0.jpg",639,426,false],"thumbnail":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2017\/11\/MIT-Crowdsource-Features_0-150x150.jpg",150,150,true],"medium":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2017\/11\/MIT-Crowdsource-Features_0-300x200.jpg",300,200,true],"medium_large":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2017\/11\/MIT-Crowdsource-Features_0.jpg",639,426,false],"large":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2017\/11\/MIT-Crowdsource-Features_0.jpg",639,426,false],"1536x1536":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2017\/11\/MIT-Crowdsource-Features_0.jpg",639,426,false],"2048x2048":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2017\/11\/MIT-Crowdsource-Features_0.jpg",639,426,false],"ultp_layout_landscape_large":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2017\/11\/MIT-Crowdsource-Features_0.jpg",639,426,false],"ultp_layout_landscape":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2017\/11\/MIT-Crowdsource-Features_0.jpg",639,426,false],"ultp_layout_portrait":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2017\/11\/MIT-Crowdsource-Features_0.jpg",600,400,false],"ultp_layout_square":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2017\/11\/MIT-Crowdsource-Features_0.jpg",600,400,false],"newspaper-x-single-post":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2017\/11\/MIT-Crowdsource-Features_0.jpg",639,426,false],"newspaper-x-recent-post-big":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2017\/11\/MIT-Crowdsource-Features_0.jpg",540,360,false],"newspaper-x-recent-post-list-image":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2017\/11\/MIT-Crowdsource-Features_0.jpg",95,63,false],"web-stories-poster-portrait":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2017\/11\/MIT-Crowdsource-Features_0.jpg",639,426,false],"web-stories-publisher-logo":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2017\/11\/MIT-Crowdsource-Features_0.jpg",96,64,false],"web-stories-thumbnail":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2017\/11\/MIT-Crowdsource-Features_0.jpg",150,100,false]},"author_info":{"info":["Amrita Tuladhar"]},"category_info":"<a href=\"https:\/\/www.revoscience.com\/en\/category\/news\/research\/\" rel=\"category tag\">Research<\/a>","tag_info":"Research","comment_count":"0","_links":{"self":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/posts\/13514","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/users\/6"}],"replies":[{"embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/comments?post=13514"}],"version-history":[{"count":0,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/posts\/13514\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/media\/13515"}],"wp:attachment":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/media?parent=13514"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/categories?post=13514"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/tags?post=13514"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}