{"id":10496,"date":"2016-11-12T07:21:57","date_gmt":"2016-11-12T07:21:57","guid":{"rendered":"http:\/\/revoscience.com\/en\/?p=10496"},"modified":"2016-11-11T07:38:52","modified_gmt":"2016-11-11T07:38:52","slug":"artificial-intelligence-system-surfs-web-to-improve-its-performance","status":"publish","type":"post","link":"https:\/\/www.revoscience.com\/en\/artificial-intelligence-system-surfs-web-to-improve-its-performance\/","title":{"rendered":"Artificial-intelligence system surfs web to improve its performance"},"content":{"rendered":"<p style=\"text-align: justify;\"><span style=\"color: #000000;\"><em><strong style=\"color: #222222;\">\u201cInformation extraction\u201d system helps turn plain text into data for statistical analysis.<\/strong><\/em><\/span><\/p>\n<figure id=\"attachment_10497\" aria-describedby=\"caption-attachment-10497\" style=\"width: 575px\" class=\"wp-caption alignnone\"><a href=\"http:\/\/revoscience.com\/en\/wp-content\/uploads\/2016\/11\/MIT-Webaid-Learning_0.jpg\" target=\"_blank\" rel=\"noopener\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-10497\" src=\"http:\/\/revoscience.com\/en\/wp-content\/uploads\/2016\/11\/MIT-Webaid-Learning_0.jpg\" alt=\"Information extraction \u2014 or automatically classifying data items stored as plain text \u2014 is a major topic of artificial-intelligence research. Image: MIT News\" width=\"575\" height=\"383\" title=\"\" srcset=\"https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/11\/MIT-Webaid-Learning_0.jpg 575w, https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/11\/MIT-Webaid-Learning_0-300x199.jpg 300w\" sizes=\"auto, (max-width: 575px) 100vw, 575px\" \/><\/a><figcaption id=\"caption-attachment-10497\" class=\"wp-caption-text\">Information extraction \u2014 or automatically classifying data items stored as plain text \u2014 is a major topic of artificial-intelligence research.<br \/>Image: MIT News<\/figcaption><\/figure>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\"><strong>CAMBRIDGE, Mass.<\/strong> &#8212;\u00a0Of the vast wealth of information unlocked by the Internet, most is plain text. The data necessary to answer myriad questions \u2014 about, say, the correlations between the industrial use of certain chemicals and incidents of disease, or between patterns of news coverage and voter-poll results \u2014 may all be online. But extracting it from plain text and organizing it for quantitative analysis may be prohibitively time consuming.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">Information extraction \u2014 or automatically classifying data items stored as plain text \u2014 is thus a major topic of artificial-intelligence research. Last week, at the Association for Computational Linguistics\u2019 Conference on Empirical Methods on Natural Language Processing, researchers from MIT\u2019s Computer Science and Artificial Intelligence Laboratory won a best-paper award for a new approach to information extraction that turns conventional machine learning on its head.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">Most machine-learning systems work by combing through training examples and looking for patterns that correspond to classifications provided by human annotators. For instance, humans might label parts of speech in a set of texts, and the machine-learning system will try to identify patterns that resolve ambiguities \u2014 for instance, when \u201cher\u201d is a direct object and when it\u2019s an adjective.<\/span><\/p>\n<p style=\"text-align: justify;\">[pullquote]A machine-learning system will generally assign each of its classifications a confidence score, which is a measure of the statistical likelihood that the classification is correct, given the patterns discerned in the training data.[\/pullquote]<\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">Typically, computer scientists will try to feed their machine-learning systems as much training data as possible. That generally increases the chances that a system will be able to handle difficult problems.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">In their new\u00a0<a style=\"color: #1155cc;\" href=\"http:\/\/mit.pr-optout.com\/Tracking.aspx?Data=HHL%3d8091A0-%3eLCE9%3b4%3b8%3f%26SDG%3c90%3a.&amp;RE=MC&amp;RI=4334046&amp;Preview=False&amp;DistributionActionID=32869&amp;Action=Follow+Link\" target=\"_blank\" data-saferedirecturl=\"https:\/\/www.google.com\/url?hl=en&amp;q=http:\/\/mit.pr-optout.com\/Tracking.aspx?Data%3DHHL%253d8091A0-%253eLCE9%253b4%253b8%253f%2526SDG%253c90%253a.%26RE%3DMC%26RI%3D4334046%26Preview%3DFalse%26DistributionActionID%3D32869%26Action%3DFollow%2BLink&amp;source=gmail&amp;ust=1478935292997000&amp;usg=AFQjCNEm4AOGxARQMGXakyebTpT-5senyA\" rel=\"noopener\"><span style=\"color: #000000;\">paper<\/span><\/a>, by contrast, the MIT researchers train their system on scanty data \u2014 because in the scenario they\u2019re investigating, that\u2019s usually all that\u2019s available. But then they find the limited information an easy problem to solve.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">\u201cIn information extraction, traditionally, in natural-language processing, you are given an article and you need to do whatever it takes to extract correctly from this article,\u201d says Regina Barzilay, the Delta Electronics Professor of Electrical Engineering and Computer Science and senior author on the new paper. \u201cThat\u2019s very different from what you or I would do. When you\u2019re reading an article that you can\u2019t understand, you\u2019re going to go on the web and find one that you can understand.\u201d<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\"><strong>Confidence boost<\/strong><\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">Essentially, the researchers\u2019 new system does the same thing. A machine-learning system will generally assign each of its classifications a confidence score, which is a measure of the statistical likelihood that the classification is correct, given the patterns discerned in the training data. With the researchers\u2019 new system, if the confidence score is too low, the system automatically generates a web search query designed to pull up texts likely to contain the data it\u2019s trying to extract.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">It then attempts to extract the relevant data from one of the new texts and reconciles the results with those of its initial extraction. If the confidence score remains too low, it moves on to the next text pulled up by the search string, and so on.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">\u201cThe base extractor isn\u2019t changing,\u201d says Adam Yala, a graduate student in the MIT Department of Electrical Engineering and Computer Science (EECS) and one of the coauthors on the new paper. \u201cYou\u2019re going to find articles that are easier for that extractor to understand. So you have something that\u2019s a very weak extractor, and you just find data that fits it automatically from the web.\u201d Joining Yala and Barzilay on the paper is first author Karthik Narasimhan, also a graduate student in EECS.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">Remarkably, every decision the system makes is the result of machine learning. The system learns how to generate search queries, gauge the likelihood that a new text is relevant to its extraction task, and determine the best strategy for fusing the results of multiple attempts at extraction.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\"><strong>Just the facts<\/strong><\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">In experiments, the researchers applied their system to two extraction tasks. One was the collection of data on mass shootings in the U.S., which is an essential resource for any epidemiological study of the effects of gun-control measures. The other was the collection of similar data on instances of food contamination. The system was trained separately for each task.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">In the first case \u2014 the database of mass shootings \u2014 the system was asked to extract the name of the shooter, the location of the shooting, the number of people wounded, and the number of people killed. In the food-contamination case, it extracted food type, type of contaminant, and location. In each case, the system was trained on about 300 documents.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">From those documents, it learned clusters of search terms that tended to be associated with the data items it was trying to extract. For instance, the names of mass shooters were correlated with terms like \u201cpolice,\u201d \u201cidentified,\u201d \u201carrested,\u201d and \u201ccharged.\u201d During training, for each article the system was asked to analyze, it pulled up, on average, another nine or 10 news articles from the web.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">The researchers compared their system\u2019s performance to that of several extractors trained using more conventional machine-learning techniques. For every data item extracted in both tasks, the new system outperformed its predecessors, usually by about 10 percent.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Of the vast wealth of information unlocked by the Internet, most is plain text. The data necessary to answer myriad questions \u2014 about, say, the correlations between the industrial use of certain chemicals and incidents of disease, or between patterns of news coverage and voter-poll results \u2014 may all be online. But extracting it from plain text and organizing it for quantitative analysis may be prohibitively time consuming.<\/p>\n","protected":false},"author":6,"featured_media":10497,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[43,17],"tags":[],"class_list":["post-10496","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-computer-science","category-research"],"featured_image_urls":{"full":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/11\/MIT-Webaid-Learning_0.jpg",575,383,false],"thumbnail":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/11\/MIT-Webaid-Learning_0-150x150.jpg",150,150,true],"medium":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/11\/MIT-Webaid-Learning_0-300x199.jpg",300,199,true],"medium_large":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/11\/MIT-Webaid-Learning_0.jpg",575,383,false],"large":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/11\/MIT-Webaid-Learning_0.jpg",575,383,false],"1536x1536":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/11\/MIT-Webaid-Learning_0.jpg",575,383,false],"2048x2048":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/11\/MIT-Webaid-Learning_0.jpg",575,383,false],"ultp_layout_landscape_large":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/11\/MIT-Webaid-Learning_0.jpg",575,383,false],"ultp_layout_landscape":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/11\/MIT-Webaid-Learning_0.jpg",575,383,false],"ultp_layout_portrait":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/11\/MIT-Webaid-Learning_0.jpg",575,383,false],"ultp_layout_square":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/11\/MIT-Webaid-Learning_0.jpg",575,383,false],"newspaper-x-single-post":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/11\/MIT-Webaid-Learning_0.jpg",575,383,false],"newspaper-x-recent-post-big":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/11\/MIT-Webaid-Learning_0.jpg",540,360,false],"newspaper-x-recent-post-list-image":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/11\/MIT-Webaid-Learning_0.jpg",95,63,false],"web-stories-poster-portrait":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/11\/MIT-Webaid-Learning_0.jpg",575,383,false],"web-stories-publisher-logo":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/11\/MIT-Webaid-Learning_0.jpg",96,64,false],"web-stories-thumbnail":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/11\/MIT-Webaid-Learning_0.jpg",150,100,false]},"author_info":{"info":["Amrita Tuladhar"]},"category_info":"<a href=\"https:\/\/www.revoscience.com\/en\/category\/computer-science\/\" rel=\"category tag\">Computer Science<\/a> <a href=\"https:\/\/www.revoscience.com\/en\/category\/news\/research\/\" rel=\"category tag\">Research<\/a>","tag_info":"Research","comment_count":"0","_links":{"self":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/posts\/10496","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/users\/6"}],"replies":[{"embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/comments?post=10496"}],"version-history":[{"count":0,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/posts\/10496\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/media\/10497"}],"wp:attachment":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/media?parent=10496"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/categories?post=10496"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/tags?post=10496"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}