{"id":10807,"date":"2016-12-08T06:26:38","date_gmt":"2016-12-08T06:26:38","guid":{"rendered":"http:\/\/revoscience.com\/en\/?p=10807"},"modified":"2016-12-08T07:11:39","modified_gmt":"2016-12-08T07:11:39","slug":"learning-words-pictures","status":"publish","type":"post","link":"https:\/\/www.revoscience.com\/en\/learning-words-pictures\/","title":{"rendered":"Learning words from pictures"},"content":{"rendered":"<p style=\"text-align: justify;\"><span style=\"color: #000000;\"><em><strong>System correlates recorded speech with images, could lead to fully automated speech recognition.<\/strong><\/em><\/span><\/p>\n<figure id=\"attachment_10808\" aria-describedby=\"caption-attachment-10808\" style=\"width: 639px\" class=\"wp-caption alignnone\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-10808\" src=\"http:\/\/revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-Speech-Pictures_0.jpg\" alt=\"MIT researchers have developed a new approach to training speech-recognition systems that doesn\u2019t depend on transcription. Instead, their system analyzes correspondences between images and spoken descriptions of those images, as captured in a large collection of audio recordings. Image: MIT News\" width=\"639\" height=\"426\" title=\"\" srcset=\"https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-Speech-Pictures_0.jpg 639w, https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-Speech-Pictures_0-300x200.jpg 300w\" sizes=\"auto, (max-width: 639px) 100vw, 639px\" \/><figcaption id=\"caption-attachment-10808\" class=\"wp-caption-text\">MIT researchers have developed a new approach to training speech-recognition systems that doesn\u2019t depend on transcription. Instead, their system analyzes correspondences between images and spoken descriptions of those images, as captured in a large collection of audio recordings.<br \/>Image: MIT News<\/figcaption><\/figure>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">CAMBRIDGE, Mass. &#8212;\u00a0Speech recognition systems, such as those that convert speech to text on cellphones, are generally the result of machine learning. A computer pores through thousands or even millions of audio files and their transcriptions, and learns which acoustic features correspond to which typed words.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">But transcribing recordings is costly, time-consuming work, which has limited speech recognition to a small subset of languages spoken in wealthy nations.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">At the <a style=\"color: #000000;\" href=\"http:\/\/mit.pr-optout.com\/Tracking.aspx?Data=HHL%3d80%3a0%3a0-%3eLCE9%3b4%3b8%3f%26SDG%3c90%3a.&amp;RE=MC&amp;RI=4334046&amp;Preview=False&amp;DistributionActionID=33320&amp;Action=Follow+Link\" target=\"_blank\" data-saferedirecturl=\"https:\/\/www.google.com\/url?hl=en&amp;q=http:\/\/mit.pr-optout.com\/Tracking.aspx?Data%3DHHL%253d80%253a0%253a0-%253eLCE9%253b4%253b8%253f%2526SDG%253c90%253a.%26RE%3DMC%26RI%3D4334046%26Preview%3DFalse%26DistributionActionID%3D33320%26Action%3DFollow%2BLink&amp;source=gmail&amp;ust=1481261293826000&amp;usg=AFQjCNGork0_Aw6x81x6ru8wF6RjposvHA\" rel=\"noopener\">Neural Information Processing Systems conference<\/a> this week, researchers from MIT\u2019s Computer Science and Artificial Intelligence Laboratory (CSAIL) are presenting a new approach to training speech-recognition systems that doesn\u2019t depend on transcription. Instead, their system analyzes correspondences between images and spoken descriptions of those images, as captured in a large collection of audio recordings. The system then learns which acoustic features of the recordings correlate with which image characteristics.<\/span><\/p>\n<p style=\"text-align: justify;\">[pullquote]To build their system, the researchers used neural networks, machine-learning systems that approximately mimic the structure of the brain. Neural networks are composed of processing nodes that, like individual neurons, are capable of only very simple computations but are connected to each other in dense networks.[\/pullquote]<\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">\u201cThe goal of this work is to try to get the machine to learn language more like the way humans do,\u201d says Jim Glass, a senior research scientist at CSAIL and a co-author on the paper describing the new system. \u201cThe current methods that people use to train up speech recognizers are very supervised. You get an utterance, and you\u2019re told what\u2019s said. And you do this for a large body of data.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">\u201cBig advances have been made \u2014 Siri, Google \u2014 but it\u2019s expensive to get those annotations, and people have thus focused on, really, the major languages of the world. There are 7,000 languages, and I think less than 2 percent have ASR [automatic speech recognition] capability, and probably nothing is going to be done to address the others. So if you\u2019re trying to think about how technology can be beneficial for society at large, it\u2019s interesting to think about what we need to do to change the current situation. And the approach we\u2019ve been taking through the years is looking at what we can learn with less supervision.\u201d<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">Joining Glass on the paper are first author David Harwath, a graduate student in electrical engineering and computer science (EECS) at MIT; and Antonio Torralba, an EECS professor.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\"><strong>Visual semantics<\/strong><\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">The version of the system reported in the new paper doesn\u2019t correlate recorded speech with written text; instead, it correlates speech with groups of thematically related images. But that correlation could serve as the basis for others.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">If, for instance, an utterance is associated with a particular class of images, and the images have text terms associated with them, it should be possible to find a likely transcription of the utterance, all without human intervention. Similarly, a class of images with associated text terms in different languages could provide a way to do automatic translation.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">Conversely, text terms associated with similar clusters of images, such as, say, \u201cstorm\u201d and \u201cclouds,\u201d could be inferred to have related meanings. Because the system in some sense learns words\u2019 meanings \u2014 the images associated with them \u2014 and not just their sounds, it has a wider range of potential applications than a standard speech recognition system.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">To test their system, the researchers used a database of 1,000 images, each of which had a recording of a free-form verbal description associated with it. They would feed their system one of the recordings and ask it to retrieve the 10 images that best matched it. That set of 10 images would contain the correct one 31 percent of the time.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">\u201cI always emphasize that we&#8217;re just taking baby steps here and have a long way to go,\u201d Glass says. \u201cBut it\u2019s an encouraging start.\u201d<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">The researchers trained their system on images from a huge <a style=\"color: #000000;\" href=\"http:\/\/mit.pr-optout.com\/Tracking.aspx?Data=HHL%3d80%3a0%3a0-%3eLCE9%3b4%3b8%3f%26SDG%3c90%3a.&amp;RE=MC&amp;RI=4334046&amp;Preview=False&amp;DistributionActionID=33319&amp;Action=Follow+Link\" target=\"_blank\" data-saferedirecturl=\"https:\/\/www.google.com\/url?hl=en&amp;q=http:\/\/mit.pr-optout.com\/Tracking.aspx?Data%3DHHL%253d80%253a0%253a0-%253eLCE9%253b4%253b8%253f%2526SDG%253c90%253a.%26RE%3DMC%26RI%3D4334046%26Preview%3DFalse%26DistributionActionID%3D33319%26Action%3DFollow%2BLink&amp;source=gmail&amp;ust=1481261293826000&amp;usg=AFQjCNHayI45p1E_wazWUfSF5Qa7ApIaUQ\" rel=\"noopener\">database<\/a> <a style=\"color: #000000;\" href=\"http:\/\/mit.pr-optout.com\/Tracking.aspx?Data=HHL%3d80%3a0%3a0-%3eLCE9%3b4%3b8%3f%26SDG%3c90%3a.&amp;RE=MC&amp;RI=4334046&amp;Preview=False&amp;DistributionActionID=33318&amp;Action=Follow+Link\" target=\"_blank\" data-saferedirecturl=\"https:\/\/www.google.com\/url?hl=en&amp;q=http:\/\/mit.pr-optout.com\/Tracking.aspx?Data%3DHHL%253d80%253a0%253a0-%253eLCE9%253b4%253b8%253f%2526SDG%253c90%253a.%26RE%3DMC%26RI%3D4334046%26Preview%3DFalse%26DistributionActionID%3D33318%26Action%3DFollow%2BLink&amp;source=gmail&amp;ust=1481261293826000&amp;usg=AFQjCNFqha5aVlx5Ueo9qCXkTNfm1xKvmw\" rel=\"noopener\">built by<\/a> Torralba; Aude Oliva, a principal research scientist at CSAIL; and their students. Through Amazon\u2019s Mechanical Turk crowdsourcing site, they hired people to describe the images verbally, using whatever phrasing came to mind, for about 10 to 20 seconds.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">For an initial demonstration of the researchers\u2019 approach, that kind of tailored data was necessary to ensure good results. But the ultimate aim is to train the system using digital video, with minimal human involvement. \u201cI think this will extrapolate naturally to video,\u201d Glass says.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\"><strong>Merging modalities<\/strong><\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">To build their system, the researchers used neural networks, machine-learning systems that approximately mimic the structure of the brain. Neural networks are composed of processing nodes that, like individual neurons, are capable of only very simple computations but are connected to each other in dense networks. Data is fed to a network\u2019s input nodes, which modify it and feed it to other nodes, which modify it and feed it to still other nodes, and so on. When a neural network is being trained, it constantly modifies the operations executed by its nodes in order to improve its performance on a specified task.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">The researchers\u2019 network is, in effect, two separate networks: one that takes images as input and one that takes spectrograms, which represent audio signals as changes of amplitude, over time, in their <a style=\"color: #000000;\" href=\"http:\/\/mit.pr-optout.com\/Tracking.aspx?Data=HHL%3d80%3a0%3a0-%3eLCE9%3b4%3b8%3f%26SDG%3c90%3a.&amp;RE=MC&amp;RI=4334046&amp;Preview=False&amp;DistributionActionID=33317&amp;Action=Follow+Link\" target=\"_blank\" data-saferedirecturl=\"https:\/\/www.google.com\/url?hl=en&amp;q=http:\/\/mit.pr-optout.com\/Tracking.aspx?Data%3DHHL%253d80%253a0%253a0-%253eLCE9%253b4%253b8%253f%2526SDG%253c90%253a.%26RE%3DMC%26RI%3D4334046%26Preview%3DFalse%26DistributionActionID%3D33317%26Action%3DFollow%2BLink&amp;source=gmail&amp;ust=1481261293826000&amp;usg=AFQjCNEA-Lpy12NuN_ePZQR9w0mpQwHAWA\" rel=\"noopener\">component frequencies<\/a>. The output of the top layer of each network is a 1,024-dimensional vector \u2014 a sequence of 1,024 numbers.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">The final node in the network takes the dot product of the two vectors. That is, it multiplies the corresponding terms in the vectors together and adds them all up to produce a single number. During training, the networks had to try to maximize the dot product when the audio signal corresponded to an image and minimize it when it didn\u2019t.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">For every spectrogram that the researchers\u2019 system analyzes, it can identify the points at which the dot-product peaks. In experiments, those peaks reliably picked out words that provided accurate image labels \u2014 \u201cbaseball,\u201d for instance, in a photo of a baseball pitcher in action, or \u201cgrassy\u201d and \u201cfield\u201d for an image of a grassy field.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">In ongoing work, the researchers have refined the system so that it can pick out spectrograms of individual words and identify just those regions of an image that correspond to them.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>At the Neural Information Processing Systems conference this week, researchers from MIT\u2019s Computer Science and Artificial Intelligence Laboratory (CSAIL) are presenting a new approach to training speech-recognition systems that doesn\u2019t depend on transcription. Instead, their system analyzes correspondences between images and spoken descriptions of those images, as captured in a large collection of audio recordings. The system then learns which acoustic features of the recordings correlate with which image characteristics.<\/p>\n","protected":false},"author":6,"featured_media":10808,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[22,20,28],"tags":[],"class_list":["post-10807","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-other","category-space-news","category-techbiz"],"featured_image_urls":{"full":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-Speech-Pictures_0.jpg",639,426,false],"thumbnail":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-Speech-Pictures_0-150x150.jpg",150,150,true],"medium":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-Speech-Pictures_0-300x200.jpg",300,200,true],"medium_large":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-Speech-Pictures_0.jpg",639,426,false],"large":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-Speech-Pictures_0.jpg",639,426,false],"1536x1536":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-Speech-Pictures_0.jpg",639,426,false],"2048x2048":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-Speech-Pictures_0.jpg",639,426,false],"ultp_layout_landscape_large":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-Speech-Pictures_0.jpg",639,426,false],"ultp_layout_landscape":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-Speech-Pictures_0.jpg",639,426,false],"ultp_layout_portrait":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-Speech-Pictures_0.jpg",600,400,false],"ultp_layout_square":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-Speech-Pictures_0.jpg",600,400,false],"newspaper-x-single-post":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-Speech-Pictures_0.jpg",639,426,false],"newspaper-x-recent-post-big":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-Speech-Pictures_0.jpg",540,360,false],"newspaper-x-recent-post-list-image":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-Speech-Pictures_0.jpg",95,63,false],"web-stories-poster-portrait":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-Speech-Pictures_0.jpg",639,426,false],"web-stories-publisher-logo":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-Speech-Pictures_0.jpg",96,64,false],"web-stories-thumbnail":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-Speech-Pictures_0.jpg",150,100,false]},"author_info":{"info":["Amrita Tuladhar"]},"category_info":"<a href=\"https:\/\/www.revoscience.com\/en\/category\/news\/other\/\" rel=\"category tag\">Other<\/a> <a href=\"https:\/\/www.revoscience.com\/en\/category\/news\/space-news\/\" rel=\"category tag\">Space\/ AstroPhysics<\/a> <a href=\"https:\/\/www.revoscience.com\/en\/category\/techbiz\/\" rel=\"category tag\">Tech<\/a>","tag_info":"Tech","comment_count":"0","_links":{"self":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/posts\/10807","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/users\/6"}],"replies":[{"embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/comments?post=10807"}],"version-history":[{"count":0,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/posts\/10807\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/media\/10808"}],"wp:attachment":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/media?parent=10807"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/categories?post=10807"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/tags?post=10807"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}