{"id":10744,"date":"2016-12-02T06:47:28","date_gmt":"2016-12-02T06:47:28","guid":{"rendered":"http:\/\/revoscience.com\/en\/?p=10744"},"modified":"2016-12-02T06:47:28","modified_gmt":"2016-12-02T06:47:28","slug":"computer-learns-recognize-sounds-watching-video","status":"publish","type":"post","link":"https:\/\/www.revoscience.com\/en\/computer-learns-recognize-sounds-watching-video\/","title":{"rendered":"Computer learns to recognize sounds by watching video"},"content":{"rendered":"<p style=\"text-align: justify;\"><span style=\"color: #000000;\"><em><strong>Machine-learning system doesn\u2019t require costly hand-annotated data.<\/strong><\/em><\/span><\/p>\n<figure id=\"attachment_10745\" aria-describedby=\"caption-attachment-10745\" style=\"width: 628px\" class=\"wp-caption alignnone\"><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-10745\" src=\"http:\/\/revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-SoundRec-1_0.jpg\" alt=\"The researchers\u2019 neural network was fed video from 26 terabytes of video data downloaded from the photo-sharing site Flickr. Researchers found the network can interpret natural sounds in terms of image categories. For instance, the network might determine that the sound of birdsong tends to be associated with forest scenes and pictures of trees, birds, birdhouses, and bird feeders. Image: Jose-Luis Olivares\/MIT\" width=\"628\" height=\"424\" title=\"\"><figcaption id=\"caption-attachment-10745\" class=\"wp-caption-text\">The researchers\u2019 neural network was fed video from 26 terabytes of video data downloaded from the photo-sharing site Flickr. Researchers found the network can interpret natural sounds in terms of image categories. For instance, the network might determine that the sound of birdsong tends to be associated with forest scenes and pictures of trees, birds, birdhouses, and bird feeders.<br \/>Image: Jose-Luis Olivares\/MIT<\/figcaption><\/figure>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">CAMBRIDGE, Mass. &#8212;\u00a0In recent years, computers have gotten remarkably good at recognizing speech and images: Think of the dictation software on most cellphones, or the algorithms that automatically identify people in photos posted to Facebook.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">But recognition of natural sounds \u2014 such as crowds cheering or waves crashing \u2014 has lagged behind. That\u2019s because most automated recognition systems, whether they process audio or visual information, are the result of machine learning, in which computers search for patterns in huge compendia of training data. Usually, the training data has to be first annotated by hand, which is prohibitively expensive for all but the highest-demand applications.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">Sound recognition may be catching up, however, thanks to researchers at MIT\u2019s Computer Science and Artificial Intelligence Laboratory (CSAIL). At the Neural Information Processing Systems conference next week, they will present <a style=\"color: #000000;\" href=\"http:\/\/mit.pr-optout.com\/Tracking.aspx?Data=HHL%3d8098%3b9-%3eLCE9%3b4%3b8%3f%26SDG%3c90%3a.&amp;RE=MC&amp;RI=4334046&amp;Preview=False&amp;DistributionActionID=33199&amp;Action=Follow+Link\" target=\"_blank\" data-saferedirecturl=\"https:\/\/www.google.com\/url?hl=en&amp;q=http:\/\/mit.pr-optout.com\/Tracking.aspx?Data%3DHHL%253d8098%253b9-%253eLCE9%253b4%253b8%253f%2526SDG%253c90%253a.%26RE%3DMC%26RI%3D4334046%26Preview%3DFalse%26DistributionActionID%3D33199%26Action%3DFollow%2BLink&amp;source=gmail&amp;ust=1480741555570000&amp;usg=AFQjCNHHWx_C5Uk9G_hXMwoDpmTWp-VqBQ\" rel=\"noopener\">a sound-recognition system<\/a> that outperforms its predecessors but didn\u2019t require hand-annotated data during training.<\/span><\/p>\n<p style=\"text-align: justify;\">[pullquote]The training process continually modifies the settings of the individual nodes, until the output of the final layer reliably performs some classification of the data \u2014 say, identifying the objects in the image.[\/pullquote]<\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">Instead, the researchers trained the system on video. First, existing computer vision systems that recognize scenes and objects categorized the images in the video. The new system then found correlations between those visual categories and natural sounds.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">\u201cComputer vision has gotten so good that we can transfer it to other domains,\u201d says Carl Vondrick, an MIT graduate student in electrical engineering and computer science and one of the paper\u2019s two first authors. \u201cWe\u2019re capitalizing on the natural synchronization between vision and sound. We scale up with tons of unlabeled video to learn to understand sound.\u201d<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">The researchers tested their system on two standard databases of annotated sound recordings, and it was between 13 and 15 percent more accurate than the best-performing previous system. On a data set with 10 different sound categories, it could categorize sounds with 92 percent accuracy, and on a data set with 50 categories it performed with 74 percent accuracy. On those same data sets, humans are 96 percent and 81 percent accurate, respectively.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">\u201cEven humans are ambiguous,\u201d says Yusuf Aytar, the paper\u2019s other first author and a postdoc in the lab of MIT professor of electrical engineering and computer science Antonio Torralba. Torralba is the final co-author on the paper.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">\u201cWe did an experiment with Carl,\u201d Aytar says. \u201cCarl was looking at the computer monitor, and I couldn\u2019t see it. He would play a recording and I would try to guess what it was. It turns out this is really, really hard. I could tell indoor from outdoor, basic guesses, but when it comes to the details \u2014 \u2018Is it a restaurant?\u2019 \u2014 those details are missing. Even for annotation purposes, the task is really hard.\u201d<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\"><strong>Complementary modalities<\/strong><\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">Because it takes far less power to collect and process audio data than it does to collect and process visual data, the researchers envision that a sound-recognition system could be used to improve the context sensitivity of mobile devices.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">When coupled with GPS data, for instance, a sound-recognition system could determine that a cellphone user is in a movie theater and that the movie has started, and the phone could automatically route calls to a prerecorded outgoing message. Similarly, sound recognition could improve the situational awareness of autonomous robots.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">\u201cFor instance, think of a self-driving car,\u201d Aytar says. \u201cThere\u2019s an ambulance coming, and the car doesn\u2019t see it. If it hears it, it can make future predictions for the ambulance \u2014 which path it\u2019s going to take \u2014 just purely based on sound.\u201d<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\"><strong>Visual language<\/strong><\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">The researchers\u2019 machine-learning system is a neural network, so called because its architecture loosely resembles that of the human brain. A neural net consists of processing nodes that, like individual neurons, can perform only rudimentary computations but are densely interconnected. Information \u2014 say, the pixel values of a digital image \u2014 is fed to the bottom layer of nodes, which processes it and feeds it to the next layer, which processes it and feeds it to the next layer, and so on. The training process continually modifies the settings of the individual nodes, until the output of the final layer reliably performs some classification of the data \u2014 say, identifying the objects in the image.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">Vondrick, Aytar, and Torralba first trained a neural net on two large, annotated sets of images: one, the ImageNet data set, contains labeled examples of images of 1,000 different objects; the other, the <a style=\"color: #000000;\" href=\"http:\/\/mit.pr-optout.com\/Tracking.aspx?Data=HHL%3d8098%3b9-%3eLCE9%3b4%3b8%3f%26SDG%3c90%3a.&amp;RE=MC&amp;RI=4334046&amp;Preview=False&amp;DistributionActionID=33198&amp;Action=Follow+Link\" target=\"_blank\" data-saferedirecturl=\"https:\/\/www.google.com\/url?hl=en&amp;q=http:\/\/mit.pr-optout.com\/Tracking.aspx?Data%3DHHL%253d8098%253b9-%253eLCE9%253b4%253b8%253f%2526SDG%253c90%253a.%26RE%3DMC%26RI%3D4334046%26Preview%3DFalse%26DistributionActionID%3D33198%26Action%3DFollow%2BLink&amp;source=gmail&amp;ust=1480741555570000&amp;usg=AFQjCNFjMZFcTWGarH8mcnnMnQe6lMl8pQ\" rel=\"noopener\">Places<\/a> data set created by Torralba\u2019s group, contains labeled images of 401 different scene types, such as a playground, bedroom, or conference room.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">Once the network was trained, the researchers fed it the video from 26 terabytes of video data downloaded from the photo-sharing site Flickr. \u201cIt\u2019s about 2 million unique videos,\u201d Vondrick says. \u201cIf you were to watch all of them back to back, it would take you about two years.\u201d Then they trained a second neural network on the audio from the same videos. The second network\u2019s goal was to correctly predict the object and scene tags produced by the first network.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">The result was a network that could interpret natural sounds in terms of image categories. For instance, it might determine that the sound of birdsong tends to be associated with forest scenes and pictures of trees, birds, birdhouses, and bird feeders.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\"><strong>Benchmarking<\/strong><\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">To compare the sound-recognition network\u2019s performance to that of its predecessors, however, the researchers needed a way to translate its language of images into the familiar language of sound names. So they trained a simple machine-learning system to associate the outputs of the sound-recognition network with a set of standard sound labels.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">For that, the researchers did use a database of annotated audio \u2014 one with 50 categories of sound and about 2,000 examples. Those annotations had been supplied by humans. But it\u2019s much easier to label 2,000 examples than to label 2 million. And the MIT researchers\u2019 network, trained first on unlabeled video, significantly outperformed all previous networks trained solely on the 2,000 labeled examples.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Mass. &#8212; In recent years, computers have gotten remarkably good at recognizing speech and images: Think of the dictation software on most cellphones, or the algorithms that automatically identify people in photos posted to Facebook.<\/p>\n<p>But recognition of natural sounds \u2014 such as crowds cheering or waves crashing \u2014 has lagged behind. That\u2019s because most automated recognition systems, whether they process audio or visual information, are the result of machine learning, in which computers search for patterns in huge compendia of training data. Usually, the training data has to be first annotated by hand, which is prohibitively expensive for all but the highest-demand applications.<\/p>\n","protected":false},"author":6,"featured_media":10745,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[43,17],"tags":[],"class_list":["post-10744","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-computer-science","category-research"],"featured_image_urls":{"full":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-SoundRec-1_0.jpg",511,341,false],"thumbnail":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-SoundRec-1_0-150x150.jpg",150,150,true],"medium":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-SoundRec-1_0-300x200.jpg",300,200,true],"medium_large":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-SoundRec-1_0.jpg",511,341,false],"large":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-SoundRec-1_0.jpg",511,341,false],"1536x1536":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-SoundRec-1_0.jpg",511,341,false],"2048x2048":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-SoundRec-1_0.jpg",511,341,false],"ultp_layout_landscape_large":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-SoundRec-1_0.jpg",511,341,false],"ultp_layout_landscape":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-SoundRec-1_0.jpg",511,341,false],"ultp_layout_portrait":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-SoundRec-1_0.jpg",511,341,false],"ultp_layout_square":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-SoundRec-1_0.jpg",511,341,false],"newspaper-x-single-post":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-SoundRec-1_0.jpg",511,341,false],"newspaper-x-recent-post-big":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-SoundRec-1_0.jpg",511,341,false],"newspaper-x-recent-post-list-image":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-SoundRec-1_0.jpg",95,63,false],"web-stories-poster-portrait":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-SoundRec-1_0.jpg",511,341,false],"web-stories-publisher-logo":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-SoundRec-1_0.jpg",96,64,false],"web-stories-thumbnail":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2016\/12\/MIT-SoundRec-1_0.jpg",150,100,false]},"author_info":{"info":["Amrita Tuladhar"]},"category_info":"<a href=\"https:\/\/www.revoscience.com\/en\/category\/computer-science\/\" rel=\"category tag\">Computer Science<\/a> <a href=\"https:\/\/www.revoscience.com\/en\/category\/news\/research\/\" rel=\"category tag\">Research<\/a>","tag_info":"Research","comment_count":"0","_links":{"self":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/posts\/10744","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/users\/6"}],"replies":[{"embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/comments?post=10744"}],"version-history":[{"count":0,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/posts\/10744\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/media\/10745"}],"wp:attachment":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/media?parent=10744"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/categories?post=10744"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/tags?post=10744"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}