{"id":13412,"date":"2017-10-24T09:12:59","date_gmt":"2017-10-24T09:12:59","guid":{"rendered":"https:\/\/www.revoscience.com\/en\/?p=13412"},"modified":"2017-10-24T09:12:59","modified_gmt":"2017-10-24T09:12:59","slug":"selective-memory","status":"publish","type":"post","link":"https:\/\/www.revoscience.com\/en\/selective-memory\/","title":{"rendered":"Selective memory"},"content":{"rendered":"<p style=\"text-align: justify;\"><span style=\"color: #000000;\"><em><strong>Scheme would make new high-capacity data caches 33 to 50 percent more efficient.<\/strong><\/em><\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone  wp-image-13413\" src=\"https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2017\/10\/MIT-Fast-Cache_0.jpg\" alt=\"\" width=\"623\" height=\"420\" title=\"\">CAMBRIDGE, Mass. &#8212; In a traditional computer, a microprocessor is mounted on a \u201cpackage,\u201d a small circuit board with a grid of electrical leads on its bottom. The package snaps into the computer\u2019s motherboard, and data travels between the processor and the computer\u2019s main memory bank through the leads.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">As processors\u2019 transistor counts have gone up, the relatively slow connection between the processor and main memory has become the chief impediment to improving computers\u2019 performance. So, in the past few years, chip manufacturers have started putting dynamic random-access memory \u2014 or DRAM, the type of memory traditionally used for main memory \u2014 right on the chip package.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">The natural way to use that memory is as a high-capacity cache, a fast, local store of frequently used data. But DRAM is fundamentally different from the type of memory typically used for on-chip caches, and existing cache-management schemes don\u2019t use it efficiently.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">At the recent IEEE\/ACM International Symposium on Microarchitecture, researchers from MIT, Intel, and ETH Zurich presented a new cache-management scheme that improves the data rate of in-package DRAM caches by 33 to 50 percent.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">\u201cThe bandwidth in this in-package DRAM can be five times higher than off-package DRAM,\u201d says Xiangyao Yu, a postdoc in MIT\u2019s Computer Science and Artificial Intelligence Laboratory and first author on the new paper. \u201cBut it turns out that previous schemes spend too much traffic accessing metadata or moving data between in- and off-package DRAM, not really accessing data, and they waste a lot of bandwidth. The performance is not the best you can get from this new technology.\u201d<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\"><strong>Cache hash<\/strong><\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">By \u201cmetadata,\u201d Yu means data that describe where data in the cache comes from. In a modern computer chip, when a processor needs a particular chunk of data, it will check its local caches to see if the data is already there. Data in the caches is \u201ctagged\u201d with the addresses in main memory from which it is drawn; the tags are the metadata.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">A typical on-chip cache might have room enough for 64,000 data items with 64,000 tags. Obviously, a processor doesn\u2019t want to search all 64,000 entries for the one that it\u2019s interested in. So cache systems usually organize data using something called a \u201chash table.\u201d When a processor seeks data with a particular tag, it first feeds the tag to a hash function, which processes it in a prescribed way to produce a new number. That number designates a slot in a table of data, which is where the processor looks for the item it\u2019s interested in.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">The point of a hash function is that very similar inputs produce very different outputs. That way, if a processor is relying heavily on data from a narrow range of addresses \u2014 if, for instance, it\u2019s performing a complicated operation on one section of a large image \u2014 that data is spaced out across the cache so as not to cause a logjam at a single location.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">Hash functions can, however, produce the same output for different inputs, which is all the more likely if they have to handle a wide range of possible inputs, as caching schemes do. So a cache\u2019s hash table will often store two or three data items under the same hash index. Searching two or three items for a given tag, however, is much better than searching 64,000.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\"><strong>Dumb memory<\/strong><\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">Here\u2019s where the difference between DRAM and SRAM, the technology used in standard caches, comes in. For every bit of data it stores, SRAM uses six transistors. DRAM uses one, which means that it\u2019s much more space-efficient. But SRAM has some built-in processing capacity, and DRAM doesn\u2019t. If a processor wants to search an SRAM cache for a data item, it sends the tag to the cache. The SRAM circuit itself compares the tag to those of the items stored at the corresponding hash location and, if it gets a match, returns the associated data.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">DRAM, by contrast, can\u2019t do anything but transmit requested data. So the processor would request the first tag stored at a given hash location and, if it\u2019s a match, send a second request for the associated data. If it\u2019s not a match, it will request the second stored tag, and if that\u2019s not a match, the third, and so on, until it either finds the data it wants or gives up and goes to main memory.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">In-package DRAM may have a lot of bandwidth, but this process squanders it. Yu and his colleagues \u2014 Srinivas Devadas, the Edwin Sibley Webster Professor of Electrical Engineering and Computer Science at MIT; Christopher Hughes and Nadathur Satish of Intel; and Onur Mutlu of ETH Zurich \u2014 avoid all that metadata transfer with a slight modification of a memory management system found in most modern chips.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">Any program running on a computer chip has to manage its own memory use, and it\u2019s generally handy to let the program act as if it has its own dedicated memory store. But in fact, multiple programs are usually running on the same chip at once, and they\u2019re all sending data to main memory at the same time. So each core, or processing unit, in a chip usually has a table that maps the virtual addresses used by individual programs to the actual addresses of data stored in main memory.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\"><strong>Look here<\/strong><\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">Yu and his colleagues\u2019 new system, dubbed Banshee, adds three bits of data to each entry in the table. One bit indicates whether the data at that virtual address can be found in the DRAM cache, and the other two indicate its location relative to any other data items with the same hash index.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">\u201cIn the entry, you need to have the physical address, you need to have the virtual address, and you have some other data,\u201d Yu says. \u201cThat\u2019s already almost 100 bits. So three extra bits is a pretty small overhead.\u201d<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">There\u2019s one problem with this approach that Banshee also has to address. If one of a chip\u2019s cores pulls a data item into the DRAM cache, the other cores won\u2019t know about it. Sending messages to all of a chip\u2019s cores every time any one of them updates the cache consumes a good deal of time and bandwidth. So Banshee introduces another small circuit, called a tag buffer, where any given core can record the new location of a data item it caches.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">Any request sent to either the DRAM cache or main memory by any core first passes through the tag buffer, which checks to see whether the requested tag is one whose location has been remapped. Only when the buffer fills up does Banshee notify all the chips\u2019 cores that they need to update their virtual-memory tables. Then it clears the buffer and starts over.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">The buffer is small, only 5 kilobytes, so its addition would not use up too much valuable on-chip real estate. And the researchers\u2019 simulations show that the time required for one additional address lookup per memory access is trivial compared to the bandwidth savings Banshee affords.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Scheme would make new high-capacity data caches 33 to 50 percent more efficient. CAMBRIDGE, Mass. &#8212; In a traditional computer, a microprocessor is mounted on a \u201cpackage,\u201d a small circuit board with a grid of electrical leads on its bottom. The package snaps into the computer\u2019s motherboard, and data travels between the processor and the [&hellip;]<\/p>\n","protected":false},"author":6,"featured_media":13413,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[43,17],"tags":[],"class_list":["post-13412","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-computer-science","category-research"],"featured_image_urls":{"full":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2017\/10\/MIT-Fast-Cache_0.jpg",543,362,false],"thumbnail":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2017\/10\/MIT-Fast-Cache_0-150x150.jpg",150,150,true],"medium":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2017\/10\/MIT-Fast-Cache_0-300x200.jpg",300,200,true],"medium_large":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2017\/10\/MIT-Fast-Cache_0.jpg",543,362,false],"large":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2017\/10\/MIT-Fast-Cache_0.jpg",543,362,false],"1536x1536":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2017\/10\/MIT-Fast-Cache_0.jpg",543,362,false],"2048x2048":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2017\/10\/MIT-Fast-Cache_0.jpg",543,362,false],"ultp_layout_landscape_large":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2017\/10\/MIT-Fast-Cache_0.jpg",543,362,false],"ultp_layout_landscape":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2017\/10\/MIT-Fast-Cache_0.jpg",543,362,false],"ultp_layout_portrait":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2017\/10\/MIT-Fast-Cache_0.jpg",543,362,false],"ultp_layout_square":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2017\/10\/MIT-Fast-Cache_0.jpg",543,362,false],"newspaper-x-single-post":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2017\/10\/MIT-Fast-Cache_0.jpg",543,362,false],"newspaper-x-recent-post-big":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2017\/10\/MIT-Fast-Cache_0.jpg",540,360,false],"newspaper-x-recent-post-list-image":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2017\/10\/MIT-Fast-Cache_0.jpg",95,63,false],"web-stories-poster-portrait":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2017\/10\/MIT-Fast-Cache_0.jpg",543,362,false],"web-stories-publisher-logo":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2017\/10\/MIT-Fast-Cache_0.jpg",96,64,false],"web-stories-thumbnail":["https:\/\/www.revoscience.com\/en\/wp-content\/uploads\/2017\/10\/MIT-Fast-Cache_0.jpg",150,100,false]},"author_info":{"info":["Amrita Tuladhar"]},"category_info":"<a href=\"https:\/\/www.revoscience.com\/en\/category\/computer-science\/\" rel=\"category tag\">Computer Science<\/a> <a href=\"https:\/\/www.revoscience.com\/en\/category\/news\/research\/\" rel=\"category tag\">Research<\/a>","tag_info":"Research","comment_count":"0","_links":{"self":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/posts\/13412","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/users\/6"}],"replies":[{"embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/comments?post=13412"}],"version-history":[{"count":0,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/posts\/13412\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/media\/13413"}],"wp:attachment":[{"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/media?parent=13412"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/categories?post=13412"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.revoscience.com\/en\/wp-json\/wp\/v2\/tags?post=13412"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}