题 目:RapidCDC:Leveraging Duplicate Locality to Accelerate Chunking in CDC-based Deduplication Systems
报告人:江松 美国德克萨斯大学阿灵顿分校
时 间:2019年7月15日上午10:00—12:00
地 点:东五楼2楼210学术报告厅
报告摘要:
I/O deduplication is a key technique for improving storage systems' space and I/O efficiency.Among various deduplication techniques content-defined chunking (CDC) based deduplication is the most desired one for its high deduplication ratio. However, its chunking operation is slow, and may become a performance bottleneck. Currently a choice has to be made between high deduplication ratio and high speed.
In this talk I will show how to leverage locality in the duplicate chunks to remove almost all chunking cost for deduplicatable chunks in CDC-based deduplication systems. The proposed deduplication method, named RapidCDC, has two salient features. One is that its efficiency is positively correlated to the deduplication ratio. The second is that its high efficiency does not heavily depend on the existence of the locality. Our experimental results with synthetic and real-world datasets show that RapidCDC’s chunking speed can be improved by up to 33× over regular CDC. Meanwhile, it maintains (almost) the same deduplication ratio.
报告人简介:
Dr. Song Jiang is an associate professor of the CSE department at University of Texas at Arlington. His research interests include system infrastructure for big data processing, such as file and storage systems and data management systems, as well as I/O systems for high-performance computing. He was a recipient of a 2009 US National Science Foundation (NSF) CAREER award and his research activities have been continuously supported by the NSF. He has served on many conference program committees and proposal review panels. He has been involved in projects at Facebook and Baidu as a collaborator for providing high-quality Internet-wide services based on big data, resulting in many significant publications at top-tier conferences. Dr. Jiang’s research has generated substantial impact in industry where several of his proposed algorithms for memory and storage management have been officially adopted into mainstream systems, including the Linux kernel, the NetBSD kernel, and the storage engine of MySQL.