狄盛学术报告通知-服务计算技术与系统教育部重点实验室

狄盛学术报告通知

时间：2014-02-10 11:19:10

题目： Optimization of Fault Tolerance in Cloud Computing and High Performance Computing
报告人：狄盛博士
地点：东五楼二楼210学术报告厅
时间：2月11日下午15:00

报告摘要：
　　This talk will tackle some new fault tolerance issues, in the context of cloud computing and high performance computing. Recent characterization based on Google trace indicates that the cloud tasks with different priorities exhibit different failure probability distributions. The existing formula (Young's formula) to compute the optimal checkpoint intervals is subject to the Poisson distribution of failure probability. In contrast, our new formula proposed is not only generic because of no assumption on failure probability, but attractively simple to apply in practice. Using a real cluster with hundreds of virtual machines and Google trace, experiments show a significant performance gains with the new formula. On the other hand, fault tolerance is also the key technology in current High Performance Computing (HPC). The new problem is how to deal with the exa-scale HPC execution running with millions of cores in case of failures. Execution failures may occur due to multiple factors with different scales, from transient uncorrectable memory errors localized in processes to massive system outages. Multi-level checkpoint/restart is a promising model that provides an elastic response to tolerate different types of failures. It stores checkpoints at different levels: e.g., local memory, remote memory, using a software RAID, local SSD, remote file system. How to optimize the checkpoint intervals for different checkpoint levels will also be presented in this talk.

报告人简介：
　　Dr. Di obtained his Ph.D degree from the Department of Computer Science of The University of Hong Kong in Nov. of 2011, supervised by Professor Cho-Li Wang. Thereafter, he worked at INRIA (France) as a postdoc researcher for one and half year, and now working at Argonne National Laboratory (USA) as postdoc researcher. He has a broad research interest, including optimization of resource allocation in cloud computing, fault tolerance in high performance computing, Grid computing, and P2P resource discovery. He has published more than 30 papers in various peer reviewed journals and conference proceedings, including IEEE/ACM SC, IEEE IPDPS, IEEE TPDS, IEEE TCC, JPDC, etc. Dr. Di was the PC member of 6 conferences and external invited reviewers for 30+ conferences/journals. More details can be found inhttp://www.cs.hku.hk/~sdi. The talk to be given is based on his recent papers published in SC'13 and IPDPS'14.