时间:2023-07-05 09:54:59

主题:Natural Language Processing for Materials Science

嘉宾:刘邦 加拿大蒙特利尔大学计算机科学与运筹学系 Assistant Professor

时间:2023年7月10日 上午10:00 – 11:00



In materials science, large amounts and heterogeneous data are being produced every day, such as scientific publications, lab reports, manuals, tables, and so on. Natural language processing (NLP) is therefore playing a key role in understanding and unlocking the rich datasets in materials science, especially for understanding scientific literature and extracting useful information from them. Capturing unstructured information from the vast and evergrowing number of scientific publications has substantial promise to enable the creation of experimental-based databases currently lacking and meet the various needs in the materials domain. However, directly applying NLP techniques developed in the general domain to the materials science domain cannot give us satisfactory performance in different tasks. The reasons include but are not limited to the following: i) the content and style of material science literature are different from general domain texts such as news articles, which leads to degraded performance for NLP tasks; ii) understanding the literature requires significant in-domain expert knowledge; and iii) we lack high-quality and large-scale labeled training datasets for NLP tasks in the materials science domain. In this talk, we will introduce our recent works on NLP for materials science. We first present a natural language benchmark (MatSci-NLP) and study various BERT-based models based on it to understand the impact of pretraining strategies on understanding materials science text. Then we introduce an instruction-based process for trustworthy data curation in materials science (MatSci-Instruct), which we then apply to finetune a LLaMa-based language model (HoneyBee). MatSci-Instruct helps alleviate the scarcity of relevant, high-quality materials science textual data available in the open literature, and HoneyBee is the first billion-parameter language model specialized to materials science.


Bang Liu is an Assistant Professor in the Department of Computer Science and Operations Research (DIRO) at the University of Montreal. He is a core member of the RALI laboratory (Applied Research in Computer Linguistics) of DIRO, an associate member of Mila – Quebec Artificial Intelligence Institute, and a Canada CIFAR AI (CCAI) Chair. He received his B.Engr. degree in 2013 from University of Science and Technology of China (USTC), as well as his M.S. degree and Ph.D. degree from University of Alberta in 2015 and 2020, respectively. His research interests primarily lie in the areas of natural language processing, multimodal & embodied learning, theory and techniques for AGI (e.g., understanding and improving large language models), and AI for science (e.g., health, material science, XR).