Date: Dec 13, 2023
Time: 3:00 PM - 4:00 PM
Bio: Yang Chen is a Ph.D. student at Georgia Tech, supervised by Professors Alan Ritter and Wei Xu. His research interests lie in grounding the visual knowledge of multimodal large language models through retrieval-augmented generation. He has interned at Google DeepMind, working with Hexiang Hu and Ming-Wei Chang on visual entity reasoning using multimodal LLMs. Prior to that, he received an M.S. in Computer Science from the University of Chicago and a Bachelor's degree from the University of Melbourne.
Description: Multimodal Large Language Models (MLLMs) have demonstrated state-of-the-art capabilities in various tasks involving both images and text, including visual question answering. However, it remains unclear whether these MLLMs possess the ability to answer information-seeking queries of an image such as 'When was this church built?'.
In this talk, I will first introduce InfoSeek, a dataset tailored for visual information-seeking questions that cannot be answered using only common sense knowledge. I will then present insights into the generalization and instruction-tuning of MLLMs using InfoSeek. Finally, I will discuss what the future holds for multimodal retrieval models and how MLLMs-powered generative search engines could transform the existing search experiences.
Project page at https://open-vision-language.github.io/infoseek/
Add event to calendar