This codebase provides the data and code of our NeurIPS 2024 paper:
Yinuo Jing, Ruxu Zhang, Kongming Liang*, Yongxiang Li, Zhongjiang He, Zhanyu Ma and Jun Guo, "Animal-Bench: Benchmarking Multimodal Video Models for Animal-centric Video Understanding", in Proceedings of Neural Information Processing Systems (NeurIPS), 2024.
![image](https://private-user-images.githubusercontent.com/45665026/381807796-ea30d59d-32f2-4fda-b563-d87df05d8fba.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkxNzkxMDksIm5iZiI6MTczOTE3ODgwOSwicGF0aCI6Ii80NTY2NTAyNi8zODE4MDc3OTYtZWEzMGQ1OWQtMzJmMi00ZmRhLWI1NjMtZDg3ZGYwNWQ4ZmJhLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEwVDA5MTMyOVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTc0MjQ5MTY2MWQ2MjdmMDU2NGRmOTk0YmVjNmU0MmM0NTJkODhhYmY4MzlkNzYwNGZiMGI5YTBmMTQ2ZDQ3NGYmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.WFyS7xZO1jpKHma1WuJIQiZ1k8_LeRui6PZNv2pN7_w)
Previous benchmarks (left) relied on limited agent and the scenarios of editing-based benchmarks are unrealistic. Our proposed Animal-Bench (right) includes diverse animal agents, various realistic scenarios, and encompasses 13 different tasks.
Task Demonstration
![image](https://private-user-images.githubusercontent.com/45665026/381808771-2eccd62f-02a4-4d5a-a248-cea40d37d06c.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkxNzkxMDksIm5iZiI6MTczOTE3ODgwOSwicGF0aCI6Ii80NTY2NTAyNi8zODE4MDg3NzEtMmVjY2Q2MmYtMDJhNC00ZDVhLWEyNDgtY2VhNDBkMzdkMDZjLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEwVDA5MTMyOVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTM5MDcwOWNjNDNkZGE4NzQ0MDU1NmQ2MWVhZjAyMTYyYTg4NzdkZjZjMjQ3YzhiMDViYzU0M2UxZTQxMjM5ODUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.WBdbsohm4G2toGKeST5PBpn_C8OOXfZfULs4IcddFlM)
Effectiveness evaluation results:
Robustness evaluation results:
Data: You can access and download the MammalNet, Animal Kingdom, LoTE-Animal, MSRVTT-QA, TGIF-QA, NExT-QA dataset to obtain the data used in the paper or you can use your own data.
Annotations: You can find our question-answer pair annotation files in /data.
Models: We mainly referred to MVBench to write the test code. You can refer to the structure of several model files in the /model folder to test your own model.
Thanks to the open source of the following projects: Chat-UniVi, mPLUG-Owl, Valley, VideoChat, VideoChat2, Video-ChatGPT, Video-LLaMA, Video-LLaVA.