We develop ECVA, a new benchmark for causation understanding of video anomaly. ECVA is the first large-scale benchmark focused on the causation of video anomalies. Compared with existing datasets, our dataset is more comprehensive and more challenging with much higher-quality annotations. This work is an extension of "Uncovering What, Why and How: A Comprehensive Benchmark for Causation Understanding of Video Anomaly" [CVPR2024]
Uncovering What, Why and How: A Comprehensive Benchmark for Causation Understanding of Video Anomaly
- Installation
- Benchmark and Evaluation Metric
- Train Dataset Preparation
- Model Training
- Inference
- Acknowledgement
- License
To install and set up the environment, follow these steps:
git clone [email protected]:Dulpy/ECVA.git
cd multimodal-video-large-model
pip install -r requirements.txt
-
We develop ECVA, a new benchmark for causation understanding of video anomaly. ECVA is the first large-scale benchmark focused on the causation of video anomalies. Compared with existing datasets, our dataset is more comprehensive and more challenging with much higher-quality annotation.
-
Our ECVA dataset contains 2240 video clips and 6720 question-answer pairs, the total length of these videos is 88.16 hours, and the average frames of videos is 8460. The frames are extracted from the original videos at a rate of 60 FPS. The videos encompass a wide range of domain
-
You can download the original video data from this link: Download Original Video Data
-
The ECVA video data along its annotation can be found in https://www.modelscope.cn/datasets/gouchenyi/ECVA/files
The proposed evaluation mertic mainly measures the performance of the model comprehensively through the following three aspects.
-
In the Basic Reasoning part, we use the GPT model to assess whether the candidate answers comprehensively cover all key phrases and rate the answers based on their logical coherence.
-
For the Consistency evaluation, we leverage the binarity of the GPT to score the candidate answers.
-
As for the Hallucination part, we remove key frames from the video and input it into the VLMs to observe how consistent the model's responses are with or without the key frames.
For each video response, you need to organize it into the following format:
[{
"video_file": '00001.mp4'
"prompt": 'Give a detailed description of the anomalous segment in the video. Please remember to describe the details of the incident'
"output": 'your model's response to this prompt'
"task_type": 'Description'
"human_expert_answer": 'The standard answer for the task'
},
]
Prepare the model's answers and our benchmark answers, then use the script here to score them with GPT assistant. Because GPT will be used to assist in the evaluation, you will need to fill in your own key in the relevant configuration file
Prepare the model's answers and our benchmark answers, then use the script here to evaluate them use BLUE, ROUGE, BLEURT and UNIEVAL.
We introduce a novel video large language model named Anomaly Shield (AnomShield), which is designed to address the three challenges presented by ECVA. You can re-organize the annotated video/image sft data according to the following format and place the image/video data in the path ECVA/datasets/pretraining/ and ECVA/datasets/videosft/
[
{
"id": 0,
"video": "images/xxx.jpg",
"conversations": [
{
"from": "human",
"value": "<image>\nWhat are the colors of the bus in the image?"
},
{
"from": "gpt",
"value": "The bus in the image is white and red."
},
...
],
}
{
"id": 1,
"video": "videos/xxx.mp4",
"conversations": [
{
"from": "human",
"value": "<video>\nWhat are the main activities that take place in the video?"
},
{
"from": "gpt",
"value": "The main activities that take place in the video are the preparation of camera equipment by a man, a group of men riding a helicopter, and a man sailing a boat through the water."
},
...
],
},
...
]
-
For Vision-Encoder, similar to most multi-modal large models, AnomShield uses the CLIP series as the visual encoder. You can download the related pre-trained weights from openai/clip-vit-large-patch14.
-
For the base model, we utilize the powerful Mistral series to help analyze the video content and provide reliable, accurate answers. You can download the related pre-trained weights from mistralai/Mistral-7B-Instruct-v0.2.
cd ECVA/scripts/vllava/mamba/
./pretrain.sh
cd ECVA/scripts/vllava/mamba/
./finetune.sh
Video/Image Inference. We have inherited the inference code from VideoLLaMA2.You can refer to the inference.ipynb to implement the model inference, and you need to prepare the relevant model weights according to the instructions in the script.
cd ECVA/
run inference.ipynb on the jupyter environment
The codebase of ECVA is adapted from VideoLLaMA2. We are grateful for the foundational work done by the VideoLLaMA2 team, which has significantly contributed to the development of this project.
If you find our work useful for your research, please consider citing.
@article{du2024exploring,
title={Exploring What Why and How: A Multifaceted Benchmark for Causation Understanding of Video Anomaly},
author={Du, Hang and Nan, Guoshun and Qian, Jiawen and Wu, Wangchenhui and Deng, Wendi and Mu, Hanqing and Chen, Zhenyan and Mao, Pengxuan and Tao, Xiaofeng and Liu, Jun},
journal={arXiv preprint arXiv:2412.07183},
year={2024}
}