ECVA

Overview

We develop ECVA, a new benchmark for causation understanding of video anomaly. ECVA is the first large-scale benchmark focused on the causation of video anomalies. Compared with existing datasets, our dataset is more comprehensive and more challenging with much higher-quality annotations. This work is an extension of "Uncovering What, Why and How: A Comprehensive Benchmark for Causation Understanding of Video Anomaly" [CVPR2024]

Uncovering What, Why and How: A Comprehensive Benchmark for Causation Understanding of Video Anomaly

Installation

To install and set up the environment, follow these steps:

git clone [email protected]:Dulpy/ECVA.git
cd multimodal-video-large-model
pip install -r requirements.txt

Benchmark and Evaluation Metric

We develop ECVA, a new benchmark for causation understanding of video anomaly. ECVA is the first large-scale benchmark focused on the causation of video anomalies. Compared with existing datasets, our dataset is more comprehensive and more challenging with much higher-quality annotation.
Our ECVA dataset contains 2240 video clips and 6720 question-answer pairs, the total length of these videos is 88.16 hours, and the average frames of videos is 8460. The frames are extracted from the original videos at a rate of 60 FPS. The videos encompass a wide range of domain
You can download the original video data from this link: Download Original Video Data
The ECVA video data along its annotation can be found in https://www.modelscope.cn/datasets/gouchenyi/ECVA/files

The proposed evaluation mertic

The proposed evaluation mertic mainly measures the performance of the model comprehensively through the following three aspects.

Basic Reasoning

In the Basic Reasoning part, we use the GPT model to assess whether the candidate answers comprehensively cover all key phrases and rate the answers based on their logical coherence.
Consistency

For the Consistency evaluation, we leverage the binarity of the GPT to score the candidate answers.
Hallucination

As for the Hallucination part, we remove key frames from the video and input it into the VLMs to observe how consistent the model's responses are with or without the key frames.

Evaluate your results on ECVA

1. Reformat your results

For each video response, you need to organize it into the following format:

[{
  "video_file": '00001.mp4'
  "prompt": 'Give a detailed description of the anomalous segment in the video. Please remember to describe the details of the incident'
  "output": 'your model's response to this prompt'
  "task_type": 'Description'
  "human_expert_answer": 'The standard answer for the task'
},
]

2. Evaluate your results

Prepare the model's answers and our benchmark answers, then use the script here to score them with GPT assistant. Because GPT will be used to assist in the evaluation, you will need to fill in your own key in the relevant configuration file

Evaluate your results on traditional mertic

1. Reformat your results as shown above

2. Evaluate your results

Prepare the model's answers and our benchmark answers, then use the script here to evaluate them use BLUE, ROUGE, BLEURT and UNIEVAL.

Training Dataset Preparation

We introduce a novel video large language model named Anomaly Shield (AnomShield), which is designed to address the three challenges presented by ECVA. You can re-organize the annotated video/image sft data according to the following format and place the image/video data in the path ECVA/datasets/pretraining/ and ECVA/datasets/videosft/

[
    {
        "id": 0,
        "video": "images/xxx.jpg",
        "conversations": [
            {
                "from": "human",
                "value": "<image>\nWhat are the colors of the bus in the image?"
            },
            {
                "from": "gpt",
                "value": "The bus in the image is white and red."
            },
            ...
        ],
    }
    {
        "id": 1,
        "video": "videos/xxx.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nWhat are the main activities that take place in the video?"
            },
            {
                "from": "gpt",
                "value": "The main activities that take place in the video are the preparation of camera equipment by a man, a group of men riding a helicopter, and a man sailing a boat through the water."
            },
            ...
        ],
    },
    ...
]

Model Training

1. Prepare CLIP and Mistral Weight

For Vision-Encoder, similar to most multi-modal large models, AnomShield uses the CLIP series as the visual encoder. You can download the related pre-trained weights from openai/clip-vit-large-patch14.
For the base model, we utilize the powerful Mistral series to help analyze the video content and provide reliable, accurate answers. You can download the related pre-trained weights from mistralai/Mistral-7B-Instruct-v0.2.

2. Pretrain Command

cd ECVA/scripts/vllava/mamba/
./pretrain.sh

3. Video SFT Command

cd ECVA/scripts/vllava/mamba/
./finetune.sh

Inference

Video/Image Inference. We have inherited the inference code from VideoLLaMA2.You can refer to the inference.ipynb to implement the model inference, and you need to prepare the relevant model weights according to the instructions in the script.

cd ECVA/
run inference.ipynb on the jupyter environment

Acknowledgement

The codebase of ECVA is adapted from VideoLLaMA2. We are grateful for the foundational work done by the VideoLLaMA2 team, which has significantly contributed to the development of this project.

License

Cite

If you find our work useful for your research, please consider citing.

@article{du2024exploring,
  title={Exploring What Why and How: A Multifaceted Benchmark for Causation Understanding of Video Anomaly},
  author={Du, Hang and Nan, Guoshun and Qian, Jiawen and Wu, Wangchenhui and Deng, Wendi and Mu, Hanqing and Chen, Zhenyan and Mao, Pengxuan and Tao, Xiaofeng and Liu, Jun},
  journal={arXiv preprint arXiv:2412.07183},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
AnomEval/evaluating_system_v2		AnomEval/evaluating_system_v2
AnomShield		AnomShield
CoT		CoT
assert		assert
eval_traditional		eval_traditional
scripts		scripts
inference.ipynb		inference.ipynb
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ECVA

Overview

Table of Contents

Installation

Benchmark and Evaluation Metric

The proposed evaluation mertic

Basic Reasoning

Consistency

Hallucination

Evaluate your results on ECVA

1. Reformat your results

2. Evaluate your results

Evaluate your results on traditional mertic

1. Reformat your results as shown above

2. Evaluate your results

Training Dataset Preparation

Model Training

1. Prepare CLIP and Mistral Weight

2. Pretrain Command

3. Video SFT Command

Inference

Acknowledgement

License

Cite

About

Releases

Packages

Contributors 4

Languages

Dulpy/ECVA

Folders and files

Latest commit

History

Repository files navigation

ECVA

Overview

Table of Contents

Installation

Benchmark and Evaluation Metric

The proposed evaluation mertic

Basic Reasoning

Consistency

Hallucination

Evaluate your results on ECVA

1. Reformat your results

2. Evaluate your results

Evaluate your results on traditional mertic

1. Reformat your results as shown above

2. Evaluate your results

Training Dataset Preparation

Model Training

1. Prepare CLIP and Mistral Weight

2. Pretrain Command

3. Video SFT Command

Inference

Acknowledgement

License

Cite

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages