Video Caption (视频描述) 调研文档内含代码、数据集

发布于2022-11-27 00:25:22

video caption视频描述问题
video captioning是由给定的一段视频自动生成一段对其进行语义描述的自然语言。可以理解为视频图像序列到文本序列的seq2seq任务。

大致分为两步：
1. 理解视觉内容，包括视频中的「人」，「物体」，「人的动作」，「场景」以及「人和物体的交互」信息；
2. 使用NLP技术对其进行语义相符、语法正确的描述。
几组相关概念
1、Visual Description
将静止的图片或视频片段转化成一个或多个自然语言语句。

2、Video Captioning
以「短视频片段通常只包含一个主要事件」假设为基础，将一个视频片段转为一个自然语言语句。

3、Video Description
将一个相对较长的视频片段转化成多个自然语言语句（即一个带有叙述性质的段落）。在形式上是一个自然语言段落，描述的信息更细节化。

4、Dense Video Captioning
将视频分为重合或不重合的不同长度的小片段，每个片段都给出一句话描述。其重点在于描述出所有出现的事件，但与video description中生成的每个句子都有一定相关性不同，dense video captioning每次生成句子可能并不相关。

数据集下载地址
目前的数据集主要分成四个大类：cooking，movies，videos in the wild, social media。

在大部分数据集中一个视频对应一句话，只有少部分数据集中一个视频对应多句话或多个自然段。（访问部分网址需要翻墙哦）

Cooking关于做饭的场景:

YouCook
Jason J. Corso; EECS @ U of Michigan
web.eecs.umich.edu/~jjcorso/r/youcook/

TACoS
SMILE project home
www.coli.uni-saarland.de/projects/smile/page.php?id=tacos

TACoS-MultiLevel
TACoS Multi-Level Corpus
www.mpi-inf.mpg.de/departments/computer-vision-and-multimodal-computing/research/vision-and-language/tacos-multi-level-corpus/
YouCook II
Large-scale Cooking Video Dataset for Procedure Understanding and Description Generation
youcook2.eecs.umich.edu/
Movies 视频来自电影:

MPII-MD
MPII Movie Description dataset
www.mpi-inf.mpg.de/departments/computer-vision-and-multimodal-computing/research/vision-and-language/mpii-movie-description-dataset/
M-VAD
Mila " M-VAD
mila.umontreal.ca/en/publications/public-datasets/m-vad/
Social Media 视频来自社交媒体:

ActivityNet Entities
Dense-Captioning Events in Videos
cs.stanford.edu/people/ranjaykrishna/densevid/
Videos in the Wild 视频来自open domain:

MSVD(YouTube2Text dataset)
Collecting Multilingual Parallel Video Descriptions Using Mechanical Turk
www.cs.utexas.edu/users/ml/clamp/videoDescription/
MSR-VTT
Microsoft Multimedia Challenge
ms-multimedia-challenge.com/2017/dataset
Charades
Allen Institute for Artificial Intelligence
allenai.org/plato/charades/
ActivityNet Captions
http://cs.stanford.edu/people/ranjaykrishna/densevid/captions.zip
cs.stanford.edu/people/ranjaykrishna/densevid/captions.zip
源代码下载地址
Consensus-based Sequence Training for Video Captioning原作者源码
mynlp/cst_captioning
github.com/mynlp/cst_captioning

End-to-End Dense Video Captioning with Masked Transformer原作者源码 cvpr2018
salesforce/densecap
github.com/salesforce/densecap

Saliency-based Spatio-Temporal Attention for Video Captioning原作者源码
Yugnaynehc/ssta-captioning
github.com/Yugnaynehc/ssta-captioning
Watch, Listen, and Describe: Globally and Locally Aligned Cross-Modal Attentions for Video Captioning非原作者源码
chitwansaharia/HACAModel
github.com/chitwansaharia/HACAModel

No Metrics Are Perfect: Adversarial Reward Learning for Visual Storytelling原作者源码acl2018
eric-xw/AREL
github.com/eric-xw/AREL

Reconstruction Network for Video Captioning非原作者源码 cvpr2018
sususushi/reconstruction-network-for-video-captioning
github.com/sususushi/reconstruction-network-for-video-captioning

Learning to Compose Topic-Aware Mixture of Experts for Zero-Shot Video Captioning原作者源码aaai2019
eric-xw/Zero-Shot-Video-Captioning
github.com/eric-xw/Zero-Shot-Video-Captioning

相关比赛
LSMDC
https://sites.google.com/site/describingmovies/lsmdc-2017
sites.google.com/site/describingmovies/lsmdc-2017
The Large Scale Movie Description Challenge (LSMDC)与ICCV 2015、ECCV workshop2016协办。比赛包括三项主要任务：Movie Description, Annotation/Retrieval and Fill-in-the-Blank. 从2017开始, MovieQA也被加入主要任务。

MSR-VTT
2016年由微软研究院组织，意在将CV和NLP研究者组织在一起（比赛使用的MSR-VTT在之前的数据集模块）。参加比赛人员需要使用MSR-VTT数据集生成video to text模型，并可以使用其他数据。

TRECVID
Text Retrieval Conference (TREC) 是一个聚焦在Information Retrieval (IR)研究的workshop系列，the TREC Video Retrieval Evaluation (TRECVID) 开始于2001，起初比赛任务包括：semantic indexing, video summarization, video copy detection, multimedia event detection。从TREC-2016开始, Video to Text Description (VTT) 也被包括在内。

ActivityNet Challenge
ActivityNet Dense Captioning Events in Videos在2017年作为the ActivityNet Large Scale Activity Recognition Challenge的一项任务，并作为CVPR Workshop的相关比赛。其任务是检测和描述出一段视频中的多个事件，使用的ActivityNet Caption数据集中的视频有时间戳标记，且每个video clip对应多句描述。

大牛个人主页
Anna Rohrbach个人主页：TACoS系列数据集建立者、LSMDC比赛协办者。
https://www.mpi-inf.mpg.de/departments/computer-vision-and-multimodal-computing/people/anna-rohrbach/
www.mpi-inf.mpg.de/departments/computer-vision-and-multimodal-computing/people/anna-rohrbach/
UCSB NLP小组Xin Wang个人主页：《Learning to Compose Topic-Aware Mixture of Experts for Zero-Shot Video Captioning》、《No Metrics Are Perfect: Adversarial Reward Learning for Visual Storytelling》、《Watch, Listen, and Describe: Globally and Locally Aligned Cross-Modal Attentions for Video Captioning》等文章作者。
Xin Wang
www.cs.ucsb.edu/~xwang/

论文附录下载地址
NAACL-HLT2015：Translating Videos to Natural Language Using Deep Recurrent Neural Networks
http://cn.arxiv.org/pdf/1412.4729.pdf
cn.arxiv.org/pdf/1412.4729.pdf
iccv2015：Sequence to Sequence – Video to Text
http://cn.arxiv.org/pdf/1412.4729.pdf
cn.arxiv.org/pdf/1412.4729.pdf
iccv2015：Learning Spatiotemporal Features with 3D Convolutional Networks
http://cn.arxiv.org/pdf/1412.0767.pdf
cn.arxiv.org/pdf/1412.0767.pdf
iccv2015：Describing Videos by Exploiting Temporal Structure
https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Yao_Describing_Videos_by_ICCV_2015_paper.pdf
www.cv-foundation.org/openaccess/content_iccv_2015/papers/Yao_Describing_Videos_by_ICCV_2015_paper.pdf
HRNE iccv2016：Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning
http://cn.arxiv.org/pdf/1511.03476.pdf
cn.arxiv.org/pdf/1511.03476.pdf
EMNLP2016：Improving LSTM-based Video Description with Linguistic Knowledge Mined from Text
http://cn.arxiv.org/pdf/1604.01729.pdf
cn.arxiv.org/pdf/1604.01729.pdf
cvpr2016：Jointly Modeling Embedding and Translation to Bridge Video and Language
http://cn.arxiv.org/pdf/1505.01861.pdf
cn.arxiv.org/pdf/1505.01861.pdf
h-RNN cvpr2016：Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks
http://cn.arxiv.org/pdf/1510.07712.pdf
cn.arxiv.org/pdf/1510.07712.pdf
cvpr 2016：MSR-VTT: A Large Video Description Dataset for Bridging Video and Language
https://www.microsoft.com/en-us/research/wp-content/uploads/2016/06/cvpr16.msr-vtt.tmei_-1.pdf
www.microsoft.com/en-us/research/wp-content/uploads/2016/06/cvpr16.msr-vtt.tmei_-1.pdf
* [Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks

http://cn.arxiv.org/pdf/1711.10305.pdf
cn.arxiv.org/pdf/1711.10305.pdf
) iccv2017

cvpr2017：Task-Driven Dynamic Fusion: Reducing Ambiguity in Video Description
http://openaccess.thecvf.com/content_cvpr_2017/papers/Zhang_Task-Driven_Dynamic_Fusion_CVPR_2017_paper.pdf
openaccess.thecvf.com/content_cvpr_2017/papers/Zhang_Task-Driven_Dynamic_Fusion_CVPR_2017_paper.pdf
acl2017：Multi-Task Video Captioning with Video and Entailment Generation
http://cn.arxiv.org/pdf/1704.07489.pdf
cn.arxiv.org/pdf/1704.07489.pdf
EMNLP2017：Reinforced Video Captioning with Entailment Rewards
http://cn.arxiv.org/pdf/1708.02300.pdf
cn.arxiv.org/pdf/1708.02300.pdf
RecNet cvpr2018：Reconstruction Network for Video Captioning
http://cn.arxiv.org/pdf/1803.11438.pdf
cn.arxiv.org/pdf/1803.11438.pdf
cvpr2018：Video Captioning via Hierarchical Reinforcement Learning
http://openaccess.thecvf.com/content_cvpr_2018/papers/Wang_Video_Captioning_via_CVPR_2018_paper.pdf
openaccess.thecvf.com/content_cvpr_2018/papers/Wang_Video_Captioning_via_CVPR_2018_paper.pdf
cvpr2018：Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning
http://openaccess.thecvf.com/content_cvpr_2018/papers/Wang_Bidirectional_Attentive_Fusion_CVPR_2018_paper.pdf
openaccess.thecvf.com/content_cvpr_2018/papers/Wang_Bidirectional_Attentive_Fusion_CVPR_2018_paper.pdf
cvpr2018：M3 : Multimodal Memory Modelling for Video Captioning
http://openaccess.thecvf.com/content_cvpr_2018/papers/Wang_M3_Multimodal_Memory_CVPR_2018_paper.pdf
openaccess.thecvf.com/content_cvpr_2018/papers/Wang_M3_Multimodal_Memory_CVPR_2018_paper.pdf
cvpr2018：End-to-End Dense Video Captioning with Masked Transformer
http://openaccess.thecvf.com/content_cvpr_2018/papers/Zhou_End-to-End_Dense_Video_CVPR_2018_paper.pdf
openaccess.thecvf.com/content_cvpr_2018/papers/Zhou_End-to-End_Dense_Video_CVPR_2018_paper.pdf
eccv2018：Less Is More: Picking Informative Frames for Video Captioning
http://openaccess.thecvf.com/content_ECCV_2018/papers/Yangyu_Chen_Less_is_More_ECCV_2018_paper.pdf
openaccess.thecvf.com/content_ECCV_2018/papers/Yangyu_Chen_Less_is_More_ECCV_2018_paper.pdf
NAACL-HLT2018：Watch, Listen, and Describe: Globally and Locally Aligned Cross-Modal Attentions for Video Captioning
http://cn.arxiv.org/pdf/1804.05448.pdf
cn.arxiv.org/pdf/1804.05448.pdf
acl2018：No Metrics Are Perfect: Adversarial Reward Learning for Visual Storytelling
http://cn.arxiv.org/pdf/1804.09160.pdf
cn.arxiv.org/pdf/1804.09160.pdf
aaai2019：Learning to Compose Topic-Aware Mixture of Experts for Zero-Shot Video Captioning
http://cn.arxiv.org/pdf/1811.02765.pdf
cn.arxiv.org/pdf/1811.02765.pdf

Web-CMS

游戏源码

移动开发

待定

[技术文章] Video Caption (视频描述) 调研文档内含代码、数据集

Web-CMS

游戏源码

移动开发

待定

[技术文章] Video Caption (视频描述) 调研文档 内含代码、数据集

[技术文章] Video Caption (视频描述) 调研文档内含代码、数据集