作者
n2n1
发布于2022-11-27 00:25:22
来自河北
video caption视频描述问题
video captioning是由给定的一段视频自动生成一段对其进行语义描述的自然语言。可以理解为视频图像序列到文本序列的seq2seq任务。
大致分为两步:
1. 理解视觉内容,包括视频中的「人」,「物体」,「人的动作」,「场景」以及「人和物体的交互」信息;
2. 使用NLP技术对其进行语义相符、语法正确的描述。
几组相关概念
1、Visual Description
将静止的图片或视频片段转化成一个或多个自然语言语句。
2、Video Captioning
以「短视频片段通常只包含一个主要事件」假设为基础,将一个视频片段转为一个自然语言语句。
3、Video Description
将一个相对较长的视频片段转化成多个自然语言语句(即一个带有叙述性质的段落)。在形式上是一个自然语言段落,描述的信息更细节化。
4、Dense Video Captioning
将视频分为重合或不重合的不同长度的小片段,每个片段都给出一句话描述。其重点在于描述出所有出现的事件,但与video description中生成的每个句子都有一定相关性不同,dense video captioning每次生成句子可能并不相关。
数据集下载地址
目前的数据集主要分成四个大类:cooking,movies,videos in the wild, social media。
在大部分数据集中一个视频对应一句话,只有少部分数据集中一个视频对应多句话或多个自然段。(访问部分网址需要翻墙哦)
Cooking关于做饭的场景:
YouCook
Jason J. Corso; EECS @ U of Michigan
web.eecs.umich.edu/~jjcorso/r/youcook/
TACoS
SMILE project home
www.coli.uni-saarland.de/projects/smile/page.php?id=tacos
TACoS-MultiLevel
TACoS Multi-Level Corpus
www.mpi-inf.mpg.de/departments/computer-vision-and-multimodal-computing/research/vision-and-language/tacos-multi-level-corpus/
YouCook II
Large-scale Cooking Video Dataset for Procedure Understanding and Description Generation
youcook2.eecs.umich.edu/
Movies 视频来自电影:
MPII-MD
MPII Movie Description dataset
www.mpi-inf.mpg.de/departments/computer-vision-and-multimodal-computing/research/vision-and-language/mpii-movie-description-dataset/
M-VAD
Mila " M-VAD
mila.umontreal.ca/en/publications/public-datasets/m-vad/
Social Media 视频来自社交媒体:
ActivityNet Entities
Dense-Captioning Events in Videos
cs.stanford.edu/people/ranjaykrishna/densevid/
Videos in the Wild 视频来自open domain:
MSVD(YouTube2Text dataset)
Collecting Multilingual Parallel Video Descriptions Using Mechanical Turk
www.cs.utexas.edu/users/ml/clamp/videoDescription/
MSR-VTT
Microsoft Multimedia Challenge
ms-multimedia-challenge.com/2017/dataset
Charades
Allen Institute for Artificial Intelligence
allenai.org/plato/charades/
ActivityNet Captions
http://cs.stanford.edu/people/ranjaykrishna/densevid/captions.zip
cs.stanford.edu/people/ranjaykrishna/densevid/captions.zip
源代码下载地址
Consensus-based Sequence Training for Video Captioning原作者源码
mynlp/cst_captioning
github.com/mynlp/cst_captioning
End-to-End Dense Video Captioning with Masked Transformer原作者源码 cvpr2018
salesforce/densecap
github.com/salesforce/densecap
Saliency-based Spatio-Temporal Attention for Video Captioning原作者源码
Yugnaynehc/ssta-captioning
github.com/Yugnaynehc/ssta-captioning
Watch, Listen, and Describe: Globally and Locally Aligned Cross-Modal Attentions for Video Captioning非原作者源码
chitwansaharia/HACAModel
github.com/chitwansaharia/HACAModel
No Metrics Are Perfect: Adversarial Reward Learning for Visual Storytelling原作者源码acl2018
eric-xw/AREL
github.com/eric-xw/AREL
Reconstruction Network for Video Captioning非原作者源码 cvpr2018
sususushi/reconstruction-network-for-video-captioning
github.com/sususushi/reconstruction-network-for-video-captioning
Learning to Compose Topic-Aware Mixture of Experts for Zero-Shot Video Captioning原作者源码aaai2019
eric-xw/Zero-Shot-Video-Captioning
github.com/eric-xw/Zero-Shot-Video-Captioning
相关比赛
LSMDC
https://sites.google.com/site/describingmovies/lsmdc-2017
sites.google.com/site/describingmovies/lsmdc-2017
The Large Scale Movie Description Challenge (LSMDC)与ICCV 2015、ECCV workshop2016协办。比赛包括三项主要任务:Movie Description, Annotation/Retrieval and Fill-in-the-Blank. 从2017开始, MovieQA也被加入主要任务。
MSR-VTT
2016年由微软研究院组织,意在将CV和NLP研究者组织在一起(比赛使用的MSR-VTT在之前的数据集模块)。参加比赛人员需要使用MSR-VTT数据集生成video to text模型,并可以使用其他数据。
TRECVID
Text Retrieval Conference (TREC) 是一个聚焦在Information Retrieval (IR)研究的workshop系列,the TREC Video Retrieval Evaluation (TRECVID) 开始于2001,起初比赛任务包括:semantic indexing, video summarization, video copy detection, multimedia event detection。从TREC-2016开始, Video to Text Description (VTT) 也被包括在内。
ActivityNet Challenge
ActivityNet Dense Captioning Events in Videos在2017年作为the ActivityNet Large Scale Activity Recognition Challenge的一项任务,并作为CVPR Workshop的相关比赛。其任务是检测和描述出一段视频中的多个事件,使用的ActivityNet Caption数据集中的视频有时间戳标记,且每个video clip对应多句描述。
大牛个人主页
Anna Rohrbach个人主页:TACoS系列数据集建立者、LSMDC比赛协办者。
https://www.mpi-inf.mpg.de/departments/computer-vision-and-multimodal-computing/people/anna-rohrbach/
www.mpi-inf.mpg.de/departments/computer-vision-and-multimodal-computing/people/anna-rohrbach/
UCSB NLP小组Xin Wang个人主页:《Learning to Compose Topic-Aware Mixture of Experts for Zero-Shot Video Captioning》、《No Metrics Are Perfect: Adversarial Reward Learning for Visual Storytelling》、《Watch, Listen, and Describe: Globally and Locally Aligned Cross-Modal Attentions for Video Captioning》等文章作者。
Xin Wang
www.cs.ucsb.edu/~xwang/
论文附录下载地址
NAACL-HLT2015:Translating Videos to Natural Language Using Deep Recurrent Neural Networks
http://cn.arxiv.org/pdf/1412.4729.pdf
cn.arxiv.org/pdf/1412.4729.pdf
iccv2015:Sequence to Sequence – Video to Text
http://cn.arxiv.org/pdf/1412.4729.pdf
cn.arxiv.org/pdf/1412.4729.pdf
iccv2015:Learning Spatiotemporal Features with 3D Convolutional Networks
http://cn.arxiv.org/pdf/1412.0767.pdf
cn.arxiv.org/pdf/1412.0767.pdf
iccv2015:Describing Videos by Exploiting Temporal Structure
https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Yao_Describing_Videos_by_ICCV_2015_paper.pdf
www.cv-foundation.org/openaccess/content_iccv_2015/papers/Yao_Describing_Videos_by_ICCV_2015_paper.pdf
HRNE iccv2016:Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning
http://cn.arxiv.org/pdf/1511.03476.pdf
cn.arxiv.org/pdf/1511.03476.pdf
EMNLP2016:Improving LSTM-based Video Description with Linguistic Knowledge Mined from Text
http://cn.arxiv.org/pdf/1604.01729.pdf
cn.arxiv.org/pdf/1604.01729.pdf
cvpr2016:Jointly Modeling Embedding and Translation to Bridge Video and Language
http://cn.arxiv.org/pdf/1505.01861.pdf
cn.arxiv.org/pdf/1505.01861.pdf
h-RNN cvpr2016:Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks
http://cn.arxiv.org/pdf/1510.07712.pdf
cn.arxiv.org/pdf/1510.07712.pdf
cvpr 2016:MSR-VTT: A Large Video Description Dataset for Bridging Video and Language
https://www.microsoft.com/en-us/research/wp-content/uploads/2016/06/cvpr16.msr-vtt.tmei_-1.pdf
www.microsoft.com/en-us/research/wp-content/uploads/2016/06/cvpr16.msr-vtt.tmei_-1.pdf
* [Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks
http://cn.arxiv.org/pdf/1711.10305.pdf
cn.arxiv.org/pdf/1711.10305.pdf
) iccv2017
cvpr2017:Task-Driven Dynamic Fusion: Reducing Ambiguity in Video Description
http://openaccess.thecvf.com/content_cvpr_2017/papers/Zhang_Task-Driven_Dynamic_Fusion_CVPR_2017_paper.pdf
openaccess.thecvf.com/content_cvpr_2017/papers/Zhang_Task-Driven_Dynamic_Fusion_CVPR_2017_paper.pdf
acl2017:Multi-Task Video Captioning with Video and Entailment Generation
http://cn.arxiv.org/pdf/1704.07489.pdf
cn.arxiv.org/pdf/1704.07489.pdf
EMNLP2017:Reinforced Video Captioning with Entailment Rewards
http://cn.arxiv.org/pdf/1708.02300.pdf
cn.arxiv.org/pdf/1708.02300.pdf
RecNet cvpr2018:Reconstruction Network for Video Captioning
http://cn.arxiv.org/pdf/1803.11438.pdf
cn.arxiv.org/pdf/1803.11438.pdf
cvpr2018:Video Captioning via Hierarchical Reinforcement Learning
http://openaccess.thecvf.com/content_cvpr_2018/papers/Wang_Video_Captioning_via_CVPR_2018_paper.pdf
openaccess.thecvf.com/content_cvpr_2018/papers/Wang_Video_Captioning_via_CVPR_2018_paper.pdf
cvpr2018:Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning
http://openaccess.thecvf.com/content_cvpr_2018/papers/Wang_Bidirectional_Attentive_Fusion_CVPR_2018_paper.pdf
openaccess.thecvf.com/content_cvpr_2018/papers/Wang_Bidirectional_Attentive_Fusion_CVPR_2018_paper.pdf
cvpr2018:M3 : Multimodal Memory Modelling for Video Captioning
http://openaccess.thecvf.com/content_cvpr_2018/papers/Wang_M3_Multimodal_Memory_CVPR_2018_paper.pdf
openaccess.thecvf.com/content_cvpr_2018/papers/Wang_M3_Multimodal_Memory_CVPR_2018_paper.pdf
cvpr2018:End-to-End Dense Video Captioning with Masked Transformer
http://openaccess.thecvf.com/content_cvpr_2018/papers/Zhou_End-to-End_Dense_Video_CVPR_2018_paper.pdf
openaccess.thecvf.com/content_cvpr_2018/papers/Zhou_End-to-End_Dense_Video_CVPR_2018_paper.pdf
eccv2018:Less Is More: Picking Informative Frames for Video Captioning
http://openaccess.thecvf.com/content_ECCV_2018/papers/Yangyu_Chen_Less_is_More_ECCV_2018_paper.pdf
openaccess.thecvf.com/content_ECCV_2018/papers/Yangyu_Chen_Less_is_More_ECCV_2018_paper.pdf
NAACL-HLT2018:Watch, Listen, and Describe: Globally and Locally Aligned Cross-Modal Attentions for Video Captioning
http://cn.arxiv.org/pdf/1804.05448.pdf
cn.arxiv.org/pdf/1804.05448.pdf
acl2018:No Metrics Are Perfect: Adversarial Reward Learning for Visual Storytelling
http://cn.arxiv.org/pdf/1804.09160.pdf
cn.arxiv.org/pdf/1804.09160.pdf
aaai2019:Learning to Compose Topic-Aware Mixture of Experts for Zero-Shot Video Captioning
http://cn.arxiv.org/pdf/1811.02765.pdf
cn.arxiv.org/pdf/1811.02765.pdf
|
-
|