[技术文章] Video Caption (视频描述) 调研文档 内含代码、数据集

[复制链接]
作者
n2n1   发布于2022-11-27 00:25:22 来自河北
video caption视频描述问题
video captioning是由给定的一段视频自动生成一段对其进行语义描述的自然语言。可以理解为视频图像序列到文本序列的seq2seq任务。
v2-feb935e481ad0ff2d6a453bf337821d0_r.jpg v2-7916930838d2a3a4ced0f2066ed6bcaf_r.jpg

大致分为两步:
1. 理解视觉内容,包括视频中的「人」,「物体」,「人的动作」,「场景」以及「人和物体的交互」信息;
2. 使用NLP技术对其进行语义相符、语法正确的描述。
几组相关概念
1、Visual Description
将静止的图片或视频片段转化成一个或多个自然语言语句。

2、Video Captioning
以「短视频片段通常只包含一个主要事件」假设为基础,将一个视频片段转为一个自然语言语句。

3、Video Description
将一个相对较长的视频片段转化成多个自然语言语句(即一个带有叙述性质的段落)。在形式上是一个自然语言段落,描述的信息更细节化。

4、Dense Video Captioning
将视频分为重合或不重合的不同长度的小片段,每个片段都给出一句话描述。其重点在于描述出所有出现的事件,但与video description中生成的每个句子都有一定相关性不同,dense video captioning每次生成句子可能并不相关。

数据集下载地址
目前的数据集主要分成四个大类:cooking,movies,videos in the wild, social media。

在大部分数据集中一个视频对应一句话,只有少部分数据集中一个视频对应多句话或多个自然段。(访问部分网址需要翻墙哦)



Cooking关于做饭的场景:

YouCook
Jason J. Corso; EECS @ U of Michigan
​web.eecs.umich.edu/~jjcorso/r/youcook/

TACoS
SMILE project home
www.coli.uni-saarland.de/projects/smile/page.php?id=tacos



TACoS-MultiLevel
TACoS Multi-Level Corpus
www.mpi-inf.mpg.de/departments/computer-vision-and-multimodal-computing/research/vision-and-language/tacos-multi-level-corpus/
YouCook II
Large-scale Cooking Video Dataset for Procedure Understanding and Description Generation
​youcook2.eecs.umich.edu/
Movies 视频来自电影:

MPII-MD
MPII Movie Description dataset
www.mpi-inf.mpg.de/departments/computer-vision-and-multimodal-computing/research/vision-and-language/mpii-movie-description-dataset/
M-VAD
Mila " M-VAD
​mila.umontreal.ca/en/publications/public-datasets/m-vad/
Social Media 视频来自社交媒体:

ActivityNet Entities
Dense-Captioning Events in Videos
​cs.stanford.edu/people/ranjaykrishna/densevid/
Videos in the Wild 视频来自open domain:

MSVD(YouTube2Text dataset)
Collecting Multilingual Parallel Video Descriptions Using Mechanical Turk
www.cs.utexas.edu/users/ml/clamp/videoDescription/
MSR-VTT
Microsoft Multimedia Challenge
​ms-multimedia-challenge.com/2017/dataset
Charades
Allen Institute for Artificial Intelligence
​allenai.org/plato/charades/
ActivityNet Captions
http://cs.stanford.edu/people/ranjaykrishna/densevid/captions.zip
​cs.stanford.edu/people/ranjaykrishna/densevid/captions.zip
源代码下载地址
Consensus-based Sequence Training for Video Captioning原作者源码
mynlp/cst_captioning
​github.com/mynlp/cst_captioning

End-to-End Dense Video Captioning with Masked Transformer原作者源码 cvpr2018
salesforce/densecap
​github.com/salesforce/densecap

Saliency-based Spatio-Temporal Attention for Video Captioning原作者源码
Yugnaynehc/ssta-captioning
​github.com/Yugnaynehc/ssta-captioning
Watch, Listen, and Describe: Globally and Locally Aligned Cross-Modal Attentions for Video Captioning非原作者源码
chitwansaharia/HACAModel
​github.com/chitwansaharia/HACAModel

No Metrics Are Perfect: Adversarial Reward Learning for Visual Storytelling原作者源码acl2018
eric-xw/AREL
​github.com/eric-xw/AREL

Reconstruction Network for Video Captioning非原作者源码 cvpr2018
sususushi/reconstruction-network-for-video-captioning
​github.com/sususushi/reconstruction-network-for-video-captioning

Learning to Compose Topic-Aware Mixture of Experts for Zero-Shot Video Captioning原作者源码aaai2019
eric-xw/Zero-Shot-Video-Captioning
​github.com/eric-xw/Zero-Shot-Video-Captioning

相关比赛
LSMDC
https://sites.google.com/site/describingmovies/lsmdc-2017
​sites.google.com/site/describingmovies/lsmdc-2017
The Large Scale Movie Description Challenge (LSMDC)与ICCV 2015、ECCV workshop2016协办。比赛包括三项主要任务:Movie Description, Annotation/Retrieval and Fill-in-the-Blank. 从2017开始, MovieQA也被加入主要任务。

MSR-VTT
2016年由微软研究院组织,意在将CV和NLP研究者组织在一起(比赛使用的MSR-VTT在之前的数据集模块)。参加比赛人员需要使用MSR-VTT数据集生成video to text模型,并可以使用其他数据。

TRECVID
Text Retrieval Conference (TREC) 是一个聚焦在Information Retrieval (IR)研究的workshop系列,the TREC Video Retrieval Evaluation (TRECVID) 开始于2001,起初比赛任务包括:semantic indexing, video summarization, video copy detection, multimedia event detection。从TREC-2016开始, Video to Text Description (VTT) 也被包括在内。

ActivityNet Challenge
ActivityNet Dense Captioning Events in Videos在2017年作为the ActivityNet Large Scale Activity Recognition Challenge的一项任务,并作为CVPR Workshop的相关比赛。其任务是检测和描述出一段视频中的多个事件,使用的ActivityNet Caption数据集中的视频有时间戳标记,且每个video clip对应多句描述。

大牛个人主页
Anna Rohrbach个人主页:TACoS系列数据集建立者、LSMDC比赛协办者。
https://www.mpi-inf.mpg.de/departments/computer-vision-and-multimodal-computing/people/anna-rohrbach/
www.mpi-inf.mpg.de/departments/computer-vision-and-multimodal-computing/people/anna-rohrbach/
UCSB NLP小组Xin Wang个人主页:《Learning to Compose Topic-Aware Mixture of Experts for Zero-Shot Video Captioning》、《No Metrics Are Perfect: Adversarial Reward Learning for Visual Storytelling》、《Watch, Listen, and Describe: Globally and Locally Aligned Cross-Modal Attentions for Video Captioning》等文章作者。
Xin Wang
www.cs.ucsb.edu/~xwang/


论文附录下载地址
NAACL-HLT2015:Translating Videos to Natural Language Using Deep Recurrent Neural Networks
http://cn.arxiv.org/pdf/1412.4729.pdf
​cn.arxiv.org/pdf/1412.4729.pdf
iccv2015:Sequence to Sequence – Video to Text
http://cn.arxiv.org/pdf/1412.4729.pdf
​cn.arxiv.org/pdf/1412.4729.pdf
iccv2015:Learning Spatiotemporal Features with 3D Convolutional Networks
http://cn.arxiv.org/pdf/1412.0767.pdf
​cn.arxiv.org/pdf/1412.0767.pdf
iccv2015:Describing Videos by Exploiting Temporal Structure
https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Yao_Describing_Videos_by_ICCV_2015_paper.pdf
www.cv-foundation.org/openaccess/content_iccv_2015/papers/Yao_Describing_Videos_by_ICCV_2015_paper.pdf
HRNE iccv2016:Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning
http://cn.arxiv.org/pdf/1511.03476.pdf
​cn.arxiv.org/pdf/1511.03476.pdf
EMNLP2016:Improving LSTM-based Video Description with Linguistic Knowledge Mined from Text
http://cn.arxiv.org/pdf/1604.01729.pdf
​cn.arxiv.org/pdf/1604.01729.pdf
cvpr2016:Jointly Modeling Embedding and Translation to Bridge Video and Language
http://cn.arxiv.org/pdf/1505.01861.pdf
​cn.arxiv.org/pdf/1505.01861.pdf
h-RNN cvpr2016:Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks
http://cn.arxiv.org/pdf/1510.07712.pdf
​cn.arxiv.org/pdf/1510.07712.pdf
cvpr 2016:MSR-VTT: A Large Video Description Dataset for Bridging Video and Language
https://www.microsoft.com/en-us/research/wp-content/uploads/2016/06/cvpr16.msr-vtt.tmei_-1.pdf
www.microsoft.com/en-us/research/wp-content/uploads/2016/06/cvpr16.msr-vtt.tmei_-1.pdf
* [Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks

http://cn.arxiv.org/pdf/1711.10305.pdf
​cn.arxiv.org/pdf/1711.10305.pdf
) iccv2017

cvpr2017:Task-Driven Dynamic Fusion: Reducing Ambiguity in Video Description
http://openaccess.thecvf.com/content_cvpr_2017/papers/Zhang_Task-Driven_Dynamic_Fusion_CVPR_2017_paper.pdf
​openaccess.thecvf.com/content_cvpr_2017/papers/Zhang_Task-Driven_Dynamic_Fusion_CVPR_2017_paper.pdf
acl2017:Multi-Task Video Captioning with Video and Entailment Generation
http://cn.arxiv.org/pdf/1704.07489.pdf
​cn.arxiv.org/pdf/1704.07489.pdf
EMNLP2017:Reinforced Video Captioning with Entailment Rewards
http://cn.arxiv.org/pdf/1708.02300.pdf
​cn.arxiv.org/pdf/1708.02300.pdf
RecNet cvpr2018:Reconstruction Network for Video Captioning
http://cn.arxiv.org/pdf/1803.11438.pdf
​cn.arxiv.org/pdf/1803.11438.pdf
cvpr2018:Video Captioning via Hierarchical Reinforcement Learning
http://openaccess.thecvf.com/content_cvpr_2018/papers/Wang_Video_Captioning_via_CVPR_2018_paper.pdf
​openaccess.thecvf.com/content_cvpr_2018/papers/Wang_Video_Captioning_via_CVPR_2018_paper.pdf
cvpr2018:Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning
http://openaccess.thecvf.com/content_cvpr_2018/papers/Wang_Bidirectional_Attentive_Fusion_CVPR_2018_paper.pdf
​openaccess.thecvf.com/content_cvpr_2018/papers/Wang_Bidirectional_Attentive_Fusion_CVPR_2018_paper.pdf
cvpr2018:M3 : Multimodal Memory Modelling for Video Captioning
http://openaccess.thecvf.com/content_cvpr_2018/papers/Wang_M3_Multimodal_Memory_CVPR_2018_paper.pdf
​openaccess.thecvf.com/content_cvpr_2018/papers/Wang_M3_Multimodal_Memory_CVPR_2018_paper.pdf
cvpr2018:End-to-End Dense Video Captioning with Masked Transformer
http://openaccess.thecvf.com/content_cvpr_2018/papers/Zhou_End-to-End_Dense_Video_CVPR_2018_paper.pdf
​openaccess.thecvf.com/content_cvpr_2018/papers/Zhou_End-to-End_Dense_Video_CVPR_2018_paper.pdf
eccv2018:Less Is More: Picking Informative Frames for Video Captioning
http://openaccess.thecvf.com/content_ECCV_2018/papers/Yangyu_Chen_Less_is_More_ECCV_2018_paper.pdf
​openaccess.thecvf.com/content_ECCV_2018/papers/Yangyu_Chen_Less_is_More_ECCV_2018_paper.pdf
NAACL-HLT2018:Watch, Listen, and Describe: Globally and Locally Aligned Cross-Modal Attentions for Video Captioning
http://cn.arxiv.org/pdf/1804.05448.pdf
​cn.arxiv.org/pdf/1804.05448.pdf
acl2018:No Metrics Are Perfect: Adversarial Reward Learning for Visual Storytelling
http://cn.arxiv.org/pdf/1804.09160.pdf
​cn.arxiv.org/pdf/1804.09160.pdf
aaai2019:Learning to Compose Topic-Aware Mixture of Experts for Zero-Shot Video Captioning
http://cn.arxiv.org/pdf/1811.02765.pdf
​cn.arxiv.org/pdf/1811.02765.pdf
v2-2e5887a2bb64db193a68759c13406f79_180x120.jpg
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 创建账号

本版积分规则

Archiver|小黑屋|( 冀ICP备2021005463号 )

GMT+8, 2024-11-21 16:41 , Processed in 0.104391 second(s), 27 queries , Gzip On.

N2N1 It社区 n2n1.cn

Copyright © 2001-2021,MeiCheng.

快速回复 返回顶部 返回列表