[多模态]Everything at Once
Multi-modal Fusion Transformer for Video Retrieval Abstract 从视频数据中进行多模态学习最近受到了越来越多的关注,因为它允许在没有人工注释的情...
[略读]ObjectBox
From Centers to Boxes for Anchor-Free Object Detection 主要贡献|Keypoints 标签分配|Label Assignment 在三层特征图上预...
[精读]表格问答TAPAS
文献 TAPAS:Weakly Supervised Table Parsing via Pre-training Abatract 通过表格回答自然语言问题通常被视为语义解析任务。为了减轻完整逻辑格...
[略读]Align before Fuse
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation 背景 VLP(Vis...
[翻译] UNITER:通用图文表示学习
UNiversal Image-TExt Representation Learning Abstract 联合图文嵌入是大多数视觉和语言任务(V+L tasks)的基础,在这些任务中,多模态输入被同...
[略读]Twins系列
Twins: Revisiting the Design of Spatial Attention in Vision Transformers Conditional Positional Enco...
[略读]Swin-Transformer
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows 主要贡献: Patch Merging Layer Sh...
[翻译]Pyramid Vision Transformer
A Versatile Backbone for Dense Prediction without Convolutions Abstract 尽管使用CNN作为骨干网络的结构在视觉领域取得巨大成功,...