슬랙 팁 아카이빙

<맨 위로>

Tip

Python

AI Math

Pytorch

NLP

CV

Other Task

한국어 음성 데이터셋

Papers

AI는 특정 인종에 bias되어있다
2021 State of AI Reports
GIRAFFE : 3D view renderings
- Paper full name : GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields
- 기존의 deep generative models이 high resolution의 이미지도 굉장히 realistic하게 만들어내었지만 대부분의 모델이 2D 기반이기 때문에 생성하고자 하는 대상에 대한 controllability가 떨어진다.
- NeRF 이후로 3D view rendering이 주목받고 있다
- GIRAFFE는 compositional 3D scene representation를 통해 개별 object와 background를 disentangle한다
Labels4Free : Segmentation Dataset의 Unsupervised manner
- Paper full name: Labels4Free: Unsupervised Segmentation using StyleGAN
- Pre-trained StyleGAN을 확장하여 생성된 이미지의 foreground/background를 unsupervised separation
- StyleGAN이 생성한 features들이 다른 task로 확장할 만큼 많은 정보를 갖고 있다는 연구들이 나오고 있는데 이를 unsupervised segmentation으로 적용한 논문
- 이를 활용하여 segmentation을 위한 high quality의 데이터셋을 unsupervised manner로 생성할 수 있다
dataset distillation : 큰 데이터셋의 정보를 작은 데이터셋으로
- Knowledge distillation이 큰 모델의 정보를 작은 모델에게 전달해주는 것이라면, dataset distillation은 큰 데이터셋의 정보를 작은 데이터셋으로 전달해주는 기법
- 해당 논문에서는 CIFAR-10의 0.02%인 10개의 데이터만 이용하여 64%의 test acc. 달성
- 새로운 distributed kernel based meta-learning framework를 제안하여 해당 목표 달성
- 아래의 그림은 논문에서 제안한 KIP라는 방법으로 이미지를 변환하였을 때의 결과이며 사람의 눈으로 보았을 때는 정확히 식별하기 어렵지만 모델의 성능을 대폭 향상
Swin Transformer
An Empirical Study of Training Self-Supervised Vision Transformers
Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions
Group-Free 3D Object Detection via Transformers
Spatial-Temporal Transformer for Dynamic Scene Graph Generation
Rethinking and Improving Relative Position Encoding for Vision Transformer
Emerging Properties in Self-Supervised Vision Transformers
Learning Spatio-Temporal Transformer for Visual Tracking
Fast Convergence of DETR with Spatially Modulated Co-Attention
Vision Transformer with Progressive Sampling
Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet
Rethinking Spatial Dimensions of Vision Transformers
The Right to Talk: An Audio-Visual Transformer Approach
Joint Inductive and Transductive Learning for Video Object Segmentation
Conformer: Local Features Coupling Global Representations for Visual Recognition
Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer
Paint Transformer: Feed Forward Neural Painting with Stroke Prediction
Conditional DETR for Fast Training Convergence
MUSIQ: Multi-scale Image Quality Transformer
SOTR: Segmenting Objects with Transformers
Complementary Patch for Weakly Supervised Semantic Segmentation
- Pixel-wise mask 대신 image-level tag만 이용하여 semantic segmentation을 수행하는 weakly-supervised 방법론
- CAM을 활용한 기존의 방법들은 가장 discriminate한 object의 일부분만 잡아내어 정확도가 떨어짐
- 해당 논문의 방법론을 따르면, 하나의 이미지에 대해 서로 complementary 관계에 있는 hidden patched parts에 대한 CAM 결과들을 더하면 훨씬 정확한 mask를 만들어낼 수 있음
YOLOF
- Detector의 핵심인 FPN이 좋은 성능이 보이는 이유가 multi-scale feature fusion 때문이 아니라 각 level의 feature를 별개로 고려하는 divide-and-conquer 때문이라는 것을 밝혀냄
- Divide-and-conquer가 좋은 성능을 유도하는 것은 맞지만 memory burben이 커서 inefficient
- 해당 논문에서는 dilated encoder와 uniform matching이라는 두가지 기법을 통해 single-level의 feature map만 이용함에도 불구하고 RetinaNet보다 약 2.5배 빠르고 성능이 더 높은 YOLOF 모델 제안
- Idea: YOLOF가 single-level feature map만으로도 좋은 detection 성능을 달성하였는데 이를 transformer와 활용하면 훨씬 효율적인 attention 기반의 detector가 만들어지지 않을까요?
Few-Shot Object Detection via Classification Refinement and Distractor Retreatment
- Few-shot object detection의 evaluation metric인 Average Precision (AP)가 class와 box quality를 동시에 고려하는데 대부분의 정량적 성능 저하가 classfication error에서 온다는 것을 밝힘 (즉, box quality는 좋은데 misclassification하는 문제가 주된 요인)
- Architecture-level enhancement: 새로운 few-shot correction network로 category confusion을 줄임
- 불완전한 annotation을 갖고 있어서 성능을 대폭 줄이는 data samples을 distractor로 정의하고 해당 distractor를 제거하고 semi-supervised loss를 통해 활용하는 방법 제안
Points as Queries: Weakly Semi-supervised Object Detection by Points
- Object detection의 성능을 높이기 위해 bounding box보다는 coarse한 annotation인 point 정보만 존재하는 weakly-supervised extra data를 활용하여 semi-supervised learning을 진행하는 weakly-semi-supervised object detection 방법론 제안
- Fully-supervised teacher model를 inference하여 만들어진 pseudo-labels을 바탕으로 student network 학습
- DETR을 확장하여 이미지 정보는 encoder로 추출하고 point 정보는 encoding하여 decoder의 query로 넣어주는 Point DETR 제안
How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers
- ViT를 어떻게 효율적으로 학습하고 주어진 데이터셋에서 좋은 성능을 달성할 수 있을지를 분석한 논문
- 다양한 데이터셋에 학습된 무려 총 50,000개의 ViT 모델 공개
- 요약 1) 데이터 수가 적은 경우에는 augmentation과 regularization이 굉장히 중요
- 요약 2) 데이터 수가 많을 때에는 aug./reg.를 크게 신경쓰지 않아도 좋은 성능
- 요약 3) Pre-trained weights로 fine-tuning하는 것이 성능 향상에 도움
End-to-End Semi-Supervised Object Detection with Soft Teacher
- Unlabeled dataset을 추가로 활용하여 성능을 높이는 semi-supervised 방법론 제안
- Multi-stage training이 필요한 기존 방법과 달리 end-to-end 학습
- 학습을 진행하면서 점점 정확한 psudeo label을 만들어가는 방식
- 크게 unlabeled data에 대한 classification loss와 regression loss를 구분하여 학습
- 1) Classfication head: soft teacher에서 만든 prediction에 대해 score filtering을 진행하여 얻은 box에 대해서만 loss 계산
- 2) Regression head: regression variance가 낮은 box만 filtering하여 해당 box에 대해서만 loss 계산
Revisiting Mask-Head Architectures for Novel Class Instance Segmentation
- Instance Segmentation은 novel class에 대한 예측이 어려움(마스크를 그리기 어렵기 때문에)
- protocol과 mask-head 구조를 바꿔서 supervised learning과의 성능 차이를 좁힘
- Crop을 custom하게 한 처음 보는 물체에 대하여 잘 작동하는 novelty가 보임.
- VOC에서 4.7% 상승한 mask mAP sota 성능 (no auxiliary loss functions, offline trained priors, weight transfer functions)
- Deep-MAC이라는 오픈소스로 공개
3DETR
- Facebook에서 이제 3D object deteciton까지 transformer로 하기 시작했습니다!..
- input point cloud로부터 transformer encoder가 feature를 얻어내고 decoder에서 box를 prediction
- 주어진 reference point에 해당하는 query embedding을 받은 decoder는 이와 관련된 points를 attention하여 detection 성능 향상
Reconcile Prediction Consistency for Balanced Object Detection
- 기존의 detector는 classification loss와 regression loss를 완전히 독립적으로 학습하기 때문에 많은 inconsistent predictions 유발 (예를 들면, classification score는 높지만 localization acc.는 낮은 경우)
- Prediction consistency를 위하여 Harmonic loss라는 새로운 loss 제안
- classification branch와 localization branch의 optimization을 harmonize하는 역할
Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation
- Figure 2에 나온 것처럼, 단순히 특정 이미지의 object를 복사(copy)한 다음 변형(random scale jittering)을 가하고 다른 이미지에 붙여 넣는 (paste) 굉장히 simple한 augmentation이 instance segmentation의 성능을 올려준다
- cutmix의 instance segmentaton 느낌
StyleNeRF: A Style-based 3D Aware Generator for High-resolution Image Synthesis
- NeRF (ECCV 2020) 이후로 neural radiance fields를 이용하여 여러 각도에서 바라본 이미지를 생성하는 task가 굉장한 주목을 받고 있습니다.
- 이러한 NeRF에 StyleGAN의 컨셉을 넣어 3D 영상을 생성함과 동시에 style attribute 또한 control 가능한 StyleNeRF 제안
Patches Are All You Need?
- 개인적으로 정말 흥미롭게 읽은 논문이고, 특히 ViT에 관심이 있으신 분들이라면 더욱 재밌게 읽으실 수 있을 것 같습니다 :미소짓는_얼굴:
- ViT의 특징을 크게 2가지로 구분한다면, ‘(1) 이미 그 자체로 강력한 Transformer의 활용’ 과 ‘(2) input을 patch단위로 쪼개어 활용하는 것’으로 나눌 수 있다.
- 저자는 (1)과 (2) 중에서 정말 어떤 것이 ViT의 높은 성능에 기여했는지를 알고 싶어 하였고, 그 결과 놀랍게도 (2) patch 단위로 입력을 쪼개어 넣어주는 것이 성능에 많은 영향을 미친다는 것을 밝혀내었다.
- 이러한 발견을 기반으로, 입력을 patch 단위로 쪼개어 넣되 ViT보다 훨씬 가볍고 효율적이며 심플한 ConvMixer라는 모델을 제안하였고 해당 모델은 오직 convolution만 사용.
- ConvMixer는 ViT와 ResNet보다 parameter 수는 적지만 더욱 높은 성능 달성
- 결론) patch embedding을 활용하는 것은 NLP에서 tokenization을 사용하는 것처럼 CV에서도 굉장히 중요하다!
Audio-Guided Image Manipulation for Artistic Paintings
- 오디오를 기반으로 이미지를 Manipulation 합니다.
- CLIP embedding space에 오디오를 align하였습니다
- StyleGAN의 latent code를 오디오로 guide하여 소리의 의미에 맞게 생성하도록 합니다.
MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer
- CNN은 (장점) inductive bias를 통해 적은 parameter로도 좋은 representation을 얻을 수 있지만, (단점) spatially local하다.
- ViT은 (장점) global representation을 얻을 수 있지만, (단점) heavy-weight하다.
- Idea: CNN과 ViT의 장점을 결합하여 가볍고 빠른 모델을 만들 수 있지 않을까?
- Standard conv.의 3가지 operations (unfolding, local processing, folding) 중에서 local processing 과정을 transformer를 활용한 global processing으로 대체하여 CNN과 ViT 각각의 장점만 활용
- 기존 conv. 기반의 MobileNet 계열보다 적은 수의 parameter로 훨씬 높은 성능을 보였고, 심지어 ResNet-101의 약 9분의 1 크기의 모델로 더 높은 성능 달성
BEiT: BERT Pre-Training of Image Transformers
- NLP의 BERT를 따라 image transformer에도 pre-training 방법 제시
- BEiT (Bidirectional Encoder from Image Transformers)
- 이미지를 visual tokens으로 tokenize한 다음, 일부 patch를 masking하고 transfomer에 넣은 뒤 corrupted image patch를 복원하는 것을 pre-training의 objection로 설정
- Image transformer에도 BERT처럼 pre-training model을 기본적으로 꼭 사용하는 날이 머지않아 올지도 모르겠네요
Towards Real-World Blind Face Restoration with Generative Facial Prior
- Low quality의 face image를 high-quality로 향상시키는 blind face restoration task
- low quality face는 정확한 geometric 정보를 담고있지 못한다는 문제를 가지고 있는데 이를 pretrained face GAN이 담고 있는 다양한 prior를 이용하여 해결
- Network 구성: (1) Degradation removal 역할의 U-Net / (2) facial details를 채워주는 Pretrained GAN as prior
- Take-home message: 단순히 새로운 혹은 높은 성능의 GAN을 만드는 것이 아닌 기존에 학습된 GAN을 pre-trained model로 사용하여 다양한 task에 확장하는 시도들이 늘어나고 있는데, blind face restoration에도 효과적으로 작용하며 pre-trained GAN은 유용한 prior를 많이 가지고 있다
instance segmentation
- class imbalace가 학습을 어떻게 방해하는지 그 원인을 분석하고, data agnostic 하지 않게 (data의 statictics를 사용하지 않고) 문제를 해결
Bag of Tricks for Image Classification with Convolutional Neural Networks
- Batch size의 영향, Learning rate scheduling (warm-up, cosine lr decay), Batch normalization initialization, No bias decay, Low-precision training, Model tweaks, Label smoothing, Knowledge distillation, Mix-up training, Transfer learning
논문 정리하시는분 레포

CastleJo의 개발일지

Tip

Python

AI Math

Pytorch

NLP

CV

Other Task

Papers

MLOps

시각화

Linux

Git

LoadMap

ETC

CastleJo의 개발일지

슬랙 팁 아카이빙

Tip

Python

AI Math

Pytorch

NLP

CV

Other Task

Papers

MLOps

시각화

Linux

Git

LoadMap

ETC

Related Posts

데코레이터 패턴을 활용한 WandB 연결하기 24 Jan 2022

OpenCV을 이용한 영상에서 이미지 추출하여 데이터셋 만들기 22 Jan 2022

smp에서 swin transformer 사용하기 22 Jan 2022