-
Notifications
You must be signed in to change notification settings - Fork 18
Model Configurations
dhansmair edited this page Sep 16, 2022
·
14 revisions
A comparison of model sizes, comparing the smallest Flamingo model, Flamingo-3B, to the architectures that were trained on Conceptual Captions and are named flamingo-mini and flamingo-tiny.
The parameters for Flamingo-3B were extracted from the paper: https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/tackling-multiple-tasks-with-a-single-visual-language-model/flamingo.pdf
| flamingo 3B | flamingo-mini (ours) | flamingo-tiny (ours) | ||
|---|---|---|---|---|
| params (trainable/total) | 529M / 835M | 180M / 267M | (!) for ours, the vision encoder parameters are not included here. | |
| language model | chinchilla | OPT-350m | OPT-125m | |
| # params | 1.4B | 350M | 125M | |
| # layers | 24 | 24 | 12 | |
| # heads | 16 | 16 | 12 | |
| embedding size | 2048 | 1024 | 768 | |
| number of tokens | 32000 | 50256 | 50256 | |
| vision encoder | NFNet-F6 | CLIP ViT-L/14 | CLIP ViT-L/14 | |
| # params | 435M | 303M | 303M | |
| output shape | ? x 1536 | 257 x 1024 | 257 x 1024 | |
| perceiver resampler | ||||
| # params | 194M | 101M | 63M | |
| # heads | 16 | 16 | 8 | |
| # layers | 6 | 6 | 6 | |
| hidden size | 1536 | 1024 | 1024 | = Vision encoder hidden size |
| KV size | 128 | 128 | 64 | |
| # latents | 64 | 64 | 64 | |
| activation function | Sq. ReLU | Sq. ReLU | Sq. ReLU | |
| gated cross-attention | ||||
| # params | 1.2B | |||
| # heads | 16 | 16 | 8 | |
| # layers (freq) | 24 (every) | 24 (every) | 12 (every) | |
| hidden size | 2048 | 1024 | 768 | = LM embedding size |
| KV size | 128 | 128 | 64 | |
| activation function | Sq. ReLU | Sq. ReLU | Sq. ReLU |
From the Flamingo paper:
