![]() ![]() ConvNext-Large (D) 320x320 fine-tune of 256x256 weights above for ~2.5B more samples on LAION-2B, top-1 of 76.6%.ConvNext-Large (D) 256x256 /w augreg on LAION-2B with a top-1 of 75.9%.ConvNext-Base (W) 256x256 on LAION-A (900M sample aesthetic subset of 2B) with a top-1 of 71.0%.ConvNext-Base (W) 256x256 /w augreg (extra augmentation + regularization) on LAION-2B with a top-1 of 71.5%.ConvNext-Base (W) 256x256 on LAION-2B with an ImageNet-1k zero-shot top-1 of 70.8%.ConvNext-Base 224x224 on LAION-400M with an ImageNet-1k zero-shot top-1 of 66.3%.The best in1k zero-shot for released, open-source weights thus far. ViT-G/14 on LAION-2B with an accuracy of 80.1%.ViT-g/14 on LAION-2B with an accuracy of 78.5%.This was trained on reduced 12B samples seen schedule, same samples seen as 400M models. ViT-g/14 on LAION-2B with an accuracy of 76.6%.ViT-H/14 on LAION-2B with an accuracy of 78.0%.CLIP ViT-L/14 73.1% (on the same dataset and samples seen) CoCa ViT-L/14 on LAION-2B with an accuracy of 75.5% (currently only 13B samples seen) vs.Our best ViT-L/14 so far, trained with a 13B samples seen schedule. ViT-L/14 on DataComp-1B with an accuracy of 79.2%.ViT-L/14 on LAION-2B with an accuracy of 75.3%, vs OpenAI's 75.5% (as measured here, 75.3% in paper).ViT-L/14 on LAION-400M with an accuracy of 72.77%, vs OpenAI's 75.5% (as measured here, 75.3% in paper).ViT-B/16 on LAION-2B with a accuracy of 70.2%.ViT-B/16 on LAION-400M achieving an accuracy of 67.1%, lower than OpenAI's 68.3% (as measured here, 68.6% in paper).ViT-B/32 on LAION-2B with a accuracy of 66.6%.ViT-B/32 on LAION-400M with a accuracy of 62.9%, comparable to OpenAI's 63.2%, zero-shot top-1 on ImageNet-1k.We have trained the following ViT CLIP models: In addition, we study scaling behavior in a paper on reproducible scaling laws for contrastive language-image learning. We further this with a replication study on a dataset of comparable size to OpenAI's, LAION-400M, and with larger datasets such as LAION-2B and DataComp-1B datasets. For ease of experimentation, we also provide code for training on the 3 million images in the Conceptual Captions dataset, where a ResNet-50x4 trained with our codebase reaches 22.2% top-1 ImageNet accuracy. OpenAI's CLIP model reaches 31.3% when trained on the same subset of YFCC. Specifically, a ResNet-50 model trained with our codebase on OpenAI's 15 million image subset of YFCC achieves 32.7% top-1 accuracy on ImageNet. Our starting point is an implementation of CLIP that matches the accuracy of the original CLIP models when trained on the same dataset. The goal of this repository is to enable training models with contrastive image-text supervision, and to investigate their properties such as robustness to distribution shift. Welcome to an open source implementation of OpenAI's CLIP (Contrastive Language-Image Pre-training). ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |