Long context CLIP #876

nicolas-dufour · 2024-04-08T10:50:35Z

nicolas-dufour
Apr 8, 2024

Hi,
Is there any CLIP model that manages longer context length than 77 (ideally >256)

Is there a reason why the context length is set to 77? LAION has too short alt-texts overall?

Thanks!

rom1504 · 2024-04-08T11:09:58Z

rom1504
Apr 8, 2024
Maintainer

You need data that has meaningful long captions
You need eval tasks where meaningful long captions are important (so not zero shot classification and not most caption retrieval evals)

0 replies

BIGBALLON · 2024-04-08T11:31:09Z

BIGBALLON
Apr 8, 2024

@nicolas-dufour @rom1504

BTW： maybe Long-CLIP is your need?

🔥 Long Input length Increase the maximum input length of CLIP from 77 to 248.
🔥 Strong performance Improve the R@5 of long-caption text-image retrieval by 20% and traditional text-image retrieval by 6%.
🔥 Plug-in and play Can be directly applied in any work that requires long-text capability.

0 replies

rwightman · 2024-05-09T15:49:06Z

rwightman
May 9, 2024
Maintainer

Moving this to discussions now for reference.

0 replies

PixelChen24 · 2024-10-08T13:45:57Z

PixelChen24
Oct 8, 2024

I wonder whether replacing the default tokenizer with others will work?

0 replies

sachinruk · 2024-10-22T11:00:39Z

sachinruk
Oct 22, 2024

Just wanted to double check. Does this mean that all models in this library have a context length of 77 at most?

1 reply

rwightman Oct 22, 2024
Maintainer

@sachinruk yes, they were trained from scratch with noisy internet image-text web data (openai wit, laion, datacomp, dfn, webli) that typically has fairly short captions so 32-77 tokens is the range here.

Having quality text beyond that range requires either adapting an existing longer context LLM as part of a VLM, or if training from scratch, a LOT of image-text data with higher quality, longer captions (which would be a challenge an billion scale).

This one might be the one exception https://huggingface.co/laion/CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k .. it was an experiment using an existing text encoder.

Could also fine-tune one of these existing models on a decent size image-text data with longer captions and increase the context length...

rom1504 · 2024-10-22T15:00:06Z

rom1504
Oct 22, 2024
Maintainer

I think the first thing to solve on this topic is producing a good long captions image text *open* large scale dataset. Is there anything yet? If not, probably it would mean running one of the small VLM on 1B images

…

On Tue, Oct 22, 2024, 16:13 Ross Wightman ***@***.***> wrote: @sachinruk <https://github.com/sachinruk> yes, they were trained from scratch with noisy internet image-text web data (openai wit, laion, datacomp, dfn, webli) that typically has fairly short captions so 32-77 tokens is the range here. Having quality text beyond that range requires either adapting an existing longer context LLM as part of a VLM, or if training from scratch, a LOT of image-text data with higher quality, longer captions (which would be a challenge an billion scale). This one might be the one exception https://huggingface.co/laion/CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k .. it was an experiment using an existing text encoder. Could also fine-tune one of these existing models on a decent size image-text data with longer captions and increase the context length... — Reply to this email directly, view it on GitHub <#876 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAR437V4WGY7XLBLJRNV5S3Z4ZMQXAVCNFSM6AAAAABQMGZSDKVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCMBRHAYDAOA> . You are receiving this because you were mentioned.Message ID: ***@***.*** com>

0 replies

rom1504 · 2024-10-22T15:04:39Z

rom1504
Oct 22, 2024
Maintainer

The second very important thing is having tasks to evaluate the meaning of long text alignment with images. I am guessing the interest here is about using CLIP as an evaluator or conditioner for GenAI. In that case you may want to consider whether you really need full attention understanding of the text and the link of it with images. If you actually don't... Then what about cutting your text in N pieces of size 77 and pooling the resulting embeddings? You can get infinite context length by doing that.

…

On Tue, Oct 22, 2024, 16:59 Romain Beaumont ***@***.***> wrote: I think the first thing to solve on this topic is producing a good long captions image text *open* large scale dataset. Is there anything yet? If not, probably it would mean running one of the small VLM on 1B images On Tue, Oct 22, 2024, 16:13 Ross Wightman ***@***.***> wrote: > @sachinruk <https://github.com/sachinruk> yes, they were trained from > scratch with noisy internet image-text web data (openai wit, laion, > datacomp, dfn, webli) that typically has fairly short captions so 32-77 > tokens is the range here. > > Having quality text beyond that range requires either adapting an > existing longer context LLM as part of a VLM, or if training from scratch, > a LOT of image-text data with higher quality, longer captions (which would > be a challenge an billion scale). > > This one might be the one exception > https://huggingface.co/laion/CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k > .. it was an experiment using an existing text encoder. > > Could also fine-tune one of these existing models on a decent size > image-text data with longer captions and increase the context length... > > — > Reply to this email directly, view it on GitHub > <#876 (reply in thread)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAR437V4WGY7XLBLJRNV5S3Z4ZMQXAVCNFSM6AAAAABQMGZSDKVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCMBRHAYDAOA> > . > You are receiving this because you were mentioned.Message ID: > ***@***.*** > com> >

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long context CLIP #876

{{title}}

Replies: 7 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Long context CLIP #876

nicolas-dufour Apr 8, 2024

Replies: 7 comments · 1 reply

rom1504 Apr 8, 2024 Maintainer

BIGBALLON Apr 8, 2024

rwightman May 9, 2024 Maintainer

PixelChen24 Oct 8, 2024

sachinruk Oct 22, 2024

rwightman Oct 22, 2024 Maintainer

rom1504 Oct 22, 2024 Maintainer

rom1504 Oct 22, 2024 Maintainer

nicolas-dufour
Apr 8, 2024

Replies: 7 comments 1 reply

rom1504
Apr 8, 2024
Maintainer

BIGBALLON
Apr 8, 2024

rwightman
May 9, 2024
Maintainer

PixelChen24
Oct 8, 2024

sachinruk
Oct 22, 2024

rwightman Oct 22, 2024
Maintainer

rom1504
Oct 22, 2024
Maintainer

rom1504
Oct 22, 2024
Maintainer