Meta CLIP 2: A Worldwide Scaling Recipe

CLIP has become a cornerstone of modern AI, powering everything from zero-shot image classification to serving as the vision backbone for multimodal LLMs. We've successfully trained CLIP on billions of English image-text pairs from the web. But here's the problem: what about the rest of the world that makes up the other 60% of the web?

When we try to scale CLIP to learn from worldwide web data, we hit two major roadblocks. First, there's no existing methods to curate and balance data from non-English languages—the methods that work for English won't trivially transfer. Second, existing multilingual CLIP models tend to perform worse than their English-only counterparts. It's the classic "curse of multilinguality": trying to do everything means you end up doing nothing particularly well. It's common for a multilingual model to perform worse than an English-only model on the English-only benchmarks.

We're excited to introduce MetaCLIP 2—the first practical recipe for training CLIP from scratch on worldwide web-scale data. Our approach is surprisingly simple: through careful ablations, we identified the minimal set of changes needed to make English and non-English data work together. The result? A recipe that creates mutual benefits between languages rather than forcing them to compete.

And it works. Our ViT-H/14 model beats its English-only counterpart by 0.8% on ImageNet zero-shot classification and outperforms mSigLIP by 0.7%. More importantly, without any translation tricks or architectural gymnastics, we're setting new state-of-the-art results on multilingual benchmarks: 57.4% on CVQA, 50.2% on Babel-ImageNet, and 64.3% on XM3600 image-to-text retrieval. The best part? English performance doesn't suffer—it actually improves.

Why Directly Training on Web Data Fails

Let's start with the fundamental problem. If you just scrape the web and train CLIP on whatever you find, you'll quickly run into trouble. Why? Because the Internet is wildly imbalanced.

A handful of common concepts—think "cat," "dog," "car"—appear millions of times and completely dominate your training data. Meanwhile, more specific concepts like "Taipei 101" or "MIT Dome" show up rarely, making them nearly impossible for the model to learn. The model ends up great at recognizing cats, but terrible at everything else. Not exactly what we're going for.

Web Data Imbalance — Head vs. Long-tail Concepts

MetaCLIP 1: Substring Matching + Balancing

So how do we fix this imbalance? The original MetaCLIP introduced a clever metadata-driven approach with three simple steps:

Step 1: Build a concept dictionary. We start by creating a list of real-world concepts using Wikipedia and WordNet. This gives us everything from common words like "dog" and "cat" to specific landmarks like "Taipei 101" and "MIT Dome."

Step 2: Match captions to concepts. For each image-text pair, we check if the caption mentions any concept in our dictionary. If it doesn't match anything? We throw it out. This helps filter out low-quality or irrelevant data.

Step 3: Balance the dataset. Here's the key insight: we cap the maximum number of images per concept at some threshold t. This prevents "cat" from appearing a million times while "Taipei 101" only shows up once. For example:

To create 400M balanced pairs, we start with ~1.6B scraped pairs and set t = 20k
To create 2.5B balanced pairs, we start with ~10.7B scraped pairs and set t = 170k

Balancing Results — English filtering → Substring matching → Balancing improves ImageNet zero-shot performance

The results speak for themselves: this pipeline dramatically improves performance on ImageNet and other benchmarks.

What's the Bottleneck?

MetaCLIP 1 works great for English, but there's a huge problem: the English filtering step throws away 60% of the web.

Think about what we're losing. The original pipeline only keeps English captions, discarding a massive amount of high-quality data in Chinese, Spanish, Arabic, and hundreds of other languages. This isn't just about missing out on multilingual capabilities—it's about missing out on diverse visual concepts that simply don't appear as often in English data.

Our goal with MetaCLIP 2 isn't just to add multilingual support (though that's nice). We want to improve the model's overall visual understanding by tapping into 2.5× more training data. That's why we evaluate on both English-only benchmarks AND multilingual datasets—to prove that going worldwide helps everyone, not just non-English users.

MetaCLIP 2: Worldwide Curation Pipeline

Here's where things get interesting. We extend the entire pipeline to work with 329 Wikipedia languages.

First challenge: substring matching across languages. We build multilingual metadata from Wikipedia and OpenWordNet, extracting unigrams, bigrams, and article titles. For languages with writing systems that are not space-separated (like Chinese and Japanese), we use custom tokenizers to handle them properly. Nothing revolutionary here—just careful engineering.

Second challenge: balancing 329 languages. Remember that threshold t we used for English? That number doesn't work for other languages. English has way more web data than Spanish, so using the same threshold everywhere would include too many tail concepts in the non-English data. Manually tuning t for 300+ languages? It's infeasible.

Our solution: keep the tail ratio constant. We discovered that in English MetaCLIP, the least frequent concepts (the "tail") contain 6% of total image-text pairs across different data scales during scaling experiments. So instead of fixing t, we find the threshold for each language that keeps the tail at 6%, allowing the non-English data to contain the same percentage of tail concepts as the English data. It's a simple heuristic that works beautifully across all languages.

Language Threshold — We set t on English data to achieve 6% tail concepts. For each new language (e.g., Spanish), we find the threshold where the rarest 6% of image-text pairs begin.

The Algorithm: A Closer Look

Our curation algorithm extends MetaCLIP to handle 329 languages. Here's the complete pseudo-code:

"""
Input: 
  D (list): raw (image, text) pairs, each text.lang assigned by LID
  M (dict): worldwide metadata, key=language code, value=metadata for that language
  t_en (int): English threshold (OpenAI CLIP=20k, MetaCLIP=170k)

Output: 
  D_star (list): curated image-text pairs
"""

# Helper functions to compute t for each language
def t_to_p(t, entry_count):
    # Convert threshold t to tail proportion p
    return entry_count[entry_count < t].sum() / entry_count.sum()

def p_to_t(p, entry_count):
    # Convert tail proportion p to threshold t
    sorted_count = np.sort(entry_count)
    cumsum_count = np.cumsum(sorted_count)
    cumsum_prob = cumsum_count / sorted_count.sum()
    return sorted_count[(np.abs(cumsum_prob - p)).argmin()]

# Stage 1: Substring matching
entry_counts = {lang: np.zeros(len(M[lang])) for lang in M}
for image, text in D:
    # Match text with language-specific metadata
    text.matched_entry_ids = substr_match(text, M[text.lang])
    entry_counts[text.lang][text.matched_entry_ids] += 1

# Stage 2: Compute t for each language
p = t_to_p(t_en, entry_counts["en"])  # Get tail proportion from English
t = {}
for lang in entry_counts:
    t[lang] = p_to_t(p, entry_counts[lang])  # Compute language-specific t

# Stage 3: Balancing via sampling
entry_probs = {}
for lang in entry_counts:
    entry_counts[lang][entry_counts[lang] < t[lang]] = t[lang]
    entry_probs[lang] = t[lang] / entry_counts[lang]

D_star = []
for image, text in D:
    for entry_id in text.matched_entry_ids:
        if random.random() < entry_probs[text.lang][entry_id]:
            D_star.append((image, text))
            break

Stage 1: Matching

Use language ID to match each text with language-specific metadata. Count matches per concept.

Stage 2: Thresholds

Compute language-specific threshold t_lang by maintaining 6% tail ratio from English.

Stage 3: Balancing

Sample pairs based on concept rarity. Tail concepts always included, head concepts downsampled.

Result: From ~10.7B raw worldwide pairs → 2.5B curated pairs with balanced concept distribution across all 329 languages.

Training at Scale: What Changes?

Beyond data curation, we need to rethink how we train the model. When you're adding 2.5× more data from non-English sources, you can't just plug it in and hope for the best. Here's what we changed:

Scaling seen pairs proportionally. Since English constitutes 44% of our curated data, we scale the number of seen training pairs by 2.3× (from 13B to 29B). This ensures English data gets the same exposure as before, while non-English data gets its fair share. We achieve this by increasing the global batch size from 32,768 to 75,366—keeping everything else (learning rate, warmup, etc.) unchanged.

Using a multilingual tokenizer. English tokenizers don't cut it for 329 languages. After testing four popular options (mT5, Gemma, XLM-Roberta, XLM-V), we found that XLM-V with its 900k vocabulary delivers the best performance on both English and multilingual benchmarks. It's the only architecture change we make—no custom layers, no translation modules, no architectural gymnastics.

Breaking the Curse of Multilinguality

Here's the million-dollar question: why does adding multilingual data usually hurt English performance? We call this the "curse of multilinguality," and it's plagued every multilingual CLIP attempt until now.

The answer turns out to be surprisingly simple: insufficient model capacity.

When we train ViT-L/14 (the largest model OpenAI used) on worldwide data, the curse persists—English performance drops compared to English-only training. But when we scale up to ViT-H/14, something magical happens: both English and multilingual performance improve simultaneously. The model finally has enough capacity to learn from all that diverse data without forgetting English.

It's not just about memorizing more concepts. The larger model can maintain strong English understanding while building multilingual capabilities. This is the inflection point where worldwide data transforms from a liability into an asset.

Breaking the Curse of Multilinguality — **Left:** ViT-L/14 suffers from the curse—worldwide data hurts English performance. ViT-H/14 breaks the curse: non-English data helps English. **Right:** English data also helps non-English performance. Mutual benefits achieved!

Key Insight: ViT-H/14 is the minimal viable model capacity for breaking the curse. With this architecture and our scaled training recipe, we achieve mutual benefits—English data helps multilingual performance, and multilingual data helps English performance. Win-win.

How Does MetaCLIP 2 Stack Up?

Let's compare MetaCLIP 2 against the current multilingual champions: mSigLIP, SigLIP 2, and XLM-CLIP. Remember, these are industrial-strength systems with lots of engineering tricks. We're keeping it simple—just better data curation and smart scaling.

Model	ViT Size	Seen Pairs	English Benchmarks			Multilingual Benchmarks
Model	ViT Size	Seen Pairs	ImageNet	SLIP 26	DC 37	Babel-IN	XM3600 T→I / I→T	CVQA EN / Local
XLM-CLIP	H/14	32B	77.0	69.4	65.5	34.0	50.4 / 60.5	56.1 / 48.2
mSigLIP	SO400M	40B	80.6	69.1	65.5	46.4	50.0 / 62.8	56.8 / 49.8
SigLIP 2	SO400M	40B	83.2	73.7	69.4	40.8	48.2 / 59.7	58.5 / 49.0
MetaCLIP (English)	H/14	13B	80.5	72.4	66.5	—	—	—
MetaCLIP 2 (Ours)	H/14	29B	81.3	74.5	69.6	50.2	51.5 / 64.3	61.5 / 57.4

What stands out? MetaCLIP 2 dominates across the board with significantly fewer training pairs (29B vs 40B for SigLIP/mSigLIP). We beat mSigLIP by +0.7% on ImageNet, +5.4% on SLIP 26, and massive gains on multilingual tasks: +3.8% on Babel-ImageNet, +1.5% / +1.5% on XM3600 (T→I / I→T), and +4.7% / +7.6% on CVQA (EN / Local).

Notice how SigLIP 2 prioritizes English (90% English data) at the expense of multilingual performance—it's actually worse than mSigLIP on multilingual tasks. We don't have to make that trade-off. Our worldwide data helps everything.

The Power of Mixing: English + Non-English

Here's where it gets interesting. We ran a series of ablations to understand exactly how English and non-English data interact. Does adding non-English hurt English performance? Does English data help multilingual capabilities? Let's find out.

Configuration	Model	Seen Pairs	ImageNet	Babel-IN	XM3600 T→I / I→T	CVQA EN / Local
English only	ViT-L/14	13B (1.0×)	79.5	—	—	—
Worldwide	ViT-L/14	29B (2.3×)	78.8	44.2	45.3 / 58.2	59.2 / 55.1
English only	ViT-H/14	13B (1.0×)	80.4	—	—	—
Non-English only	ViT-H/14	17B (1.3×)	71.4	49.9	46.9 / 59.9	59.8 / 56.8
Worldwide (no scale)	ViT-H/14	13B (1.0×)	79.5	47.1	49.6 / 62.6	59.9 / 56.0
Worldwide (scaled)	ViT-H/14	29B (2.3×)	81.3 ↑	50.2	51.5 / 64.3	61.5 / 57.4

Key observations:

ViT-L/14 suffers from the curse — worldwide data at 2.3× seen pairs actually hurts English performance (78.8 vs 79.5). The model doesn't have enough capacity.
ViT-H/14 needs scaling — at 1.0× seen pairs, we still see degradation (79.5 vs 80.4). Simply mixing data isn't enough.
The magic recipe: ViT-H/14 + scaled pairs — English performance improves to 81.3 (+0.9%), while multilingual performance soars. This is the sweet spot.
Mutual benefits are real — Compare "Non-English only" (71.4 ImageNet) vs "Worldwide scaled" (81.3). English data dramatically helps multilingual models on English tasks. Similarly, non-English data pushes English performance higher than English-only training.

Getting the Details Right: Curation Matters

We didn't get to the final recipe overnight. It took careful ablations to understand which design choices actually matter. Let's walk through the journey from English-only CLIP to full MetaCLIP 2:

Step	Configuration	ImageNet	Babel-IN	XM3600 T→I / I→T	CVQA EN / Local
1. Baseline	English CLIP with English filter	67.5	—	—	—
2. Remove filter	Use all alt-texts, English metadata	66.9	—	—	—
3. Add multilingual metadata	All metadata in one set, no isolation	62.1	31.2	37.8 / 49.7	49.8 / 45.8
4. Language isolation	Separate metadata/texts by language, same t	61.1	31.5	37.9 / 49.4	49.0 / 46.5
5. Language-specific t	Compute t_lang per language (6% tail)	64.7	31.5	38.1 / 50.0	50.3 / 46.6

What did we learn?

Step 2: Simply removing the English filter hurts performance (−0.6%). Language identification and isolation are crucial.
Step 3: Merging all metadata without language separation tanks English performance (−5.4%). Different languages need different treatment.
Step 4: Language isolation helps, but using the same threshold t for all languages is still suboptimal. It lets head concepts dominate in smaller languages.
Step 5: Language-specific thresholds t_lang based on the 6% tail ratio recover most of the lost performance (+3.6%). This is our final recipe.

The moral of the story? Details matter. Naive multilingual scaling fails. But with careful, language-specific balancing, we can maintain strong performance across all languages.

Beyond Benchmarks: Cultural Diversity

Standard benchmarks are great, but they often miss something crucial: cultural and geographic diversity. Does our model understand concepts from around the world, not just North America and Western Europe?

We evaluated on benchmarks specifically designed to test cultural understanding: Dollar Street (objects from different income levels globally), GeoDE (geographically diverse objects), and GLDv2 (landmark recognition). Here's how training data affects cultural understanding:

Training Data	Seen Pairs	Dollar Street Top-1 / Top-5	GLDv2	GeoDE
English only	13B (1.0×)	37.2 / 63.3	52.8	93.4
Non-English only	17B (1.3×)	35.7 / 61.3	68.6	91.7
Worldwide (no scale)	13B (1.0×)	37.2 / 63.7	65.8	94.3
Worldwide (scaled)	29B (2.3×)	37.9 / 64.0	69.0	93.4

The pattern is clear:

Worldwide data dramatically improves landmark recognition — GLDv2 jumps from 52.8 to 69.0 (+16.2 points). English-only data has a strong Western bias; worldwide data fixes this.
Cultural understanding improves consistently — Dollar Street top-1 accuracy improves from 37.2 to 37.9, but more importantly, the model maintains strong performance while gaining massive multilingual capabilities.
Geographic coverage matters — Non-English data alone gives great GLDv2 results (68.6), but combining with English data pushes even higher (69.0). Diversity wins.

We also tested few-shot geo-localization on Dollar Street, GeoDE, and XM3600. The trend is consistent: worldwide training data leads to better geographic and cultural understanding. When you train on data from the actual world instead of just the English-speaking world, you get models that understand the actual world.

Choosing the Right Tokenizer

One critical choice for multilingual models: which tokenizer to use? We tested four popular multilingual tokenizers on our ViT-B/32 model to find the winner:

Tokenizer	Vocab Size	ImageNet	Babel-IN	XM3600 T→I / I→T	CVQA EN / Local
mT5 (used by mSigLIP)	250k	64.7	31.5	38.1 / 50.0	50.3 / 46.6
Gemma (used by SigLIP 2)	256k	63.7	26.1	36.1 / 47.8	48.3 / 44.0
XLM-Roberta	250k	64.0	31.1	38.0 / 49.8	49.8 / 46.1
XLM-V (our choice)	900k	64.7	32.7	40.0 / 51.4	50.4 / 47.4

XLM-V wins decisively on multilingual benchmarks (+1.2% on Babel-IN, +1.9% / +1.4% on XM3600 T→I / I→T, +0.1% / +0.8% on CVQA EN / Local) while matching English performance. Its massive 900k vocabulary provides better coverage for 329 languages, especially for languages with non-Latin scripts.

Interestingly, Gemma—used by the recent SigLIP 2—performs significantly worse on multilingual tasks despite having a comparable vocabulary size. This shows that vocabulary design matters just as much as vocabulary size.

Why MetaCLIP 2 Matters

Let's step back and look at the bigger picture. What makes MetaCLIP 2 special? It's not just about hitting new benchmarks—it's about fundamentally rethinking how we build vision-language models.

🤝 Mutual Benefits

For the first time, English and non-English data help each other. English performance improves when we add multilingual data (81.3% vs 80.5% on ImageNet), and multilingual performance soars with English data included. No more trade-offs.

🌍 Native Language Supervision

Our model learns from alt-texts written by native speakers in 329 languages—not machine translations. This means authentic linguistic and cultural knowledge, not synthetic approximations.

🎭 Cultural Diversity

By training on worldwide web data, we capture concepts, landmarks, and visual patterns from every corner of the globe. The result? A model that understands "Taipei 101" as well as it understands "Empire State Building."

🚫 No-Filter Philosophy

We removed the last major filter in CLIP training: language. By embracing all languages instead of discarding 60% of the web, we maximize diversity and minimize biases. More data, better models.

🔬 Scientific Rigor

Our recipe maximizes overlap with OpenAI CLIP and MetaCLIP—no proprietary data, no translation, no architectural tricks. Every change is carefully ablated and justified. The findings generalize.

🌐 Broader Impact

This isn't just about CLIP. Our curation algorithm provides foundational worldwide data that benefits multimodal LLMs, self-supervised learning, and image generation models. It's a rising tide that lifts all boats.

Wrapping Up

So what have we learned? Training CLIP on worldwide web data doesn't have to be complicated. With a few carefully chosen changes to data curation—extending substring matching to 329 languages and using a constant tail ratio for balancing—we get something remarkable: English and non-English data actually help each other.

The results prove it. We achieve new state-of-the-art on multilingual benchmarks while also improving English-only performance. No translation hacks, no custom architectures, no curse of multilinguality. Just a simple, scalable recipe that works.

All our code and models are open source. We hope this makes it easier for the community to build even better multilingual vision-language models. Because at the end of the day, AI should work for everyone, not just English speakers.

Citation

To cite this work, please use the following BibTeX entry:

@inproceedings{chuang2025metaclip2,
    title={MetaCLIP 2: A Worldwide Scaling Recipe},
    author={Yung-Sung Chuang and Yang Li and Dong Wang and Ching-Feng Yeh and Kehan Lyu and Ramya Raghavendra and James Glass and Lifei Huang and Jason Weston and Luke Zettlemoyer and Xinlei Chen and Zhuang Liu and Saining Xie and Wen-tau Yih and Shang-Wen Li and Hu Xu},
    booktitle={Advances in Neural Information Processing Systems},
    note={to appear},
    year={2025},
    url={https://arxiv.org/abs/2507.22062},
    eprint={2507.22062},
    archivePrefix={arXiv}
}