An open-source training framework to advance multimodal AI

EPFL researchers have developed 4M, a next-generation, open-sourced framework for training versatile and scalable multimodal foundation models that go beyond language.
A couple of oranges seen through the lens of multiple modalities, with each slice showing a different way one might perceive and understand this scene - 2025 EPFL - CC-BY-SA 4.0

Large Language Models such as OpenAI’s ChatGPT have already transformed the way many of us go about some of our daily tasks. These generative artificial intelligence chatbots are trained with language — hundreds of terabytes of text ‘scraped’ from across the Internet and with billions of parameters.

Looking ahead, many believe the ‘engines’ that drive generative artificial intelligence will be multimodal models that are not just trained on text but also can process various other modalities of information, including images, video, sound, and modalities from other domains such as biological or atmospheric data.

Yet, until recently, training a single model to handle a wide range of modalities – inputs – and tasks – outputs – faced significant challenges. For example, the training often led to a reduction in performance compared to single-task models and typically required careful strategies to reduce quality losses and maximize accuracy. In addition, training one network on different modalities - or inputs - such as language, images or videos that vary greatly, presented additional complexities, and essential information in certain modalities was often incorrectly ignored by the model.

Multimodal Modeling

In a multi-year project undertaken with support from Apple in California, EPFL researchers from the Visual Intelligence and Learning Laboratory (VILAB) in the School of Computer and Communication Sciences (IC) have developed 4M, for Massively Masked Multimodal Modeling, one of the world’s most advanced single neural networks to handle a wide and varied range of tasks and modalities.

In their latest research paper on 4M, presented in December at NeurIPS 2024, the Annual Conference on Neural Information Processing Systems, the researchers describe how it expands the capabilities of existing models in multiple ways (see box below for more technical details).

“With 4M, we now have a rich model that can interpret more than just language. But why does this matter? One common criticism of LLMs is that their knowledge is not grounded because the training data is limited to only language,” explained Assistant Professor Amir Zamir, Head of VILAB.

“When we advance to multimodal modeling, we don’t have to limit ourselves to language. We bring in other modalities, including sensors. For example, we can communicate an orange through the word ‘orange,’ just like in language models, but also through a collection of pixels, meaning how the orange looks, or through the sense of touch, capturing how touching an orange feels. If you assemble various modalities, you have a more complete encapsulation of the physical reality that we are trying to model,” he continued.

Trying to model the physical reality by assembling various modalities

The image shows a couple of oranges seen through the lens of multiple modalities, with each slice showing a different way one might perceive and understand this scene.

The modalities from left to right represent surface normals (the color represents surface orientation), depth (distance to the camera, red=near, blue=far), RGB (the original image), segmentation (distinct objects and image regions), and edges (object or texture boundaries).

Towards an open-source, generic model for wide use

Despite these impressive advances, Zamir says the development of 4M has presented some intriguing challenges, including the model not developing a truly unified representation across the modalities, and he has his own theory as to why.

“We think that secretly, under the hood, the models cheat and create a little ensemble of independent models. One set of parameters solves one problem, another set of parameters solves another, and collectively, they appear to solve the overall problem. But they’re not truly unifying their knowledge in a way that enables a compact joint representation of the environment that would be a good portal to the world.”

The VILAB team is continuing to work on building more structure and unification into 4M, with the goal of developing an open-source, generic architecture, enabling experts in other domains to adapt it to their specific needs, such as climate modeling or biomedical research. The team also works on addressing other important aspects, such as boosting the scalability even further and methods for the specialization of models to deployment contexts.

“The whole point of open sourcing is that people can tailor the model for themselves with their own data and their own specifications. 4M is coming at the right moment in time, and we are especially enthusiastic about other domains adopting this line of modeling for their specific use cases. We are excited to see where that leads. But there are still a lot of challenges, and there is still a lot to do,” said Oguzhan Fatih Kar and Roman Bachmann, Doctoral Assistants in VILAB and co-authors of the paper.

Based on the team’s experience developing 4M and the intriguing problems that they continue to work on, Zamir believes there are some interesting questions around the future development of foundation models.

“As humans, we have five key senses, and on top of that, we efficiently learn language, which adds labels and structure to the knowledge that was already grounded in these other senses. It’s the opposite with the current AI – we have language models without sensory access to the world but that are trained using colossal data and compute resources. Our goal is to study the role of multimodality and efficiently develop a grounded world model that can be effectively utilized for downstream uses.”

4M expands the capabilities of existing models across several key axes

  • Modalities: 4M enables new capabilities like predicting tens of modalities from tens of others, cross-modal retrieval, controllable generation, and strong out-of-the-box performance. It has convincingly shown that a single model can solve tens of diverse tasks without any loss in performance compared to dedicated single-task models and the state-of-the-art.
  • Diversity: 4M supports varied modalities and more structured data, such as human poses, SAM instances, and metadata for controllable generation.
  • Tokenization: 4M investigates discrete tokenization of diverse modalities such as global image embeddings, human poses, and semantics.
  • Scale: The public model has been scaled to 3 billion parameters and trained on over 500 billion tokens.
  • Co-Training: 4M demonstrates co-training on vision and language modeling simultaneously.