Software preservation, open science, responsible AI: a conversation with Roberto Di Cosmo

Date :
Changed on 23/05/2025
- Following his visit to Chile, Roberto Di Cosmo reflected on the origins, mission, and achievements of the initiative launched by Inria in collaboration with UNESCO, Software Heritage, nearly a decade after its creation. In this conversation, Di Cosmo revisits the genesis and impact of Software Heritage, and shares his perspective on how the initiative could influence the development of transparent, traceable, responsible, and ethical AI.
JMR_6080
© Inria Chile / Foto JM Rojas

 

From April 7 to 10, Roberto Di Cosmo, Director of Software Heritage and Inria Researcher, visited Chile as part of the newly established Franco-Chilean Binational Center on Artificial Intelligence. Organized by Inria Chile, the schedule included the seminar "Open Source for a Responsible Artificial Intelligence in Chile", hosted by the Economic Commission for Latin America and the Caribbean (ECLAC), where Di Cosmo gave a keynote and joined a panel discussion, along with a lecture at the Faculty of Physical and Mathematical Sciences at the University of Chile, and meetings at the National Agency for Research and Development (ANID).

We took the opportunity to talk with the open source and software expert in a dialogue that covered the history of Software Heritage, its major milestones, and his vision for the future of the world’s largest source code archive.

Could you tell us what Software Heritage is? How did it start, what is its mission, and what are some of its main achievements nearly 10 years after its launch?

Software Heritage was launched in 2016 as a collaboration led by Inria with the support of UNESCO to address a key challenge: preserving humanity’s source code and making it accessible to everyone, today and in the future. The idea stems from the same motivation that drives us to preserve books, films, or manuscripts: software is a critical component of our cultural, scientific, and industrial heritage and underpins much of today's technological innovation.

Our mission is to collect, organize, preserve, and share all publicly available source code in a single, universal archive.

Verbatim

Over nearly a decade, we’ve made significant progress: Software Heritage now contains over 23 billion unique source files from around 260 million archived projects, resulting in a deduplicated Merkle DAG (graph) with tens of billions of nodes and hundreds of billions of edges. This makes us the largest source code archive ever built and we continue expanding it daily to ensure future generations can access it.

Auteur

Roberto Di Cosmo

Poste

Director Software Heritage, Inria

How can Software Heritage contribute to sustainable development and the digital transformation of countries, particularly those in the Global South?

Software Heritage fulfills three essential roles that make it a strategic ally for sustainable digital transformation, especially in the Global South.

First, it serves as a universal catalog of source code, regardless of where the code is hosted online. Whether a project is on GitHub, GitLab, or any other platform, SWH indexes it and makes it discoverable through a unified interface.

Second, SWH is also an archive that safeguards this content against deletions or alterations. Platforms often shut down for economic or strategic reasons, leading to the loss of countless software projects. With SWH, that code is preserved and remains available for future use and reference.

Lastly, the mission of SWH goes beyond archiving, it's about enabling observation and analysis. It is the first major attempt to build a "telescope" for the software development galaxy. Its deduplicated graph linking files, commits, contributions, and metadata forms a true map of the “stars” of the software ecosystem. This opens up new possibilities for research and continuous improvement, enabling governments, universities, and developer communities anywhere in the world to analyze and leverage insights collaboratively. For the Global South, this means verified and guaranteed access to critical software resources, preventing duplication of efforts, reducing costs, and promoting technological sovereignty.

What role do you see Software Heritage playing in the development of responsible artificial intelligence?

The development of AI models, especially large language models (LLMs), critically depends on the availability of appropriate training datasets. When training AI with large datasets of source code, it is essential to ensure traceability, and to respect copyright and licensing terms. Software Heritage provides not only the code, but also essential metadata such as detailed provenance, thanks to our graph structure and intrinsic persistent identifiers (SWHIDs), which ensure reproducibility.

This means that in the AI domain, we can offer full transparency and traceability of the datasets used for training: researchers, companies, and developers can track exactly where every code snippet came from and how it has evolved over time. This strengthens the foundation for ethical and responsible AI development, and enables serious consideration of questions related to code legitimacy and originality.

Can you tell us more about the principles Software Heritage released in 2023 for LLMs and their importance? Also, what is the Code Commons initiative about?

In 2023, Software Heritage published a set of principles highlighting the importance of transparency, traceability, and license compliance when using its data to train large-scale language models. These principles promote open collaboration while protecting the intellectual property of authors and projects. They stress the need for LLM developers to document and disclose their data sources, contribute to improving dataset quality, and ensure their usage aligns with the original authors’ intentions.

Code Commons is a two year project funded by the French government, led by Inria in collaboration with partners including CEA and several Italian universities, with Software Heritage at its core.

The goal of Code Commons is to elevate the Software Heritage archive to a new level of coverage, scalability, and quality by incorporating richer metadata (tickets, discussions, pull requests) and building infrastructure to support its efficient use in AI training, especially on national supercomputers. This makes it possible for research teams, particularly in developing countries, to accurately select and extract the most relevant datasets for training their models. Moreover, Code Commons emphasizes sustainability, ethics, and digital sovereignty, promoting a collaborative approach where knowledge is treated as a shared public good.

Together, the 2023 LLM principles and the Code Commons initiative reinforce Software Heritage’s vision: to provide a robust, responsible, and open infrastructure that enables inclusive, sustainable technological innovation aligned with values of transparency and global collaboration.

Verbatim

The recent visit of Roberto Di Cosmo, Director of Software Heritage, to Chile has been highly significant for the development of responsible Artificial Intelligence in the country. Software Heritage provides essential tools to ensure transparency and traceability in AI development. Its universal source code archive enables researchers and developers in Chile to access training data while understanding its origin and evolution, an essential factor for building ethical and trustworthy models.Also, initiatives like Code Commons, which aim to improve the quality and accessibility of code, open up new opportunities for Chile to take a leading role in the creation of sustainable AI aligned with global collaboration values. Within the framework of the Franco-Chilean Binational Center on AI, we will fully leverage these functionalities for the benefit of Chile.

Auteur

Nayat Sánchez Pi

Poste

Director of Inria Chile / Director of the Franco-Chilean Binational Center on Artificial Intelligence