The ONTOLISST Project – One Year Update

The ONTOLISST project, launched at the end of 2024 under the coordination of RDC CSS within an international consortium and funded by the European Union, has now completed its first year. The project aims to improve the discoverability and interoperability of social science research data through multilingual, cost-efficient digital tools.

During this first year, survey metadata were collected from ten major European research infrastructures, and interviews with professionals responsible for metadata workflows provided insights into institutional practices, thematic classification systems, and the challenges archives face when adapting to new requirements. These contributions have guided the development of our methodology.

The metadata received were processed in several phases to construct a coherent two-level thesaurus. First, XML datasets were consolidated into a unified structure, extracting question texts, variable descriptions, and conceptual categories. A comprehensive data cleaning strategy was applied, including the removal of non-thematic content, generic phrases, duplicates, and incomplete texts. Thematic categorization was then carried out using a combination of unsupervised topic modeling and anchored clustering. BERTopic was applied to identify high-level clusters, which were refined into ten overarching thematic categories inspired by established conceptual frameworks. Concepts from multiple archives were manually mapped to these categories, creating crosswalks and ensuring consistency. All questions were labeled according to the current top-level version of the LiSST framework, supported by iterative semi-supervised clustering and expert validation.

Progress will be shared at the EDDI 2025 Conference in a dedicated ONTOLISST session, which includes two talks and a roundtable discussion with selected interviewees. The session provides an opportunity to present results and engage in dialogue on future directions for metadata standardization and thematic annotation.

The next phase focuses on defining approximately one hundred minor categories representing lower-level concepts. This process combines clustering techniques with an analysis of keywords from leading social science journals to align categories with current research trends. Proportions are adjusted to eliminate sample bias and ensure balanced representation. Validation draws on mixed domain expertise from social sciences, linguistics, and archival practices. These steps complete the LiSST thesaurus and support the development of a machine-assisted annotation tool and a gold standard corpus.

Find more info on the ONTOLISST project on the project website: https://oscars-project.eu/projects/ontolisst-thematic-ontologies-social-science-research-data