About

Abstract

Using modern scientific data and metadata ontologies, existing and newly acquired data from the natural and biomedical sciences will be made accessible in accordance with the FAIR principles and utilised to create models by implementing machine learning (ML) and quantum computing (QC) methods. The project objectives include adapting appropriate ontologies for the data, providing open access to (meta)data, preparing a distributed system for large-scale data exchange, and creating, implementing, and testing ML and QC models to ensure compliance with FAIR principles. The established centre of excellence enables the reduction of gaps in the dissemination of research in the natural and biomedical sciences, opens up new opportunities for more effective interdisciplinary collaboration, and ensures the country's technological progress and international recognition, extending beyond research in the aforementioned areas.

The Idea of the Project

The idea behind the project is to achieve an international breakthrough in interdisciplinary research in the field of modelling by combining existing and newly generated data sets from VU scientists and making them openly accessible in accordance with FAIR principles. The project covers scientific areas in which VU scientists have achieved exceptional results – system models will be created using classical and quantum machine learning, signal analysis, statistical and analytical methods, which are actively developed not only at the Faculty of Mathematics and Informatics (MIF), but also in other departments of the university; valuable data obtained in the fields of signal analysis, life sciences, materials science, and other scientific fields will form the basis for new data processing methods and models. New models will be created using ML and QS methods, analysing one-dimensional or two-dimensional signals, analysing chemoinformatics, structural biology, genomics, and ecology data sets.

The platform being developed during the project will enable researchers to combine their expertise in various complementary fields of the natural sciences, thereby achieving stimulating interactions. The platform would combine existing scientific data resources (e.g. MIDAS, COD, Bioinformatics Department Web Services), global open resources (such as PDB or NCBI open databases), data-generating devices from the IT Open Access Centre at MIF, and new scientific equipment purchased during the project. The expertise and resources of the centre's scientists will be used to create the models.

The data will be supplemented with all the necessary metadata for the subject areas, adapting it to the planned uses during the project, as well as to unforeseen future cases of reuse. For example, the Life Sciences Center (LSC) has accumulated and continues to accumulate large, valuable data sets: approximately 1,000 high-resolution crystal diffraction data sets have been accumulated; the electron microscopy platform can generate approximately 1TB of raw images per day; deep sequencing data reaches about 1TB per month. In the fields of ecology and zoology, data on species distribution, environmental pollution, and the interactions of wild animals with humans in Lithuania have been accumulated over several decades. Chemistry and materials science experiments continuously generate large amounts of data on structural analysis, electron microscopy, and (electro)catalytic processes. The volume and diversity of this data will enable the creation of new models and the improvement of existing ones.

With the development of modern data and high-performance computing (HPC) centers, the need for quantum computing is a key technological requirement in Europe and ranks among the top three technologies worldwide among the 500 largest HPC data centers. The agreement signed by MIF with the Poznan Supercomputing and Networking Center (PSNC) in the field of quantum technologies allows not only to take over the experience of PSNC scientists in the field of quantum computing, but also to gain exclusive access to quantum computers, which are still very expensive. Furthermore, the HPC resources available at MIF are intensively used in scientific research on speech signals, image analysis, and blockchains, enabling modelling and calculations not only in a conventional HPC environment but also in quantum simulations. Given the center's priority to develop MM methods, special attention will be paid to the analysis and creation of new quantum MM methods, considering the specificity of the data. Thus, the project plans to develop both classical and new quantum methods and tools necessary for ML. The developed methods will be able to analyse large amounts of complex scientific data and extract new patterns, trends, and relationships, revealing hidden correlations that may be difficult to detect using traditional statistical or ML methods.

In addition to opening up big data to science and society, developing methods for processing it and developing classical ML algorithms, the center will initiate and launch completely new research topics and quantum computing in Lithuania. It will investigate ways to ensure the interaction of HPC and quantum computing or simulations with MM, addressing scientific uncertainties in speech signals, images (including medical images), chemoinformatics, structural biology, etc., using both classical MM methods and quantum research. All this will provide new opportunities for interdisciplinary collaboration and research development, making more effective use of existing expertise and valuable data obtained in various fields, thus creating a long-term strategy to become a leading center of excellence in the region.

The Goal of the Project

To establish a globally recognized center of excellence dedicated to the accumulation and disclosure of life and biomedical science data, and the development and validation of classical and quantum machine learning methods.

Objectives

  1. Create infrastructure that enables the efficient and secure collection and sharing of large-scale data; form teams of researchers from various fields who collaborate systematically.

  2. Prepare new data sets in the fields of life, physical, and biomedical sciences; prepare descriptions of metadata ontologies and make them available to science and society in accordance with FAIR principles.

  3. Develop effective data processing methods and analysis tools, considering the volume and specificity of the data collected.

  4. Develop machine learning-based classical and quantum methods that allow the prediction, classification, clustering, or other description of the behaviour of one-dimensional or multidimensional signals, and other complex phenomena or systems.

Impact of the Project

The creation of a data centre for natural and biomedical sciences will not only contribute to the direct improvement of the quality of science and studies by creating a universal and open platform, but will also make various data available to the general public. By integrating the expertise of scientists from different fields, the center will promote interdisciplinary cooperation, leading to innovative solutions and breakthroughs not only in science but also in the digital space, where data is the main tool for ensuring the creation of innovative scientific solutions and the development of services. Unfortunately, in such a competitive environment, data is often closed, which limits scientific progress. Open, effective, and secure access to data enables specialists in various fields, researchers, doctoral students, and students to enhance the study process, contribute to higher-quality scientific work, attract new talent, and increase the added value and quality of R&D activities at both national and international levels.

The creation of the necessary technical and software infrastructure, along with data analysis methods, will standardise the various processes of data collection, storage, and disclosure across different fields, ensuring their consistency and reliability. The application of advanced data science models and methods, such as machine learning, simulation, signal analysis, and quantum computing, will enable new and valuable insights in various scientific fields, from understanding biological processes to predicting natural phenomena and creating new materials. Open data and models based on it will enable the public and private sectors to formulate more informed policies, make more optimal and reasoned decisions, increase transparency, and strengthen public confidence in these decisions. This, in turn, can contribute to economic development, public health, and climate policy.

This project will directly contribute to Vilnius University's strategy, "High-level international science," as the integration and reuse of data will lead to the implementation of new large-scale projects. It will also contribute to the guidelines of the VU Open Science Policy, whose main principle states that research results and data should be "as open as possible, as closed as necessary."

Team Leads

Jurgita Markevičiūtė (MIF) has 10 years of experience in fundamental and applied mathematics and statistics. She has published more than 18 papers in CA WOS with IF, in which methods for data analysis are developed and historical economic data are analysed. During the DOTSUT-31 project, she and her colleagues prepared and published an electronic research database containing 200 data sets. Currently, as part of the ongoing international project Baltic100 (EEA-RESEARCH-174) and a state-commissioned study (No. S-VIS-23-15), she and her colleagues are preparing publicly available data sets.

Povilas Treigys (MIF) has been leading the Image and Signal Analysis Research Group since 2017. He has published more than 70 papers, with at least 25 of which have CA WOS IF, 16 of which are no lower than Q2. He has participated in 9 project activities and led a PostDoc project funded by LMT. He supervises 4 doctoral students (2 have successfully defended their doctoral degrees) and also leads the VU MIF ITAPC center, distributing HPC resources to the academic and business communities.

Saulius Gražulis (GMC) has published 60 scientific publications. His scientific interests include crystallographic and scientific databases, data science methods, and ontologies. He supervises two doctoral students, and four doctoral students have successfully defended their doctoral degrees. He has created an open-access database of crystalline structures of organic, inorganic, metal-organic compounds and minerals, except for biopolymers (COD).

Arvydas Laurinavičius (MF) works in the fields of pathology informatics, medical semantics standards, digital image analysis, and statistical disease modelling. He participated in the establishment of the European Society for Digital Integrative Pathology and leads the implementation of the National Biobank project at VUL Santaros Clinics. His work is related to the development of the PathIS pathology information system. He has written 125 scientific articles and is the co-author of two patent applications.

Linas Vilčiauskas (ChGF) has more than 15 years of work experience in the fields of molecular modeling, computational chemistry, and materials informatics. He has published over 30 articles in high-level journals such as Nature Chemistry, Journal of the American Chemical Society, Chemistry of Materials, etc., 1 book chapter (Royal Society of Chemistry) and has 1 patent application with the US Patent and Trademark Office. He was the leader of the LMT Brain Gain Program project "Understanding and Applications of Aqueous Na-ion Technologies for Energy Storage" and the principal investigator of several other projects.

Remigijus Paulavičius (MIF) has more than 15 years of experience in the development of mathematical optimisation algorithms and open access tools based on them, and more than 5 years of experience in the fields of high-performance computing and distributed data technologies and has recently been working in the field of quantum computing. He has published over 40 articles in highly rated international journals and is the co-author of two monographs. He has been/is the principal investigator of more than 10 projects (including at the prestigious Imperial College London), three of which he has successfully led.