Alessandra Bagnato, Fadwa REKIK

Abstract

This article presents the systematic adaptation of the FOODITY Data Lake, a production-ready open-access platform for food-related datasets, into a secure, project-bound Data Lake Management System (DLMS) for the SOSFood project, reaching TRL 7 within Deliverable D4.8. Key adaptations include a dual-portal architecture separating data providers from reviewers and JWT-based authentication enforcing partner-only access. The platform was validated following ISO/IEC/IEEE 29119 standards by nine testers across three European use cases (Galicia, Greece, Lithuania), representing diverse professional profiles. All mandatory requirements were met. Usability scores averaged 4.38/5 for visual appeal and 4.50/5 for ease of understanding. This work demonstrates a replicable model for repurposing open data lake infrastructure under differing access and governance requirements.

Introduction

The SOSFood project advances sustainability in European food systems through data-driven insights and multi-stakeholder collaboration across four partner regions. Although a data lake was not explicitly required within the original description of work, the consortium identified the existing FOODITY Data Lake as an opportunity to accelerate data exploration. FOODITY, originally designed for open citizen access to food-related datasets, offered interactive dashboards and intelligent search capabilities.

The project spans four years, with the first deliverable of WP4 Tools Testing (D4.8) expected by the end of the second year. This timeline created a natural checkpoint to validate the data lake approach early and establish which analyses the project’s datasets could support before committing to more costly bespoke development.

This article details the technical adaptations, validation methodology, and test results of this transformation. Section 2 describes the architectural evolution from FOODITY to SOSFood across three dimensions: portal architecture, access control, and operational isolation. Section 3 presents the technology stack and security implementation. Section 4 describes the validation methodology aligned with ISO/IEC/IEEE 29119 standards, and Section 5 reports the results obtained from structured testing with nine end users across three European regions. We conclude with a discussion of limitations and future work.

From FOODITY to SOS Food: Architectural evolution

The SOSFood Data Lake Management System (DLMS) represents a controlled-access evolution of the FOODITY platform. Where FOODITY prioritised open access for citizens and researchers, the DLMS reconfigures the system for restricted dissemination, enforcing authenticated access for project partners and reviewers while preserving the proven ingestion, indexing, and visualisation pipeline. This section describes the key adaptations across three dimensions: portal architecture, access control, and operational isolation.

Dual-Portal Architecture

The DLMS employs a dual-portal architecture serving distinct user categories. The Data Provider Portal enables case study partners to upload and manage their datasets with secure, credential-protected access. The Reviewer Portal — adapted from FOODITY’s original Citizen Portal — provides project partners and reviewers with protected access to explore datasets and interactive dashboards. Both portals now operate under mandatory authentication, eliminating the anonymous browsing path that was central to FOODITY’s citizen-facing design.

Access Control and Identity Management

The front-end evolution includes SOSFood-specific navigation, legal and policy content, and user guidance emphasising partner-only access. Login is mandatory for all portals, and all sensitive routes are gated to authenticated users. The back-end configuration is separated, with SOSFood using distinct endpoints, database identifiers, and service credentials, ensuring operational isolation while inheriting proven mechanisms such as the Personal Data Checker, file validation, and dashboard integration.

Architecture & Technology Stack

Data providers interact through the dedicated Data Provider Portal, which supports dataset submission with detailed metadata (descriptions, keywords, license information), dataset management and update, dashboard creation for data visualisation, and profile management including credential updates. Reviewers access the platform through the Reviewer Portal, which provides dataset browsing with associated metadata, data visualisation through configurable dashboards, and detailed analysis via integrated Kibana dashboards.

Technology Stack

The frontend applications are built with Angular 15, selected for its mature component model and support for dynamic single-page applications. The backend employs Spring Boot 3.1 (Spring Framework 6) to expose RESTful APIs for dataset lifecycle management, user authentication, and file processing. Data storage and retrieval rely on the ELK stack (Elasticsearch, Logstash, Kibana). Elasticsearch was chosen because it enables fast search and easy filtering across large datasets, which directly matches how the platform is used. Two distributed Elasticsearch instances provide fault tolerance and horizontal read scalability. Kibana powers interactive dashboards with role-based access controls. PostgreSQL manages relational metadata (users, dataset registrations, audit logs), with Liquibase ensuring repeatable schema migrations. An Apache HTTP Server acts as a reverse proxy, routing requests to the appropriate frontend application based on the requested domain name. The platform now supports CSV datasets up to 1 GB per file, significantly increasing its data handling capacity.

Security Implementation

Security measures align with OWASP recommendations:

  • File validation: CSV-only whitelist, filename sanitisation, 1 GB size limit, and ClamAV antivirus scanning
  • Authentication: JWT-based token authentication ensuring case studies maintain exclusive control over their datasets
  • Personal data protection: Automated Personal Data Checker using pattern matching to detect sensitive information, complemented by a mandatory questionnaire confirming PII exclusion
  • Data sovereignty: On-premises hosting in France ensures GDPR compliance and European jurisdiction

Privacy by Design

Privacy is embedded throughout the architecture through mandatory data anonymisation, explicit consent management via DataU integration, strict access controls between user roles, and proactive privacy risk assessments. An ethics presentation outlining principles of data ethics and guidelines for ethical data practices was shared with all project partners.

Methodology: Testing the existing Data Lake Features

Methodology

Testing activities were structured around ISO/IEC/IEEE 29119 standards and mapped to SOSFood requirements. Fifteen requirements were defined across four categories: core functionality (7), user experience (3), security (3), and compliance (2). Ten test scenarios were designed and executed by nine testers across three European use cases (Galicia, Greece, and Lithuania), representing diverse professional profiles including IT managers, financial analysts, project managers, and CFOs. We acknowledge that nine testers is a limited sample; however, it covers all three case studies and the primary user roles. We present below the five features most representative of the platform’s capabilities.

Key Features Tested

Feature 1, Data Provider Registration and Authentication: The DLMS provides a secure registration and authentication mechanism enabling project pilots to create accounts, receive email confirmation, and access the platform through credential-protected login.

Feature 2, Dataset Creation and Metadata Management: the DLMS offers a comprehensive module for creating new datasets accompanied by detailed metadata (descriptions, keywords, license information), with support for update and modification to maintain accurate and current dataset documentation.

Feature 3, File Upload and Download Management: the platform facilitates efficient upload and download operations through an intuitive interface, including file deletion capabilities within datasets.

Feature 4, Personal Data Checker (PDC): an automated GDPR compliance function that scans uploaded files to detect personal data (such as columns containing names or identifiable information) and prevents the upload when such data is identified, displaying a warning message.

Feature 5, Usability and Visual Design: The DLMS interface supports inclusive usability across major web browsers (Chrome, Firefox, and Edge), encompassing user-friendly navigation, intuitive data access patterns, and a clean interface.

Results

All nine testers successfully completed the core functional scenarios, registration, authentication, dataset creation, update, deletion, and file upload/download, with a 100% pass rate and no blocking defects. The Personal Data Checker was validated by eight of nine testers; one tester did not have a dataset containing personal data columns and therefore could not trigger the detection mechanism. The only notable observation was minor: one tester reported confusion with file naming conventions during upload, which was resolved by surfacing guidance text directly adjacent to the input field in a subsequent patch.  Usability scores were collected on a 5-point Likert scale. Visual appeal averaged 4.38 and ease of understanding averaged 4.50. Testers consistently highlighted the simplicity and intuitive flow of the platform, a result we consider particularly meaningful given the diversity of their professional backgrounds (none were UX specialists).

Impact

By reusing the FOODITY codebase, the SOSFood consortium avoided an estimated 70–80% of the development effort that a greenfield implementation would have required, based on a comparison of the FOODITY feature set retained versus the SOSFood-specific features developed. The inherited security pipeline (ClamAV scanning, CSV whitelisting, JWT authentication) and the Personal Data Checker, combined with mandatory PII exclusion questionnaires, ensured that all three case studies met GDPR requirements.  The dual-portal architecture proved effective for multi-stakeholder collaboration: case study partners uploaded and managed their datasets independently, while reviewers explored shared resources through a protected, read-only interface. Validation conducted between February and June 2024 confirmed that all tested features met their acceptance criteria, and usability metrics exceeded target thresholds across all tester profiles.

Conclusions and Future Work

This article presented the adaptation of the FOODITY Data Lake into a secure, project-specific DLMS for the SOSFood project. The main contribution consists of a dual-portal architecture separating data provider and reviewer workflows under mandatory authentication. Validation conducted in accordance with ISO/IEC/IEEE 29119 confirmed that all 15 requirements were satisfied, with usability scores of 4.38/5 for visual appeal and 4.50/5 for ease of understanding across nine testers. User experience feedback collected during testing has been incorporated into platform refinements, ensuring the Data Lake meets the practical needs of case study partners and reviewers.

In alignment with real-world data requirements, the platform is designed to support CSV datasets of up to 1 GB per file, reflecting the typical scale and complexity of datasets handled by project partners. While this capacity enables more representative use-case validation, very large-scale or streaming scenarios still require further evolution of the ingestion and storage architecture. Future work will focus on supporting additional data formats, optimizing large-scale data processing, and integrating advanced analytics capabilities.