Publications - Erkan Karabulut

2026

E. Karabulut, D. Daza, P. Groth, M. C. Schut and V. Degeler. "Tabular Foundation Models Can Learn Association Rules." arXiv preprint arXiv:2602.14622 (2026). Abstract PDF

Association Rule Mining (ARM) is a fundamental task for knowledge discovery in tabular data and is widely used in high-stakes decision-making. Classical ARM methods rely on frequent itemset mining, leading to rule explosion and poor scalability, while recent neural approaches mitigate these issues but suffer from degraded performance in low-data regimes. Tabular foundation models (TFMs), pretrained on diverse tabular data with strong in-context generalization, provide a basis for addressing these limitations. We introduce a model-agnostic association rule learning framework that extracts association rules from any conditional probabilistic model over tabular data, enabling us to leverage TFMs. We then introduce TabProbe, an instantiation of our framework that utilizes TFMs as conditional probability estimators to learn association rules out-of-the-box without frequent itemset mining. We evaluate our approach on tabular datasets of varying sizes based on standard ARM rule quality metrics and downstream classification performance. The results show that TFMs consistently produce concise, high-quality association rules with strong predictive performance and remain robust in low-data settings without task-specific training. Source code is available at https://github.com/DiTEC-project/tabprobe.

2025

E. Karabulut, D. Daza, P. Groth and V. Degeler. "Discovering Association Rules in High-Dimensional Small Tabular Data". In ANSyA'25: 1st International Workshop on Advanced Neuro-Symbolic Applications, co-located with 28th European Conference on Artificial Intelligence (ECAI 2025). Abstract PDF

Association Rule Mining (ARM) aims to discover patterns between features in datasets in the form of propositional rules, supporting both knowledge discovery and interpretable machine learning in high-stakes decision-making. However, in high-dimensional settings, rule explosion and computational overhead render popular algorithmic approaches impractical without effective search space reduction, challenges that propagate to downstream tasks. Neurosymbolic methods, such as Aerial+, have recently been proposed to address the rule explosion in ARM. While they tackle the high dimensionality of the data, they also inherit limitations of neural networks, particularly reduced performance in low-data regimes. This paper makes three key contributions to association rule discovery in high-dimensional tabular data. First, we empirically show that Aerial+ scales one to two orders of magnitude better than state-of-the-art algorithmic and neurosymbolic baselines across five real-world datasets. Second, we introduce the novel problem of ARM in high-dimensional, low-data settings, such as gene expression data from the biomedicine domain with around 18k features and 50 samples. Third, we propose two fine-tuning approaches to Aerial+ using tabular foundation models. Our proposed approaches are shown to significantly improve rule quality on five real-world datasets, demonstrating their effectiveness in low-data, high-dimensional scenarios.
E. Karabulut, P. Groth, and V. Degeler. "Pyaerial: Scalable association rule mining from tabular data". SoftwareX, 31:102341, 2025. ISSN 2352-7110. Abstract PDF

Association Rule Mining (ARM) is a knowledge discovery technique that identifies frequent patterns as logical implications within transaction datasets and has been applied across domains such as e-commerce, healthcare, and cyber–physical systems. However, many state-of-the-art ARM methods, typically algorithmic or nature-inspired, suffer from rule explosion and long execution times. Aerial is a novel neurosymbolic ARM algorithm for tabular datasets that mitigates rule explosion using neural networks, while remaining compatible with existing approaches. Aerial transforms tables into transactions, uses an autoencoder to learn compact neural representations, and extracts logical rules from the neural representations. This paper presents PyAerial, a Python library that makes Aerial accessible and easy to use on generic tabular datasets for end users in a domain-independent way. Besides association rules, PyAerial can also be used to extract frequent itemsets, learn classification rules, apply item constraints to learn rules over the features of interest rather than all features, pre-discretize numerical data for ARM, and can be run on a GPU.
E. Karabulut, P. Groth, V. Degeler, Neurosymbolic association rule mining from tabular data, in: Proceedings of The 19th International Conference on Neurosymbolic Learning and Reasoning, volume 284 of Proceedings of Machine Learning Research, PMLR, 2025, pp. 565–588. Abstract PDF

Association Rule Mining (ARM) is the task of mining patterns among data features in the form of logical rules, with applications across a myriad of domains. However, high-dimensional datasets often result in an excessive number of rules, increasing execution time and negatively impacting downstream task performance. Managing this rule explosion remains a central challenge in ARM research. To address this, we introduce Aerial+, a novel neurosymbolic ARM method. Aerial+ leverages an under-complete autoencoder to create a neural representation of the data, capturing associations between features. It extracts rules from this neural representation by exploiting the model's reconstruction mechanism. Extensive evaluations on five datasets against seven baselines demonstrate that Aerial+ achieves state-of-the-art results by learning more concise, high-quality rule sets with full data coverage. When integrated into rule-based interpretable machine learning models, Aerial+ significantly reduces execution time while maintaining or improving accuracy.
Erkan Karabulut, Paul Groth, and Victoria Degeler. "Learning Semantic Association Rules from Internet of Things Data". Neurosymbolic Artificial Intelligence, 2025:1. doi:10.1177/29498732251377518. Abstract PDF

Association rule mining (ARM) is the task of discovering commonalities in data in the form of logical implications. ARM is used in the Internet of Things (IoT) for different tasks, including monitoring and decision-making. However, existing methods give limited consideration to IoT-specific requirements such as heterogeneity and volume. Furthermore, they do not utilize important static domain-specific description data about IoT systems, which is increasingly represented as knowledge graphs. In this paper, we propose a novel ARM pipeline for IoT data that utilizes both dynamic sensor data and static IoT system metadata. Furthermore, we propose an autoencoder-based neurosymbolic ARM method (Aerial) as part of the pipeline to address the high volume of IoT data and reduce the total number of rules that are resource-intensive to process. Aerial learns a neural representation of a given dataset and extracts association rules from this representation by exploiting the reconstruction (decoding) mechanism of an autoencoder. Extensive evaluations on three IoT datasets from two domains show that ARM on both static and dynamic IoT data results in more generically applicable rules while Aerial can learn a more concise set of high-quality association rules than the state-of-the-art, with full coverage over the datasets.

2024

Erkan Karabulut, Paul Groth, and Victoria Degeler. "3K: Knowledge-Enriched Digital Twin Framework." (2024). In LongevIoT'24: 1st International Workshop on Longevity in IoT Systems, co-located with 14th International Conference on Internet of Things, November 19–22, 2024, Oulu, Finland. Abstract PDF

Digital Twins (DTs) are the digital equivalent of physical entities that facilitate, among others, monitoring and decision-making, thus helping extend the longevity of the twinned entity. DTs with automated decision-making capabilities require explainable inference mechanisms, especially for critical infrastructures such as water networks. Here we introduce 3K, a DT framework that aims for knowledge-enriched inference that is explainable and fast, by synthesizing knowledge representation (semantics) and knowledge discovery methods. 3K constructs a knowledge graph, which is becoming a mainstream way of metadata storage in DTs, and proposes a new method that can run on both sensor data and knowledge graphs to learn semantic association rules. The rules represent the expected working conditions of the DT and we argue that when combined with domain knowledge in the form of ontological axioms, semantic association rules can help perform downstream tasks in DTs, including extending the longevity of the twinned entities such as an Internet of Things (IoT) system. Furthermore, we demonstrate the 3K framework in a water distribution network use case and show how it can be used for downstream tasks.
Degeler, Victoria, et al. "DiTEC: Digital twin for evolutionary changes in water distribution networks." International Symposium on Leveraging Applications of Formal Methods. Cham: Springer Nature Switzerland, 2024. Abstract PDF

Conventional digital twins (DT) for critical infrastructures are widely used to model and simulate the system's state. But fundamental environment changes bring challenges for DT adaptation to new conditions, leading to a progressively decreasing correspondence of the DT to its physical counterpart. This paper introduces the DiTEC system, a Digital Twin for Evolutionary Changes in Water Distribution Networks (WDN). This framework combines novel techniques, including semantic rule learning, graph neural network-based state estimation, and adaptive model selection, to ensure that changes are adequately detected, processed and the DT is updated to the new state. The DiTEC system is tested on the Dutch Oosterbeek region WDN, with results showing the superiority of the approach compared to traditional methods.
Huang, Yiwen, Erkan Karabulut, and Victoria Degeler. "Large Language Model for Ontology Learning In Drinking Water Distribution Network Domain.". ELMKE'24: Evaluation of Language Models in Knowledge Engineering, co-located with 24th International Conference on Knowledge Engineering and Knowledge Management, 26-28 November, 2024, Amsterdam, The Netherlands. Abstract PDF

Currently, most ontologies are created manually, which is time-consuming and labour-intensive. Meanwhile, the advanced capabilities of Large Language Models (LLMs) have proven beneficial in various domains, significantly improving the efficiency of text processing and text generation. Therefore, this paper focuses on the use of LLMs for ontology learning. It uses a manual ontology construction method as a basis to facilitate the LLMs for ontology learning. The proposed approach is based on Retrieval Augmented Generation (RAG), and passed queries to LLMs are based upon the manual ontology method–UPON Lite ontology. Two different variants of LLMs have been experimented with, and they all demonstrate the capability of ontology learning to varying degrees. This approach shows promising initial results in the direction of (semi-) automated ontology learning using LLMs and makes the ontology construction process easier for people without prior domain expertise. The final ontology was evaluated by the domain expert and ranked according to the defined criteria. Based on the evaluation results, the final ontology could be used as a base version, but it requires further fine-tuning by domain experts to ensure its accuracy and completeness.
Erkan Karabulut, Victoria Degeler, and Paul Groth. AE SemRL: Learning Semantic Association Rules with Autoencoders, 2024. (Earlier version of Aerial) Abstract PDF

Association Rule Mining (ARM) is the task of learning associations among data features in the form of logical rules. Mining association rules from high-dimensional numerical data, for example, time series data from a large number of sensors in a smart environment, is a computationally intensive task. In this study, we propose an Autoencoder-based approach to learn and extract association rules from time series data (AE SemRL). Moreover, we argue that in the presence of semantic information related to time series data sources, semantics can facilitate learning generalizable and explainable association rules. Despite enriching time series data with additional semantic features, AE SemRL makes learning association rules from high-dimensional data feasible. Our experiments show that semantic association rules can be extracted from a latent representation created by an Autoencoder and this method has in the order of hundreds of times faster execution time than state-of-the-art ARM approaches in many scenarios. We believe that this study advances a new way of extracting associations from representations and has the potential to inspire more research in this field.

2023

Karabulut, Erkan, Degeler, Victoria, and Groth, Paul. "Semantic Association Rule Learning from Time Series Data and Knowledge Graphs." In SemIIM'23: 2nd International Workshop on Semantic Industrial Information Modelling co-located with 22nd International Semantic Web Conference (ISWC 2023). Abstract PDF

Digital Twins (DT) are a promising concept in cyber-physical systems research due to their advanced features including monitoring and automated reasoning. Semantic technologies such as Knowledge Graphs (KG) are recently being utilized in DTs especially for information modelling. Building on this move, this paper proposes a pipeline for semantic association rule learning in DTs using KGs and time series data. In addition to this initial pipeline, we also propose new semantic association rule criterion. The approach is evaluated on an industrial water network scenario. Initial evaluation shows that the proposed approach is able to learn a high number of association rules with semantic information which are more generalizable. The paper aims to set a foundation for further work on using semantic association rule learning especially in the context of industrial applications.
Karabulut, Erkan, Salvatore F. Pileggi, Paul Groth, and Victoria Degeler. "Ontologies in digital twins: A systematic literature review." Future Generation Computer Systems (2023). Abstract PDF

Digital Twins (DT) facilitate monitoring and reasoning processes in cyber–physical systems. They have progressively gained popularity over the past years because of intense research activity and industrial advancements. Cognitive Twins is a novel concept, recently coined to refer to the involvement of Semantic Web technology in DTs. Recent studies address the relevance of ontologies and knowledge graphs in the context of DTs, in terms of knowledge representation, interoperability and automatic reasoning. However, there is no comprehensive analysis of how semantic technologies, and specifically ontologies, are utilized within DTs. This Systematic Literature Review (SLR) is based on the analysis of 82 research articles, that either propose or benefit from ontologies with respect to DT. The paper uses different analysis perspectives, including a structural analysis based on a reference DT architecture, and an application-specific analysis to specifically address the different domains, such as Manufacturing and Infrastructure. The review also identifies open issues and possible research directions on the usage of ontologies and knowledge graphs in DTs.
E. Karabulut and R. C. Sofia, "An Analysis of Machine Learning-based Semantic Matchmaking," in IEEE Access, doi: 10.1109/ACCESS.2023.3259360. 2023. Abstract PDF

Interoperability remains to be one of the main challenges in the Internet of Things. The increasing number of IoT data sources from various vendors augments the complexity of integrating different sensors and actuators on the existing platforms, requiring human involvement and becoming error prone. To improve this situation, devices are usually coupled with a semantic description of their attributes. Such semantic descriptions, Things Descriptions, TD, are therefore an abstraction of devices, that is helpful to achieve a smoother integration of devices into IoT platforms. However, TD are usually vendor-based, so for large-scale IoT infrastructures, the integration complexity increases, as there will be different descriptions of similar sensors, provided by different vendors to be interconnected into IoT platforms. In this context, the paper assesses different ML-based semantic matchmaking approaches, against a sentence-based statistical similarity approach. For the ML approaches, the paper focuses on clustering and Natural Language Processing. The three approaches have been implemented on a realistic testbed, and experiments carried out show that the best performance achieved in terms of accuracy, time to completion of a matchmaking request, and memory usage is the NLP-based approach.

2022

Bnouhanna, N., Karabulut, E., Sofia, R. C., Seder, E. E., Scivoletto, G., & Insolvibile, G. (2022, March). An Evaluation of a Semantic Thing To Service Matching Approach in Industrial IoT Environments. In 2022 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops) (pp. 433-438). IEEE. Abstract PDF

Industrial Internet of Things Platforms enable the use of available data to improve production and business processes. However, the data exchange and provisioning between the data sources and platform services remain a challenge as such platforms are usually vendor specific, proprietary, and associated with specific IoT hardware. Therefore, we propose a detailed description of an open-source software-based solution, Thing to Service Matching (TSMatch) which performs fine-grained semantic matching between available IoT data and services. Moreover, the paper presents the actual implementation of the proposed solution in 2 different Aerospace production environments, and a performance evaluation in a testbed environment.

2021

Karabulut, Erkan, Nisrine Bnouhanna, and Rute C. Sofia. "ML-based data classification and data aggregation on the edge." Proceedings of the CoNEXT Student Workshop. 2021. Abstract PDF

This study focuses on sensor classification using machine learning algorithms, to improve data aggregation on the Edge. This aspect is particularly important in large-scale Internet of Things environments, where data aggregation derived from sensors from different vendors often requires human intervention. The proposed research is focused on relying on machine learning to classify sensors based on their semantic descriptions.

2018

Yazici, Ilkay Melek, Erkan Karabulut, and Mehmet S. Aktas. "A data provenance visualization approach." 2018 14th International Conference on Semantics, Knowledge and Grids (SKG). IEEE, 2018. Abstract PDF

Data Provenance has created an emerging requirement for technologies that enable end users to access, evaluate, and act on the provenance of data in recent years. In the era of Big Data, the amount of data created by corporations around the world has grown each year. As an example, both in the Social Media and e-Science domains, data is growing at an unprecedented rate. As the data has grown rapidly, information on the origin and lifecycle of the data has also grown. In turn, this requires technologies that enable the clarification and interpretation of data through the use of data provenance. This study proposes methodologies towards the visualization of W3C-PROV-O Specification compatible provenance data. The visualizations are done by summarization and comparison of the data provenance. We facilitated the testing of these methodologies by providing a prototype, extending an existing open source visualization tool. We discuss the usability of the proposed methodologies with an experimental study; our initial results show that the proposed approach is usable, and its processing overhead is negligible.