A Methodology to Extract Knowledge from Datasets Using ML

Sánchez-de-Madariaga, Ricardo; Pascual-Carrasco, Mario; Muñoz Carrero, Adolfo

Publication:
A Methodology to Extract Knowledge from Datasets Using ML

Files

MethodologyExtractKnowledgeDatasets_2025.pdf (1.02 MB)

Supplementary1_MethodologyExtractKnowledgeDatasets_2025.pdf (248.63 KB)

Supplementary2_MethodologyExtractKnowledgeDatasets_2025.pdf (120.46 KB)

Identifiers

URI: https://hdl.handle.net/20.500.12105/26783

ISSN: 2227-7390

DOI: 10.3390/math13111807

Publication date

2025-05-28

Authors

Sánchez-de-Madariaga, Ricardo

ISCIII

Pascual-Carrasco, Mario

ISCIII

Muñoz Carrero, Adolfo

ISCIII

Publishers

Multidisciplinary Digital Publishing Institute (MDPI)

Metrics

Export

Abstract

This study aims to verify whether there is any relationship between the different classification outputs produced by distinct ML algorithms and the relevance of the data they classify, to address the problem of knowledge extraction (KE) from datasets. If such a relationship exists, the main objective of this research is to use it in order to improve performance in the important task of KE from datasets. A new dataset generation and a new ML classification measurement methodology were developed to determine whether the feature subsets (FSs) best classified by a specific ML algorithm corresponded to the most KE-relevant combinations of features. Medical expertise was extracted to determine the knowledge relevance using two LLMs, namely, chat GPT-4o and Google Gemini 2.5. Some specific ML algorithms fit much better than others for a working dataset extracted from a given probability distribution. They best classify FSs that contain combinations of features that are particularly knowledge-relevant. This implies that, by using a specific ML algorithm, we can indeed extract useful scientific knowledge. The best-fitting ML algorithm is not known a priori. However, we can bootstrap its identity using a small amount of medical expertise, and we have a powerful tool for extracting (medical) knowledge from datasets using ML.

Description

The original data presented in the study are openly available at https://archive.ics.uci.edu/ (accessed on 1 March 2025).

Keywords

Knowledge relevance Knowledge extraction Feature subset Large language models Machine learning algorithms Statistics

Bibliographic citation

Sánchez-de-Madariaga, R.; Pascual Carrasco, M.; Muñoz Carrero, A. A Methodology to Extract Knowledge from Datasets Using ML. Mathematics. 2025. 13(11):1807.

Collections

Unidad de Investigación en Salud Digital (UITeS)

Publisher version

https://doi.org/10.3390/math13111807

Document type

research article

Full item page

Publication:
A Methodology to Extract Knowledge from Datasets Using ML

Files

Identifiers

Publication date

Authors

Advisors

Journal Title

Journal ISSN

Volume Title

Publishers

Metrics

Export

Research Projects

Organizational Units

Journal Issue

Abstract

Description

Keywords

MeSH Terms

DeCS Terms

Bibliographic citation

Collections

Publisher version

Related dataset

Related publication

Document type

Publication: A Methodology to Extract Knowledge from Datasets Using ML

Files

Identifiers

Publication date

Authors

Advisors

Journal Title

Journal ISSN

Volume Title

Publishers

Metrics

Export

Research Projects

Organizational Units

Journal Issue

Abstract

Description

Keywords

MeSH Terms

DeCS Terms

Bibliographic citation

Collections

Publisher version

Related dataset

Related publication

Document type

Publication:
A Methodology to Extract Knowledge from Datasets Using ML