A Methodology to Extract Knowledge from Datasets Using ML

Sánchez-de-Madariaga, Ricardo; Pascual-Carrasco, Mario; Muñoz Carrero, Adolfo

Publication:
A Methodology to Extract Knowledge from Datasets Using ML

dc.contributor.author	Sánchez-de-Madariaga, Ricardo
dc.contributor.author	Pascual-Carrasco, Mario
dc.contributor.author	Muñoz Carrero, Adolfo
dc.date.accessioned	2025-07-01T06:32:56Z
dc.date.available	2025-07-01T06:32:56Z
dc.date.issued	2025-05-28
dc.description	The original data presented in the study are openly available at https://archive.ics.uci.edu/ (accessed on 1 March 2025).
dc.description.abstract	This study aims to verify whether there is any relationship between the different classification outputs produced by distinct ML algorithms and the relevance of the data they classify, to address the problem of knowledge extraction (KE) from datasets. If such a relationship exists, the main objective of this research is to use it in order to improve performance in the important task of KE from datasets. A new dataset generation and a new ML classification measurement methodology were developed to determine whether the feature subsets (FSs) best classified by a specific ML algorithm corresponded to the most KE-relevant combinations of features. Medical expertise was extracted to determine the knowledge relevance using two LLMs, namely, chat GPT-4o and Google Gemini 2.5. Some specific ML algorithms fit much better than others for a working dataset extracted from a given probability distribution. They best classify FSs that contain combinations of features that are particularly knowledge-relevant. This implies that, by using a specific ML algorithm, we can indeed extract useful scientific knowledge. The best-fitting ML algorithm is not known a priori. However, we can bootstrap its identity using a small amount of medical expertise, and we have a powerful tool for extracting (medical) knowledge from datasets using ML.
dc.description.peerreviewed	Sí
dc.format.number	11
dc.format.page	1807
dc.format.volume	13
dc.identifier.citation	Sánchez-de-Madariaga, R.; Pascual Carrasco, M.; Muñoz Carrero, A. A Methodology to Extract Knowledge from Datasets Using ML. Mathematics. 2025. 13(11):1807.
dc.identifier.doi	10.3390/math13111807
dc.identifier.issn	2227-7390
dc.identifier.journal	Mathematics
dc.identifier.uri	https://hdl.handle.net/20.500.12105/26783
dc.language.iso	eng
dc.publisher	Multidisciplinary Digital Publishing Institute (MDPI)
dc.relation.publisherversion	https://doi.org/10.3390/math13111807
dc.repisalud.centro	ISCIII::Unidad de Investigación en Salud Digital (UITeS)
dc.repisalud.institucion	ISCIII
dc.rights.accessRights	open access
dc.rights.license	Attribution 4.0 International
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/
dc.subject	Knowledge relevance
dc.subject	Knowledge extraction
dc.subject	Feature subset
dc.subject	Large language models
dc.subject	Machine learning algorithms
dc.subject	Statistics
dc.title	A Methodology to Extract Knowledge from Datasets Using ML
dc.type	research article
dc.type.hasVersion	VoR
dspace.entity.type	Publication
relation.isAuthorOfPublication	ef1f86bf-f242-486b-824a-f72079a729b2
relation.isAuthorOfPublication	28c618fc-d588-423d-803e-f66a36399b42
relation.isAuthorOfPublication	c62651ac-034c-4271-b51e-d82a428af13e
relation.isAuthorOfPublication.latestForDiscovery	ef1f86bf-f242-486b-824a-f72079a729b2
relation.isPublisherOfPublication	30293a55-0e53-431f-ae8c-14ab01127be9
relation.isPublisherOfPublication.latestForDiscovery	30293a55-0e53-431f-ae8c-14ab01127be9