Publication:
A Methodology to Extract Knowledge from Datasets Using ML

dc.contributor.authorSánchez-de-Madariaga, Ricardo
dc.contributor.authorPascual-Carrasco, Mario
dc.contributor.authorMuñoz Carrero, Adolfo
dc.date.accessioned2025-07-01T06:32:56Z
dc.date.available2025-07-01T06:32:56Z
dc.date.issued2025-05-28
dc.descriptionThe original data presented in the study are openly available at https://archive.ics.uci.edu/ (accessed on 1 March 2025).
dc.description.abstractThis study aims to verify whether there is any relationship between the different classification outputs produced by distinct ML algorithms and the relevance of the data they classify, to address the problem of knowledge extraction (KE) from datasets. If such a relationship exists, the main objective of this research is to use it in order to improve performance in the important task of KE from datasets. A new dataset generation and a new ML classification measurement methodology were developed to determine whether the feature subsets (FSs) best classified by a specific ML algorithm corresponded to the most KE-relevant combinations of features. Medical expertise was extracted to determine the knowledge relevance using two LLMs, namely, chat GPT-4o and Google Gemini 2.5. Some specific ML algorithms fit much better than others for a working dataset extracted from a given probability distribution. They best classify FSs that contain combinations of features that are particularly knowledge-relevant. This implies that, by using a specific ML algorithm, we can indeed extract useful scientific knowledge. The best-fitting ML algorithm is not known a priori. However, we can bootstrap its identity using a small amount of medical expertise, and we have a powerful tool for extracting (medical) knowledge from datasets using ML.
dc.description.peerreviewed
dc.format.number11
dc.format.page1807
dc.format.volume13
dc.identifier.citationSánchez-de-Madariaga, R.; Pascual Carrasco, M.; Muñoz Carrero, A. A Methodology to Extract Knowledge from Datasets Using ML. Mathematics. 2025. 13(11):1807.
dc.identifier.doi10.3390/math13111807
dc.identifier.issn2227-7390
dc.identifier.journalMathematics
dc.identifier.urihttps://hdl.handle.net/20.500.12105/26783
dc.language.isoeng
dc.publisherMultidisciplinary Digital Publishing Institute (MDPI)
dc.relation.publisherversionhttps://doi.org/10.3390/math13111807
dc.repisalud.centroISCIII::Unidad de Investigación en Salud Digital (UITeS)
dc.repisalud.institucionISCIII
dc.rights.accessRightsopen access
dc.rights.licenseAttribution 4.0 International
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/
dc.subjectKnowledge relevance
dc.subjectKnowledge extraction
dc.subjectFeature subset
dc.subjectLarge language models
dc.subjectMachine learning algorithms
dc.subjectStatistics
dc.titleA Methodology to Extract Knowledge from Datasets Using ML
dc.typeresearch article
dc.type.hasVersionVoR
dspace.entity.typePublication
relation.isAuthorOfPublicationef1f86bf-f242-486b-824a-f72079a729b2
relation.isAuthorOfPublication28c618fc-d588-423d-803e-f66a36399b42
relation.isAuthorOfPublicationc62651ac-034c-4271-b51e-d82a428af13e
relation.isAuthorOfPublication.latestForDiscoveryef1f86bf-f242-486b-824a-f72079a729b2
relation.isPublisherOfPublication30293a55-0e53-431f-ae8c-14ab01127be9
relation.isPublisherOfPublication.latestForDiscovery30293a55-0e53-431f-ae8c-14ab01127be9

Files

Original bundle

Now showing 1 - 3 of 3
Loading...
Thumbnail Image
Name:
MethodologyExtractKnowledgeDatasets_2025.pdf
Size:
1.02 MB
Format:
Adobe Portable Document Format
Loading...
Thumbnail Image
Name:
Supplementary1_MethodologyExtractKnowledgeDatasets_2025.pdf
Size:
248.63 KB
Format:
Adobe Portable Document Format
Loading...
Thumbnail Image
Name:
Supplementary2_MethodologyExtractKnowledgeDatasets_2025.pdf
Size:
120.46 KB
Format:
Adobe Portable Document Format