More than just a machine readable way, the MEX Vocabulary has been designed to define a lightweight and flexible schema for publishing machine learning outputs metadata. We aim at providing a free-format to export and share machine learning metadata, indifferent to existing workflow systems or frameworks. As an immediate advantage, users can benefit of the mex format for further analysis and integrations more easily.
The definition of an ontology can be a complex and never-ending task, even more for a highly complex environment such as machine learning. The MEX vocabulary has been designed to serve as an ally in the metadata exporting process, focusing in practical and important aspects concerning the publication of the achieved results, i.e.: the needed input parameters for an model which produces measures as output. More sophisticated variables and procedures (e.g.: optimizations and feature selection) are not covered, simply because they go beyond a logical threshold of simplicity we aim to achieve. At first, people are usualy more interested in finding out, comparing and sharing methodologies and their achieved performances than to have deep understanding of performed sub tasks. MEX is a PROV-O based vocabulary.
@inproceedings{esteves2015mexvoc,
title={MEX vocabulary: a lightweight interchange format for machine learning experiments},
author={Esteves, Diego and Moussallem, Diego and Neto, Ciro Baron and Soru, Tommaso and Usbeck, Ricardo and Ackermann, Markus and Lehmann, Jens},
booktitle={Proceedings of the 11th International Conference on Semantic Systems},
pages={169--176},
year={2015},
organization={ACM}
}The current version of the vocabulaty is described (per layer) as following. We've omitted obvious information for brevity.
-
Basic Informations: stores the basic authoring information and a brief introduction for your experiment setup
:ApplicationContext:Context: the context for the experiment, such as:Bioinformatics,:ComputationalAdversiting,:ComputationalFinance,:DetectingCreditCardFrauds,:FactPredictionand:NaturalLanguageProcessing.:Experiment: core information for the experiment
-
Configurations: the class
:ExperimentConfigurationplays a important role, creating logical groups for your set of executions. In other words, if you decide run your algorithms over different hardware environments (:HardwareConfiguration) or datasets (:Dataset), for instance, you should create more than one instance of this class. The set of classes which define the logical group are listed below.:FeatureCollection: the set of:Featurerepresent the dataset features (also defined as attribute):SamplingMethod:HardwareConfiguration:Dataset
Besides, each configuration is related to one specific software implementation (mexalgo:Implementation)
- The runs: here the
:Executionentity (super class) is the central class of thecorelayer and receives the spotlight on our schema definition. It is defined as a hub for linking the three layers (mexcore,mexalgoandmexperf). Concisely one instance of:Executionis performed on behalf of an:Algorithmand produces one or more:PerformanceMeasure, represented by the:ExecutionPerformanceclass (e.g.::ex1 prov:used :algAand:execperf :prov:wasGeneratedBy :exec1). It is worth noting the abstraction level for modeling the execution process: for representing the set of runs the algorithm has performed over a dataset (or a subset) (i.e.: an overall execution), the appropriete class is the:ExecutionOverall [subClassOf :Execution], whereas for representing the run over eachdataset example(i.e.: a single execution), the relevant class is the:ExecutionSingle [subClassOf :Execution], although less appropriate for the abstraction level required for presenting experimental results. Below are listed additional related entities::Model: basic information regarding the used/generated model:Phase: train, validation or test:ExampleCollection: the set of:Exampleinstances represent a singledataset example(also defined as instance, or, more technically, a dataset row). It is useful when you're describing each execution for each dataset example (e.g.: a test dataset which contains 100 examples [100 rows] is represented by 100:ExampleCollectioninstances, each of these representing a row) and want to store the values for some reason.
:Tool: the software tool used by the experiment:ToolParameter(:ToolParameterCollection): parameters specific defined for the tool:Algorithm: the algorithm:HyperParameter(:HyperParameterCollection): the set of hyperparamters for a given algorithm:AlgorithmClass: the class of an algorithm (e.g.:Decision TreesorSupport Vector Machines)
Additionally, two classes are defined into this layer in order to provide more formalism for the algorithm, although not crucial (and are instantiated in a transparent way to the end user) for the execution description :LearningMethod and :LearningProblem
-
:ExecutionPerformance: this class links the:Executionand its results, i.e., it's an entity for interlinking themexcoreandmexperflayers. For each execution you have to define at least one:PerformanceMeasureor:ExamplePerformanceor:UserDefinedMeasure(see details below) -
:PerformanceMeasure (super class): the root for machine learning general measures, grouped into classes. Each existing:OverallExecutionyields one or more measures::ClassificationMeasure: classification problem measures (e.g.:accuracy,fpp,f1, ...):RegressionMeasure: regression problem measures (e.g:residual,MSE, ...):ClusteringMeasure: clustering problem measures (e.g.:hammingDistance,manhattanDistance, ...):StatisticalMeasure: generical statistical measures (e.g.:pearsonCorrelation,chiSquare, ...)
-
:ExamplePerformanceCollection: associates each:SingleExecutionand its predicted value and the real value (:ExamplePerformance) -
:UserDefinedMeasureCollection: We know reusing is a key factor for semantic web, but sometimes the number of people would reuse your metric is quite low, of even zero! Besides, extending a vocabulary might be complicated for non-semantic web users (e.g.: how people would useMEXfor describing a machine learning experiment without semantic web background?). Therefore, if you have a very specific measure formula for your business and want to describe use this class! Each:UserDefinedMeasurestores an unknown measure (prov:value) value and its formula (mexperf:formula). Of course we will keep an eye on it for thevocabulary measuresupdating process! ;-)