Skip to content

Throw error when tables are presented with new column orders? #1144

@ablaom

Description

@ablaom

Over at MLJFlux, @tiemvanderdeure has pointed out the following issue that is actually MLJ generic.

As the example below shows, a user presenting a table for training a model cannot present new data for prediction with a different ordering of the table columns:

N = 1000
X = (x1 = rand(Float32, N), x2 = randn(Float32, N), x3 = categorical(rand('a':'c', N)))
y = categorical(bitrand(N))

model = MLJFlux.NeuralNetworkBinaryClassifier(epochs = 10, builder=MLJFlux.MLP(; hidden=(5,4)), batch_size = 100)
mach = machine(model, X, y)
fit!(mach)

# this errors
predict(mach, (x3 = X.x3, x1 = X.x1, x2 = X.x2))

# this is false!
all(predict(mach, (x2 = X.x2, x1 = X.x1, x3 = X.x3)) .≈ predict(mach, X))

Here is my response from the original post:

Mmm. I think this kind of implicit assumption - that the columns of tables are ordered, and that they be presented in a consistent order, is everywhere in MLJ, and probably elsewhere. [Transferring this issue to MLJ].

One could either try to allow tables to be presented in any column order, or throw a warning when the original order is violated. Personally, I think the latter would be sufficient. If MLJ had a generic data-front end for dealing with tables, apart from Tables.matrix which dumps the feature names, this could be an easy fix either way. But a lot of interfaces just don't save the feature names.

I'd support some kind of resolution, but it's a big ask to adapt across the ecosystem.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions