Skip to content
This repository was archived by the owner on Oct 19, 2025. It is now read-only.
This repository was archived by the owner on Oct 19, 2025. It is now read-only.

Function to identify variable that can be best predicted from a set of base variables #38

@MaxGhenis

Description

@MaxGhenis

This would help for defining the sequence of variables to impute or synthesize. Something like this would fit well in other functions:

def most_predictable(df, base_cols, candidate_cols, algorithm):
    """ Identifies the most predictable column from a set of base columns.
    
    Args:
        df: DataFrame with base and candidate columns.
        base_cols: List of column names to predict from.
        candidate_cols: List of column names to compare on predictability given base_cols.
        algorithm: Algorithm for determining predictability.

    Returns:
        Column name from candidate_cols which is most predictable from base_cols.
    """

This could be done with something like correlations, or algorithms like random forests (after standardizing data, and the standardization technique might be another arg).

cc @rickecon, per our chat if you can take a stab at this that'd be awesome.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions