Usage

mihifepe is invoked as follows:

python -m mihifepe -data_filename <data_filename.hdf5>
                   -model_generator_filename <model_generator_filename.py>
                   -hierarchy_filename <hierarchy_filename.csv>
                   -output_dir <output_dir>

Performance can be greatly improved by running on a distributed system. Presently mihifepe only supports HTCondor running on a shared filesystem, invoked as follows:

python -m mihifepe -condor ...

To see a complete list of options, run:

python -m mihifepe -h

Inputs

  • Test data: <data_filename.hdf5> Test data in HDF5 format
  • Trained model: <model_generator_filename.py> Python script that generates model object for subsequent callbacks to model.predict and model.loss
  • Hierarchy over features: <hierarchy_filename.csv> CSV specifying hierarchy over features

See Input specification for a detailed descriptions of the input data.

Outputs

  • <output_dir>/p_values.csv: CSV, listing for all nodes in hierarchy:

    • Accuracy of model with given node perturbed
    • p-values of paired statistical test comparing perturbed model loss to baseline (unperturbed) model loss
  • <output_dir>/hierarchical_fdr_control/tree.png: PNG showing subtree of hierarchy corresponding to rejected nodes, subsequent to hierarchical FDR control

Input specification

Data types

The library deals with models that accept one or both of the following types of input:

  • Static input: Represented as a single vector per instance of length L, cumulatively represented as a data matrix
  • Temporal input: Represented as an input sequence of variable length V, each element of which is a vector of fixed length W.

Models commonly take only static input, but models such as Recurrent Neural Networks (RNNs) work with input sequences. Models comprising bigger networks with RNN sub-networks may take both kinds of inputs.

Data representation

The data must be in HDF5 format, which (among other things) allows easy and scalable storage and access of a combination of static and variable-length temporal data. The recommended method of generating HDF5 inputs is via h5py.

HDF5 data is organized into a hierarchy comprising groups (containers) and datasets (data collections). The input data for mihifepe must be organized as follows (see https://mihifepe.readthedocs.io/en/latest/examples.html for examples).

Groups:

/temporal               (Group containing temporal data)

Datasets:

/record_ids             (List of record identifiers (strings) of length M = number of records/instances)
/targets                (vector of target values (regression/classification outputs) of length M)
/static                 (matrix of static data of size M x L)
/temporal/<record_id>   (One dataset per record_id) (List (of variable length V) of vectors (of fixed length W))

[TODO]: Sparse representations may be used for both temporal and static data, in which case the attribute ‘sparse’ must be specified.

Trained model

The caller must create a model object, corresponding to the trained model, that implements the following methods:

model.predict(target, static data, temporal_data)
"""
Predicts the model's output (loss, prediction) for the given target and instance.

Args:
    target:         classification label or regression output (scalar value)
    static_data:    static data (vector)
    temporal_data:  temporal data (matrix, where number of rows are variable across instances)

Returns:
    loss:           model's output loss
    prediction:     model's output prediction, only used for classifiers
"""

model.loss(prediction, target)
"""
The model's loss function applied to an output/target pair for a single input instance.
For instance, if the loss function is RMSE, the function would return
sqrt(mean(prediction - target)**2)) = abs(prediction - target)

Args:
    prediction:     model's output prediction on a single input instance
    target:         corresponding target label for instance

Returns:
    loss:           model's output loss
"""

This object must be generated by a standalone Python script that is passed to mihifepe. This allows mihifepe to distribute the feature perturbations across multiple worker nodes, each with its own copy of model. For instance, if the script path is /a/b/c/d/gen_model.py, then mihifepe will access model as follows:

sys.path.insert(0, "/a/b/c/d/") # Makes python search this folder for modules
from gen_model import model

The test data type must match the data type of the predict function (e.g. if the model requires both static and temporal input, the input test data must provide both for every instance).

Hierarchy over features

The caller must provide a hierarchy over features as a CSV file. Each node (including leaf nodes) may correspond to a single feature or a group of features. Two sets of indices must be specified for each leaf node, at least one of which must be non-empty. Indices of the same data type must be mutually exclusive across leaf nodes. The CSV must contain the following columns:

name:             feature name, unique across features
parent_name:      name of parent if it exists, else '' (root node)
description:      node description
static_indices:   [only required for leaf nodes] list of tab-separated indices corresponding to the indices
                    of these features in the static data
temporal_indices: [only required for leaf nodes] list of tab-separated indices corresponding to the indices
                    of these features in the temporal data