Managing Machine Learning Dependencies In Distributed Systems

Recently, I was struggling with dependency issues while building NBoost. One of the important features of NBoost is the portability of the models; we wanted the platform to be agnostic between many different types of finetuned models: Tensorflow, Pytorch, Transformers, whatever. But I really wanted NBoost to be packageable and pip-installable as well! This meant that all of the python classes needed to be added to the package. But machine learning dependencies are HEAVY; you shouldn’t have to import pytorch (which is a heavy dependency) if you want to use a tensorflow model, and vice versa. Users should be able to cherry-pick the dependencies they need. The solution wasn’t very complicated, but it worked well for our use case, so I thought I’d share it. Here it is:

1. Add the dependencies in `extras_require`

Nboost comes in four flavors: nboost (the base package), nboost[tf] (w/ tensorflow), nboost[torch] (w/ pytorch), and nboost[all] (with both). Our setup.py is set up so that users can pip install any of these options depending on their model:

from setuptools import setup

setup(
    # ...
    extras_require={
        'torch': ['torch', 'transformers'],
        'tf': ['tensorflow==1.15', 'sentencepiece'],
        'all': ['torch', 'tensorflow==1.15', 'transformers'],
    },
    # ...
)

2. Create a model mapping in the `init.py`

In the base of our package I added a mapping that reveals which module each model class can be found.

# component => class => module
CLASS_MAP = {
    'protocol': {
        'TestProtocol': 'test',
        'ESProtocol': 'es'
    },
    'model': {
        'TestModel': 'test',
        'TransformersModel': 'transformers',
        'BertModel': 'bert_model',
        'AlbertModel': 'albert_model'
    }
}

This way, when a user types in nboost --model_dir BertModel, NBoost knows that it should only import the bert_model module.

3. Import the classes dynamically from the cli

In our command line entrypoint, I added an import_class function that takes “model” (the module) and “BertModel” (the class) as arguments and returns that class (but doesn’t import anything else!).

import importlib, CLASS_MAP

def import_class(module: str, name: str):
    """Dynamically import class from a module in the CLASS_MAP. This is used
    to manage dependencies within nboost. For example, you don't necessarily
    want to import pytorch models everytime you boot up tensorflow..."""
    if name not in CLASS_MAP[module]:
        raise ImportError('Cannot locate %s with name "%s"' % (module, name))

    file = 'nboost.%s.%s' % (module, CLASS_MAP[module][name])
    return getattr(importlib.import_module(file), name)

And that’s it! In NBoost, this model class gets handed to the proxy so that the proxy can initialize the model. Good luck!

Written on November 23, 2019

Managing Machine Learning Dependencies In Distributed Systems

1. Add the dependencies in extras_require

2. Create a model mapping in the __init__.py

3. Import the classes dynamically from the cli

1. Add the dependencies in `extras_require`

2. Create a model mapping in the `init.py`