Introducing mle-monitor: A Lightweight Experiment & Resource Monitoring Tool πΊ
Published:
βDid I already run this experiment before? How many resources are currently available on my cluster?β If these are common questions you encounter during your daily life as a researcher, then mle-monitor is made for you. It provides a lightweight API for tracking your experiments using a pickle protocol database (e.g. for hyperparameter searches and/or multi-configuration/multi-seed runs). Furthermore, it comes with built-in resource monitoring on Slurm/Grid Engine clusters and local machines/servers. Finally, it leverages rich in order to provide a terminal dashboard that is updated online with new protocolled experiments and the current state of resource utilization. Here is an example of a dashboard running on a Grid Engine cluster:

%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format = 'retina'
try:
import mle_monitor
except:
!pip install -q mle-monitor
import mle_monitor
mle-monitor comes with three core functionalities:
MLEProtocol: A composable protocol database API for ML experiments.MLEResource: A tool for obtaining server/cluster usage statistics.MLEDashboard: A dashboard visualizing resource usage & experiment protocol.

Finally, mle-monitor is part of the mle-infrastructure and comes with a set of handy built-in synergies. We will wrap-up by outlining a full workflow of using the protocol together with a random search experiment using mle-hyperopt, mle-scheduler and mle-logging.
# Check if code is run in Colab: If so -- download configs from repo
try:
import google.colab
IN_COLAB = True
!wget -q https://raw.githubusercontent.com/mle-infrastructure/mle-monitor/main/examples/train.py
!wget -q https://raw.githubusercontent.com/mle-infrastructure/mle-monitor/main/examples/base_config.json
except:
IN_COLAB = False
Experiment Management with MLEProtocol π
from mle_monitor import MLEProtocol
# Load the protocol from a local file (create new if it doesn't exist yet)
protocol = MLEProtocol(protocol_fname="mle_protocol.db", verbose=True)
In order to add a new experiment to the protocol database you have to provide a dictionary containing the experiment meta data:
| Search Type | Description | Default |
|---|---|---|
purpose |
Purpose of experiment | 'None provided' |
project_name |
Project name of experiment | 'default' |
exec_resource |
Resource jobs are run on | 'local' |
experiment_dir |
Experiment log storage directory | 'experiments' |
experiment_type |
Type of experiment to run | 'single' |
base_fname |
Main code script to execute | 'main.py' |
config_fname |
Config file path of experiment | 'base_config.yaml' |
num_seeds |
Number of evaluations seeds | 1 |
num_total_jobs |
Number of total jobs to run | 1 |
num_job_batches |
Number of jobs in single batch | 1 |
num_jobs_per_batch |
Number of sequential job batches | 1 |
time_per_job |
Expected duration: days-hours-minutes | '00:01:00' |
num_cpus |
Number of CPUs used in job | 1 |
num_gpus |
Number of GPUs used in job | 0 |
meta_data = {
"purpose": "Test Protocol", # Purpose of experiment
"project_name": "MNIST", # Project name of experiment
"exec_resource": "local", # Resource jobs are run on
"experiment_dir": "log_dir", # Experiment log storage directory
"experiment_type": "hyperparameter-search", # Type of experiment to run
"base_fname": "train.py", # Main code script to execute
"config_fname": "base_config.json", # Config file path of experiment
"num_seeds": 5, # Number of evaluations seeds
"num_total_jobs": 10, # Number of total jobs to run
"num_jobs_per_batch": 5, # Number of jobs in single batch
"num_job_batches": 2, # Number of sequential job batches
"time_per_job": "00:05:00", # Expected duration: days-hours-minutes
"num_cpus": 2, # Number of CPUs used in job
"num_gpus": 1, # Number of GPUs used in job
}
e_id = protocol.add(meta_data, save=False)
[15:37:51] INFO Added experiment 1 to protocol. mle_protocol.py:162
Adding the experiment will load in the configuration file (either .json or .yaml) and set the experiment status to βrunningβ. You can then always retrieve the provided information using protocol.get(e_id):
protocol.get(e_id)
{'purpose': 'Test Protocol',
'project_name': 'MNIST',
'exec_resource': 'local',
'experiment_dir': 'log_dir',
'experiment_type': 'hyperparameter-search',
'base_fname': 'train.py',
'config_fname': 'base_config.json',
'num_seeds': 5,
'num_total_jobs': 10,
'num_jobs_per_batch': 5,
'num_job_batches': 2,
'time_per_job': '00:05:00',
'num_cpus': 2,
'num_gpus': 1,
'git_hash': '60cb3e3883da3888865b47abf0d5b6257e6d91e5',
'loaded_config': [{'train_config': {'lrate': 0.1},
'model_config': {'num_layers': 5},
'log_config': {'time_to_track': ['step_counter'],
'what_to_track': ['loss'],
'time_to_print': ['step_counter'],
'what_to_print': ['loss'],
'print_every_k_updates': 10,
'overwrite_experiment_dir': 1}}],
'e-hash': '897b974332747d81e84b3ed688f7862d',
'retrieved_results': False,
'stored_in_cloud': False,
'report_generated': False,
'job_status': 'running',
'completed_jobs': 0,
'start_time': '12/09/21 15:37',
'duration': '0:10:00',
'stop_time': '12/10/21 01:37'}
You can also always print a summary snapshot of the last experiments using protocol.summary(). By providing the boolean option full, you also print the resources used in an experiment:
# Print a summary of the last experiments
sub_df = protocol.summary()
# ... and a more detailed version
sub_df = protocol.summary(full=True)
π π π Project Purpose Type βΆ β» CPU GPU ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β Έ 1 12/09 MNIST Test Protocol search Local 5 2 1
π π π Project Purpose Type βΆ β» CPU GPU β³ Completed Jobs β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β Ό 1 12/09 MNIST Test Protocol search Local 5 2 1 0 /10 0%
If you want to ad-hoc change any of the stored attributes of an experiment, you can do so using the update method. Furthermore, you can change the experiment status using abort or complete:
# Update some element in the database
protocol.update(e_id, "exec_resource", "slurm-cluster", save=False)
# Abort the experiment - changes status
protocol.abort(e_id, save=False)
sub_df = protocol.summary()
# Get the status of the experiment
protocol.status(e_id)
π π π Project Purpose Type βΆ β» CPU GPU ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β 1 12/09 MNIST Test Protocol search Slurm 5 2 1
'aborted'
If you would like to get a summary of all reported experiments, the last experiment and its resource requirements, protocol.monitor() does so:
# Get the monitoring data - used later in dashboard
protocol_data = protocol.monitor()
protocol_data["total_data"]
{'total': '1',
'run': '0',
'done': '0',
'aborted': '1',
'sge': '0',
'slurm': '1',
'gcp': '0',
'local': '0',
'report_gen': '0',
'gcs_stored': '0',
'retrieved': '0'}
Finally, you can also store other data specific to an experiment using an additional dictionary of data as follows:
extra_data = {"extra_config": {"lrate": 3e-04}}
e_id = protocol.add(meta_data, extra_data, save=False)
protocol.get(e_id)["extra_config"]
[15:37:58] INFO Added experiment 2 to protocol. mle_protocol.py:162
{'lrate': 0.0003}
Syncing your Protocol DB with a GCS Bucket
If you would like to keep a remote copy of your protocol, you can also automatically sync your protocol database with a Google Cloud Storage (GCS) bucket. This is especially useful when running experiments on multiple resource and will require you to have created a GCP project and a GCS bucket. Furthermore you will have to provide you .json authentication key path. If you donβt have one yet, have a look here. Alternatively, just make sure that the environment variable GOOGLE_APPLICATION_CREDENTIALS is set to the right path.
# Sync your protocol with a GCS bucket
cloud_settings = {
"project_name": "mle-toolbox", # Name of your GCP project
"bucket_name": "mle-protocol", # Name of your GCS bucket
"protocol_fname": "mle_protocol.db", # Name of DB file in GCS bucket
"use_protocol_sync": True, # Whether to sync the protocol
"use_results_storage": False # Whether to upload zipped dir at completion
}
protocol = MLEProtocol(protocol_fname="mle_protocol.db",
cloud_settings=cloud_settings,
verbose=True)
[15:38:01] INFO No DB found in GCloud Storage - mle_protocol.db gcs_sync.py:39
INFO New DB will be created - mle-toolbox/mle-protocol gcs_sync.py:40
INFO Pulled protocol from GCS bucket: mle-protocol. mle_protocol.py:379
e_id = protocol.add(meta_data)
[15:38:06] INFO Added experiment 1 to protocol. mle_protocol.py:162
INFO Locally stored protocol: mle_protocol.db mle_protocol.py:91
[15:38:07] INFO Send to GCloud Storage - mle_protocol.db gcs_sync.py:70
INFO Send protocol to GCS bucket: mle-protocol. mle_protocol.py:364
INFO GCS synced protocol: mle_protocol.db mle_protocol.py:97
Finally, you can also choose to store the results of an experiment in the GCS bucket. In this case the protocol will upload a zipped version of your created experiment_dir to the bucket whenever you call protocol.complete().
Resource Monitoring with MLEResource π
You can monitor your local machine, server or clusters using the MLEResource. If you are running this on a Google Colab, make sure to add a GPU accelerator!
from mle_monitor import MLEResource
resource = MLEResource(resource_name="local")
resource_data = resource.monitor()
resource_data["user_data"].keys()
dict_keys(['pid', 'p_name', 'mem_util', 'cpu_util', 'cmdline', 'total_cpu_util', 'total_mem_util'])
You can also monitor slurm or grid engine clusters by providing the queues/partitions to monitor in monitor_config:
resource = MLEResource(
resource_name="slurm-cluster",
monitor_config={"partitions": ["<partition-1>", "<partition-2>"]},
)
resource = MLEResource(
resource_name="sge-cluster",
monitor_config={"queues": ["<queue-1>", "<queue-2>"]}
)
Dashboard Visualization with MLEDashboard ποΈ
from mle_monitor import MLEDashboard
dashboard = MLEDashboard(protocol, resource)
# Get a static snapshot of the protocol & resource utilisation
# Note: This will look a lot nicer in your terminal!
dashboard.snapshot()
# Run monitoring in while loop - dashboard
dashboard.live()
- Add widget animation!/screenshot
Integration with the MLE-Infrastructure Ecosystem πΊ
Running a Hyperparameter Search for Multiple Random Seeds
try:
from mle_hyperopt import RandomSearch
from mle_scheduler import MLEQueue
from mle_logging import load_meta_log
except:
!pip install -q mle-hyperopt mle-scheduler mle-logging
!pip install --upgrade rich
from mle_hyperopt import RandomSearch
from mle_scheduler import MLEQueue
from mle_logging import load_meta_log
We again start by adding an experiment to the protocol at launch time.
# Load (existing) protocol database and add experiment data
protocol_db = MLEProtocol("mle_protocol.db", verbose=True)
meta_data = {
"purpose": "random search", # Purpose of experiment
"project_name": "surrogate", # Project name of experiment
"exec_resource": "local", # Resource jobs are run on
"experiment_dir": "logs_search", # Experiment log storage directory
"experiment_type": "hyperparameter-search", # Type of experiment to run
"base_fname": "train.py", # Main code script to execute
"config_fname": "base_config.json", # Config file path of experiment
"num_seeds": 2, # Number of evaluations seeds
"num_total_jobs": 4, # Number of total jobs to run
"num_jobs_per_batch": 4, # Number of jobs in single batch
"num_job_batches": 1, # Number of sequential job batches
"time_per_job": "00:00:02", # Expected duration: days-hours-minutes
}
new_experiment_id = protocol_db.add(meta_data)
[15:38:26] INFO Added experiment 2 to protocol. mle_protocol.py:162
INFO Locally stored protocol: mle_protocol.db mle_protocol.py:91
Afterwards, we leverage mle-hyperopt to instantiate a random search strategy with its parameter space. We then ask for two configurations and store them as .yaml files in our working directory:
# Instantiate random search class
strategy = RandomSearch(
real={"lrate": {"begin": 0.1, "end": 0.5, "prior": "log-uniform"}},
integer={"batch_size": {"begin": 1, "end": 5, "prior": "uniform"}},
categorical={"arch": ["mlp", "cnn"]},
verbose=True,
)
# Ask for configurations to evaluate & run parallel eval of seeds * configs
configs, config_fnames = strategy.ask(2, store=True)
configs
MLE-Hyperopt Random Search Hyperspace π π» Variable Type Search Range β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ arch categorical ['mlp', 'cnn'] lrate real Begin: 0.1, End: 0.5, Prior: log-uniform batch_size integer Begin: 1, End: 5, Prior: uniform
[{'arch': 'mlp', 'lrate': 0.360379148648584, 'batch_size': 3},
{'arch': 'cnn', 'lrate': 0.26208630215377515, 'batch_size': 2}]
Next, we can use a MLEQueue from mle-scheduler to run our training script train.py for our two configurations and two different random seeds. train.py implements a simple surrogate training loop, which logs some statistics with the help of mle-logging. Afterwards, we merge the resulting logs into a single meta_log.hdf5 and retrieve the mean (over seeds) test loss score for both configurations.
queue = MLEQueue(
resource_to_run="local",
job_filename="train.py",
config_filenames=config_fnames,
random_seeds=[1, 2],
experiment_dir="logs_search",
protocol_db=protocol_db,
)
queue.run()
# Merge logs of random seeds & configs -> load & get final scores
queue.merge_configs(merge_seeds=True)
meta_log = load_meta_log("logs_search/meta_log.hdf5")
test_scores = [meta_log[r].stats.test_loss.mean[-1] for r in queue.mle_run_ids]
Output()
[15:38:35] INFO Locally stored protocol: mle_protocol.db mle_protocol.py:91
INFO Locally stored protocol: mle_protocol.db mle_protocol.py:91
INFO Locally stored protocol: mle_protocol.db mle_protocol.py:91
[15:38:36] INFO Locally stored protocol: mle_protocol.db mle_protocol.py:91
MLEQueue - local β’ 4/4 Jobs βββββββββββββββββββββββββββ 100% β’ 0:00:01 β
Finally, we update the random search strategy and tell the protocol that the experiment has been completed:
# Update the hyperparameter search strategy
strategy.tell(configs, test_scores)
# Wrap up experiment (store completion time, etc.)
protocol_db.complete(new_experiment_id)
βββββββββββββββββ³βββββ³ββββββββββ³ββββββββββββββββββββββββββββββββββββββββββββββββ β π₯ Total: 2 β ID β Obj. π β Configuration π - 12/09/2021 15:38:43 β β‘βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ© β Best Overall β 0 β 1.193 β 'arch': 'mlp', 'lrate': 0.360379148648584, β β β β β 'batch_size': 3 β β Best in Batch β 0 β 1.193 β 'arch': 'mlp', 'lrate': 0.360379148648584, β β β β β 'batch_size': 3 β βββββββββββββββββ΄βββββ΄ββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββββ
[15:38:43] INFO Locally stored protocol: mle_protocol.db mle_protocol.py:91
INFO Updated protocol - COMPLETED: 2 mle_protocol.py:253
Give it a try and let me know what you think! If you find a bug or are missing your favourite feature, feel free to contact me @RobertTLange or create an issue!