Version: Next

Ray Launcher plugin

PyPI - License PyPI - Python Version

The Ray Launcher plugin provides 2 launchers: ray_aws and ray. ray_aws launches jobs remotely on AWS and is built on top of ray autoscaler sdk. ray launches jobs on your local machine or existing ray cluster.

Installation

$ pip install hydra-ray-launcher --upgrade

Usage

Once installed, add hydra/launcher=ray_aws or hydra/launcher=ray to your command line. Alternatively, override hydra/launcher in your config:

defaults:
  - override hydra/launcher: ray_aws

There are several standard approaches for configuring plugins. Check this page for more information.

`ray_aws` launcher

important

ray_aws launcher is built on top of ray's autoscaler sdk. To get started, you need to config your AWS credentials. ray autoscaler sdk expects your AWS credentials have certain permissions for EC2 and IAM. Read this for more information.

ray autoscaler sdk expects a configuration for the EC2 cluster; we've schematized the configs in here

Discover ray_aws launcher's config

$ python my_app.py hydra/launcher=ray_aws --cfg hydra -p hydra.launcher
# @package hydra.launcher
# @package hydra.launcher
_target_: hydra_plugins.hydra_ray_launcher.ray_aws_launcher.RayAWSLauncher
env_setup:
  pip_packages:
    omegaconf: ${ray_pkg_version:omegaconf}
    hydra_core: ${ray_pkg_version:hydra}
    ray: ${ray_pkg_version:ray}
    cloudpickle: ${ray_pkg_version:cloudpickle}
    pickle5: 0.0.11
    hydra_ray_launcher: 1.2.0.dev1
  commands:
  - conda create -n hydra_${python_version:micro} python=${python_version:micro} -y
  - echo 'export PATH="$HOME/anaconda3/envs/hydra_${python_version:micro}/bin:$PATH"'
    >> ~/.bashrc
ray:
  init:
    address: null
  remote: {}
  cluster:
    cluster_name: default
    min_workers: 0
    upscaling_speed: 1.0
    max_workers: 1
    initial_workers: 0
    autoscaling_mode: default
    target_utilization_fraction: 0.8
    idle_timeout_minutes: 5
    docker:
      image: ''
      container_name: ''
      pull_before_run: true
      run_options: []
    provider:
      type: aws
      region: us-west-2
      availability_zone: us-west-2a,us-west-2b
      cache_stopped_nodes: false
      key_pair:
        key_name: hydra-${oc.env:USER,user}
    auth:
      ssh_user: ubuntu
    available_node_types:
      ray.head.default:
        resources: {}
        node_config:
          InstanceType: m5.large
          ImageId: ami-0a2363a9cff180a64
      ray.worker.default:
        min_workers: 0
        max_workers: 2
        resources: {}
        node_config:
          InstanceType: m5.large
          ImageId: ami-0a2363a9cff180a64
          InstanceMarketOptions:
            MarketType: spot
    head_node_type: ray.head.default
    file_mounts: {}
    initialization_commands: []
    cluster_synced_files: []
    setup_commands: []
    head_setup_commands: []
    worker_setup_commands: []
    head_start_ray_commands:
    - ray stop
    - ulimit -n 65536;ray start --head --port=6379 --object-manager-port=8076             --autoscaling-config=~/ray_bootstrap_config.yaml
    worker_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076
  run_env: auto
stop_cluster: true
sync_up:
  source_dir: null
  target_dir: null
  include: []
  exclude: []
sync_down:
  source_dir: null
  target_dir: null
  include: []
  exclude: []
logging:
  log_style: auto
  color_mode: auto
  verbosity: 0
create_update_cluster:
  no_restart: false
  restart_only: false
  no_config_cache: false
teardown_cluster:
  workers_only: false
  keep_min_workers: false

Examples

The following examples can be found here.

Simple app

$ python my_app.py --multirun task=1,2,3
[HYDRA] Ray Launcher is launching 3 jobs,
[HYDRA]        #0 : task=1
[HYDRA]        #1 : task=2
[HYDRA]        #2 : task=3
[HYDRA] Pickle for jobs: /var/folders/n_/9qzct77j68j6n9lh0lw3vjqcn96zxl/T/tmpqqg4v4i7/job_spec.pkl
Cluster: default
...
INFO services.py:1172 -- View the Ray dashboard at http://localhost:8265
(pid=3374) [__main__][INFO] - Executing task 1
(pid=3374) [__main__][INFO] - Executing task 2
(pid=3374) [__main__][INFO] - Executing task 3
...
[HYDRA] Stopping cluster now. (stop_cluster=true)
[HYDRA] Deleted the cluster (provider.cache_stopped_nodes=false)
Destroying cluster. Confirm [y/N]: y [automatic, due to --yes]
...
No nodes remaining.

Upload & Download from remote cluster

If your application is dependent on multiple modules, you can configure hydra.launcher.sync_up to upload dependency modules to the remote cluster. You can also configure hydra.launcher.sync_down to download output from remote cluster if needed. This functionality is built on top of rsync, include and exclude is consistent with how it works in rsync.

$  python train.py --multirun random_seed=1,2,3
[HYDRA] Ray Launcher is launching 3 jobs,
[HYDRA]        #0 : random_seed=1
[HYDRA]        #1 : random_seed=2
[HYDRA]        #2 : random_seed=3
[HYDRA] Pickle for jobs: /var/folders/n_/9qzct77j68j6n9lh0lw3vjqcn96zxl/T/tmptdkye9of/job_spec.pkl
Cluster: default
...
INFO services.py:1172 -- View the Ray dashboard at http://localhost:8265
(pid=1772) [__main__][INFO] - Start training...
(pid=1772) [INFO] - Init my model
(pid=1772) [INFO] - Created dir for checkpoints. dir=checkpoint
(pid=1772) [__main__][INFO] - Start training...
(pid=1772) [INFO] - Init my model
(pid=1772) [INFO] - Created dir for checkpoints. dir=checkpoint
(pid=1772) [__main__][INFO] - Start training...
(pid=1772) [INFO] - Init my model
(pid=1772) [INFO] - Created dir for checkpoints. dir=checkpoint
Loaded cached provider configuration
...
[HYDRA] Output: receiving file list ... done
16-32-25/
16-32-25/0/
16-32-25/0/checkpoint/
16-32-25/0/checkpoint/checkpoint_1.pt
16-32-25/1/
16-32-25/1/checkpoint/
16-32-25/1/checkpoint/checkpoint_2.pt
16-32-25/2/
16-32-25/2/checkpoint/
16-32-25/2/checkpoint/checkpoint_3.pt
...
[HYDRA] Stopping cluster now. (stop_cluster=true)
[HYDRA] Deleted the cluster (provider.cache_stopped_nodes=false)
Destroying cluster. Confirm [y/N]: y [automatic, due to --yes]
...
No nodes remaining.

Manage Cluster LifeCycle

You can manage the Ray EC2 cluster lifecycle by configuring the flags provided by the plugin:

Default setting (no need to specify on commandline): delete cluster after job finishes remotely:

hydra.launcher.stop_cluster=true
hydra.launcher.ray.cluster.provider.cache_stopped_nodes=false
hydra.launcher.teardown_cluster.workers_only=false
hydra.launcher.teardown_cluster.keep_min_workers=false

Keep cluster running after jobs finishes remotely
```
hydra.launcher.stop_cluster=false
```

Power off EC2 instances and control node termination using hydra.launcher.ray.cluster.provider.cache_stopped_nodes and hydra.launcher.teardown_cluster.workers_only

cache_stopped_nodes	workers_only	behavior
false	false	All nodes are terminated
false	true	Keeps head node running and terminates only worker node
true	false	Keeps both head node and worker node and stops both of them
true	true	Keeps both head node and worker node and stops only worker node

Keep hydra.launcher.ray.cluster.min_workers worker nodes and delete the rest of the worker nodes
```
hydra.launcher.teardown_cluster.keep_min_workers=true
```

Additionally, you can configure how to create or update the cluster:

Default config: run setup commands, restart Ray and use the config cache if available

hydra.launcher.create_update_cluster.no_restart=false
hydra.launcher.create_update_cluster.restart_only=false
hydra.launcher.create_update_cluster.no_config_cache=false

Skip restarting Ray services when updating the cluster config
```
hydra.launcher.create_update_cluster.no_restart=true
```
Skip running setup commands and only restart Ray (cannot be used with hydra.launcher.create_update_cluster.no_restart)
```
hydra.launcher.create_update_cluster.restart_only=true
```
Fully resolve all environment settings from the cloud provider again
```
hydra.launcher.create_update_cluster.no_config_cache=true
```

Configure Ray Logging

You can manage Ray specific logging by configuring the flags provided by the plugin:

Default config: use minimal verbosity and automatically detect whether to use pretty-print and color mode

hydra.launcher.logging.log_style="auto"
hydra.launcher.logging.color_mode="auto"
hydra.launcher.logging.verbosity=0

Disable pretty-print

hydra.launcher.logging.log_style="record"

Disable color mode

hydra.launcher.logging.color_mode="false"

Increase Ray logging verbosity
```
hydra.launcher.logging.verbosity=3
```

`ray` launcher

ray launcher lets you launch application on your ray cluster or local machine. You can easily config how your jobs are executed by changing ray launcher's configuration here ~/hydra/plugins/hydra_ray_launcher/hydra_plugins/hydra_ray_launcher/conf/hydra/launcher/ray.yaml

The example application starts a new ray cluster.

$ python my_app.py  --multirun hydra/launcher=ray
[HYDRA] Ray Launcher is launching 1 jobs, sweep output dir: multirun/2020-11-10/15-16-28
[HYDRA] Initializing ray with config: {}
INFO services.py:1164 -- View the Ray dashboard at http://127.0.0.1:8266
[HYDRA]        #0 :
(pid=97801) [__main__][INFO] - Executing task 1

You can run the example application on your existing ray cluster as well by overriding hydra.launcher.ray.init.address:

$ python my_app.py  --multirun hydra/launcher=ray hydra.launcher.ray.init.address=localhost:6379
[HYDRA] Ray Launcher is launching 1 jobs, sweep output dir: multirun/2020-11-10/15-13-32
[HYDRA] Initializing ray with config: {'num_cpus': None, 'num_gpus': None, 'address': 'localhost:6379'}
INFO worker.py:633 -- Connecting to existing Ray cluster at address: 10.30.99.17:6379
[HYDRA]        #0 :
(pid=93358) [__main__][INFO] - Executing task 1

Configure `ray.init()` and `ray.remote()`

Ray launcher is built on top of ray.init() and ray.remote(). You can configure ray by overriding hydra.launcher.ray.init and hydra.launcher.ray.remote. Check out an example config.

Installation​

Usage​

ray_aws launcher​

Examples​

Manage Cluster LifeCycle​

Configure Ray Logging​

ray launcher​

Configure ray.init() and ray.remote()​