Python Multiprocessing with discomp

Discomp (Dis.co multiprocessing Python package) is a package that distributes computing jobs using the disco service with an API similar to the multiprocessing Python package. In many cases it can act as a drop-in replacement.

Installing discomp

  1. Install the CLI. For more details, see Using the CLI.

  2. Sign up for an account. See details at Quick Start Guide.

  3. Install the discomp package:

    pip install discomp
    # For python 3
    pip3 install discomp

Usage

You go on writing your Python script as if you were using the PyPI multiprocessing Python package but instead of importing the process and pool modules from multiprocessing, you import them from discomp.

Set up the environment variables using credentials for your Dis.co account (see the examples below).

Classes

discomp.Process

class discomp.Process(name, target, args=())

Instantiating the Process class creates a new job with a single task in a 'waiting state' and submits it to disco computing service

Arguments

name

The job's name. The name is a string used for identification purposes only. It does not have to be unique.

target

The callable object to be invoked by the running job.

args

The argument tuple for the target invocation. By default, no arguments are passed to target.

Methods

start()

Start running the job on one of the machines .

This must be called at most once per process object.

join(timeout=None)

Blocks the calling thread until the job is done.

Upon successful completion of the job, results files are downloaded to a new directory, named <job>, within the working directory.

Currently, timeout must always be 'None'.A process should be joined at most once. A job may already be done by the time Pool.join was called. However, the results are downloaded only upon calling Pool.join.

discomp.Pool

class discomp.Pool(processes=None)

Instantiating the Pool class creates and object to be later used to run a job with one or more tasks executed in many machines, by invoking it's map() method. The Pool class does not take any arguments and has no control on the number of machines used to run the job tasks. The number of machines are determined separately.

Methods

map(func, iterable, chunksize=None)

Pool.map applies the same function, func to many sets of arguments.

It creates a job that runs each set of arguments as a separate task on one of the machines in the "pool". It blocks until the result is ready (meaning, until all the tasks in the job are done). The results are returned in the original order (corresponding to the order of the arguments).

Job-related files (in addition to script, input, and config files that were used to run the task) are downloaded automatically when the job is done, under a directory named <func>, within the working directory.

The function's arguments should be provided an iterable.

starmap(func, iterable, chunksize=None)

Pool.starmap is similar to Pool.map but it can apply the same function to many sets of multiple arguments.

The function's arguments should be provided an iterable. Elements of the iterable are expected to be iterables as well that are unpacked as arguments.

Hence an iterable of [(1,2), (3, 4)] results in [func(1,2), func(3,4)].

Process Example

A basic example using the Process class:

import os
from discomp import Process
os.environ['DISCO_LOGIN_USER'] = 'username@mail.com'
os.environ['DISCO_LOGIN_PASSWORD'] = 'password'
def func(name):
print ('Hello', name)
p = Process(
name='MyFirstJobExample',
target=func,
args=('Bob',))
p.start()
p.join()

Output:

Pool Example

A basic example using the Pool class:

import os
from discomp import Pool
os.environ['DISCO_LOGIN_USER'] = 'username@mail.com'
os.environ['DISCO_LOGIN_PASSWORD'] = 'password'
def pow3(x):
print (x**3)
return (x**3)
p = Pool()
results = p.map(pow3, range(10))
print(results)

Output: