1.12. Dislib tutorial

This tutorial will show the basics of using dislib.

Setup

First, we need to start an interactive PyCOMPSs session:

[1]:
import pycompss.interactive as ipycompss
import os
if 'BINDER_SERVICE_HOST' in os.environ:
    ipycompss.start(graph=True,
                    project_xml='../xml/project.xml',
                    resources_xml='../xml/resources.xml')
else:
    ipycompss.start(graph=True, monitor=1000)
******************************************************
*************** PyCOMPSs Interactive *****************
******************************************************
*          .-~~-.--.           _____       ________  *
*         :         )         |____ \     |____   /  *
*   .~ ~ -.\       /.- ~~ .     ___) |        /  /   *
*   >       `.   .'       <    / ___/        /  /    *
*  (         .- -.         )  | |___   _    /  /     *
*   `- -.-~  `- -'  ~-.- -'   |_____| |_|  /__/      *
*     (        :        )           _ _ .-:          *
*      ~--.    :    .--~        .-~  .-~  }          *
*          ~-.-^-.-~ \_      .~  .-~   .~            *
*                   \ \ '     \ '_ _ -~              *
*                    \`.\`.    //                    *
*           . - ~ ~-.__\`.\`-.//                     *
*       .-~   . - ~  }~ ~ ~-.~-.                     *
*     .' .-~      .-~       :/~-.~-./:               *
*    /_~_ _ . - ~                 ~-.~-._            *
*                                     ~-.<           *
******************************************************
* - Starting COMPSs runtime...                       *
* - Log path : /home/user/.COMPSs/Interactive_01/
* - PyCOMPSs Runtime started... Have fun!            *
******************************************************

Next, we import dislib and we are all set to start working!

[2]:
import dislib as ds

Distributed arrays

The main data structure in dislib is the distributed array (or ds-array). These arrays are a distributed representation of a 2-dimensional array that can be operated as a regular Python object. Usually, rows in the array represent samples, while columns represent features.

To create a random array we can run the following NumPy-like command:

[3]:
x = ds.random_array(shape=(500, 500), block_size=(100, 100))
print(x.shape)
x
(500, 500)
[3]:
ds-array(blocks=(...), top_left_shape=(100, 100), reg_shape=(100, 100), shape=(500, 500), sparse=False)

Now x is a 500x500 ds-array of random numbers stored in blocks of 100x100 elements. Note that x is not stored in memory. Instead, random_array generates the contents of the array in tasks that are usually executed remotely. This allows the creation of really big arrays.

The content of x is a list of Futures that represent the actual data (wherever it is stored).

To see this, we can access the _blocks field of x:

[4]:
x._blocks[0][0]
[4]:
<pycompss.runtime.binding.Future at 0x7fb45006c4a8>

block_size is useful to control the granularity of dislib algorithms.

To retrieve the actual contents of x, we use collect, which synchronizes the data and returns the equivalent NumPy array:

[5]:
x.collect()
[5]:
array([[0.64524889, 0.69250058, 0.97374899, ..., 0.36928149, 0.75461806,
        0.98784805],
       [0.61753454, 0.092557  , 0.4967433 , ..., 0.87482708, 0.44454572,
        0.0022951 ],
       [0.78473344, 0.15288068, 0.70022708, ..., 0.82488172, 0.16980005,
        0.30608108],
       ...,
       [0.16257112, 0.94326181, 0.26206143, ..., 0.49725598, 0.80564738,
        0.69616444],
       [0.25089352, 0.10652958, 0.79657793, ..., 0.86936011, 0.67382938,
        0.78140887],
       [0.17716041, 0.24354163, 0.52866266, ..., 0.12053584, 0.9071268 ,
        0.55058659]])

Another way of creating ds-arrays is using array-like structures like NumPy arrays or lists:

[6]:
x1 = ds.array([[1, 2, 3], [4, 5, 6]], block_size=(1, 3))
x1
[6]:
ds-array(blocks=(...), top_left_shape=(1, 3), reg_shape=(1, 3), shape=(2, 3), sparse=False)

Distributed arrays can also store sparse data in CSR format:

[7]:
from scipy.sparse import csr_matrix

sp = csr_matrix([[0, 0, 1], [1, 0, 1]])
x_sp = ds.array(sp, block_size=(1, 3))
x_sp
[7]:
ds-array(blocks=(...), top_left_shape=(1, 3), reg_shape=(1, 3), shape=(2, 3), sparse=True)

In this case, collect returns a CSR matrix as well:

[8]:
x_sp.collect()
[8]:
<2x3 sparse matrix of type '<class 'numpy.int64'>'
        with 3 stored elements in Compressed Sparse Row format>

Loading data

A typical way of creating ds-arrays is to load data from disk. Dislib currently supports reading data in CSV and SVMLight formats like this:

[9]:
x, y = ds.load_svmlight_file("./files/libsvm/1", block_size=(20, 100), n_features=780, store_sparse=True)

print(x)

csv = ds.load_txt_file("./files/csv/1", block_size=(500, 122))

print(csv)
ds-array(blocks=(...), top_left_shape=(20, 100), reg_shape=(20, 100), shape=(61, 780), sparse=True)
ds-array(blocks=(...), top_left_shape=(500, 122), reg_shape=(500, 122), shape=(4235, 122), sparse=False)

Slicing

Similar to NumPy, ds-arrays support the following types of slicing:

(Note that slicing a ds-array creates a new ds-array)

[10]:
x = ds.random_array((50, 50), (10, 10))

Get a single row:

[11]:
x[4]
[11]:
ds-array(blocks=(...), top_left_shape=(10, 10), reg_shape=(10, 10), shape=(1, 50), sparse=False)

Get a single element:

[12]:
x[2, 3]
[12]:
ds-array(blocks=(...), top_left_shape=(1, 1), reg_shape=(1, 1), shape=(1, 1), sparse=False)

Get a set of rows or a set of columns:

[13]:
# Consecutive rows
print(x[10:20])

# Consecutive columns
print(x[:, 10:20])

# Non consecutive rows
print(x[[3, 7, 22]])

# Non consecutive columns
print(x[:, [5, 9, 48]])
ds-array(blocks=(...), top_left_shape=(10, 10), reg_shape=(10, 10), shape=(10, 50), sparse=False)
ds-array(blocks=(...), top_left_shape=(10, 10), reg_shape=(10, 10), shape=(50, 10), sparse=False)
ds-array(blocks=(...), top_left_shape=(10, 10), reg_shape=(10, 10), shape=(3, 50), sparse=False)
ds-array(blocks=(...), top_left_shape=(10, 10), reg_shape=(10, 10), shape=(50, 3), sparse=False)

Get any set of elements:

[14]:
x[0:5, 40:45]
[14]:
ds-array(blocks=(...), top_left_shape=(10, 10), reg_shape=(10, 10), shape=(5, 5), sparse=False)

Other functions

Apart from this, ds-arrays also provide other useful operations like transpose and mean:

[15]:
x.mean(axis=0).collect()
[15]:
array([0.551327  , 0.49044018, 0.47707326, 0.505077  , 0.48609203,
       0.5046837 , 0.48576743, 0.50085461, 0.59721606, 0.52625186,
       0.54670041, 0.51213254, 0.4790549 , 0.56172745, 0.54560812,
       0.52904012, 0.54524971, 0.52370167, 0.44268389, 0.58325103,
       0.4754059 , 0.4990062 , 0.55135695, 0.53576298, 0.44728801,
       0.57918722, 0.46289081, 0.48786927, 0.46635653, 0.53229369,
       0.50648353, 0.53486796, 0.5159688 , 0.50703502, 0.46818427,
       0.59886016, 0.48181209, 0.51001161, 0.46914881, 0.57437929,
       0.52491673, 0.49711231, 0.50817903, 0.50102322, 0.42250693,
       0.51456321, 0.53393824, 0.47924624, 0.49860827, 0.49424366])
[16]:
x.transpose().collect()
[16]:
array([[0.75899854, 0.83108977, 0.28083382, ..., 0.65348721, 0.10747938,
        0.13453309],
       [0.70515755, 0.22656129, 0.60863163, ..., 0.10640133, 0.3311688 ,
        0.50884584],
       [0.71224037, 0.95907871, 0.6010006 , ..., 0.41099068, 0.5671029 ,
        0.06170055],
       ...,
       [0.39789773, 0.69988175, 0.93784369, ..., 0.24439267, 0.45685381,
        0.93017544],
       [0.22410234, 0.13491992, 0.75906239, ..., 0.96917569, 0.96204333,
        0.14629864],
       [0.81496796, 0.96925576, 0.58510411, ..., 0.65520011, 0.05744591,
        0.78974985]])

Machine learning with dislib

Dislib provides an estimator-based API very similar to scikit-learn. To run an algorithm, we first create an estimator. For example, a K-means estimator:

[17]:
from dislib.cluster import KMeans

km = KMeans(n_clusters=3)

Now, we create a ds-array with some blob data, and fit the estimator:

[18]:
from sklearn.datasets import make_blobs

# create ds-array
x, y = make_blobs(n_samples=1500)
x_ds = ds.array(x, block_size=(500, 2))

km.fit(x_ds)
[18]:
KMeans(arity=50, init='random', max_iter=10, n_clusters=3,
       random_state=<mtrand.RandomState object at 0x7fb51401f9d8>, tol=0.0001,
       verbose=False)

Finally, we can make predictions on new (or the same) data:

[19]:
y_pred = km.predict(x_ds)
y_pred
[19]:
ds-array(blocks=(...), top_left_shape=(500, 1), reg_shape=(500, 1), shape=(1500, 1), sparse=False)

y_pred is a ds-array of predicted labels for x_ds

Let’s plot the results

[20]:
%matplotlib inline
import matplotlib.pyplot as plt


centers = km.centers

# set the color of each sample to the predicted label
plt.scatter(x[:, 0], x[:, 1], c=y_pred.collect())

# plot the computed centers in red
plt.scatter(centers[:, 0], centers[:, 1], c='red')
[20]:
<matplotlib.collections.PathCollection at 0x7fb42b4fc518>
../../../_images/Sections_09_PyCOMPSs_Notebooks_syntax_9_Dislib_39_1.png

Note that we need to call y_pred.collect() to retrieve the actual labels and plot them. The rest is the same as if we were using scikit-learn.

Now let’s try a more complex example that uses some preprocessing tools.

First, we load a classification data set from scikit-learn into ds-arrays.

Note that this step is only necessary for demonstration purposes. Ideally, your data should be already loaded in ds-arrays.

[21]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

x, y = load_breast_cancer(return_X_y=True)

x_train, x_test, y_train, y_test = train_test_split(x, y)

x_train = ds.array(x_train, block_size=(100, 10))
y_train = ds.array(y_train.reshape(-1, 1), block_size=(100, 1))

x_test = ds.array(x_test, block_size=(100, 10))
y_test = ds.array(y_test.reshape(-1, 1), block_size=(100, 1))

Next, we can see how support vector machines perform in classifying the data. We first fit the model (ignore any warnings in this step):

[22]:
from dislib.classification import CascadeSVM

csvm = CascadeSVM()

csvm.fit(x_train, y_train)
/usr/lib/python3.6/site-packages/dislib-0.4.0-py3.6.egg/dislib/classification/csvm/base.py:374: RuntimeWarning: overflow encountered in exp
  k = np.exp(k)
/usr/lib/python3.6/site-packages/dislib-0.4.0-py3.6.egg/dislib/classification/csvm/base.py:342: RuntimeWarning: invalid value encountered in double_scalars
  delta = np.abs((w - self._last_w) / self._last_w)
[22]:
CascadeSVM(c=1, cascade_arity=2, check_convergence=True, gamma='auto',
           kernel='rbf', max_iter=5, random_state=None, tol=0.001,
           verbose=False)

and now we can make predictions on new data using csvm.predict(), or we can get the model accuracy on the test set with:

[23]:
score = csvm.score(x_test, y_test)

score represents the classifier accuracy, however, it is returned as a Future. We need to synchronize to get the actual value:

[24]:
from pycompss.api.api import compss_wait_on

print(compss_wait_on(score))
0.5944055944055944

The accuracy should be around 0.6, which is not very good. We can scale the data before classification to improve accuracy. This can be achieved using dislib’s StandardScaler.

The StandardScaler provides the same API as other estimators. In this case, however, instead of making predictions on new data, we transform it:

[25]:
from dislib.preprocessing import StandardScaler

sc = StandardScaler()

# fit the scaler with train data and transform it
scaled_train = sc.fit_transform(x_train)

# transform test data
scaled_test = sc.transform(x_test)

Now scaled_train and scaled_test are the scaled samples. Let’s see how SVM perfroms now.

[26]:
csvm.fit(scaled_train, y_train)
score = csvm.score(scaled_test, y_test)
print(compss_wait_on(score))
0.972027972027972

The new accuracy should be around 0.9, which is a great improvement!

Close the session

To finish the session, we need to stop PyCOMPSs:

[27]:
ipycompss.stop()
****************************************************
*************** STOPPING PyCOMPSs ******************
****************************************************
Warning: some of the variables used with PyCOMPSs may
         have not been brought to the master.
****************************************************