GridSearchCV implementation #214

salvisolamartinell · 2019-08-12T10:17:52Z

Description

New module dislib.model_selection with classes GridSearchCV and KFold.

GridSearchCV is a grid search implementation with cross-validation included. The functionality is similar to sklearn.model_selection.GridSearchCV, but without predefined scorers and only 1 predefined splitter (KFold) available. It extends BaseSearchCV, which is an abstract class that could be extended in the future by BinarySearchCV or RandomSearchCV.

KFold(k) is a splitter class with a split() method that takes a dataset and returns k partitions of it into a train dataset and a validation dataset. In each partition, the validation dataset is one of the k folds of the original dataset. Splitters used by the GridSearchCV are not equivalent to sklearn splitters: they must return partitions of the dataset instead of partitions of the indices.

There are not any predefined scorers in dislib. GridSearchCV will use the score() function of the estimator by default if it exists. Otherwise, the user should provide a scorer. Currently, only CascadeSVM and RandomForestClassifier contain a score() function, that now returns an unsynchronized future object.

To use dislib estimators with GridSearchCV, they have to extend BaseEstimator (or have get_params() and set_params() methods). This is tested. Also, I've modified all the estimators so that the init arguments become publicly available as parameters with the same name. Also, any logic is removed from the init method and moved to the fit() method. This is not tested, I tried but failed because of a stupid dependency of sklearn that requires nose for these validations. To test it we would need to duplicate a lot of code from sklearn, but this has a maintenance cost. So maybe we should just document guidelines for estimators.

This GridSearchCV implementation doesn't use nesting, so execution will only be parallel for estimators without synchronizations in the fit() and score()/scorer methods. For some algorithms like CascadeSVM, the user should set check_convergence=False to get parallel execution.

Fixes #176

Type of change

New algorithm or support class.
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

I have added new test files: test_kfold.py, test_estimator_interface.py, test_gridsearch.py
I have added a new test case: [E.g. RFTest.test_make_classification]
(but I modified estimator tests)
I have tested it manually in a local environment. See examples/gridsearchcv.py.
I have tested it manually in a supercomputer.

Reproduce instructions:
Example in MN4: enqueue_compss -t --qos=debug --exec_time=120 --num_nodes=9 --worker_in_master_cpus=0 example.py

example.py:

from sklearn import datasets
import pandas as pd

from dislib.classification import CascadeSVM
from dislib.data import load_data, load_txt_files
from dislib.model_selection import GridSearchCV


def main():
    ds = load_txt_files(
        "/gpfs/projects/bsc19/COMPSs_DATASETS/dislib/kdd99/partitions/384",
        n_features=121, label_col="last")
    # x, y = datasets.load_iris(return_X_y=True)
    # ds = load_data(x, 30, y)
    parameters = {'c': (10000, 5000),
                  'gamma': (0.01, 0.02)}
    csvm = CascadeSVM(check_convergence=False)
    searcher = GridSearchCV(csvm, parameters, cv=3)
    searcher.fit(ds)
    print(searcher.cv_results_['params'])
    print(searcher.cv_results_['mean_test_score'])
    pd_df = pd.DataFrame.from_dict(searcher.cv_results_)
    print(pd_df[['params', 'mean_test_score']])
    with pd.option_context('display.max_rows', None,
                           'display.max_columns', None):
        print(pd_df)
    print(searcher.best_estimator_)
    print(searcher.best_score_)
    print(searcher.best_params_)
    print(searcher.best_index_)
    print(searcher.scorer_)
    print(searcher.n_splits_)


if __name__ == "__main__":
    main()

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas.
I have documented the public methods of any public class according to the guide styles.
I have made corresponding changes to the documentation
New and existing unit tests pass locally with my changes
I have rebased my branch before trying to merge.

codecov · 2019-08-12T15:06:02Z

Codecov Report

Merging #214 into master will increase coverage by 0.16%.
The diff coverage is 97.32%.

@@            Coverage Diff             @@
##           master     #214      +/-   ##
==========================================
+ Coverage   95.05%   95.21%   +0.16%     
==========================================
  Files          26       30       +4     
  Lines        2546     2861     +315     
==========================================
+ Hits         2420     2724     +304     
- Misses        126      137      +11

Impacted Files	Coverage Δ
dislib/classification/rf/_data.py	`61.36% <ø> (ø)`	⬆️
dislib/model_selection/__init__.py	`100% <100%> (ø)`
dislib/cluster/dbscan/base.py	`100% <100%> (ø)`	⬆️
dislib/classification/rf/forest.py	`96.94% <100%> (+0.04%)`	⬆️
dislib/cluster/gm/base.py	`98.67% <100%> (ø)`	⬆️
dislib/recommendation/als/base.py	`94.91% <100%> (+0.04%)`	⬆️
dislib/neighbors/base.py	`98.24% <100%> (+0.06%)`	⬆️
dislib/cluster/kmeans/base.py	`97.7% <100%> (+0.02%)`	⬆️
dislib/decomposition/pca/base.py	`100% <100%> (ø)`	⬆️
dislib/regression/linear/base.py	`100% <100%> (ø)`	⬆️
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fd76d17...cc9f0f9. Read the comment docs.

javicid

Great job overall! I have just some minor comments.

dislib/classification/csvm/base.py

dislib/cluster/gm/base.py

dislib/classification/csvm/base.py

javicid · 2019-10-30T14:54:12Z

dislib/data/array.py

+        if r_start >= r_stop or c_start >= c_stop:
+            shape = [0, 0]
+            if r_start < r_stop:
+                shape[0] = r_stop - r_start
+            if c_start < c_stop:
+                shape[1] = c_stop - c_start
+            res = Array(blocks=[[np.empty(shape)]], top_left_shape=shape,
+                        reg_shape=self._reg_shape, shape=shape,
+                        sparse=self._sparse)
+            return res
+


Why is this block necessary? Can you add a comment?

The previous implementation returned arrays with empty blocks at the end. I corrected that as it was unnecessary, and caused difficulties when working with array blocks. As a result of the correction, empty slices no longer returned empty blocks, and thus failed the Array._validate_blocks() (raise AttributeError('Blocks must a list of lists, with at least an empty numpy/scipy matrix.')). Although I don't agree with this validation, this is a matter of discussion, so I added this code block to handle this special case in the most natural way possible (having the empty numpy array keep the number of rows/cols when possible). However I didn't consider to keep the sparsity of the array, and this starts to be too annoying.

Anyways, I'm going to add a short comment to explain what does this block do.

Ok, I understand. Just a couple of comments:

Not sure if it matters, but shape should probably be a tuple instead of a list

If the array is empty maybe it is better to set sparse to False in all cases

Are top_left_shape and reg_shape defined in consistency with other slicing cases?

I have modified this block, to simplify. Now the empty block will always have a shape of (0,0) (if this causes any problem, we probably have to be more explicit when defining what is a valid Array).

Solved.

I think it's better to keep it sparse if the original array is sparse, for consistency with other slices.

reg_shape is the same as in the original array, and it is consistent with other slices. top_left_shape is not consistent with other slices, but I think it doesn't need to be, we can reset it to top_left_shape = reg_shape, which is consistent with the default behavior of the array constructor when creating a new Array.

I don't know what happens in other cases, but if the contents of the array is a numpy array, sparse should be False for consistency. An alternative solution is to create the array with an empty CSR matrix if sparse is True.

Also, now you can do the following to avoid the if statements:

nrows = max(0, nrows) ncols = max(0, ncols)

I added the creation of an empty CSR matrix if sparse is True, but I hope in the future we remove the empty_block requirement and we don't need to do this.

I also applied your proposed change for n_rows and n_cols.

dislib/model_selection/_search.py

dislib/model_selection/_split.py

javicid · 2019-11-04T09:49:26Z

dislib/data/array.py

+        if r_start >= r_stop or c_start >= c_stop:
+            shape = [0, 0]
+            if r_start < r_stop:
+                shape[0] = r_stop - r_start
+            if c_start < c_stop:
+                shape[1] = c_stop - c_start
+            res = Array(blocks=[[np.empty(shape)]], top_left_shape=shape,
+                        reg_shape=self._reg_shape, shape=shape,
+                        sparse=self._sparse)
+            return res
+


Ok, I understand. Just a couple of comments:

Not sure if it matters, but shape should probably be a tuple instead of a list

If the array is empty maybe it is better to set sparse to False in all cases

Are top_left_shape and reg_shape defined in consistency with other slicing cases?

docs/source/dislib.rst

salvisolamartinell requested review from javicid and kafkasl as code owners August 12, 2019 10:17

javicid changed the title ~~GridSearchCV implementation~~ [WIP] GridSearchCV implementation Aug 13, 2019

salvisolamartinell force-pushed the grid_search branch from ffbb717 to f3e6efb Compare August 13, 2019 09:24

salvisolamartinell force-pushed the grid_search branch from f3e6efb to 0bed11b Compare October 15, 2019 14:16

salvisolamartinell force-pushed the grid_search branch from ef5fcac to f050e59 Compare October 24, 2019 10:55

Salvi Solà Martinell added 12 commits October 30, 2019 16:13

Adopt some sklearn conventions for estimators

d504874

GridSearchCV and KFold implementation

f3e2f95

Added master time info to GridSearchCV and added tests

29aaff0

Updated examples using score() to synchronize

7781840

minor fix in docstring

8500400

Removing BaseSearchCV from public interface

5b01cc9

style fix

8e21b92

Fixing compatibility issues with older sklearn versions

7418510

Adapted GridSearchCV to arrays

0eedb9b

LR example to use arrays

1eaec4b

Fixed docs and examples

db344ba

add jupyter notebook with estimator examples

362f9b4

salvisolamartinell force-pushed the grid_search branch from be7d203 to 362f9b4 Compare October 30, 2019 15:20

javicid suggested changes Oct 30, 2019

View reviewed changes

Merge branch 'master' into grid_search

70e2535

javicid changed the title ~~[WIP] GridSearchCV implementation~~ GridSearchCV implementation Oct 31, 2019

Salvi Solà Martinell and others added 2 commits October 31, 2019 18:11

adding doc files and minor changes

0d37ebd

Merge branch 'master' into grid_search

58deeba

javicid suggested changes Nov 4, 2019

View reviewed changes

Salvi Solà Martinell added 2 commits November 4, 2019 11:52

adressing comments when slicing to empty array

fc935bd

adressed comments on array creation for empty slices

cc9f0f9

javicid approved these changes Nov 4, 2019

View reviewed changes

javicid merged commit bf37dc6 into master Nov 4, 2019

javicid deleted the grid_search branch November 4, 2019 13:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GridSearchCV implementation #214

GridSearchCV implementation #214

salvisolamartinell commented Aug 12, 2019 •

edited

Loading

codecov bot commented Aug 12, 2019 •

edited

Loading

javicid left a comment

javicid Oct 30, 2019

salvisolamartinell Oct 31, 2019

javicid Nov 4, 2019

salvisolamartinell Nov 4, 2019

javicid Nov 4, 2019

salvisolamartinell Nov 4, 2019

javicid Nov 4, 2019

GridSearchCV implementation #214

GridSearchCV implementation #214

Conversation

salvisolamartinell commented Aug 12, 2019 • edited Loading

Description

Type of change

How Has This Been Tested?

Checklist:

codecov bot commented Aug 12, 2019 • edited Loading

Codecov Report

javicid left a comment

Choose a reason for hiding this comment

javicid Oct 30, 2019

Choose a reason for hiding this comment

salvisolamartinell Oct 31, 2019

Choose a reason for hiding this comment

javicid Nov 4, 2019

Choose a reason for hiding this comment

salvisolamartinell Nov 4, 2019

Choose a reason for hiding this comment

javicid Nov 4, 2019

Choose a reason for hiding this comment

salvisolamartinell Nov 4, 2019

Choose a reason for hiding this comment

javicid Nov 4, 2019

Choose a reason for hiding this comment

salvisolamartinell commented Aug 12, 2019 •

edited

Loading

codecov bot commented Aug 12, 2019 •

edited

Loading