Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GridSearchCV implementation #214

Merged
merged 17 commits into from
Nov 4, 2019
Merged

GridSearchCV implementation #214

merged 17 commits into from
Nov 4, 2019

Conversation

salvisolamartinell
Copy link
Collaborator

@salvisolamartinell salvisolamartinell commented Aug 12, 2019

Description

New module dislib.model_selection with classes GridSearchCV and KFold.

GridSearchCV is a grid search implementation with cross-validation included. The functionality is similar to sklearn.model_selection.GridSearchCV, but without predefined scorers and only 1 predefined splitter (KFold) available. It extends BaseSearchCV, which is an abstract class that could be extended in the future by BinarySearchCV or RandomSearchCV.

KFold(k) is a splitter class with a split() method that takes a dataset and returns k partitions of it into a train dataset and a validation dataset. In each partition, the validation dataset is one of the k folds of the original dataset. Splitters used by the GridSearchCV are not equivalent to sklearn splitters: they must return partitions of the dataset instead of partitions of the indices.

There are not any predefined scorers in dislib. GridSearchCV will use the score() function of the estimator by default if it exists. Otherwise, the user should provide a scorer. Currently, only CascadeSVM and RandomForestClassifier contain a score() function, that now returns an unsynchronized future object.

To use dislib estimators with GridSearchCV, they have to extend BaseEstimator (or have get_params() and set_params() methods). This is tested. Also, I've modified all the estimators so that the init arguments become publicly available as parameters with the same name. Also, any logic is removed from the init method and moved to the fit() method. This is not tested, I tried but failed because of a stupid dependency of sklearn that requires nose for these validations. To test it we would need to duplicate a lot of code from sklearn, but this has a maintenance cost. So maybe we should just document guidelines for estimators.

This GridSearchCV implementation doesn't use nesting, so execution will only be parallel for estimators without synchronizations in the fit() and score()/scorer methods. For some algorithms like CascadeSVM, the user should set check_convergence=False to get parallel execution.

Fixes #176

Type of change

  • New algorithm or support class.
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

  • I have added new test files: test_kfold.py, test_estimator_interface.py, test_gridsearch.py
  • I have added a new test case: [E.g. RFTest.test_make_classification]
    (but I modified estimator tests)
  • I have tested it manually in a local environment. See examples/gridsearchcv.py.
  • I have tested it manually in a supercomputer.

Reproduce instructions:
Example in MN4: enqueue_compss -t --qos=debug --exec_time=120 --num_nodes=9 --worker_in_master_cpus=0 example.py

example.py:

from sklearn import datasets
import pandas as pd

from dislib.classification import CascadeSVM
from dislib.data import load_data, load_txt_files
from dislib.model_selection import GridSearchCV


def main():
    ds = load_txt_files(
        "/gpfs/projects/bsc19/COMPSs_DATASETS/dislib/kdd99/partitions/384",
        n_features=121, label_col="last")
    # x, y = datasets.load_iris(return_X_y=True)
    # ds = load_data(x, 30, y)
    parameters = {'c': (10000, 5000),
                  'gamma': (0.01, 0.02)}
    csvm = CascadeSVM(check_convergence=False)
    searcher = GridSearchCV(csvm, parameters, cv=3)
    searcher.fit(ds)
    print(searcher.cv_results_['params'])
    print(searcher.cv_results_['mean_test_score'])
    pd_df = pd.DataFrame.from_dict(searcher.cv_results_)
    print(pd_df[['params', 'mean_test_score']])
    with pd.option_context('display.max_rows', None,
                           'display.max_columns', None):
        print(pd_df)
    print(searcher.best_estimator_)
    print(searcher.best_score_)
    print(searcher.best_params_)
    print(searcher.best_index_)
    print(searcher.scorer_)
    print(searcher.n_splits_)


if __name__ == "__main__":
    main()

image

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas.
  • I have documented the public methods of any public class according to the guide styles.
  • I have made corresponding changes to the documentation
  • New and existing unit tests pass locally with my changes
  • I have rebased my branch before trying to merge.

@codecov
Copy link

codecov bot commented Aug 12, 2019

Codecov Report

Merging #214 into master will increase coverage by 0.16%.
The diff coverage is 97.32%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #214      +/-   ##
==========================================
+ Coverage   95.05%   95.21%   +0.16%     
==========================================
  Files          26       30       +4     
  Lines        2546     2861     +315     
==========================================
+ Hits         2420     2724     +304     
- Misses        126      137      +11
Impacted Files Coverage Δ
dislib/classification/rf/_data.py 61.36% <ø> (ø) ⬆️
dislib/model_selection/__init__.py 100% <100%> (ø)
dislib/cluster/dbscan/base.py 100% <100%> (ø) ⬆️
dislib/classification/rf/forest.py 96.94% <100%> (+0.04%) ⬆️
dislib/cluster/gm/base.py 98.67% <100%> (ø) ⬆️
dislib/recommendation/als/base.py 94.91% <100%> (+0.04%) ⬆️
dislib/neighbors/base.py 98.24% <100%> (+0.06%) ⬆️
dislib/cluster/kmeans/base.py 97.7% <100%> (+0.02%) ⬆️
dislib/decomposition/pca/base.py 100% <100%> (ø) ⬆️
dislib/regression/linear/base.py 100% <100%> (ø) ⬆️
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fd76d17...cc9f0f9. Read the comment docs.

@javicid javicid changed the title GridSearchCV implementation [WIP] GridSearchCV implementation Aug 13, 2019
Copy link
Collaborator

@javicid javicid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job overall! I have just some minor comments.

Comment on lines 328 to 338
if r_start >= r_stop or c_start >= c_stop:
shape = [0, 0]
if r_start < r_stop:
shape[0] = r_stop - r_start
if c_start < c_stop:
shape[1] = c_stop - c_start
res = Array(blocks=[[np.empty(shape)]], top_left_shape=shape,
reg_shape=self._reg_shape, shape=shape,
sparse=self._sparse)
return res

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this block necessary? Can you add a comment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous implementation returned arrays with empty blocks at the end. I corrected that as it was unnecessary, and caused difficulties when working with array blocks. As a result of the correction, empty slices no longer returned empty blocks, and thus failed the Array._validate_blocks() (raise AttributeError('Blocks must a list of lists, with at least an empty numpy/scipy matrix.')). Although I don't agree with this validation, this is a matter of discussion, so I added this code block to handle this special case in the most natural way possible (having the empty numpy array keep the number of rows/cols when possible). However I didn't consider to keep the sparsity of the array, and this starts to be too annoying.

Anyways, I'm going to add a short comment to explain what does this block do.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I understand. Just a couple of comments:

  1. Not sure if it matters, but shape should probably be a tuple instead of a list
  2. If the array is empty maybe it is better to set sparse to False in all cases
  3. Are top_left_shape and reg_shape defined in consistency with other slicing cases?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have modified this block, to simplify. Now the empty block will always have a shape of (0,0) (if this causes any problem, we probably have to be more explicit when defining what is a valid Array).

  1. Solved.
  2. I think it's better to keep it sparse if the original array is sparse, for consistency with other slices.
  3. reg_shape is the same as in the original array, and it is consistent with other slices. top_left_shape is not consistent with other slices, but I think it doesn't need to be, we can reset it to top_left_shape = reg_shape, which is consistent with the default behavior of the array constructor when creating a new Array.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what happens in other cases, but if the contents of the array is a numpy array, sparse should be False for consistency. An alternative solution is to create the array with an empty CSR matrix if sparse is True.

Also, now you can do the following to avoid the if statements:

nrows = max(0, nrows)
ncols = max(0, ncols)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the creation of an empty CSR matrix if sparse is True, but I hope in the future we remove the empty_block requirement and we don't need to do this.

I also applied your proposed change for n_rows and n_cols.

@javicid javicid changed the title [WIP] GridSearchCV implementation GridSearchCV implementation Oct 31, 2019
Comment on lines 328 to 338
if r_start >= r_stop or c_start >= c_stop:
shape = [0, 0]
if r_start < r_stop:
shape[0] = r_stop - r_start
if c_start < c_stop:
shape[1] = c_stop - c_start
res = Array(blocks=[[np.empty(shape)]], top_left_shape=shape,
reg_shape=self._reg_shape, shape=shape,
sparse=self._sparse)
return res

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I understand. Just a couple of comments:

  1. Not sure if it matters, but shape should probably be a tuple instead of a list
  2. If the array is empty maybe it is better to set sparse to False in all cases
  3. Are top_left_shape and reg_shape defined in consistency with other slicing cases?

@javicid javicid merged commit bf37dc6 into master Nov 4, 2019
@javicid javicid deleted the grid_search branch November 4, 2019 13:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement grid search
2 participants