-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GridSearchCV implementation #214
Conversation
Codecov Report
@@ Coverage Diff @@
## master #214 +/- ##
==========================================
+ Coverage 95.05% 95.21% +0.16%
==========================================
Files 26 30 +4
Lines 2546 2861 +315
==========================================
+ Hits 2420 2724 +304
- Misses 126 137 +11
Continue to review full report at Codecov.
|
ffbb717
to
f3e6efb
Compare
f3e6efb
to
0bed11b
Compare
ef5fcac
to
f050e59
Compare
be7d203
to
362f9b4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job overall! I have just some minor comments.
dislib/data/array.py
Outdated
if r_start >= r_stop or c_start >= c_stop: | ||
shape = [0, 0] | ||
if r_start < r_stop: | ||
shape[0] = r_stop - r_start | ||
if c_start < c_stop: | ||
shape[1] = c_stop - c_start | ||
res = Array(blocks=[[np.empty(shape)]], top_left_shape=shape, | ||
reg_shape=self._reg_shape, shape=shape, | ||
sparse=self._sparse) | ||
return res | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this block necessary? Can you add a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous implementation returned arrays with empty blocks at the end. I corrected that as it was unnecessary, and caused difficulties when working with array blocks. As a result of the correction, empty slices no longer returned empty blocks, and thus failed the Array._validate_blocks() (raise AttributeError('Blocks must a list of lists, with at least an empty numpy/scipy matrix.')). Although I don't agree with this validation, this is a matter of discussion, so I added this code block to handle this special case in the most natural way possible (having the empty numpy array keep the number of rows/cols when possible). However I didn't consider to keep the sparsity of the array, and this starts to be too annoying.
Anyways, I'm going to add a short comment to explain what does this block do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I understand. Just a couple of comments:
- Not sure if it matters, but shape should probably be a tuple instead of a list
- If the array is empty maybe it is better to set sparse to False in all cases
- Are
top_left_shape
andreg_shape
defined in consistency with other slicing cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have modified this block, to simplify. Now the empty block will always have a shape of (0,0) (if this causes any problem, we probably have to be more explicit when defining what is a valid Array).
- Solved.
- I think it's better to keep it sparse if the original array is sparse, for consistency with other slices.
reg_shape
is the same as in the original array, and it is consistent with other slices.top_left_shape
is not consistent with other slices, but I think it doesn't need to be, we can reset it totop_left_shape = reg_shape
, which is consistent with the default behavior of the array constructor when creating a new Array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what happens in other cases, but if the contents of the array is a numpy array, sparse should be False for consistency. An alternative solution is to create the array with an empty CSR matrix if sparse is True.
Also, now you can do the following to avoid the if statements:
nrows = max(0, nrows)
ncols = max(0, ncols)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the creation of an empty CSR matrix if sparse is True, but I hope in the future we remove the empty_block requirement and we don't need to do this.
I also applied your proposed change for n_rows
and n_cols.
dislib/data/array.py
Outdated
if r_start >= r_stop or c_start >= c_stop: | ||
shape = [0, 0] | ||
if r_start < r_stop: | ||
shape[0] = r_stop - r_start | ||
if c_start < c_stop: | ||
shape[1] = c_stop - c_start | ||
res = Array(blocks=[[np.empty(shape)]], top_left_shape=shape, | ||
reg_shape=self._reg_shape, shape=shape, | ||
sparse=self._sparse) | ||
return res | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I understand. Just a couple of comments:
- Not sure if it matters, but shape should probably be a tuple instead of a list
- If the array is empty maybe it is better to set sparse to False in all cases
- Are
top_left_shape
andreg_shape
defined in consistency with other slicing cases?
Description
New module dislib.model_selection with classes GridSearchCV and KFold.
GridSearchCV is a grid search implementation with cross-validation included. The functionality is similar to sklearn.model_selection.GridSearchCV, but without predefined scorers and only 1 predefined splitter (KFold) available. It extends BaseSearchCV, which is an abstract class that could be extended in the future by BinarySearchCV or RandomSearchCV.
KFold(k) is a splitter class with a split() method that takes a dataset and returns k partitions of it into a train dataset and a validation dataset. In each partition, the validation dataset is one of the k folds of the original dataset. Splitters used by the GridSearchCV are not equivalent to sklearn splitters: they must return partitions of the dataset instead of partitions of the indices.
There are not any predefined scorers in dislib. GridSearchCV will use the score() function of the estimator by default if it exists. Otherwise, the user should provide a scorer. Currently, only CascadeSVM and RandomForestClassifier contain a score() function, that now returns an unsynchronized future object.
To use dislib estimators with GridSearchCV, they have to extend BaseEstimator (or have get_params() and set_params() methods). This is tested. Also, I've modified all the estimators so that the init arguments become publicly available as parameters with the same name. Also, any logic is removed from the init method and moved to the fit() method. This is not tested, I tried but failed because of a stupid dependency of sklearn that requires nose for these validations. To test it we would need to duplicate a lot of code from sklearn, but this has a maintenance cost. So maybe we should just document guidelines for estimators.
This GridSearchCV implementation doesn't use nesting, so execution will only be parallel for estimators without synchronizations in the fit() and score()/scorer methods. For some algorithms like CascadeSVM, the user should set check_convergence=False to get parallel execution.
Fixes #176
Type of change
How Has This Been Tested?
RFTest.test_make_classification
](but I modified estimator tests)
Reproduce instructions:
Example in MN4:
enqueue_compss -t --qos=debug --exec_time=120 --num_nodes=9 --worker_in_master_cpus=0 example.py
example.py:
Checklist: