[ML] Move ML put_job dependencies to X-Pack protocol library #32167

droberts195 · 2018-07-18T12:32:54Z

This change is in preparation for adding the ML put_job action
to the high level REST client.

This change does NOT add any ML functionality to the high level
REST client, but merely moves existing code into the X-Pack
protocol library, so that the classes can be returned in the
put_job response. (If this change were done in the same PR as
adding the endpoint then the code rearrangement would make it
hard to see and review the changes related to adding the
endpoint.)

Unit tests are also moved to the same module as the class they
test.

A handful constants have also been moved around to avoid moving
some dependencies.

This change is in preparation for adding the ML put_job action to the high level REST client. This change does NOT add any ML functionality to the high level REST client, but merely moves existing code into the X-Pack protocol library, so that the classes can be returned in the put_job response. (If this change were done in the same PR as adding the endpoint then the code rearrangement would make it hard to see and review the changes related to adding the endpoint.) Unit tests are also moved to the same module as the class they test. A handful constants have also been moved around to avoid moving some dependencies.

elasticmachine · 2018-07-18T12:32:56Z

Pinging @elastic/ml-core

elasticmachine · 2018-07-18T12:32:57Z

Pinging @elastic/es-core-infra

nik9000

At first I thought it was way too much but on second look I think it is maybe just a little too much. Moving all of the config classes should make it simple to configure ML from the high level rest client which is totally the point.

I'm concerned that we're already adding subpackages to the protocol.xpack.ml package. I'd sort of hoped for a flat structure but seeing how many classes are involved I can see why. I might look to flatten job.config into job and I think everything in util should be triple checked to see if we can get away without moving it. And if we do move it then we should move it into the protocol.xpack.ml package. Probably. Maybe.

nik9000 · 2018-07-18T12:55:18Z

x-pack/protocol/src/main/java/org/elasticsearch/protocol/xpack/ml/utils/ExceptionsHelper.java

 import org.elasticsearch.rest.RestStatus;
 import org.elasticsearch.search.SearchShardTarget;
-import org.elasticsearch.xpack.core.ml.job.messages.Messages;

 public class ExceptionsHelper {


l hadn't intended something like this to move over. I have no problem open sourcing it but I don't feel like it is something the client should have to know about.

I see that this is used by the Requests for validation. I wonder if we can split this into server side and client side code. Or build the exceptions directly in the validation methods if they aren't reused.

nik9000 · 2018-07-18T12:56:29Z

...otocol/src/main/java/org/elasticsearch/protocol/xpack/ml/utils/MachineLearningConstants.java

+
+import org.elasticsearch.common.unit.TimeValue;
+
+public final class MachineLearningConstants {


This too feels like it is mostly devoted to working with ML's C++ process and shouldn't move.

I see that the request and response objects do indeed need these. Could they be made to not need them by moving code out of those objects or would that be a nightmare?

nik9000 · 2018-07-18T12:58:50Z

x-pack/protocol/build.gradle

@@ -27,3 +27,16 @@ dependencies {

    testCompile "org.elasticsearch.test:framework:${version}"
 }
+
+configurations {
+    testArtifacts.extendsFrom testRuntime


We've really been trying to limit the number of testArtifacts that we do. I'm not sure of the right way to work around this, but it does set off alarm bells in my head.

nik9000 · 2018-07-18T13:07:20Z

...ck/protocol/src/main/java/org/elasticsearch/protocol/xpack/ml/job/config/AnalysisConfig.java

-import org.elasticsearch.xpack.core.ml.job.messages.Messages;
-import org.elasticsearch.xpack.core.ml.utils.ExceptionsHelper;
-import org.elasticsearch.xpack.core.ml.utils.time.TimeUtils;
+import org.elasticsearch.protocol.xpack.ml.messages.Messages;


I hadn't envisaged there being much in the way of complex package structure in the protocol. I expect we'll end up with some subpackages but this doesn't feel right as it is.

davidkyle

LGTM

I share @nik9000's concerns about the amount of code moved to the protocol package especially given that this facilitates only a subset of the ML APIs. I'd prefer it if some of the code in ml.utils package was not moved unfortunately I cannot see a way to prune the change effectively.

Once the client is complete a large portion of the ML code will live in the protocol package. I assume the protocol package was created to break a dependency on xpack.core. Maybe this is naming problem, I expect protocol to be small and lightweight but it's more like everything-i-need-for-the-x-packs-apis

nik9000 · 2018-07-18T16:10:58Z

Yeah, it is supposed to be the protocol spoken by the APIs. I think we should endeavour for it to be as small as possible while still being usable.

hub-cap · 2018-07-18T17:41:19Z

im very ++ of making this small. Is it not possible to break up some of these classes?

javanna · 2018-07-19T07:35:42Z

Given all the needed changes, I wonder if we should consider creating pojos that match the REST response which are independent from the objects that we already have, something along the lines of what we did with synced flush.

droberts195 · 2018-07-19T14:36:17Z

I have pondered this a bit more and here are my thoughts.

Regarding size, there is already a high level client for ML, namely the .NET client. I think it's interesting to look at the size and structure of that as a guide to what could be possible for the Java high level REST client. The .NET client's ML section consists of classes for each request and response, plus classes to represent config and results objects. These classes to represent config and results objects are split into:

(The distinction between job config and job results is why I had a job.config package in this PR. This PR doesn't move any of the results classes, as they're not required to implement the put_job endpoint.)

The size of this portion of the .NET client is just over 7300 lines of code:

cd elasticsearch-net/src/Nest/XPack/MachineLearning
find . -type f | xargs wc -l
    7312 total

But if you look at the .NET code it's clear that it's been laid out very compactly: braces and content on single lines, no copyright headers, /// doc comments versus /** doc comments. See for example PutJobRequest.cs. I reckon the equivalent in our Java style would be at least twice as many lines.

The other thing about the .NET client's config classes is that they're mutable. Our config classes are immutable and have corresponding builders. So there's even more code in those builders.

So realistically, to have a high level client for ML were going to need somewhere in the region of 20000 lines of code in the protocol library. Anything significantly below that is not going to be "high level" regardless of whether it's new code or rearranged existing code.

What's in this PR currently is in line with that expectation. I've moved somewhere around 40% of the config and results classes and the number of lines is around 40% of 20000:

cd elasticsearch/x-pack/protocol/src/main/java/org/elasticsearch/protocol/xpack/ml
find . -type f | xargs wc -l
    7627 total

On the question of "what is high level" I looked back to the original high level REST client blog and it says:

It soon became clear that in order to progress and to ease migration at the same time, we would still need the current search and query builders and support the current response objects

I agree with this. For anyone currently using classes like Job and Job.Builder from the transport client a high level REST client that doesn't have these classes is not going to be easy to migrate to.

I had a look at what's been done elsewhere. It seems that a pattern that's used quite a lot is that requests take a BytesReference of config rather than a config object. Then there can be a separate builder that can build the BytesReference in a more friendly way. It's possible to move the request object into the protocol library without also moving the friendly builder, but I'm not convinced this is really in the spirit of the high level REST client being a drop-in replacement for the transport client - see also
#32191 (comment).

Going back to the comments on this PR, the reason some of the current config classes are heavier than you might imagine and have dependencies on utility classes is that the config builders do validation and refuse to create config objects that are invalid. One option for making the config classes smaller and have fewer dependencies would be to move the validation out of them.

As @javanna said, another option would be to create a whole mirror class hierarchy that doesn't do any validation.

So I think the options for making ML's presence in the protocol library lighter come down to these three:

Remove validation from config builders, and add separate validation classes in x-pack-core. So our config and results classes would live in the protocol library, but not depend on utilities and constants required to do validation. Then we'd have a set of corresponding validation classes in x-pack-core that would call the build() method of config builders and only return the built config if it was valid, thus achieving the same end result as the current validating build() methods.

Pros: No duplication of code.
Cons: (a) Possible future doubts about whether config objects are valid at a particular point in the code - at the moment it's impossible to have an invalid config object as the builder won't let you. (b) Classes in the protocol library will contain both X-Content and transport layer serialisation code.

Create an entirely separate hierarchy of classes for config and results - use exactly the same class names as the x-pack-core classes, and have getter methods, to/from X-Content, and non-validating builders for config.

Pros: (a) Separates potentially invalid configs (protocol library) from known valid ones (x-pack-core library). (b) Protocol library classes don't have transport layer serialisation code.
Cons: Lots of code duplication - how do we prevent the class hierarchies getting out of sync? (Maybe in an "evil" test that uses reflection to confirm they have exactly the same getter methods?)

Change request objects to take configs as BytesReference instead of config objects, effectively decoupling requests from builders. Similar to the Watcher approach, although assuming users value builders I think it only gives an illusion of breaking the need to depend on server code. Plus in ML's case we return config objects too, so doesn't really save us from writing much code if we have to add POJOs to represent these returned configs.

I think given where ML is now this is inferior to options 1 and 2 and we shouldn't do it.

Any thoughts?

hub-cap · 2018-07-19T15:02:17Z

WRT #2, can you not use inheritance? The protocol classes are the base classes w/o any of the server side logic, and there are server side classes that contain transport layer functionality and validation. Server side can use the server side (or in cases where it does not need to can use the client side objects in the code), and client side only knows about client side code. Then we dont have code drift. If there is some new field that needs to be added to the class, its added to the protocol class, and then the transport class takes advantage of that in its ser/deser. Or if its forever only needed server side, make a constructor in the transport class. What do you think about that as 2.a option?

WRT the whole validation thing, there are definitely 2 schools of thought. Do some very light validation (null checks) client side and let the server side do full validation, or do all the things client side. Im not sure what doing all the things client side buys us, since it will happen again server side anyway unless we crazy (ive been known to be crazy).

javanna · 2018-07-19T15:28:19Z

I think the main problem of option 2 is not necessarily code duplication but rather that migrating may not be straight-forward compared to reusing transport client classes, but that's because I try to look at it from the user's perspective. I dream of a client that's independent from the server code, one day, and think that on the (very) long run we may end up with separate client classes that diverge from server classes, and that's natural. Ideally we would be able to generate client classes, but we are not ready for any of this at the moment.

On the other hand the statement on query builders etc. was around the search API, which we think is one of the most used and complex API, while we have later used different approaches for other API that are less used from a java client, so we could simplify things for ourselves, something to keep in mind when making these decisions.

I guess the take-away here is that we need to make some compromise :)

davidkyle · 2018-07-19T16:31:03Z

The results classes don't have builders and they alone come to 3657 lines of code, whatever compromise we choose we need to get comfortable with the notion that a lot of code will be moved.

I agree with @javanna, it isn't in the spirit of the high level client not to provide the builders. The validation does not need to place on the client side and should remain in x-pack.core/ml. Yes we lose the guarantee that a config object is properly formed and validated but we know the Rest entry points and will validate on the edge and can safely assume any persisted job config is valid.

My compromise would be to move the pojos and builders without the validation and transport layer serialisation code which will be handled in dedicated classes via inheritance/composistion.

hub-cap · 2018-07-25T14:44:37Z

Also, i think it would be more meaningful to make these changes as we need them (in the PR, or in a PR before the rest PR), so that we dont get thousands of lines of code changes in a review. Im assuming we will have to split things up less mechanically than making changes in a script/IDE, so we would want to review them. Im not sure i see the benefit of moving them over wholesale in 1 PR.

droberts195 · 2018-07-25T15:38:23Z

Also, i think it would be more meaningful to make these changes as we need them (in the PR, or in a PR before the rest PR), so that we dont get thousands of lines of code changes in a review

The config classes I moved in this PR are the minimum required to implement the put job endpoint. (In this PR I didn't move any results classes or datafeed config classes.) I could have moved the job config classes in the same PR that implements the put job endpoint, but that would have resulted in even more changes in one PR, with moves obscuring the new code for put job.

I was on holiday yesterday and Monday, but now I've caught up I'll move the validation out of the config classes which will reduce the number of dependencies of the config classes and allow some of the files moved to the protocol library in this PR to be moved back to x-pack-core.

droberts195 · 2018-07-26T12:43:07Z

I spoke to @davidkyle and @dimitris-athanasiou about this, as we're the ones who will live most closely with the consequences. We came to the conclusion that having a completely separate hierarchy of client classes will cause least pain in the long term. For example, we won't have to refactor again when there's a desire to remove the wire transport methods from REST client classes.

For the benefit of anyone using the current transport client with ML I'll be careful to keep the class names and method names the same on the client side as on the server side. Anyone switching from the Transport client to the high level REST client will have to change imports, but they would have to do that with either approach.

Also, from a PR size perspective, creating completely separate classes has the advantage that I can do just a few classes per PR, and there will be no knock-on effects on imports in hundreds of other files.

So I'll close this and create some new PRs.

droberts195 added >breaking-java v7.0.0 :Core/Java High Level REST Client >refactoring :ml Machine learning v6.4.0 labels Jul 18, 2018

droberts195 requested review from nik9000, javanna and davidkyle July 18, 2018 12:32

nik9000 reviewed Jul 18, 2018

View reviewed changes

davidkyle approved these changes Jul 18, 2018

View reviewed changes

droberts195 mentioned this pull request Jul 19, 2018

Rest HL client: Add put watch action (#32026) #32191

Merged

droberts195 added v6.5.0 and removed v6.4.0 labels Jul 19, 2018

Merge branch 'master' into ml_hlrc_put_job_code_move

08ee77b

droberts195 closed this Jul 26, 2018

benwtrent removed the :Core/Java High Level REST Client label Oct 25, 2018

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

droberts195 deleted the ml_hlrc_put_job_code_move branch September 26, 2019 13:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Move ML put_job dependencies to X-Pack protocol library #32167

[ML] Move ML put_job dependencies to X-Pack protocol library #32167

droberts195 commented Jul 18, 2018

elasticmachine commented Jul 18, 2018

elasticmachine commented Jul 18, 2018

nik9000 left a comment

nik9000 Jul 18, 2018

nik9000 Jul 18, 2018

nik9000 Jul 18, 2018

nik9000 Jul 18, 2018

nik9000 Jul 18, 2018

nik9000 Jul 18, 2018

davidkyle left a comment

nik9000 commented Jul 18, 2018

hub-cap commented Jul 18, 2018

javanna commented Jul 19, 2018

droberts195 commented Jul 19, 2018 •

edited

Loading

hub-cap commented Jul 19, 2018

javanna commented Jul 19, 2018

davidkyle commented Jul 19, 2018

hub-cap commented Jul 25, 2018

droberts195 commented Jul 25, 2018

droberts195 commented Jul 26, 2018


		import org.elasticsearch.common.unit.TimeValue;

		public final class MachineLearningConstants {

[ML] Move ML put_job dependencies to X-Pack protocol library #32167

[ML] Move ML put_job dependencies to X-Pack protocol library #32167

Conversation

droberts195 commented Jul 18, 2018

elasticmachine commented Jul 18, 2018

elasticmachine commented Jul 18, 2018

nik9000 left a comment

Choose a reason for hiding this comment

nik9000 Jul 18, 2018

Choose a reason for hiding this comment

nik9000 Jul 18, 2018

Choose a reason for hiding this comment

nik9000 Jul 18, 2018

Choose a reason for hiding this comment

nik9000 Jul 18, 2018

Choose a reason for hiding this comment

nik9000 Jul 18, 2018

Choose a reason for hiding this comment

nik9000 Jul 18, 2018

Choose a reason for hiding this comment

davidkyle left a comment

Choose a reason for hiding this comment

nik9000 commented Jul 18, 2018

hub-cap commented Jul 18, 2018

javanna commented Jul 19, 2018

droberts195 commented Jul 19, 2018 • edited Loading

hub-cap commented Jul 19, 2018

javanna commented Jul 19, 2018

davidkyle commented Jul 19, 2018

hub-cap commented Jul 25, 2018

droberts195 commented Jul 25, 2018

droberts195 commented Jul 26, 2018

droberts195 commented Jul 19, 2018 •

edited

Loading