Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Move ML put_job dependencies to X-Pack protocol library #32167

Closed

Conversation

droberts195
Copy link
Contributor

This change is in preparation for adding the ML put_job action
to the high level REST client.

This change does NOT add any ML functionality to the high level
REST client, but merely moves existing code into the X-Pack
protocol library, so that the classes can be returned in the
put_job response. (If this change were done in the same PR as
adding the endpoint then the code rearrangement would make it
hard to see and review the changes related to adding the
endpoint.)

Unit tests are also moved to the same module as the class they
test.

A handful constants have also been moved around to avoid moving
some dependencies.

This change is in preparation for adding the ML put_job action
to the high level REST client.

This change does NOT add any ML functionality to the high level
REST client, but merely moves existing code into the X-Pack
protocol library, so that the classes can be returned in the
put_job response.  (If this change were done in the same PR as
adding the endpoint then the code rearrangement would make it
hard to see and review the changes related to adding the
endpoint.)

Unit tests are also moved to the same module as the class they
test.

A handful constants have also been moved around to avoid moving
some dependencies.
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first I thought it was way too much but on second look I think it is maybe just a little too much. Moving all of the config classes should make it simple to configure ML from the high level rest client which is totally the point.

I'm concerned that we're already adding subpackages to the protocol.xpack.ml package. I'd sort of hoped for a flat structure but seeing how many classes are involved I can see why. I might look to flatten job.config into job and I think everything in util should be triple checked to see if we can get away without moving it. And if we do move it then we should move it into the protocol.xpack.ml package. Probably. Maybe.

import org.elasticsearch.rest.RestStatus;
import org.elasticsearch.search.SearchShardTarget;
import org.elasticsearch.xpack.core.ml.job.messages.Messages;

public class ExceptionsHelper {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

l hadn't intended something like this to move over. I have no problem open sourcing it but I don't feel like it is something the client should have to know about.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that this is used by the Requests for validation. I wonder if we can split this into server side and client side code. Or build the exceptions directly in the validation methods if they aren't reused.


import org.elasticsearch.common.unit.TimeValue;

public final class MachineLearningConstants {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This too feels like it is mostly devoted to working with ML's C++ process and shouldn't move.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that the request and response objects do indeed need these. Could they be made to not need them by moving code out of those objects or would that be a nightmare?

@@ -27,3 +27,16 @@ dependencies {

testCompile "org.elasticsearch.test:framework:${version}"
}

configurations {
testArtifacts.extendsFrom testRuntime
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've really been trying to limit the number of testArtifacts that we do. I'm not sure of the right way to work around this, but it does set off alarm bells in my head.

import org.elasticsearch.xpack.core.ml.job.messages.Messages;
import org.elasticsearch.xpack.core.ml.utils.ExceptionsHelper;
import org.elasticsearch.xpack.core.ml.utils.time.TimeUtils;
import org.elasticsearch.protocol.xpack.ml.messages.Messages;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't envisaged there being much in the way of complex package structure in the protocol. I expect we'll end up with some subpackages but this doesn't feel right as it is.

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I share @nik9000's concerns about the amount of code moved to the protocol package especially given that this facilitates only a subset of the ML APIs. I'd prefer it if some of the code in ml.utils package was not moved unfortunately I cannot see a way to prune the change effectively.

Once the client is complete a large portion of the ML code will live in the protocol package. I assume the protocol package was created to break a dependency on xpack.core. Maybe this is naming problem, I expect protocol to be small and lightweight but it's more like everything-i-need-for-the-x-packs-apis

@nik9000
Copy link
Member

nik9000 commented Jul 18, 2018

Yeah, it is supposed to be the protocol spoken by the APIs. I think we should endeavour for it to be as small as possible while still being usable.

@hub-cap
Copy link
Contributor

hub-cap commented Jul 18, 2018

im very ++ of making this small. Is it not possible to break up some of these classes?

@javanna
Copy link
Member

javanna commented Jul 19, 2018

Given all the needed changes, I wonder if we should consider creating pojos that match the REST response which are independent from the objects that we already have, something along the lines of what we did with synced flush.

@droberts195
Copy link
Contributor Author

droberts195 commented Jul 19, 2018

I have pondered this a bit more and here are my thoughts.

Regarding size, there is already a high level client for ML, namely the .NET client. I think it's interesting to look at the size and structure of that as a guide to what could be possible for the Java high level REST client. The .NET client's ML section consists of classes for each request and response, plus classes to represent config and results objects. These classes to represent config and results objects are split into:

  1. Datafeed config
  2. Job config
  3. Job stats
  4. Job results

(The distinction between job config and job results is why I had a job.config package in this PR. This PR doesn't move any of the results classes, as they're not required to implement the put_job endpoint.)

The size of this portion of the .NET client is just over 7300 lines of code:

cd elasticsearch-net/src/Nest/XPack/MachineLearning
find . -type f | xargs wc -l
    7312 total

But if you look at the .NET code it's clear that it's been laid out very compactly: braces and content on single lines, no copyright headers, /// doc comments versus /** doc comments. See for example PutJobRequest.cs. I reckon the equivalent in our Java style would be at least twice as many lines.

The other thing about the .NET client's config classes is that they're mutable. Our config classes are immutable and have corresponding builders. So there's even more code in those builders.

So realistically, to have a high level client for ML were going to need somewhere in the region of 20000 lines of code in the protocol library. Anything significantly below that is not going to be "high level" regardless of whether it's new code or rearranged existing code.

What's in this PR currently is in line with that expectation. I've moved somewhere around 40% of the config and results classes and the number of lines is around 40% of 20000:

cd elasticsearch/x-pack/protocol/src/main/java/org/elasticsearch/protocol/xpack/ml
find . -type f | xargs wc -l
    7627 total

On the question of "what is high level" I looked back to the original high level REST client blog and it says:

It soon became clear that in order to progress and to ease migration at the same time, we would still need the current search and query builders and support the current response objects

I agree with this. For anyone currently using classes like Job and Job.Builder from the transport client a high level REST client that doesn't have these classes is not going to be easy to migrate to.

I had a look at what's been done elsewhere. It seems that a pattern that's used quite a lot is that requests take a BytesReference of config rather than a config object. Then there can be a separate builder that can build the BytesReference in a more friendly way. It's possible to move the request object into the protocol library without also moving the friendly builder, but I'm not convinced this is really in the spirit of the high level REST client being a drop-in replacement for the transport client - see also
#32191 (comment).

Going back to the comments on this PR, the reason some of the current config classes are heavier than you might imagine and have dependencies on utility classes is that the config builders do validation and refuse to create config objects that are invalid. One option for making the config classes smaller and have fewer dependencies would be to move the validation out of them.

As @javanna said, another option would be to create a whole mirror class hierarchy that doesn't do any validation.

So I think the options for making ML's presence in the protocol library lighter come down to these three:

  1. Remove validation from config builders, and add separate validation classes in x-pack-core. So our config and results classes would live in the protocol library, but not depend on utilities and constants required to do validation. Then we'd have a set of corresponding validation classes in x-pack-core that would call the build() method of config builders and only return the built config if it was valid, thus achieving the same end result as the current validating build() methods.
  • Pros: No duplication of code.
  • Cons: (a) Possible future doubts about whether config objects are valid at a particular point in the code - at the moment it's impossible to have an invalid config object as the builder won't let you. (b) Classes in the protocol library will contain both X-Content and transport layer serialisation code.
  1. Create an entirely separate hierarchy of classes for config and results - use exactly the same class names as the x-pack-core classes, and have getter methods, to/from X-Content, and non-validating builders for config.
  • Pros: (a) Separates potentially invalid configs (protocol library) from known valid ones (x-pack-core library). (b) Protocol library classes don't have transport layer serialisation code.
  • Cons: Lots of code duplication - how do we prevent the class hierarchies getting out of sync? (Maybe in an "evil" test that uses reflection to confirm they have exactly the same getter methods?)
  1. Change request objects to take configs as BytesReference instead of config objects, effectively decoupling requests from builders. Similar to the Watcher approach, although assuming users value builders I think it only gives an illusion of breaking the need to depend on server code. Plus in ML's case we return config objects too, so doesn't really save us from writing much code if we have to add POJOs to represent these returned configs.
  • I think given where ML is now this is inferior to options 1 and 2 and we shouldn't do it.

Any thoughts?

@hub-cap
Copy link
Contributor

hub-cap commented Jul 19, 2018

WRT #2, can you not use inheritance? The protocol classes are the base classes w/o any of the server side logic, and there are server side classes that contain transport layer functionality and validation. Server side can use the server side (or in cases where it does not need to can use the client side objects in the code), and client side only knows about client side code. Then we dont have code drift. If there is some new field that needs to be added to the class, its added to the protocol class, and then the transport class takes advantage of that in its ser/deser. Or if its forever only needed server side, make a constructor in the transport class. What do you think about that as 2.a option?

WRT the whole validation thing, there are definitely 2 schools of thought. Do some very light validation (null checks) client side and let the server side do full validation, or do all the things client side. Im not sure what doing all the things client side buys us, since it will happen again server side anyway unless we crazy (ive been known to be crazy).

@javanna
Copy link
Member

javanna commented Jul 19, 2018

I think the main problem of option 2 is not necessarily code duplication but rather that migrating may not be straight-forward compared to reusing transport client classes, but that's because I try to look at it from the user's perspective. I dream of a client that's independent from the server code, one day, and think that on the (very) long run we may end up with separate client classes that diverge from server classes, and that's natural. Ideally we would be able to generate client classes, but we are not ready for any of this at the moment.

On the other hand the statement on query builders etc. was around the search API, which we think is one of the most used and complex API, while we have later used different approaches for other API that are less used from a java client, so we could simplify things for ourselves, something to keep in mind when making these decisions.

I guess the take-away here is that we need to make some compromise :)

@davidkyle
Copy link
Member

The results classes don't have builders and they alone come to 3657 lines of code, whatever compromise we choose we need to get comfortable with the notion that a lot of code will be moved.

I agree with @javanna, it isn't in the spirit of the high level client not to provide the builders. The validation does not need to place on the client side and should remain in x-pack.core/ml. Yes we lose the guarantee that a config object is properly formed and validated but we know the Rest entry points and will validate on the edge and can safely assume any persisted job config is valid.

My compromise would be to move the pojos and builders without the validation and transport layer serialisation code which will be handled in dedicated classes via inheritance/composistion.

@hub-cap
Copy link
Contributor

hub-cap commented Jul 25, 2018

Also, i think it would be more meaningful to make these changes as we need them (in the PR, or in a PR before the rest PR), so that we dont get thousands of lines of code changes in a review. Im assuming we will have to split things up less mechanically than making changes in a script/IDE, so we would want to review them. Im not sure i see the benefit of moving them over wholesale in 1 PR.

@droberts195
Copy link
Contributor Author

Also, i think it would be more meaningful to make these changes as we need them (in the PR, or in a PR before the rest PR), so that we dont get thousands of lines of code changes in a review

The config classes I moved in this PR are the minimum required to implement the put job endpoint. (In this PR I didn't move any results classes or datafeed config classes.) I could have moved the job config classes in the same PR that implements the put job endpoint, but that would have resulted in even more changes in one PR, with moves obscuring the new code for put job.

I was on holiday yesterday and Monday, but now I've caught up I'll move the validation out of the config classes which will reduce the number of dependencies of the config classes and allow some of the files moved to the protocol library in this PR to be moved back to x-pack-core.

@droberts195
Copy link
Contributor Author

I spoke to @davidkyle and @dimitris-athanasiou about this, as we're the ones who will live most closely with the consequences. We came to the conclusion that having a completely separate hierarchy of client classes will cause least pain in the long term. For example, we won't have to refactor again when there's a desire to remove the wire transport methods from REST client classes.

For the benefit of anyone using the current transport client with ML I'll be careful to keep the class names and method names the same on the client side as on the server side. Anyone switching from the Transport client to the high level REST client will have to change imports, but they would have to do that with either approach.

Also, from a PR size perspective, creating completely separate classes has the advantage that I can do just a few classes per PR, and there will be no knock-on effects on imports in hundreds of other files.

So I'll close this and create some new PRs.

@droberts195 droberts195 deleted the ml_hlrc_put_job_code_move branch September 26, 2019 13:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants