Skip to content
This repository was archived by the owner on Mar 4, 2020. It is now read-only.

Content Type Filtering for Migration. #16

Merged
merged 25 commits into from
Feb 21, 2018

Conversation

nihadness
Copy link
Contributor

@nihadness nihadness commented Feb 13, 2018

Current Behavior
While fetching data, it is not possible to filter the objects from the source instance. It would be nice if users could choose content types they want to import into the destination.

Expected Behavior after this PR
Users can enter content types separated by comma ( , ) to filter the objects that will be imported.


It is possible that sync will have to create some 'unwanted' objects if they are dependencies of main objects. However, for that kind of objects, dependencies will not be created recursively. It might cause an error during 'reindexing' those objects. A try:except added especially for that case.

Also some log levels and messages were updated to avoid noise in the logs and make them more understandable.

@nihadness nihadness self-assigned this Feb 13, 2018
@nihadness nihadness requested a review from ramonski February 13, 2018 11:03
Copy link
Contributor

@ramonski ramonski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phew, a lot changed in here I obviously missed... Please take care that you do not increase complexity unnecessary. However, only minor changes needed Nihad, so it would be cool if you could take care of them;)

if ct:
content_types = [t.strip() for t in ct.split(",") if t]
else:
content_types = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's already the default value, so you can skip it.

content_types = form.get("content_types", None)
if content_types is not None:
    content_types = content_types.split(",")

@@ -56,6 +56,7 @@ def verify(self):
storage["credentials"]["url"] = self.url
storage["credentials"]["username"] = self.username
storage["credentials"]["password"] = self.password
storage["credentials"]["content_types"] = self.content_types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why the selected content_types need to be in the credentials storage... Please choose a better location to increase clarity

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had had to move it to the storage after I saw need of those 'content types' in both- fetch and import steps. It cannot be just a variable or saved in memory, because currently a user can do fetch, then stop the instance, and then run the import...
Actually currently @juangallostra is working on making settings and registry records optional, so we will have to save them somewhere as well.
So any better idea where to save all these Import Options ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that you should better put it into the settings storage;) The credentials should hold only the credentials for the sync.
https://github.com/senaite/senaite.sync/blob/master/src/senaite/sync/browser/views.py#L142

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah true, that's where they should be : )

return self.get_items("registry/{}".format(key))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better use "/".join(["registry", key])

class="form-control"
id="content_types"
tal:attributes="value view/content_types|nothing"
name="content_types"/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would render here a multi checkbox field instead of requiring the user to type in the content types by hand.
Note: You also don't handle eventual spaces in between in your code, e.g. Sample, Worksheed , FooType

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right that using checkboxes is a way better idea. But we discussed it with @xispa how to generate that list of checkboxes.
Should it be hardcoded?! - Not a good solution plus what about different content types of each add on?! Should we retrieve them from Portal Types registry? Then the list is too long...
So we ended up with letting the user enter them considering that the person who manages sync, will have some acknowledge.

And spaces are removed here: https://github.com/senaite/senaite.sync/pull/16/files#diff-32423e22290aaa3db2cdebe0c85fa752R106

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok got it. Then maybe you should do something like this afterwards:

portal_types = api.get_tool("portal_types")
content_types = filter(lambda ct: ct in portal_types, content_types)

Just a scribble, so you need to find out how to retrieve the current registered types...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good idea, I added it to form submission in the view.

# Knowing the catalog length compute the number of pages we will need
# with the desired window size and overlap
effective_window = window-overlap
number_of_pages = (catalog_data["count"]/effective_window) + 1
number_of_pages = (cd["count"]/effective_window) + 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this? Wasn't there once a method yield_items that exactly went through the pages?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes there was and there is, thanks to you : )
But since we retrieve the objects in the order of modification time ('uid_catalog' does it actually), considering that some objects can be modified in the source during fetch process in the destination, we query some records twice. I.e: First query retrieves records from '0-1000' and the second query '990-1990' and so on. So if there are 10 objects being modified during each fetch query, we are still safe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, tricky :) But what if 11 objects were modified 🙈That was the reason I suggested that we just fetch the UID, portal_type and physical path in step 1 to just be able to create the object slugs. Then in step 2 I would fetch just in time to update the objects and probably re-fetch any missing objects...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already fetch just UID, portal_type and path in first step...You mean fetching all the objects at once? It can be done here by increasing window size. But then with a source with 800.000 objects, it was loading memory to the full, especially if the source and the destination is on the same machine.
Please let me know if I didn't clearly understand what you had meant.

@juangallostra juangallostra mentioned this pull request Feb 15, 2018
@nihadness
Copy link
Contributor Author

Hi @ramonski ,could you please say what the state of this PR is? : )

Copy link
Contributor

@ramonski ramonski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Nihad, thanks for your efforts!
Only some minor changes needed and then we can merge that in;)

content_types = form.get("content_types", None)
if content_types is not None:
content_types = [t.strip() for t in content_types.split(",")
if t]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just do a [t.strip() for t in content_types.split(",")] since you filter out invalids below

if not all([self.domain_name, self.url, self.username,
self.password]):
if not all([domain_name, url, username,
password]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this a single line

p_local_uid = existing.UID()
self.sh.update_by_path(p_path, local_uid=p_local_uid)

existing = self.portal.unrestrictedTraverse(str(local_path), None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better ensure in translate_path that it returns a string than to typecast it here

if existing:
# Skip if its the portal object.
if len(p_path.split("/")) < 3:
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check for the portal path instead of this. E.g.

portal = api.get_portal()
portal_path = api.get_path(portal)
if ppath == portal_path:
    return

parent_path = item.get("parent_path")
# Skip if the parent is portal object
if len(parent_path.split("/")) < 3:
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DRY;) Make a method that returns you the portal path and use it here as well for your check

# TODO: Make sure reindexing won't fail!
try:
api.get_object_by_uid(uid).reindexObject()
except Exception, e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your exception catch is too generic in here.
E.g. it swallows the case when api.get_object_by_uid fails with or your reindexObject fails.
Try to divide this into two steps and try to find out which errors occur. The more specific we are, the better we can debug it later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right that just Exception is not good at all, but 'reindexing' can fail with too many error types. I added that 'TODO' there to find out which method (that is not a field in the Scehma) of the object can fail so we actually do not need this error handling here... For now, to be safe I think it is better to have generic Exception case, if you don't mind.

@ramonski ramonski merged commit 61c096c into senaite:master Feb 21, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Development

Successfully merging this pull request may close these issues.

2 participants