-
Notifications
You must be signed in to change notification settings - Fork 7
Content Type Filtering for Migration. #16
Content Type Filtering for Migration. #16
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Phew, a lot changed in here I obviously missed... Please take care that you do not increase complexity unnecessary. However, only minor changes needed Nihad, so it would be cool if you could take care of them;)
src/senaite/sync/browser/views.py
Outdated
if ct: | ||
content_types = [t.strip() for t in ct.split(",") if t] | ||
else: | ||
content_types = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's already the default value, so you can skip it.
content_types = form.get("content_types", None)
if content_types is not None:
content_types = content_types.split(",")
src/senaite/sync/fetchstep.py
Outdated
@@ -56,6 +56,7 @@ def verify(self): | |||
storage["credentials"]["url"] = self.url | |||
storage["credentials"]["username"] = self.username | |||
storage["credentials"]["password"] = self.password | |||
storage["credentials"]["content_types"] = self.content_types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why the selected content_types
need to be in the credentials
storage... Please choose a better location to increase clarity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had had to move it to the storage after I saw need of those 'content types' in both- fetch and import steps. It cannot be just a variable or saved in memory, because currently a user can do fetch
, then stop the instance, and then run the import
...
Actually currently @juangallostra is working on making settings
and registry
records optional, so we will have to save them somewhere as well.
So any better idea where to save all these Import Options ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant that you should better put it into the settings
storage;) The credentials
should hold only the credentials for the sync.
https://github.com/senaite/senaite.sync/blob/master/src/senaite/sync/browser/views.py#L142
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah true, that's where they should be : )
src/senaite/sync/fetchstep.py
Outdated
return self.get_items("registry/{}".format(key)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better use "/".join(["registry", key])
class="form-control" | ||
id="content_types" | ||
tal:attributes="value view/content_types|nothing" | ||
name="content_types"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would render here a multi checkbox field instead of requiring the user to type in the content types by hand.
Note: You also don't handle eventual spaces in between in your code, e.g. Sample, Worksheed , FooType
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right that using checkboxes is a way better idea. But we discussed it with @xispa how to generate that list of checkboxes.
Should it be hardcoded?! - Not a good solution plus what about different content types of each add on?! Should we retrieve them from Portal Types registry? Then the list is too long...
So we ended up with letting the user enter them considering that the person who manages sync
, will have some acknowledge.
And spaces are removed here: https://github.com/senaite/senaite.sync/pull/16/files#diff-32423e22290aaa3db2cdebe0c85fa752R106
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok got it. Then maybe you should do something like this afterwards:
portal_types = api.get_tool("portal_types")
content_types = filter(lambda ct: ct in portal_types, content_types)
Just a scribble, so you need to find out how to retrieve the current registered types...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good idea, I added it to form submission in the view.
# Knowing the catalog length compute the number of pages we will need | ||
# with the desired window size and overlap | ||
effective_window = window-overlap | ||
number_of_pages = (catalog_data["count"]/effective_window) + 1 | ||
number_of_pages = (cd["count"]/effective_window) + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this? Wasn't there once a method yield_items
that exactly went through the pages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes there was and there is, thanks to you : )
But since we retrieve the objects in the order of modification time ('uid_catalog' does it actually), considering that some objects can be modified in the source during fetch
process in the destination, we query some records twice. I.e: First query retrieves records from '0-1000' and the second query '990-1990' and so on. So if there are 10 objects being modified during each fetch query, we are still safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, tricky :) But what if 11 objects were modified 🙈That was the reason I suggested that we just fetch the UID, portal_type and physical path in step 1 to just be able to create the object slugs. Then in step 2 I would fetch just in time to update the objects and probably re-fetch any missing objects...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already fetch just UID, portal_type and path in first step...You mean fetching all the objects at once? It can be done here by increasing window size. But then with a source with 800.000 objects, it was loading memory to the full, especially if the source and the destination is on the same machine.
Please let me know if I didn't clearly understand what you had meant.
Hi @ramonski ,could you please say what the state of this PR is? : ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Nihad, thanks for your efforts!
Only some minor changes needed and then we can merge that in;)
content_types = form.get("content_types", None) | ||
if content_types is not None: | ||
content_types = [t.strip() for t in content_types.split(",") | ||
if t] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can just do a [t.strip() for t in content_types.split(",")]
since you filter out invalids below
if not all([self.domain_name, self.url, self.username, | ||
self.password]): | ||
if not all([domain_name, url, username, | ||
password]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this a single line
p_local_uid = existing.UID() | ||
self.sh.update_by_path(p_path, local_uid=p_local_uid) | ||
|
||
existing = self.portal.unrestrictedTraverse(str(local_path), None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better ensure in translate_path
that it returns a string than to typecast it here
if existing: | ||
# Skip if its the portal object. | ||
if len(p_path.split("/")) < 3: | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check for the portal path instead of this. E.g.
portal = api.get_portal()
portal_path = api.get_path(portal)
if ppath == portal_path:
return
parent_path = item.get("parent_path") | ||
# Skip if the parent is portal object | ||
if len(parent_path.split("/")) < 3: | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DRY;) Make a method that returns you the portal path and use it here as well for your check
# TODO: Make sure reindexing won't fail! | ||
try: | ||
api.get_object_by_uid(uid).reindexObject() | ||
except Exception, e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your exception catch is too generic in here.
E.g. it swallows the case when api.get_object_by_uid
fails with or your reindexObject
fails.
Try to divide this into two steps and try to find out which errors occur. The more specific we are, the better we can debug it later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right that just Exception
is not good at all, but 'reindexing' can fail with too many error types. I added that 'TODO' there to find out which method (that is not a field in the Scehma) of the object can fail so we actually do not need this error handling here... For now, to be safe I think it is better to have generic Exception case, if you don't mind.
Current Behavior
While fetching data, it is not possible to filter the objects from the source instance. It would be nice if users could choose content types they want to import into the destination.
Expected Behavior after this PR
Users can enter content types separated by comma ( , ) to filter the objects that will be imported.
It is possible that
sync
will have to create some 'unwanted' objects if they are dependencies of main objects. However, for that kind of objects, dependencies will not be created recursively. It might cause an error during 'reindexing' those objects. Atry:except
added especially for that case.Also some log levels and messages were updated to avoid noise in the logs and make them more understandable.