-
Notifications
You must be signed in to change notification settings - Fork 501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix real-time, incremental harvesting from Odum archive #3268
Comments
We should verify that these failures are reasonable. @landreev - please take 30 minutes to do this. Thanks! |
I'd like to suggest that we drop this off the 4.5 milestone as well. It's not just because it's a "good success ratio". It's just that it would be kind of a wasted effort; to investigate why their DVN3-produced DDIs are not getting imported... Because they are upgrading to Dataverse 4 pretty much as soon as we give it to them. And then they'll re-export their datasets; those new DDIs produced will be the ones we'll be harvesting, and they'll have their own, different problems... So, instead I'd like to focus on investigating the failures that we're getting while harvesting our production datasets, produced by our Dataverse 4 export. Figuring out and fixing whatever's causing these failures will most likely address any import issues with Odum's future Dataverse 4. |
Leonid, in case this information is helpful, here's what we exported locally during our migration test run (August 10): [dls@irss-dvn3-patchtest ~]$ ./versions_source_ http://localhost/dvn/ddi 1902.29 [dls@irss-dvn3-patchtest ~]$ ./versions_source_ http://localhost/dvn/ddi 10.15139/S3 |
I agree with Leonid but we have a much bigger issue. We need to get our datasets out of 3.x and into 4.x and the issues here might help diagnose the migration problems we are having since it uses the same export set you are reading here. |
OK - thanks all for the comments. I'll remove this from the 4.5 Milestone, but it sounds like we'll need to have some further discussion. |
@jonc1438 @donsizemore - as I mentioned over email, please send me a note when you're ready to bring us in to help on this one - thanks! |
Danny We are getting closer to figuring it out. We do have one question. It seems that the migration process is finding a list of files that SEEM to belong to studies where someone REMOVED the files during the DRAFT phase and before they were released. These file were REMOVED from the file system and in reality do not exist BUT the 3.X database does still track the metadata. During migration we see an error and these cause the number of file not to match between the new and old systems. We think this can be ignored but it would be good to double check with Leonid that the file on our list truly are the ones deleted before publication/release and never existing except in the database. This is over 1000 files for us. We are certain 100 of them are from a recent study that our staff deleted the files during draft phase and we have checked a couple others that seem to be the case but it might be prudent to check code to verify our observations. Does this make sense? Jon Jonathan Crabtree On Aug 17, 2016, at 12:11 PM, Danny Brooke <[email protected]mailto:[email protected]> wrote: @jonc1438https://github.com/jonc1438 @donsizemorehttps://github.com/donsizemore - as I mentioned over email, please send me a note when you're ready to bring us in to help on this one - thanks! — |
Hey @landreev - any thoughts on the zombie metadata? I looked for issues (both closed and open) that would be helpful to reference here, but nothing came up. |
@jonc1438 For the purposes of migration, we could easily modify the script to either a) skip the files no longer on the filesystem; or b) Add a description note to the file saying something like "This file was removed from the current version of the study and is no longer available" (so that if a user goes back to an older version, they'll see it and know that they cannot download it...) But this sounds pretty serious... Assuming this is really what it looks like - that some files were physically deleted when they were supposed to be kept around - ? |
L We are finishing our investigation and Akio/Don can get into specifics but from our testing with 3.x If you have an unpublished draft with data files in the study and then delete the files prior to publication they are correctly removed from the file system BUT NOT removed from the file names in the database. We had some cases where our archivists worked with customers where this happened and we knew files were deleted prior to publication. Those pointed us to the issues. We are 99% sure but Akio is working on a final check we should know in in the AM for sure. Sounds like a bug in 3.6 but could possible be ignored during the migration. We will keep you in the loop Jon Jonathan Crabtree On Aug 22, 2016, at 6:00 PM, landreev <[email protected]mailto:[email protected]> wrote: @jonc1438https://github.com/jonc1438 For the purposes of migration, we could easily modify the script to either a) skip the files no longer on the filesystem; or b) Add a description note to the file saying something like "This file was removed from the current version of the study and is no longer available" (so that if a user goes back to an older version, they'll see it and know that they cannot download it...) But this sounds pretty serious... Assuming this is really what it looks like - that some files were physically deleted when they were supposed to be kept around - ? — |
@jonc1438 So yes, we should simply drop any StudyFiles that have no FileMetadatas. |
@kcondon now that Odum is on 4.5 and the dust has settled from our migration - just curious how harvesting looks now? |
@donsizemore thanks for the reminder - we actually haven't yet started harvesting from you, since your upgrade to 4.5... Yes, we were waiting for the dust to settle... but then kind of forgot about it. I'm on it - I'm going to run a test harvest on one of our test boxes first, before running it on our production server. I'll let you know if I run into any problems on your side in the process. |
hey @djbrooke we're on 4.9.4 and curious if you all have tried to harvest us and how it went? |
Hey @donsizemore, thanks for checking. It looks like there was a harvest in Nov 2016, after the above correspondence. I see 3,893 records here: https://dataverse.harvard.edu/dataverse/odum On the dashboard, our most recent attempt at Harvesting on June 30 shows IN PROGRESS. There are 5 records added from that day. The rest of the records are showing Nov 20, 2016. I'm guessing we need to revisit this. How many records are in the set you're making available to us? |
We offer about 30 smaller harvesting sets but I'm guessing you're pulling one of those two. |
Our harvesting client is configured to harvest "odum_all" dataset. |
@landreev according to Thu-Mai "odum_all" is the one you want for now. |
@landreev though Jon suggests we start with a smaller set, like "carolinapoll" (which contains 54 datasets) before firing at a larger batch such as odum_all. |
I'll be running some experimental harvests from our test servers here. |
It is this dataset that we can't process and import: Haven't figured out what it is yet - will finish tomorrow. |
We now have 4,191 Odum datasets harvested in production. |
fixes the issue with transaction scope when deleting harvested records, per OAI instructions; plus some cosmetic issues and minor optimizations. (#3268)
This is what's in the PR now:
Unless there are some checks further down in the IdServiceBean implementations, it really looks like we were attempting to delete remote DOIs whenever the corresponding harvested datasets were updated. (DestroyDatasetCommand is called whenever our harvester updates a harvested dataset - we delete and recreate it from scratch).
|
For test/QA: Harvesting some real-life remote set (Odum, UVA, etc.) from scratch could be a good test; to verify that the resulting search cards are NOT all showing the local harvest date. |
Of the 3000+ datasets harvested from Odum, 50 failed.
For example,
1902.29/10497, grep -c " Exception processing getRecord()" server.log_2016-08-01T15*
The text was updated successfully, but these errors were encountered: