Fix real-time, incremental harvesting from Odum archive #3268

kcondon · 2016-08-11T21:48:05Z

Of the 3000+ datasets harvested from Odum, 50 failed.

For example,
1902.29/10497, grep -c " Exception processing getRecord()" server.log_2016-08-01T15*

djbrooke · 2016-08-15T15:41:58Z

We should verify that these failures are reasonable. @landreev - please take 30 minutes to do this. Thanks!

landreev · 2016-08-17T02:02:39Z

I'd like to suggest that we drop this off the 4.5 milestone as well.

It's not just because it's a "good success ratio".

It's just that it would be kind of a wasted effort; to investigate why their DVN3-produced DDIs are not getting imported... Because they are upgrading to Dataverse 4 pretty much as soon as we give it to them. And then they'll re-export their datasets; those new DDIs produced will be the ones we'll be harvesting, and they'll have their own, different problems...

So, instead I'd like to focus on investigating the failures that we're getting while harvesting our production datasets, produced by our Dataverse 4 export. Figuring out and fixing whatever's causing these failures will most likely address any import issues with Odum's future Dataverse 4.

donsizemore · 2016-08-17T12:04:20Z

Leonid, in case this information is helpful, here's what we exported locally during our migration test run (August 10):

[dls@irss-dvn3-patchtest ~]$ ./versions_source_ http://localhost/dvn/ddi 1902.29
3989 studies processed.
3730 released versions;
635 had versions other than released;
Total 4639 versions processed.

[dls@irss-dvn3-patchtest ~]$ ./versions_source_ http://localhost/dvn/ddi 10.15139/S3
278 studies processed.
230 released versions;
105 had versions other than released;
Total 360 versions processed.

jonc1438 · 2016-08-17T12:33:16Z

I agree with Leonid but we have a much bigger issue. We need to get our datasets out of 3.x and into 4.x and the issues here might help diagnose the migration problems we are having since it uses the same export set you are reading here.
We have a long list of mismatches between the datasets and files on 3 and our new 4.x box
We are working to reconcile them but if you have any tips please let us know.

djbrooke · 2016-08-17T15:04:24Z

OK - thanks all for the comments. I'll remove this from the 4.5 Milestone, but it sounds like we'll need to have some further discussion.

djbrooke · 2016-08-17T16:11:56Z

@jonc1438 @donsizemore - as I mentioned over email, please send me a note when you're ready to bring us in to help on this one - thanks!

jonc1438 · 2016-08-18T18:12:38Z

Danny

We are getting closer to figuring it out. We do have one question.

It seems that the migration process is finding a list of files that SEEM to belong to studies where someone REMOVED the files during the DRAFT phase and before they were released.

These file were REMOVED from the file system and in reality do not exist BUT the 3.X database does still track the metadata.

During migration we see an error and these cause the number of file not to match between the new and old systems.

We think this can be ignored but it would be good to double check with Leonid that the file on our list truly are the ones deleted before publication/release and never existing except in the database.

This is over 1000 files for us.

We are certain 100 of them are from a recent study that our staff deleted the files during draft phase and we have checked a couple others that seem to be the case but it might be prudent to check code to verify our observations.

Does this make sense?

Jon

Jonathan Crabtree
Assistant Director for Cyberinfrastructure
HW Odum Institute for Research in Social Science
www.odum.unc.eduhttp://www.odum.unc.edu
[email protected]
919-962-0517 Office
919-428-6112 Cell

On Aug 17, 2016, at 12:11 PM, Danny Brooke <[email protected]mailto:[email protected]> wrote:

@jonc1438 https://github.com/jonc1438 @donsizemorehttps://github.com/donsizemore - as I mentioned over email, please send me a note when you're ready to bring us in to help on this one - thanks!

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com//issues/3268#issuecomment-240462893, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AHYP4tZfJPFkUiNbM6ujTwxEXNxjEO6kks5qgzLNgaJpZM4JilqY.

djbrooke · 2016-08-18T18:36:32Z

Hey @landreev - any thoughts on the zombie metadata? I looked for issues (both closed and open) that would be helpful to reference here, but nothing came up.

landreev · 2016-08-22T22:00:55Z

@jonc1438
OK, I am confused. If the files were deleted from a draft, before they were released... - as in, if it was a new file, added to the draft version, and not available in any previous versions... then there shouldn't be any trace of it left in the database.
So we must be talking about files that already existed in a previously released version... But then, if such a file were deleted from a draft version, it was NOT supposed to be deleted from the filesystem! The whole idea was to only remove the filemetadata associated with this file in the current version, but to keep it in the previous versions, and on the filesystem...
So, is this a bug in DVN3?? @scolapasta Gustavo, does this ring a bell? Was that a known problem at any point during the DVN3 lifespan?

For the purposes of migration, we could easily modify the script to either a) skip the files no longer on the filesystem; or b) Add a description note to the file saying something like "This file was removed from the current version of the study and is no longer available" (so that if a user goes back to an older version, they'll see it and know that they cannot download it...)

But this sounds pretty serious... Assuming this is really what it looks like - that some files were physically deleted when they were supposed to be kept around - ?

jonc1438 · 2016-08-22T22:10:51Z

L

We are finishing our investigation and Akio/Don can get into specifics but from our testing with 3.x

If you have an unpublished draft with data files in the study and then delete the files prior to publication they are correctly removed from the file system BUT NOT removed from the file names in the database.

We had some cases where our archivists worked with customers where this happened and we knew files were deleted prior to publication. Those pointed us to the issues.

We are 99% sure but Akio is working on a final check we should know in in the AM for sure.

Sounds like a bug in 3.6 but could possible be ignored during the migration.

We will keep you in the loop

Jon

Jonathan Crabtree
Assistant Director for Cyberinfrastructure
HW Odum Institute for Research in Social Science
www.odum.unc.eduhttp://www.odum.unc.edu
[email protected]
919-962-0517 Office
919-428-6112 Cell

On Aug 22, 2016, at 6:00 PM, landreev <[email protected]mailto:[email protected]> wrote:

@jonc1438 https://github.com/jonc1438
OK, I am confused. If the files were deleted from a draft, before they were released... - as in, if it was a new file, added to the draft version, and not available in any previous versions... then there shouldn't be any trace of it left in the database.
So we must be talking about files that already existed in a previously released version... But then, if such a file were deleted from a draft version, it was NOT supposed to be deleted from the filesystem! The whole idea was to only remove the filemetadata associated with this file in the current version, but to keep it in the previous versions, and on the filesystem...
So, is this a bug in DVN3?? @scolapastahttps://github.com/scolapasta Gustavo, does this ring a bell? Was that a known problem at any point during the DVN3 lifespan?

For the purposes of migration, we could easily modify the script to either a) skip the files no longer on the filesystem; or b) Add a description note to the file saying something like "This file was removed from the current version of the study and is no longer available" (so that if a user goes back to an older version, they'll see it and know that they cannot download it...)

But this sounds pretty serious... Assuming this is really what it looks like - that some files were physically deleted when they were supposed to be kept around - ?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com//issues/3268#issuecomment-241564969, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AHYP4jMUvzTd-4ZmfdGQZpzq4mbP9vBcks5qihwZgaJpZM4JilqY.

landreev · 2016-08-26T17:35:34Z

@jonc1438
Jon, I spoke to Gustavo about this, he also showed me an email thread that was going on in parallel, between him and Akio.
I believe I understand now what's going on; and no, it doesn't look as scary as I thought initially. So yes, it looks like those files were meant to be deleted - new files in unreleased versions. It's just that due to some bug in DVN3 the StudyFile object for this file was left behind - even though there were no longer any FileMetadata objects linking them to any StudyVersions.

So yes, we should simply drop any StudyFiles that have no FileMetadatas.
And if we need a github issue for this, let's open a new one.

donsizemore · 2016-10-27T19:45:51Z

@kcondon now that Odum is on 4.5 and the dust has settled from our migration - just curious how harvesting looks now?

landreev · 2016-10-28T01:04:18Z

@donsizemore thanks for the reminder - we actually haven't yet started harvesting from you, since your upgrade to 4.5... Yes, we were waiting for the dust to settle... but then kind of forgot about it. I'm on it - I'm going to run a test harvest on one of our test boxes first, before running it on our production server. I'll let you know if I run into any problems on your side in the process.

donsizemore · 2019-07-30T13:57:47Z

hey @djbrooke we're on 4.9.4 and curious if you all have tried to harvest us and how it went?

djbrooke · 2019-07-30T14:08:06Z

Hey @donsizemore, thanks for checking. It looks like there was a harvest in Nov 2016, after the above correspondence. I see 3,893 records here:

https://dataverse.harvard.edu/dataverse/odum

On the dashboard, our most recent attempt at Harvesting on June 30 shows IN PROGRESS. There are 5 records added from that day. The rest of the records are showing Nov 20, 2016. I'm guessing we need to revisit this. How many records are in the set you're making available to us?

donsizemore · 2019-07-30T14:11:22Z

default: 4,206 datasets
odum_all: 4,184 datasets

We offer about 30 smaller harvesting sets but I'm guessing you're pulling one of those two.

landreev · 2019-07-31T19:01:36Z

Our harvesting client is configured to harvest "odum_all" dataset.
Is this correct? - Or should we be harvesting the default/"no name" set?

donsizemore · 2019-07-31T19:12:17Z

@landreev according to Thu-Mai "odum_all" is the one you want for now.

donsizemore · 2019-07-31T19:14:53Z

@landreev though Jon suggests we start with a smaller set, like "carolinapoll" (which contains 54 datasets) before firing at a larger batch such as odum_all.

landreev · 2019-07-31T19:28:41Z

I'll be running some experimental harvests from our test servers here.
We already have most of your catalog harvested; it's the increments that we keep having trouble with. I'll run some smaller harvests to figure out what's going on; but in the end we may have to resort to dropping what we have now and re-harvesting everything from scratch.
I suspect the problems we are having are with the harvested versions of some of your datasets that are very very old; that may now be somehow incompatible with the current version of Dataverse, and that's why we are having trouble updating them?
Please stay tuned, I'll report more as I figure out what's going on.

landreev · 2019-08-08T23:10:03Z

It is this dataset that we can't process and import:
doi:10.15139/S3/11917
- it doesn't fail; the harvest just gets stuck on this record. (it's not on the Odum side, it's on ours - whatever it is, it happens after we successfully download the DDI record; as we try to parse and import it)

Haven't figured out what it is yet - will finish tomorrow.

landreev · 2019-08-12T15:06:57Z

We now have 4,191 Odum datasets harvested in production.
(this is odum_all; with a few import failures that I'll summarize).
This is more than the total number in the set, as specified 2 weeks ago - so I'm assuming it means more datasets have been added and published since then.

fixes the issue with transaction scope when deleting harvested records, per OAI instructions; plus some cosmetic issues and minor optimizations. (#3268)

landreev · 2019-08-13T15:48:32Z

This is what's in the PR now:

Fix for the main issue at hand - a deleted record specified by the server in ListIdentifiers locks up the harvest on the client under certain conditions. (That was the reason why some recent Odum and UVA harvests would not complete).
plus some smaller improvements:
Many harvested datasets were lacking the "release date". The cosmetic manifestation of this is that the search card for the dataset shows the date on which the dataset was harvested; not the date when it was published on the original Dataverse. Added a fix to populate the date from the oai date stamp when otherwise not available.
Added some harvesting-specific checks to DestroyDatasetCommand.
I should've thought about this when the "Destroying dataset should unregister the DOI" issue was going through dev. recently; the following code is in the command:

   if(idServiceBean.alreadyExists(doomed)){
      idServiceBean.deleteIdentifier(doomed);
         for (DataFile df : doomed.getFiles()) {
            idServiceBean.deleteIdentifier(df);
         }
      }
   }

Unless there are some checks further down in the IdServiceBean implementations, it really looks like we were attempting to delete remote DOIs whenever the corresponding harvested datasets were updated. (DestroyDatasetCommand is called whenever our harvester updates a harvested dataset - we delete and recreate it from scratch).
Whatever it was doing, this was not resulting in any observable error or log warning messages (and we cannot delete other people's DOIs of course). But I added if (!doomed.isHarvested()) around it.

Similarly the Destroy command was attempting to delete saved logos for harvested datasets; this resulted in error messages in server.log for each updated or deleted dataset. Also fixed.

landreev · 2019-08-13T16:04:28Z

For test/QA:
Verifying that harvesting is still working should probably be the main test.
If you want to reproduce the actual lockup condition, it'll require creating a small set (of 2+ records); harvesting it; then deleting one dataset and updating the remaining ones before re-exporting... Please ask if you have any questions. It's not hard to reproduce, but may require a few steps in just the right order, to emulate the condition we were seeing.

Harvesting some real-life remote set (Odum, UVA, etc.) from scratch could be a good test; to verify that the resulting search cards are NOT all showing the local harvest date.
Generally, wiping an entire harvested dataverse, then re-harvesting it from scratch should be a standard procedure whenever problems with harvested content are observed. (it's pretty fast too, even for a large remote archive, like Odum. Deleting a large dataverse like that may take longer than the actual re-harvesting though). But an annoying side effect of this was often the misleading date on the search card.

kcondon added Type: Bug a defect Feature: Harvesting Priority 2: Moderate labels Aug 11, 2016

kcondon added this to the 4.5 - Metadata Export and Harvesting milestone Aug 11, 2016

djbrooke added the ready label Aug 15, 2016

djbrooke assigned landreev Aug 15, 2016

djbrooke added in progress and removed ready labels Aug 16, 2016

landreev assigned kcondon and djbrooke and unassigned landreev Aug 17, 2016

djbrooke removed this from the 4.5 - Metadata Export and Harvesting milestone Aug 17, 2016

djbrooke assigned landreev and djbrooke and unassigned kcondon, djbrooke and landreev Aug 17, 2016

djbrooke assigned landreev and unassigned djbrooke Aug 18, 2016

djbrooke removed the in progress label Aug 22, 2016

pdurbin added User Role: Superuser Has access to the superuser dashboard and cares about how the system is configured and removed zPriority 2: Moderate labels Jul 12, 2017

djbrooke added the Medium label Aug 6, 2019

landreev changed the title ~~Harvest: Harvesting all available datasets from Odum works but failed to harvest 50 datasets.~~ Fix real time, incremental harvesting from Odum archive Aug 8, 2019

landreev changed the title ~~Fix real time, incremental harvesting from Odum archive~~ Fix real-time, incremental harvesting from Odum archive Aug 8, 2019

landreev added a commit that referenced this issue Aug 12, 2019

Harvester improvements.

fd6e548

fixes the issue with transaction scope when deleting harvested records, per OAI instructions; plus some cosmetic issues and minor optimizations. (#3268)

landreev mentioned this issue Aug 12, 2019

Harvester improvements. #6087

Merged

5 tasks

landreev added a commit that referenced this issue Aug 13, 2019

an (old?) typo. #3268

01931e5

kcondon closed this as completed in #6087 Aug 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix real-time, incremental harvesting from Odum archive #3268

Fix real-time, incremental harvesting from Odum archive #3268

kcondon commented Aug 11, 2016 •

edited by djbrooke

Loading

djbrooke commented Aug 15, 2016

landreev commented Aug 17, 2016

donsizemore commented Aug 17, 2016

jonc1438 commented Aug 17, 2016

djbrooke commented Aug 17, 2016

djbrooke commented Aug 17, 2016

jonc1438 commented Aug 18, 2016

djbrooke commented Aug 18, 2016

landreev commented Aug 22, 2016

jonc1438 commented Aug 22, 2016

landreev commented Aug 26, 2016

donsizemore commented Oct 27, 2016

landreev commented Oct 28, 2016

donsizemore commented Jul 30, 2019

djbrooke commented Jul 30, 2019

donsizemore commented Jul 30, 2019

landreev commented Jul 31, 2019

donsizemore commented Jul 31, 2019

donsizemore commented Jul 31, 2019

landreev commented Jul 31, 2019

landreev commented Aug 8, 2019

landreev commented Aug 12, 2019

landreev commented Aug 13, 2019

landreev commented Aug 13, 2019 •

edited

Loading

Fix real-time, incremental harvesting from Odum archive #3268

Fix real-time, incremental harvesting from Odum archive #3268

Comments

kcondon commented Aug 11, 2016 • edited by djbrooke Loading

djbrooke commented Aug 15, 2016

landreev commented Aug 17, 2016

donsizemore commented Aug 17, 2016

jonc1438 commented Aug 17, 2016

djbrooke commented Aug 17, 2016

djbrooke commented Aug 17, 2016

jonc1438 commented Aug 18, 2016

djbrooke commented Aug 18, 2016

landreev commented Aug 22, 2016

jonc1438 commented Aug 22, 2016

landreev commented Aug 26, 2016

donsizemore commented Oct 27, 2016

landreev commented Oct 28, 2016

donsizemore commented Jul 30, 2019

djbrooke commented Jul 30, 2019

donsizemore commented Jul 30, 2019

landreev commented Jul 31, 2019

donsizemore commented Jul 31, 2019

donsizemore commented Jul 31, 2019

landreev commented Jul 31, 2019

landreev commented Aug 8, 2019

landreev commented Aug 12, 2019

landreev commented Aug 13, 2019

landreev commented Aug 13, 2019 • edited Loading

kcondon commented Aug 11, 2016 •

edited by djbrooke

Loading

landreev commented Aug 13, 2019 •

edited

Loading