Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix real-time, incremental harvesting from Odum archive #3268

Closed
kcondon opened this issue Aug 11, 2016 · 24 comments · Fixed by #6087
Closed

Fix real-time, incremental harvesting from Odum archive #3268

kcondon opened this issue Aug 11, 2016 · 24 comments · Fixed by #6087
Assignees
Labels
Feature: Harvesting Type: Bug a defect User Role: Superuser Has access to the superuser dashboard and cares about how the system is configured

Comments

@kcondon
Copy link
Contributor

kcondon commented Aug 11, 2016

Of the 3000+ datasets harvested from Odum, 50 failed.

For example,
1902.29/10497, grep -c " Exception processing getRecord()" server.log_2016-08-01T15*

@djbrooke
Copy link
Contributor

We should verify that these failures are reasonable. @landreev - please take 30 minutes to do this. Thanks!

@landreev
Copy link
Contributor

I'd like to suggest that we drop this off the 4.5 milestone as well.

It's not just because it's a "good success ratio".

It's just that it would be kind of a wasted effort; to investigate why their DVN3-produced DDIs are not getting imported... Because they are upgrading to Dataverse 4 pretty much as soon as we give it to them. And then they'll re-export their datasets; those new DDIs produced will be the ones we'll be harvesting, and they'll have their own, different problems...

So, instead I'd like to focus on investigating the failures that we're getting while harvesting our production datasets, produced by our Dataverse 4 export. Figuring out and fixing whatever's causing these failures will most likely address any import issues with Odum's future Dataverse 4.

@landreev landreev assigned kcondon and djbrooke and unassigned landreev Aug 17, 2016
@donsizemore
Copy link
Contributor

Leonid, in case this information is helpful, here's what we exported locally during our migration test run (August 10):

[dls@irss-dvn3-patchtest ~]$ ./versions_source_ http://localhost/dvn/ddi 1902.29
3989 studies processed.
3730 released versions;
635 had versions other than released;
Total 4639 versions processed.

[dls@irss-dvn3-patchtest ~]$ ./versions_source_ http://localhost/dvn/ddi 10.15139/S3
278 studies processed.
230 released versions;
105 had versions other than released;
Total 360 versions processed.

@jonc1438
Copy link

I agree with Leonid but we have a much bigger issue. We need to get our datasets out of 3.x and into 4.x and the issues here might help diagnose the migration problems we are having since it uses the same export set you are reading here.
We have a long list of mismatches between the datasets and files on 3 and our new 4.x box
We are working to reconcile them but if you have any tips please let us know.

@djbrooke
Copy link
Contributor

OK - thanks all for the comments. I'll remove this from the 4.5 Milestone, but it sounds like we'll need to have some further discussion.

@djbrooke djbrooke assigned landreev and djbrooke and unassigned kcondon, djbrooke and landreev Aug 17, 2016
@djbrooke
Copy link
Contributor

@jonc1438 @donsizemore - as I mentioned over email, please send me a note when you're ready to bring us in to help on this one - thanks!

@jonc1438
Copy link

Danny

We are getting closer to figuring it out. We do have one question.

It seems that the migration process is finding a list of files that SEEM to belong to studies where someone REMOVED the files during the DRAFT phase and before they were released.

These file were REMOVED from the file system and in reality do not exist BUT the 3.X database does still track the metadata.

During migration we see an error and these cause the number of file not to match between the new and old systems.

We think this can be ignored but it would be good to double check with Leonid that the file on our list truly are the ones deleted before publication/release and never existing except in the database.

This is over 1000 files for us.

We are certain 100 of them are from a recent study that our staff deleted the files during draft phase and we have checked a couple others that seem to be the case but it might be prudent to check code to verify our observations.

Does this make sense?

Jon

Jonathan Crabtree
Assistant Director for Cyberinfrastructure
HW Odum Institute for Research in Social Science
www.odum.unc.eduhttp://www.odum.unc.edu
[email protected]
919-962-0517 Office
919-428-6112 Cell

On Aug 17, 2016, at 12:11 PM, Danny Brooke <[email protected]mailto:[email protected]> wrote:

@jonc1438https://github.com/jonc1438 @donsizemorehttps://github.com/donsizemore - as I mentioned over email, please send me a note when you're ready to bring us in to help on this one - thanks!


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com//issues/3268#issuecomment-240462893, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AHYP4tZfJPFkUiNbM6ujTwxEXNxjEO6kks5qgzLNgaJpZM4JilqY.

@djbrooke djbrooke assigned landreev and unassigned djbrooke Aug 18, 2016
@djbrooke
Copy link
Contributor

Hey @landreev - any thoughts on the zombie metadata? I looked for issues (both closed and open) that would be helpful to reference here, but nothing came up.

@landreev
Copy link
Contributor

@jonc1438
OK, I am confused. If the files were deleted from a draft, before they were released... - as in, if it was a new file, added to the draft version, and not available in any previous versions... then there shouldn't be any trace of it left in the database.
So we must be talking about files that already existed in a previously released version... But then, if such a file were deleted from a draft version, it was NOT supposed to be deleted from the filesystem! The whole idea was to only remove the filemetadata associated with this file in the current version, but to keep it in the previous versions, and on the filesystem...
So, is this a bug in DVN3?? @scolapasta Gustavo, does this ring a bell? Was that a known problem at any point during the DVN3 lifespan?

For the purposes of migration, we could easily modify the script to either a) skip the files no longer on the filesystem; or b) Add a description note to the file saying something like "This file was removed from the current version of the study and is no longer available" (so that if a user goes back to an older version, they'll see it and know that they cannot download it...)

But this sounds pretty serious... Assuming this is really what it looks like - that some files were physically deleted when they were supposed to be kept around - ?

@jonc1438
Copy link

L

We are finishing our investigation and Akio/Don can get into specifics but from our testing with 3.x

If you have an unpublished draft with data files in the study and then delete the files prior to publication they are correctly removed from the file system BUT NOT removed from the file names in the database.

We had some cases where our archivists worked with customers where this happened and we knew files were deleted prior to publication. Those pointed us to the issues.

We are 99% sure but Akio is working on a final check we should know in in the AM for sure.

Sounds like a bug in 3.6 but could possible be ignored during the migration.

We will keep you in the loop

Jon

Jonathan Crabtree
Assistant Director for Cyberinfrastructure
HW Odum Institute for Research in Social Science
www.odum.unc.eduhttp://www.odum.unc.edu
[email protected]
919-962-0517 Office
919-428-6112 Cell

On Aug 22, 2016, at 6:00 PM, landreev <[email protected]mailto:[email protected]> wrote:

@jonc1438https://github.com/jonc1438
OK, I am confused. If the files were deleted from a draft, before they were released... - as in, if it was a new file, added to the draft version, and not available in any previous versions... then there shouldn't be any trace of it left in the database.
So we must be talking about files that already existed in a previously released version... But then, if such a file were deleted from a draft version, it was NOT supposed to be deleted from the filesystem! The whole idea was to only remove the filemetadata associated with this file in the current version, but to keep it in the previous versions, and on the filesystem...
So, is this a bug in DVN3?? @scolapastahttps://github.com/scolapasta Gustavo, does this ring a bell? Was that a known problem at any point during the DVN3 lifespan?

For the purposes of migration, we could easily modify the script to either a) skip the files no longer on the filesystem; or b) Add a description note to the file saying something like "This file was removed from the current version of the study and is no longer available" (so that if a user goes back to an older version, they'll see it and know that they cannot download it...)

But this sounds pretty serious... Assuming this is really what it looks like - that some files were physically deleted when they were supposed to be kept around - ?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com//issues/3268#issuecomment-241564969, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AHYP4jMUvzTd-4ZmfdGQZpzq4mbP9vBcks5qihwZgaJpZM4JilqY.

@landreev
Copy link
Contributor

@jonc1438
Jon, I spoke to Gustavo about this, he also showed me an email thread that was going on in parallel, between him and Akio.
I believe I understand now what's going on; and no, it doesn't look as scary as I thought initially. So yes, it looks like those files were meant to be deleted - new files in unreleased versions. It's just that due to some bug in DVN3 the StudyFile object for this file was left behind - even though there were no longer any FileMetadata objects linking them to any StudyVersions.

So yes, we should simply drop any StudyFiles that have no FileMetadatas.
And if we need a github issue for this, let's open a new one.

@donsizemore
Copy link
Contributor

@kcondon now that Odum is on 4.5 and the dust has settled from our migration - just curious how harvesting looks now?

@landreev
Copy link
Contributor

@donsizemore thanks for the reminder - we actually haven't yet started harvesting from you, since your upgrade to 4.5... Yes, we were waiting for the dust to settle... but then kind of forgot about it. I'm on it - I'm going to run a test harvest on one of our test boxes first, before running it on our production server. I'll let you know if I run into any problems on your side in the process.

@pdurbin pdurbin added User Role: Superuser Has access to the superuser dashboard and cares about how the system is configured and removed zPriority 2: Moderate labels Jul 12, 2017
@donsizemore
Copy link
Contributor

hey @djbrooke we're on 4.9.4 and curious if you all have tried to harvest us and how it went?

@djbrooke
Copy link
Contributor

Hey @donsizemore, thanks for checking. It looks like there was a harvest in Nov 2016, after the above correspondence. I see 3,893 records here:

https://dataverse.harvard.edu/dataverse/odum

On the dashboard, our most recent attempt at Harvesting on June 30 shows IN PROGRESS. There are 5 records added from that day. The rest of the records are showing Nov 20, 2016. I'm guessing we need to revisit this. How many records are in the set you're making available to us?

@donsizemore
Copy link
Contributor

  • default: 4,206 datasets
  • odum_all: 4,184 datasets

We offer about 30 smaller harvesting sets but I'm guessing you're pulling one of those two.

@landreev
Copy link
Contributor

Our harvesting client is configured to harvest "odum_all" dataset.
Is this correct? - Or should we be harvesting the default/"no name" set?

@donsizemore
Copy link
Contributor

@landreev according to Thu-Mai "odum_all" is the one you want for now.

@donsizemore
Copy link
Contributor

@landreev though Jon suggests we start with a smaller set, like "carolinapoll" (which contains 54 datasets) before firing at a larger batch such as odum_all.

@landreev
Copy link
Contributor

I'll be running some experimental harvests from our test servers here.
We already have most of your catalog harvested; it's the increments that we keep having trouble with. I'll run some smaller harvests to figure out what's going on; but in the end we may have to resort to dropping what we have now and re-harvesting everything from scratch.
I suspect the problems we are having are with the harvested versions of some of your datasets that are very very old; that may now be somehow incompatible with the current version of Dataverse, and that's why we are having trouble updating them?
Please stay tuned, I'll report more as I figure out what's going on.

@djbrooke djbrooke added the Medium label Aug 6, 2019
@landreev landreev changed the title Harvest: Harvesting all available datasets from Odum works but failed to harvest 50 datasets. Fix real time, incremental harvesting from Odum archive Aug 8, 2019
@landreev landreev changed the title Fix real time, incremental harvesting from Odum archive Fix real-time, incremental harvesting from Odum archive Aug 8, 2019
@landreev
Copy link
Contributor

landreev commented Aug 8, 2019

It is this dataset that we can't process and import:
doi:10.15139/S3/11917
- it doesn't fail; the harvest just gets stuck on this record. (it's not on the Odum side, it's on ours - whatever it is, it happens after we successfully download the DDI record; as we try to parse and import it)

Haven't figured out what it is yet - will finish tomorrow.

@landreev
Copy link
Contributor

We now have 4,191 Odum datasets harvested in production.
(this is odum_all; with a few import failures that I'll summarize).
This is more than the total number in the set, as specified 2 weeks ago - so I'm assuming it means more datasets have been added and published since then.

landreev added a commit that referenced this issue Aug 12, 2019
fixes the issue with transaction scope when deleting harvested records, per OAI instructions;
plus some cosmetic issues and minor optimizations. (#3268)
@landreev landreev mentioned this issue Aug 12, 2019
5 tasks
@landreev
Copy link
Contributor

This is what's in the PR now:

  • Fix for the main issue at hand - a deleted record specified by the server in ListIdentifiers locks up the harvest on the client under certain conditions. (That was the reason why some recent Odum and UVA harvests would not complete).
    plus some smaller improvements:
  • Many harvested datasets were lacking the "release date". The cosmetic manifestation of this is that the search card for the dataset shows the date on which the dataset was harvested; not the date when it was published on the original Dataverse. Added a fix to populate the date from the oai date stamp when otherwise not available.
  • Added some harvesting-specific checks to DestroyDatasetCommand.
    I should've thought about this when the "Destroying dataset should unregister the DOI" issue was going through dev. recently; the following code is in the command:
   if(idServiceBean.alreadyExists(doomed)){
      idServiceBean.deleteIdentifier(doomed);
         for (DataFile df : doomed.getFiles()) {
            idServiceBean.deleteIdentifier(df);
         }
      }
   }

Unless there are some checks further down in the IdServiceBean implementations, it really looks like we were attempting to delete remote DOIs whenever the corresponding harvested datasets were updated. (DestroyDatasetCommand is called whenever our harvester updates a harvested dataset - we delete and recreate it from scratch).
Whatever it was doing, this was not resulting in any observable error or log warning messages (and we cannot delete other people's DOIs of course). But I added if (!doomed.isHarvested()) around it.

  • Similarly the Destroy command was attempting to delete saved logos for harvested datasets; this resulted in error messages in server.log for each updated or deleted dataset. Also fixed.

@landreev
Copy link
Contributor

landreev commented Aug 13, 2019

For test/QA:
Verifying that harvesting is still working should probably be the main test.
If you want to reproduce the actual lockup condition, it'll require creating a small set (of 2+ records); harvesting it; then deleting one dataset and updating the remaining ones before re-exporting... Please ask if you have any questions. It's not hard to reproduce, but may require a few steps in just the right order, to emulate the condition we were seeing.

Harvesting some real-life remote set (Odum, UVA, etc.) from scratch could be a good test; to verify that the resulting search cards are NOT all showing the local harvest date.
Generally, wiping an entire harvested dataverse, then re-harvesting it from scratch should be a standard procedure whenever problems with harvested content are observed. (it's pretty fast too, even for a large remote archive, like Odum. Deleting a large dataverse like that may take longer than the actual re-harvesting though). But an annoying side effect of this was often the misleading date on the search card.

landreev added a commit that referenced this issue Aug 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: Harvesting Type: Bug a defect User Role: Superuser Has access to the superuser dashboard and cares about how the system is configured
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants