-
Notifications
You must be signed in to change notification settings - Fork 501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Harvesting - Broken dataset title links for non-Dataverse/OAI-PMH repositories #4964
Comments
Stumbled upon this issue again today. We are up to 2,470 datasets that result in 404's when you click on the dataset title link in the search card. The resulting 404 URL is formatted with the assumption that there is a "dataset.xhtml" page on the DataCite site.
Looking at the code, it appears there is search-include-fragment.xhtml
SolrSearchResult.java
|
|
@landreev at standup I mentioned that I was chatting with @jggautier about this issue this morning. One thing we noticed is that when I tried to set up a harvesting client from https://oai.datacite.org/oai it was taking FOREVER after I clicked "Next". Actually, the same thing happens when I click "Next" under "Edit Harvesting Client" like this. It just spins and spins: But then Julian clued me in to the fact that https://oai.datacite.org/oai has over 2000 sets. In order to know how long I should wait (10 minutes or so?) I hacked in a counter like this:
Obviously, the code above is a hack but I guess I'd suggest adding in some more logging ( Also, I can easily reproduce the bug with the client above. Here are all the parameters I used:
|
This seems to be the case. I'm helping resolve a different harvesting issue and I noticed that when setting up a harvesting client, the last of 4 steps is to choose the "Archive Type." When I set up harvesting of non-Dataverse repositories, I left the Archive Type as Dataverse v4+, and it seems like I was expected to choose "Generic OAI resource (DC)". ICPSR datasets are in the middle of being harvesting into this dataverse on Demo Dataverse, and the dataset title links I've clicked take me to the records on ICPSR's website. (I chose "Generic OAI resource (DC)" since I didn't know why there was an option specifically for ICPSR.) I think that needing to choose "Generic OAI resource (DC)", in step 4, also implies that in step 2, I was expected to choose oai_dc as the metadata format.
The non-Dataverse repositories that Harvard Dataverse should re-harvest are:
The docs (http://guides.dataverse.org/en/latest/admin/harvestclients.html) don't mention each step. I'm not sure if there's a need for them to. In that screenshot, the modal window describes the importance of choosing the right Archive Type (I think I just overlooked it since I don't normally set up harvesting from non-Dataverse repositories): Maybe not making Dataverse v4.x as a default will force the user to think about the Archive Type. |
@jggautier, kudos for the comment with screenshots, proposing UX/UI improvements to the create harvesting client workflow in order to avoid this issue going forward. It should be easy enough to change the dropdown menu in Step 4 to be "Select...", forcing the user to make a selection. Are you also suggesting that we could combine Step 2 and Step 4 because of the relation of Metadata Format and Archive Type fields? |
Thanks for reading this so quickly and so closely!
Not really. A Dataverse repository that wants to harvest from a non-Dataverse repository might want to choose a metadata format that's richer than Dublin Core. E.g. for harvesting ICPSR, I'm testing harvesting using DDI 2.5. I'm wondering what was meant by DC in the option "Generic OAI resource (DC)", and if DC should be removed. |
To summarize:
|
I'd like to clarify, and redefine, if needed, the scope of this issue. It was originally opened to reconfigure any existing harvesting clients to make the redirect links work. But it sounds like we are talking about changing the configuration dialogs. (it is of course confusing in its current form).
Yes, probably.
The only other harvesting format we (theoretically) recognize from a non-Dataverse OAI archive is DDI; in practice, it's extremely unlikely that we'll be able to parse a DDI that's produced by anything other than a Dataverse. That may have been the rationale - ?
If I'm reading @pdurbin's report correctly, this issue - a very long list of sets - should be making configuring a new client (or reconfiguring an existing one) very slow, or impossible. I don't think it should affect harvesting from an already configured client though. (during a harvesting run we never issue a "list sets" command). So if this archive cannot be harvested, it's probably something else. |
Just to clarify, it is not necessary to re-harvest a remote archive, for the "archive type" change to take effect. |
It's not impossible, I just had to wait 10 minutes or so. I forget exactly how long. Not a great user experience, obviously. 😄 In practice, I put in some logging to so I could watch server.log and not get frustrated by not knowing how long I'd have to wait. When it got to 1800 of 2000 or whatever I knew I was getting close to the end. 😄 So at minimum I'd suggest a logger.fine line that a sysadmin and bump up in the case of a long list of sets. Basically, a cleaned up version of the hack I mentioned at #4964 (comment) 😄 |
Made a PR. |
When Dataverse harvests from some non-Dataverse sources (e.g. ICPSR, DataCite), clicking on the dataset link doesn't take users to the source's dataset page.
You can see an example in this dataverse on Harvard Dataverse, where a set of metadata records from DataCite was harvested. Clicking on the dataset title link (https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.6141/tw-srda-af010014-1) take you to DataCite's 404 Not Found page (https://oai.datacite.org/dataset.xhtml?persistentId=doi:10.6141/tw-srda-af010014-1). Clicking on the DOI link (https://doi.org/10.6141/tw-srda-af010014-1) in the citation box takes you to the source's dataset page.
(On Harvard Dataverse's "ICPSR Harvested Dataverse", clicking on the dataset titles correctly takes you to the sources' dataset pages. I'm not sure how those dataset records were collected into that ICPSR Harvested Dataverse. But they're out of sync with the records that ICPSR makes available over OAI-PMH, and this bug is prevent Harvard Dataverse from updating the datasets it's harvested from ICPSR.)
Could we investigate what's going on?
These are two github issues that might be related:
#4831
#4707
The text was updated successfully, but these errors were encountered: