Harvesting - Broken dataset title links for non-Dataverse/OAI-PMH repositories #4964

jggautier · 2018-08-15T14:30:10Z

When Dataverse harvests from some non-Dataverse sources (e.g. ICPSR, DataCite), clicking on the dataset link doesn't take users to the source's dataset page.

You can see an example in this dataverse on Harvard Dataverse, where a set of metadata records from DataCite was harvested. Clicking on the dataset title link (https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.6141/tw-srda-af010014-1) take you to DataCite's 404 Not Found page (https://oai.datacite.org/dataset.xhtml?persistentId=doi:10.6141/tw-srda-af010014-1). Clicking on the DOI link (https://doi.org/10.6141/tw-srda-af010014-1) in the citation box takes you to the source's dataset page.

(On Harvard Dataverse's "ICPSR Harvested Dataverse", clicking on the dataset titles correctly takes you to the sources' dataset pages. I'm not sure how those dataset records were collected into that ICPSR Harvested Dataverse. But they're out of sync with the records that ICPSR makes available over OAI-PMH, and this bug is prevent Harvard Dataverse from updating the datasets it's harvested from ICPSR.)

Could we investigate what's going on?

These are two github issues that might be related:
#4831
#4707

mheppler · 2019-07-18T17:16:25Z

Stumbled upon this issue again today. We are up to 2,470 datasets that result in 404's when you click on the dataset title link in the search card.

The resulting 404 URL is formatted with the assumption that there is a "dataset.xhtml" page on the DataCite site.

https://oai.datacite.org/dataset.xhtml?persistentId=doi:10.6141/TW-SRDA-E89101-1

Looking at the code, it appears there is return remoteArchiveUrl; logic but it appears that it isn't being properly applied.

search-include-fragment.xhtml

<!--DATASET CARDS-->
                    <div class="datasetResult clearfix" jsf:rendered="#{result.type == 'datasets'}">
                        <div class="card-title-icon-block">
                            ...
                            <a href="#{!SearchIncludeFragment.rootDv and !result.isInTree ? result.datasetUrl : widgetWrapper.wrapURL(result.datasetUrl)}" target="#{(!SearchIncludeFragment.rootDv and !result.isInTree and widgetWrapper.widgetView) or result.harvested ? '_blank' : ''}">
                                <h:outputText value="#{result.title}" style="padding:4px 0;" rendered="#{result.titleHighlightSnippet == null}"/>
                                ...

SolrSearchResult.java

    public String getDatasetUrl() {
        String failSafeUrl = "/dataset.xhtml?id=" + entityId + "&versionId=" + datasetVersionId;
        if (identifier != null) {
            /**
             * Unfortunately, colons in the globalId (doi:10...) are converted
             * to %3A (doi%3A10...). To prevent this we switched many JSF tags
             * to a plain "a" tag with an href as suggested at
             * http://stackoverflow.com/questions/24733959/houtputlink-value-escaped
             */
            String badString = "null";
            if (!identifier.contains(badString)) {
                if (entity != null && entity instanceof Dataset) {
                    if (this.isHarvested() && ((Dataset)entity).getHarvestedFrom() != null) {
                        String remoteArchiveUrl = ((Dataset) entity).getRemoteArchiveURL();
                        if (remoteArchiveUrl != null) {
                            return remoteArchiveUrl;
                        }
                        return null;
                    }
                }
                if (isDraftState()) {
                    return "/dataset.xhtml?persistentId=" + identifier + "&version=DRAFT";
                }
                return "/dataset.xhtml?persistentId=" + identifier;
            } else {
                logger.info("Dataset identifier/globalId contains \"" + badString + "\" perhaps due to https://github.com/IQSS/dataverse/issues/1147 . Fix data in database and reindex. Returning failsafe URL: " + failSafeUrl);
                return failSafeUrl;
            }
        } else {
            logger.info("Dataset identifier/globalId was null. Returning failsafe URL: " + failSafeUrl);
            return failSafeUrl;
        }
    }

djbrooke · 2020-02-05T19:27:19Z

The URLs assume a dataset link for each harvested record which is probably the issue
This should be addressed through a configuration change (not code change hopefully) of the harvesting client and verify
We should identify what clients we expect to change in Harvard Dataverse
We should add/verify docs

pdurbin · 2020-02-10T19:37:51Z

@landreev at standup I mentioned that I was chatting with @jggautier about this issue this morning.

One thing we noticed is that when I tried to set up a harvesting client from https://oai.datacite.org/oai it was taking FOREVER after I clicked "Next".

Actually, the same thing happens when I click "Next" under "Edit Harvesting Client" like this. It just spins and spins:

But then Julian clued me in to the fact that https://oai.datacite.org/oai has over 2000 sets. In order to know how long I should wait (10 minutes or so?) I hacked in a counter like this:

$ git diff
diff --git a/src/main/java/edu/harvard/iq/dataverse/harvest/client/oai/OaiHandler.java b/src/main/java/edu/harvard/iq/dataverse/harvest/client/oai/OaiHandler.java
index e4642fe0a..bd805bef2 100644
--- a/src/main/java/edu/harvard/iq/dataverse/harvest/client/oai/OaiHandler.java
+++ b/src/main/java/edu/harvard/iq/dataverse/harvest/client/oai/OaiHandler.java
@@ -157,7 +157,10 @@ public class OaiHandler implements Serializable {
         
         List<String> sets = new ArrayList<>();
 
+        int count = 0;
         while ( setIter.hasNext()) {
+            count++;
+            System.out.println("on set " + count);
             Set set = setIter.next();
             String setSpec = set.getSpec();
             /*

Obviously, the code above is a hack but I guess I'd suggest adding in some more logging (logger.fine, probably) if you feel like it, while you're in this code.

Also, I can easily reproduce the bug with the client above. Here are all the parameters I used:

harvestingurl: https://oai.datacite.org/oai
harvestingset: GESIS.SRDA
harvesttype: oai
metadataprefix: oai_dc
harveststyle: dataverse

jggautier · 2020-02-12T19:30:33Z

This should be addressed through a configuration change (not code change hopefully) of the harvesting client and verify

This seems to be the case. I'm helping resolve a different harvesting issue and I noticed that when setting up a harvesting client, the last of 4 steps is to choose the "Archive Type." When I set up harvesting of non-Dataverse repositories, I left the Archive Type as Dataverse v4+, and it seems like I was expected to choose "Generic OAI resource (DC)".

ICPSR datasets are in the middle of being harvesting into this dataverse on Demo Dataverse, and the dataset title links I've clicked take me to the records on ICPSR's website. (I chose "Generic OAI resource (DC)" since I didn't know why there was an option specifically for ICPSR.)

I think that needing to choose "Generic OAI resource (DC)", in step 4, also implies that in step 2, I was expected to choose oai_dc as the metadata format.

We should identify what clients we expect to change in Harvard Dataverse

The non-Dataverse repositories that Harvard Dataverse should re-harvest are:

SRDA (which is a set in DataCite's oai-pmh feed, and Harvard Dataverse might be having a problem getting the large list of sets in DataCite's feed, as @pdurbin wrote earlier)
ICPSR (since the existing records in Harvard Dataverse are stale, and hopefully using OAI-PMH can keep them up to date)

We should add/verify docs

The docs (http://guides.dataverse.org/en/latest/admin/harvestclients.html) don't mention each step. I'm not sure if there's a need for them to.

In that screenshot, the modal window describes the importance of choosing the right Archive Type (I think I just overlooked it since I don't normally set up harvesting from non-Dataverse repositories):

Maybe not making Dataverse v4.x as a default will force the user to think about the Archive Type.

mheppler · 2020-02-12T19:44:25Z

@jggautier, kudos for the comment with screenshots, proposing UX/UI improvements to the create harvesting client workflow in order to avoid this issue going forward.

It should be easy enough to change the dropdown menu in Step 4 to be "Select...", forcing the user to make a selection. Are you also suggesting that we could combine Step 2 and Step 4 because of the relation of Metadata Format and Archive Type fields?

jggautier · 2020-02-12T19:55:26Z

Thanks for reading this so quickly and so closely!

Are you also suggesting that we could combine Step 2 and Step 4 because of the relation of Metadata Format and Archive Type fields?

Not really. A Dataverse repository that wants to harvest from a non-Dataverse repository might want to choose a metadata format that's richer than Dublin Core. E.g. for harvesting ICPSR, I'm testing harvesting using DDI 2.5. I'm wondering what was meant by DC in the option "Generic OAI resource (DC)", and if DC should be removed.

jggautier · 2020-02-13T21:49:08Z

To summarize:

I overlooked the "Archive Type", which I should have changed to "Generic OAI resource (DC)" when harvesting from non-Dataverse repositories. I think it's because I've never had to think about changing it. Is it important that the default is "Dataverse v4+"? Should there be no default (or default is "Select...") so that the user is forced to make a selection?
Why does "Generic OAI resource (DC)" include that "(DC)", which I take to mean Dublin Core, and can the "(DC)" be removed? It's possible to harvest from non-Dataverse repositories using metadata formats other than Dublin Core.
Fixing the dataset title links for SRDA datasets is blocked by the bug that @pdurbin reported, where the large number of sets in DataCite's oai-pmh feed might somehow be preventing Dataverse from re-harvesting records in the SRDA set (GESIS.SRDA). Should this be its own GitHub issue?
In an issue in the Harvard Dataverse repo, I'll add info about re-harvesting ICPSR datasets.

landreev · 2020-02-18T16:16:01Z

I'd like to clarify, and redefine, if needed, the scope of this issue. It was originally opened to reconfigure any existing harvesting clients to make the redirect links work. But it sounds like we are talking about changing the configuration dialogs. (it is of course confusing in its current form).

To summarize:

Is it important that the default is "Dataverse v4+"? Should there be no default (or default is "Select...") so that the user is forced to make a selection?

Yes, probably.

Why does "Generic OAI resource (DC)" include that "(DC)", which I take to mean Dublin Core, and can the "(DC)" be removed? It's possible to harvest from non-Dataverse repositories using metadata formats other than Dublin Core.

The only other harvesting format we (theoretically) recognize from a non-Dataverse OAI archive is DDI; in practice, it's extremely unlikely that we'll be able to parse a DDI that's produced by anything other than a Dataverse. That may have been the rationale - ?

Fixing the dataset title links for SRDA datasets is blocked by the bug that @pdurbin reported, where the large number of sets in DataCite's oai-pmh feed might somehow be preventing Dataverse from re-harvesting records in the SRDA set (GESIS.SRDA). Should this be its own GitHub issue?

If I'm reading @pdurbin's report correctly, this issue - a very long list of sets - should be making configuring a new client (or reconfiguring an existing one) very slow, or impossible. I don't think it should affect harvesting from an already configured client though. (during a harvesting run we never issue a "list sets" command). So if this archive cannot be harvested, it's probably something else.

landreev · 2020-02-24T16:13:56Z

Just to clarify, it is not necessary to re-harvest a remote archive, for the "archive type" change to take effect.
The redirect urls are generated in real time; so the change takes effect immediately.
For the SRDA archive, it does appear to be impossible to make the change through the UI (because of the bug with the set lists described above). But it is possible to do it in the database directly:
UPDATE harvestingclient SET harveststyle='default' WHERE name='srda'
This doesn't really fix it for the archive though; there's no 404 anymore - which is a step up - but the redirect is now showing a bland/generic OAI page on their side. That is because we can't really deduce the remote URL from what they are giving us in the DC metadata (for example: view-source:https://oai.datacite.org/oai?verb=GetRecord&identifier=doi:10.6141/tw-srda-aa000001-1&metadataPrefix=oai_dc).
What we should be doing instead is redirecting to the doi: resolver (for the dataset above - https://doi.org/10.6141/TW-SRDA-AA000001-1).
But for this we'll need a code change (will make a PR shortly).

…rvesting clients, per #4964.

…ve type. (#4964)

pdurbin · 2020-02-25T22:02:43Z

If I'm reading @pdurbin's report correctly, this issue - a very long list of sets - should be making configuring a new client (or reconfiguring an existing one) very slow, or impossible.

It's not impossible, I just had to wait 10 minutes or so. I forget exactly how long. Not a great user experience, obviously. 😄 In practice, I put in some logging to so I could watch server.log and not get frustrated by not knowing how long I'd have to wait. When it got to 1800 of 2000 or whatever I knew I was getting close to the end. 😄

So at minimum I'd suggest a logger.fine line that a sysadmin and bump up in the case of a long list of sets. Basically, a cleaned up version of the hack I mentioned at #4964 (comment) 😄

landreev · 2020-02-25T23:44:20Z

Made a PR.
Checked in a simple/crude solution for the "too many sets" issue. It definitely is an improvement over the current situation (which is, you cannot set up harvesting from datacite.org/cannot edit any already created clients harvesting from datacite.org). I cannot justify spending any more time on this issue - it's already a bit outside the original scope; plus I'm not aware of any other OAI archive with the same problem.

…the PR. (#4964)

…ts. (#4964) I'm done, for reals.

…#4964)

jggautier added Type: Bug a defect Feature: Harvesting labels Aug 15, 2018

jggautier changed the title ~~Investigating broken links to harvested ICPSR datasets when using OAI-PMH to harvest~~ Investigating broken links to harvested datasets when using OAI-PMH to harvest May 17, 2019

jggautier changed the title ~~Investigating broken links to harvested datasets when using OAI-PMH to harvest~~ Investigating broken links to harvested datasets when using OAI-PMH to harvest from non-Dataverse repositories May 17, 2019

mheppler changed the title ~~Investigating broken links to harvested datasets when using OAI-PMH to harvest from non-Dataverse repositories~~ Harvesting - Broken dataset title links for non-Dataverse/OAI-PMH repositories Jul 18, 2019

jggautier mentioned this issue Jan 24, 2020

Harvest over 2 million datasets from the US Government (data.gov) IQSS/dataverse.harvard.edu#56

Open

djbrooke assigned jggautier Feb 5, 2020

djbrooke added the Medium label Feb 5, 2020

djbrooke unassigned jggautier Feb 5, 2020

landreev self-assigned this Feb 10, 2020

djbrooke assigned jggautier and unassigned landreev Feb 13, 2020

jggautier mentioned this issue Feb 13, 2020

Re-harvesting ICPSR datasets IQSS/dataverse.harvard.edu#63

Open

djbrooke assigned landreev Feb 18, 2020

jggautier mentioned this issue Feb 20, 2020

Harvest: DDI import appears not to include all fields exported as DDI. #3297

Closed

djbrooke unassigned jggautier Feb 24, 2020

landreev added a commit that referenced this issue Feb 24, 2020

small improvements for handling harvested datasets and configuring ha…

5490a31

…rvesting clients, per #4964.

landreev added a commit that referenced this issue Feb 25, 2020

more aggressive validation, forcing the user to pick the remote archi…

3e30bcb

…ve type. (#4964)

landreev added a commit that referenced this issue Feb 25, 2020

a simple solution for the "too many sets" problem. (#4964).

e5465f0

landreev mentioned this issue Feb 25, 2020

4964 harvesting issues #6686

Merged

landreev added a commit that referenced this issue Feb 26, 2020

minor cleanup of the code in OaiHandler - was supposed to be part of …

eb063f2

…the PR. (#4964)

landreev added a commit that referenced this issue Feb 26, 2020

another minor improvement, for an OAI server with a tricky list of se…

230609c

…ts. (#4964) I'm done, for reals.

landreev added a commit that referenced this issue Feb 26, 2020

ok, now I'm done (#4964)

3cd1575

landreev added a commit that referenced this issue Feb 27, 2020

couple of cosmetic/punctuation changes to the messages in the bundle. (…

641c73f

…#4964)

kcondon closed this as completed in #6686 Feb 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harvesting - Broken dataset title links for non-Dataverse/OAI-PMH repositories #4964

Harvesting - Broken dataset title links for non-Dataverse/OAI-PMH repositories #4964

jggautier commented Aug 15, 2018 •

edited by mheppler

Loading

mheppler commented Jul 18, 2019

djbrooke commented Feb 5, 2020

pdurbin commented Feb 10, 2020

jggautier commented Feb 12, 2020

mheppler commented Feb 12, 2020

jggautier commented Feb 12, 2020

jggautier commented Feb 13, 2020 •

edited

Loading

landreev commented Feb 18, 2020 •

edited by pdurbin

Loading

landreev commented Feb 24, 2020 •

edited

Loading

pdurbin commented Feb 25, 2020

landreev commented Feb 25, 2020

Harvesting - Broken dataset title links for non-Dataverse/OAI-PMH repositories #4964

Harvesting - Broken dataset title links for non-Dataverse/OAI-PMH repositories #4964

Comments

jggautier commented Aug 15, 2018 • edited by mheppler Loading

mheppler commented Jul 18, 2019

djbrooke commented Feb 5, 2020

pdurbin commented Feb 10, 2020

jggautier commented Feb 12, 2020

mheppler commented Feb 12, 2020

jggautier commented Feb 12, 2020

jggautier commented Feb 13, 2020 • edited Loading

landreev commented Feb 18, 2020 • edited by pdurbin Loading

landreev commented Feb 24, 2020 • edited Loading

pdurbin commented Feb 25, 2020

landreev commented Feb 25, 2020

jggautier commented Aug 15, 2018 •

edited by mheppler

Loading

jggautier commented Feb 13, 2020 •

edited

Loading

landreev commented Feb 18, 2020 •

edited by pdurbin

Loading

landreev commented Feb 24, 2020 •

edited

Loading