-
Notifications
You must be signed in to change notification settings - Fork 501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
6650 export import mismatch #6669
Conversation
…dataverse into export-import-mismatch
Thanks @lubitchv for the PR (and for including/updating unit tests!). We'll review this. @jggautier - do you want to take a quick look at the implementation here, either before a developer takes a look or at the same time? |
Thanks, yes, taking a quick look while a developer takes a look would be helpful. I looked at the xml export example and have some questions, but I could use help getting this PR into a working instance of Dataverse to see how Dataverse maps the DDI metadata on import. (When I tried spinning up this branch on AWS it failed, and I think I got it running locally (with Vagrant), but I think I need to publish datasets to see the DDI exports, and I can't publish datasets without asking for and configuring in my local instance a test DataCite account.) I have questions about the approach, too (i.e. using DDI or resolving the issues with using the Dataverse_JSON), but it sounds like that wouldn't be quick. |
@jggautier i’m happy to spin up the branch if you have trouble. will the default FAKE DOI provider be sufficient for your testing? Odum has “test” credentials we can use temporarily. |
@donsizemore That would be great! The default fake DOI would be fine. Thanks! |
@jggautier http://ec2-52-87-250-239.compute-1.amazonaws.com/ =) Sampledata going in at the moment. Credentials coming in Slack. |
@lubitchv I was worried at first about the amount of work required if you needed to re-add the metadata that was lost by using DDI to migrate dataset metadata, e.g. from fields in other metadatablocks, but @djbrooke let me know that you're only really concerned with the metadata that can be mapped to DDI from the citation, social science and geospatial blocks. Your changes bring the DDI export to closer alignment with the codebook schema, so I'm very excited. I couldn't think of any integrations or applications outside of Dataverse that would be affected by the changes to the DDI export (which also affects the DDI metadata available over OAI-PMH), since the integrations/applications I know of are all using the dataverse_json export. As you pointed out, there will always be some information loss, but for the migration you're working on, would the loss be lessened if we considered how to add additional metadata to the DDI export? These are the fields I think it's possible to add, especially if the effort would be worth it for your use case:
Lastly, this PR properly organizes the distributor elements in the exported XML, but also removes the Dataverse-based repository itself as a distributor of the study. To point out what I mean using this XML as an example, currently Dataverse adds the repository as both a distributor of the metadata document (line 9) and of the study (line 47 (ignore the use of ExtLink, which as you've probably seen isn't used now)). This PR removes the second instance under the stdyDscr element. Was this intentional? Should the repository not be declared as a distributor of the study? Should it only be considered a distributor of the the study's metadata? For reference, the dataset created while I tested this import and export code is at http://ec2-52-87-250-239.compute-1.amazonaws.com/dataset.xhtml?persistentId=doi:10.5072/FK2/ZNDIWD |
Thank you @jggautier for detailed review and suggestions. Yes, it would be useful for us to have export to be more close to DDI standards. Let me write the fields that should be added to make sure that I understood you correctly.
<stdyDscr>
<citation>
<titlStmt>
<titl>Replication Data for: Title</titl>
<IDNo agency="DOI">doi:10.5072/FK2/WKUKGV</IDNo>
<IDNo agency="OtherIDAgency1">OtherIDIdentifier1</IDNo>
<IDNo agency="OtherIDAgency2">OtherIDIdentifier2</IDNo>
</titlStmt>
....
</citation>
...
</stdyDscr>
<stdyDscr>
<rspStmt>
<AuthEnty affiliation="AuthorAffiliation1">LastAuthor1, FirstAuthor1</AuthEnty>
<AuthEnty affiliation="AuthorAffiliation2">LastAuthor2, FirstAuthor2</AuthEnty>
<othId role="Data Collector">LastContributor1, FirstContributor1</othId>
<othId role="Data Curator">LastContributor2, FirstContributor2</othId>
</rspStmt>
...
</stdyDscr>
<dataAccs>
<notes type="DVN:TOU" level="dv">CC0 Waiver</notes>
<notes type="DVN:TOA" level="dv">Terms of Access</notes>
...
</dataAccs>
<useStmt>
<citReq>Citation Requirements</citReq>
...
</useStmt>
<relPubl>RelatedPublicationCitation1, ark, RelatedPublicationIDNumber1, http://RelatedPublicationURL1.org</relPubl>
<relPubl>RelatedPublicationCitation2, arXiv, RelatedPublicationIDNumber2, http://RelatedPublicationURL2.org</relPubl> should be: <relPubl>
<citation>
<titlStmt>
<titl/>
<IDNo agency="ark">RelatedPublicationIDNumber1</IDNo>
</titlStmt>
<biblCit>RelatedPublicationCitation1</biblCit>
</citation>
<ExtLink URI=http://RelatedPublicationURL1.org></ExtLink>
</relPubl>
<relPubl>
<citation>
<titlStmt>
<titl/>
<IDNo agency="arXiv">RelatedPublicationIDNumber2</IDNo>
</titlStmt>
<biblCit>RelatedPublicationCitation2</biblCit>
</citation>
<ExtLink URI=http://RelatedPublicationURL2.org></ExtLink>
</relPubl> Let me know if I made a mistake or misunderstood. I can and will add these fields to export and import for this PR. Regarding dataverse distributor, it is exported. You can see it in export: <distStmt>
<distrbtr>Root</distrbtr>
<distDate>2020-02-25</distDate>
</distStmt> I do not know how one can import it, since Distributor is a name of dataverse and date is a date of publishing in dataverse. Regarding Geographic Coverage, it is not critical for us. It would be nice to know if it is possible at all, but it is not urgent. |
Hi @lubitchv. This is great. I'd like to clarify what I meant in a few places, but might not be able to until this Friday or over the weekend. Is that okay? |
Hi @jggautier. Yes, it is fine. |
I agree with your points in 1-5. I didn't even know about the ExtLink element you're using for the related publication URL! That's great! About the Terms metadata, you caught what I meant to write about Data Access. My sentence construction was a little awkward. =) So would all of the Terms metadata look like this, including the logic for Terms of Use described in the comment next to the first "Notes" element?:
|
More incoming about the distributor issue. It's just taking a while to write this clearly :) |
Yes, you are rights. Terms metadata should look like you described. |
I just realized that in Dataverse 4.19.1 (and maybe from the start of Dataverse 4) the date entered in Dataverse's distribution date field (in the citation block) isn't mapped to anything in the DDI exports. In both the docDscr and stdyDscr sections of the DDI exports, the distDate is the date when the dataset was first published in the Dataverse installation. I can't find any GitHub issues or email threads about this (clarifying this in the current crosswalk shortly). I think your PR fixes it, so that this metadata, under the docDscr section, will look like this:
The metadata under the stdyDscr section will look like this:
But it also removes the line I've always assumed that that line was included in the first two sections of the DDI export (docDscr and stdyDscr) because it was important to state that the repository (at the Dataverse installation level) is the distributor of the DDI document and also of the study itself. I'd like to know if that assumption is right. But I haven't found any years-old metadata design documentation with that level of detail, and it's in the weeds so I wouldn't expect anyone to remember. That's why I asked if it's always appropriate for the repository to be declared as a distributor of the study, or if it should only always be considered a distributor of the study's DDI metadata document. If it is appropriate to always consider the repository to be declared as a distributor of the study, and I hope I've written this clearly (I'd be happy to have a call) and that it doesn't continue to unnecessarily hold up your migration. I think the data sharing community's different interpretations of fields like "distributor" and "producer" are behind discussions around "more flexible" dataset citations (#2297), and won't be resolved soon. So I'm wondering if we could keep this metadata designed as is for now, so that the distStmt under stdyDscr looks like this:
And when importing DDI, the value in And discussion about whether or not it's always appropriate for the repository to be declared as a distributor of the study as well as of the DDI document can happen outside of this issue (and your migration). Thanks again for this! It's contributing to resolving the issue about DDI exports not being valid against the schema (#3648), which I plan to update after your great work. |
@jggautier I do not know the answer to your question, should dataverse be in distrbtr section of Study level. I asked librarians, maybe they will come up with the answer. The problem that I have with putting dataverse as distributor in study level is that I do not know how to differentiate between distributors on import, especially from different dataverse instances with different names. I guess I can remember the dataverse distributor name from docDscr section and compare it to distributor name with stdyDscr section. If it is the same when ignore it and only map distributors with different names. I can do that. |
Thanks! :) Hopefully getting an answer isn't difficult and doesn't hold up your migration.
The "source" attribute could be used to distinguish metadata added by the archive/repository versus metadata added by the producer/depositor. Would using the source attribute be a simpler method?: |
Thanks @jggautier, it helped. I added source="archive". I also made all the changes that you suggested above and committed to this branch. Please look at updated dataset-create-new-all-ddi-fields.json and exportfull.xml. I think it is ready. Please let me know if you notice that I missed something. |
@jggautier I got the answer from a librarian. You were right. Dataverse should be mentioned in both docDscr and stdyDscr. |
Thanks! About adding the repository as a distributor in both docDscr and stdyDscr, it was more of a question for me, too. I looked at the diffs for the dataset-create-new-all-ddi-fields.json and exportfull.xml files and they look good to me, too! I didn't notice anything missing from our convo :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since @jggautier has reviewed and approved the metadata mapping, the code changes look solid to me, so I'm moving this along.
@djbrooke Since the creation of DDI xml file happens on publishing so datasets that already were published will need to be cleaned of old cached DDI xml and export should be rerun. So yes, I think reExportAll will be needed. |
@lubitchv ok, thanks, I'll add it |
Added in 770194d |
@lubitchv Would you refresh this branch from develop? I'm getting a flyway db error on deployment that I think is due to some outdated code that was removed. |
Update from develop IQSS
@kcondon I updated the branch from develop. |
Update from IQSS develop
@lubitchv Hi Victoria, hope things are well with you. Would you mind syncing this branch once more with develop? There were some export/multistore changes that would help. Also, apologies for this taking so long, have been a bit distracted lately :( |
Update from IQSS develop
No problem @kcondon I just updated the branch from IQSS develop. |
@lubitchv Thanks! Testing now |
@lubitchv I was able to create, export, import according to your instructions, with some minor difficulty due to a bug in the api. When I compared the metadata of the exported and imported datasets, I saw all the metadata was preserved with just a few differences that I hope you can comment on. Update: I realized you mentioned the export limitation of Subject above so we can ignore that one. Notes field value in exported dataset became field name: value in Notes field in imported dataset, ie. Notes1 became Notes: Notes1 What do you think? I'll also check with @jggautier to see what he thinks. |
@kcondon Yes, I think it how it suppose to work. There is no subject in xml ddi so it is moved to keywords. It is also how it worked before. |
@lubitchv Thanks, have merged this pr. |
What this PR does / why we need it: Fixes the existing mismatch between import and export DDI functions.
Which issue(s) this PR closes: #6650
Special notes for your reviewer: "Astronomy and Astrophysics", "Life Sciences" and "Journal" metadata does not have DDI compliant fields and therefore cannot be exported/imported using DDI export/import, hence it was not included in this PR.
Some fields that exist in dataverse (json) do not exist in DDI 2.5 and cannot be exported/imported. These are the fields:
Alternative URL
otherId fields
authorIdentifierScheme
authorIdentifier
subject does not exist in DDI standard and is transformed into keywords.
contributor (does not existin 2.5 but exists in 3.1)
For geospetial:
DDI doesnot have state and country. They go under geogCover.
For social sciences:
datasetLevelErrorNotes went into stdyDscr notes
Suggestions on how to test this: There is a file in
src/test/java/edu/harvard/iq/dataverse/export/ddi/dataset-create-new-all-ddi-fields.json
that has all the fields that suppose to be exported/imported. One may import it usingcurl -H "X-Dataverse-key: $API_TOKEN" --upload-file dataset-create-new-all-ddi-fields.json -X POST $SERVER_URL/api/dataverses/$DV_ALIAS/datasets
Then one will need to publish it. And then export metadata using either UI ("Export Metadata->DDI) or curl.
Then one can import exported xml back using
curl -H "X-Dataverse-key: $API_TOKEN" -X POST --upload-file export.xml $SERVER_URL/api/dataverses/$DV_ALIAS/datasets/:importddi?pid=new_pid&release=no
Then one can compare metadata fields in UI.
The example of proper xml export with all the fields is in
src/test/java/edu/harvard/iq/dataverse/export/ddi/exportfull.xml
In DdiExportUtilTest.java there is a unit test
testExportDDI()
that converts json to ddi and compares it to exportfull.xml