-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Heracles reports do not insert final results when using Spark #2112
Comments
Thanks for the suggestion. we're just completing a 2.12.0 release, but this is something we can push into 2.12.1 after if we don't have time to squeeze this in. |
Thanks. There is a some discussion on Forums too - https://forums.ohdsi.org/t/databricks-spark-coming-to-ohdsi-stack/14545/13 |
Per @alondhe: |
Shouldn't this be handled by the core SqlRender function? I do not feel supportive of putting dialect-specific rules in WebAPI when dialect specific rules should be taken into account with these external libaries (so that you get the handling without having to make the special case every place you use it). Is this a simple mater of sql rendering or is this something eternal to sql rendering and is something that has to be done directly on the connection? |
SqlRender sparkHandleInsert() can be used, but it has to be invoked during any SQL command fired off by WebAPI. This is because we need the active connection in order to then run a "describe table" to get the full column list and re-write the insert. So unfortunately, dialect-specific rules are needed in WebAPI -- and we have done this already. Refactoring all inserts to use all columns explicitly is the cleanest approach, but also a heavy lift. But if we did that, we can just skip any Spark specific patterns in WebAPI SQL executions. |
Ok, it's messy but I think we have to adhere to dealing with writing cross-platform SQL (with respect to SqlRender). There's talk about creating a OHDSI-SQL specification and part of that would include naming your columns, and even matching your INSERTS to SELECTS. I'll take this one on over time if necessary. |
@chrisknoll , sounds like you are suggesting refactoring the relevant SQL. If so, I'm willing to help if I can get some guidance. I've confirmed that all of the queries prior to the insert statement run correctly. So, specifically for this Heracles function, it appears that only four things need to change: (1) Change contents of https://github.com/OHDSI/WebAPI/blob/master/src/main/resources/resources/cohortanalysis/sql/selectHeraclesResults.sql to Any chance that can get incorporated into the 2.12.1 release? |
yes, absolutely, if we can get a PR put together with changes, then we can incorporate that into a hotfix. If you can make the above changes, all the better. |
…hen using Spark
@chrisknoll , thanks. I added a pull request #2138 . I'm not able to compile and test the WebAPI myself, but was able to test the SQL statements this should generate. |
@TomWhite-MedStar : Hi, I need a few days to recover from all the activity, but I'm not ignoring you. Let me get back to you with final thoughts and we can push this forward then. |
@chrisknoll , any chance this can get into 2.12.1 release? |
I haven't had a chance to go into this, and I won't have time any time soon, so, I left a comment on the PR that I can approve if the PR works for spark and other dialects. Once merged to master, odyessus can pull together a hotfix with this commit. |
@ssuvorov-fls please cherry pick to 'master-2.12' |
Expected behavior
After defining and generating a cohort, it should be possible to use the Reporting tab to generate Quick or Full Analyses.
Actual behavior
This works fine on SQL Server, but not on Spark (e.g. DataBricks).
Steps to reproduce behavior
Need an instance of OHDSI that has an OMOP data source connected via SPARK
After several minutes, the Job will show a Failed status.
Root Cause
After all of the initial processing, the final query looks like this:
insert into result_schema.heracles_results (cohort_definition_id, analysis_id, stratum_1, stratum_2, stratum_3, stratum_4, count_value) select cohort_definition_id, analysis_id, cast(stratum_1 as STRING), cast(stratum_2 as STRING), cast(stratum_3 as STRING), cast(stratum_4 as STRING), count_value from tmp.xrfc9243results_1 UNION ALL ...
However, SPARK SQL does not support INSERT INTO with a list of fields that does not exactly match the full set of columns for the final table.
Modifying the query does work. Either this:
insert into omop_160101_to_220731_v813_results.heracles_results (cohort_definition_id, analysis_id, stratum_1, stratum_2, stratum_3, stratum_4, stratum_5, count_value, last_update_time) select cohort_definition_id, analysis_id, cast(stratum_1 as STRING), cast(stratum_2 as STRING), cast(stratum_3 as STRING), cast(stratum_4 as STRING), '' as stratum_5, count_value, now() from tmp.xrfc9243results_1
or this:
insert into omop_160101_to_220731_v813_results.heracles_results select cohort_definition_id, analysis_id, cast(stratum_1 as STRING), cast(stratum_2 as STRING), cast(stratum_3 as STRING), cast(stratum_4 as STRING), '' as stratum_5, count_value, now() from tmp.xrfc9243results_1
Can the relevant SQL fragments be modified so that this works for Spark?
One of the relevant files appears to be:
https://github.com/OHDSI/WebAPI/blob/master/src/main/resources/resources/cohortanalysis/sql/selectHeraclesResults.sql
But it looks as though a Java file is used to orchestrate the sub-queries:
https://github.com/OHDSI/WebAPI/blob/fd108571086fceea8513389fab426a6ea8101888/src/main/java/org/ohdsi/webapi/cohortanalysis/HeraclesQueryBuilder.java
The text was updated successfully, but these errors were encountered: