Database Connectors and SQL magics for Jupyter. jupancon
= Jupyter + Pandas + Connectors.
- Connector to Redshift
- Using user/pass
- Using IAM profile
- Connector to Bigquery (using google profile)
- Connector to Databricks
- Optional automatic tunnel setting through an SSH Bastion
- Querying capabilities
- IPython kernel magics for querying
- Always returns Pandas DataFrames
- Some hidden stuff I rather not document just yet so you don't nuke your Warehouse :) Will do when it's safer to use
pip install jupancon
Write a ~/.jupancon/config.yml
YAML file that looks similar to the following C&P from my actual config file (heavily censored for obvious reasons):
default: my-redshift-cluster
my-redshift-cluster:
type: redshift
host: XXXXXX.XXXXXX.XXXXXXX.redshift.amazonaws.com
# explicitly setting redshift port (optional)
port: 5439
user: XXXXXXXX
pass: XXXXXXXX
dbname: XXXXXX
my-redshift-using-iamprofile:
type: redshift
host: XXXXXX.XXXXXX.XXXXXXX.redshift.amazonaws.com
profile: XXXXXXXXX
dbname: XXXXXX
# NOTE: you can choose dbuser and it will be auto-created if it doesn't exist
dbuser: XXXXXX
cluster: XXXXXX
my-gcp:
type: bigquery
project: XXXXX-XXXXX
location: EU
my-databricks:
type: databricks
hostname: XXXXXX.cloud.databricks.com
http_path: /sql/XXX/XXXX/XXXXXXXXXX
# catalog is optional
catalog: XXXXXXX
token: XXXXXXXXX
my-redshift-behind-sshbastion:
type: redshift
use_bastion: true
bastion_server: censored.bastion.server.com
bastion_user: XXXXXX
bastion_host: XXXXXX.XXXXXX.XXXXXX.redshift.amazonaws.com
host: censored.main.server.com
user: XXXXXXXX
pass: XXXXXXXX
dbname: XXXXXX
Jupancon will also pick environment variables (which have preference over the config.yml
).
JPC_DB_TYPE
:redshift
orbigquery
JPC_HOST
: for example,XXXXXX.XXXXXX.XXXXXX.redshift.amazonaws.com
JPC_USER
: User nameJPC_DB
: Database nameJPC_PASS
: PasswordJPC_USE_BASTION
:true
or leave blankJPC_BASTION_SERVER
JPC_BASTION_HOST
JPC_PROFILE
: IAM profile (for IAM connection only)JPC_CLUSTER
: Redshift cluster (for IAM connection only)JPC_DBUSER
: Redshift user (for IAM connection only)
This library is developed primarily for usage within Jupyter Lab. It's likely to work in Jupyter Notebook and Ipython, but untested and unsupported at this stage. It also works and is being used in regular scripts, but it obviously loses its magic.
from jupancon import query, list_schemas, list_tables
list_schemas()
list_tables()
query("select * from foo")
from jupancon import load_magics
load_magics()
select * from foo
df = %select * from foo
%%sql
select *
from foo
where cond = 1
and label = 'my nice label'
Install a virtual environment, activate it, install the library and dev dependencies.
python3 -m venv .venv
source .venv/bin/activate
pip install .
pip install -r requirements-dev.txt
The rest of our test flow consists on opening notebooks and checking that you can connect. This flow is not publisable because of client confidentiality. There are however some very basic unit tests that can be used to quickly check if you broke something fundamental.
scripts/test.py
Current status: Jupancon has enough basic features that it's worth open sourcing, but the documentation is still lacking.
- Because of the current architecture of Jupyter Lab, SQL syntax highlighting is not feasible to add there, but this is now possible with Notebook 7.
list_table("schema")
to detect if schema doesn't exist and return error- Add query monitoring and cancelling functionality
- Complete docs (low level stuff, exhaustive features, in Sphinx)
- Add animated gifs to docs
- Autocomplete and autodiscover of databases is possible, but not trivial at all. In addition, I'll like to find a way of not adding any extra configuration. Regardless, not worth it until the TODO list above is tackled. See this project for a successful example.
I would like to publish decent unit testing, but this library is hard to test because all the databases currently queried for it's development are either tests that cost me money or private (my clients') databases. Any ideas on how to write an open source, non exploitable set of unit tests for Redshift/BigQuery/Databricks/etc are very welcome.