jupancon

Database Connectors and SQL magics for Jupyter. jupancon = Jupyter + Pandas + Connectors.

Features

Connector to Redshift
- Using user/pass
- Using IAM profile
Connector to Bigquery (using google profile)
Connector to Databricks
Optional automatic tunnel setting through an SSH Bastion
Querying capabilities
IPython kernel magics for querying
Always returns Pandas DataFrames
Some hidden stuff I rather not document just yet so you don't nuke your Warehouse :) Will do when it's safer to use

Install

pip install jupancon

Configure

Write a ~/.jupancon/config.yml YAML file that looks similar to the following C&P from my actual config file (heavily censored for obvious reasons):

default: my-redshift-cluster

my-redshift-cluster: 
    type: redshift
    host: XXXXXX.XXXXXX.XXXXXXX.redshift.amazonaws.com
    # explicitly setting redshift port (optional)
    port: 5439
    user: XXXXXXXX
    pass: XXXXXXXX
    dbname: XXXXXX


my-redshift-using-iamprofile: 
    type: redshift
    host: XXXXXX.XXXXXX.XXXXXXX.redshift.amazonaws.com
    profile: XXXXXXXXX
    dbname: XXXXXX
    # NOTE: you can choose dbuser and it will be auto-created if it doesn't exist 
    dbuser: XXXXXX
    cluster: XXXXXX


my-gcp:
    type: bigquery
    project: XXXXX-XXXXX
    location: EU

my-databricks:
    type: databricks
    hostname: XXXXXX.cloud.databricks.com
    http_path: /sql/XXX/XXXX/XXXXXXXXXX
    # catalog is optional
    catalog: XXXXXXX
    token: XXXXXXXXX


my-redshift-behind-sshbastion:
    type: redshift
    use_bastion: true
    bastion_server: censored.bastion.server.com
    bastion_user: XXXXXX
    bastion_host: XXXXXX.XXXXXX.XXXXXX.redshift.amazonaws.com
    host: censored.main.server.com
    user: XXXXXXXX
    pass: XXXXXXXX
    dbname: XXXXXX

Jupancon will also pick environment variables (which have preference over the config.yml).

JPC_DB_TYPE: redshift or bigquery
JPC_HOST: for example, XXXXXX.XXXXXX.XXXXXX.redshift.amazonaws.com
JPC_USER: User name
JPC_DB: Database name
JPC_PASS: Password
JPC_USE_BASTION: true or leave blank
JPC_BASTION_SERVER
JPC_BASTION_HOST
JPC_PROFILE: IAM profile (for IAM connection only)
JPC_CLUSTER: Redshift cluster (for IAM connection only)
JPC_DBUSER: Redshift user (for IAM connection only)

How to use

This library is developed primarily for usage within Jupyter Lab. It's likely to work in Jupyter Notebook and Ipython, but untested and unsupported at this stage. It also works and is being used in regular scripts, but it obviously loses its magic.

Regular usage

from jupancon import query, list_schemas, list_tables

list_schemas()

list_tables()

query("select * from foo")

Magical usage

from jupancon import load_magics

load_magics()

select * from foo

df = %select * from foo

%%sql

select * 
from foo
where cond = 1
and label = 'my nice label'

Local Development

Install a virtual environment, activate it, install the library and dev dependencies.

python3 -m venv .venv
source .venv/bin/activate
pip install .
pip install -r requirements-dev.txt

The rest of our test flow consists on opening notebooks and checking that you can connect. This flow is not publisable because of client confidentiality. There are however some very basic unit tests that can be used to quickly check if you broke something fundamental.

scripts/test.py

TODO list

Current status: Jupancon has enough basic features that it's worth open sourcing, but the documentation is still lacking.

Because of the current architecture of Jupyter Lab, SQL syntax highlighting is not feasible to add there, but this is now possible with Notebook 7.
list_table("schema") to detect if schema doesn't exist and return error
Add query monitoring and cancelling functionality
Complete docs (low level stuff, exhaustive features, in Sphinx)
Add animated gifs to docs

Features that aren't worth adding right now

Autocomplete and autodiscover of databases is possible, but not trivial at all. In addition, I'll like to find a way of not adding any extra configuration. Regardless, not worth it until the TODO list above is tackled. See this project for a successful example.

A note about Unit Testing

I would like to publish decent unit testing, but this library is hard to test because all the databases currently queried for it's development are either tests that cost me money or private (my clients') databases. Any ideas on how to write an open source, non exploitable set of unit tests for Redshift/BigQuery/Databricks/etc are very welcome.

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
jupancon		jupancon
scripts		scripts
test		test
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

jupancon

Features

Install

Configure

How to use

Regular usage

Magical usage

Local Development

TODO list

Features that aren't worth adding right now

A note about Unit Testing

About

Releases

Packages

Contributors 4

Languages

License

multilayer-io/jupancon

Folders and files

Latest commit

History

Repository files navigation

jupancon

Features

Install

Configure

How to use

Regular usage

Magical usage

Local Development

TODO list

Features that aren't worth adding right now

A note about Unit Testing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages