project-bank-credit

Showcase for modern data engineering with Spark and Polars

Purpose

The project is a personal practice and showcase for how pipelines built with Apache Spark or Polars can do ETL/ELT. The applications could be run locally or hosted on public cloud services with some modifications. With the appropriate settings, the pipelines could also handle data volume from a few GBs to TB-level data (Spark, specifically).

Content

A few things that the pipelines cover:

Accepting monthly data from a local filesystem (or a data lake on a cloud)
Cleaning and transforming monthly data with business logic and geographic information
Outputting the results in Parquet for further purposes (analysis or model training)
Generating a monthly report in CSVs at the end of each monthly job
Firing a self-service report application (built by Streamlit) that shows a choropleth map with various measures and an interactive dataframe viewer
- For more details, please check README.md in streamlit_app/
Being able to run jobs for monthly process, full migration (by years), or test mode (with limited records)
The repo is set up with GitHub Action for basic CI (testing the whole transformation with sample data). The user can run pytest to do the basic test locally as well

Data

The project is built with US bank credit data.

The main one is the monthly homebuyer data at zip code 9 (zip9) level, containing information about US consumers' credit activities relevant to home buying. This data covers info from April to September 2019, and it is split into several CSVs by month.
The second is the demographic data with basic information about consumers at the zip9 level.

The total data size is about 10 GB. Nonetheless, the pipelines are designed for an ongoing monthly process, and they should be able to handle 10x-100x of size with some tweaks in settings.

All datasets are from Brown Datathon 2020 event and the actual zip9 value has been coded. The datasets are only for educational purposes and are not intended for other purposes.

Instruction

For anyone interested in trying this repo themselves, the Google Drive folder has sample data. It is the same data but with roughly 0.05% of the records (zip codes only within New York City). Please place them following the Folder Structure section. Though the small data volume may not fully utilize these frameworks, it is still a good starting ground and can be replicated with other big data projects.

Please use requirement.txt to create a virtual environment with Python 3.9 and set up Spark on your machine. For the Spark application, it is best to run this in Standalone mode (requiring setting up Spark driver/worker and submitting jobs to launch the application). To launch it, please check examples in cli_examples.sh (as well as for the Streamlit app)

p.s.1 Setting Spark on any environment could be a fuzzy task. You can refer to this simple guide or any other resources to set it up correctly.

Folder Structure

.
├── LICENSE
├── README.md
├── cli_examples.sh
├── data
│   ├── processed
│   ├── processed_polars
│   │   └── report
│   └── source
│       ├── demographic_data
│       │   └── zip9_demographics_coded_pv.csv
│       ├── zip9_coded_201904_pv.csv
│       ├── zip9_coded_201905_pv.csv
│       ├── zip9_coded_201906_pv.csv
│       ├── zip9_coded_201907_pv.csv
│       ├── zip9_coded_201908_pv.csv
│       └── zip9_coded_201909_pv.csv
├── general_utils.py
├── polars_pipeline
│   ├── README.md
│   ├── ingestion_script.py
│   ├── monthly_transformation_script.py
│   └── raw_ingestion.py
├── requirements.txt
├── spark_pipeline
│   ├── README.md
│   ├── ingestion_script.py
│   ├── monthly_transformation_script.py
│   └── raw_ingestion.py
├── streamlit_app
│   ├── README.md
│   ├── app.py
│   ├── list_zipcode_nyc.pkl
│   ├── streamlit_script.py
│   └── zipcode_2020_shape_simplified.geojson
├── tests
│   ├── polars
│   │   └── test_polars.py
│   ├── spark
│   │   ├── conftest.py
│   │   └── test_spark.py
│   └── test_data
│       ├── 201904_monthly_processed_data_expected.csv
│       ├── zip9_coded_201904_pv_test.csv
│       └── zip9_demographics_coded_pv_test.csv

Future development

Some ideas that can expand the project furthermore:

Setting up an orchestration tool (e.g., Airflow, Prefect) to make this process a monthly schedule task
Modifying the batch job to be a streaming process (Structured Streaming) to run every time the new/updated data comes in
Setting up another process to re-train machine learning models with the latest data
Setting up JDBC drivers to read/write data from/to external database systems (e.g., Postgres)
Extending the project by exporting wide tables in the final stage and connecting those with BI platforms on cloud services (e.g., BigQuery with Looker)

Contacts

Please let me know if you have any thoughts on this project.

Linkedin: https://www.linkedin.com/in/jefflu-chia-ching-lu/
Email: [email protected]

Jeff

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

project-bank-credit

Purpose

Content

Data

Instruction

Folder Structure

Future development

Contacts

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
.github/workflows		.github/workflows
assets		assets
data		data
polars_pipeline		polars_pipeline
spark_pipeline		spark_pipeline
streamlit_app		streamlit_app
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cli_examples.sh		cli_examples.sh
general_utils.py		general_utils.py
requirements.txt		requirements.txt

License

HiIamJeff/project-bank-credit

Folders and files

Latest commit

History

Repository files navigation

project-bank-credit

Purpose

Content

Data

Instruction

Folder Structure

Future development

Contacts

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages