A data engineering project.
Suppose we want do some analysis on reader's preferences for books, for a library. Data we can access are as follows:
- Metadata of books in the library
- Profile of library members
- Loan history of books
- Every time a books is loaned, the library database is updated via the library server's API.
- On a daily basis, an airflow DAG is triggered to extract the day's loans from the library database and load it into minio.
- Also, the books' metadata is extracted from a csv file and loaded into minio.
- All these data are then loaded into clickhouse.
- Metabase is then used to visualize the data in clickhouse.
Versions used at the time of implementation:
- Docker version 24.0.6, build ed223bc
- Docker Compose version v2.22.0-desktop.2
- Python 3.10.1
- Create a python virtual environment at the root of this project and install the requirements.
python -m venv venv source venv/bin/activate pip install -r requirements.txt
- Run
python scripts/library_db_data.py
to create the seed data for the library database. - Run
make up
to build and start the docker containers. Usemake run
on subsequent runs. Usemake down
to teardown the containers. - Once the containers are up, run
make setup
to setup airflow and seed data. - Goto
localhost:8080
to access the airflow UI. The username and password isairflow
, as in the.env
file. Under Admin >> Connections, add the Postgres connection with variables specified in the.env
file. - Trigger the airflow DAGs to populate Clickhouse.
- Metabase can be accessed at
localhost:3000
. Clickhouse variables are specified under the.env
file, but leave the database as default. Schema of Clickhouse can be found here.