This repo contains the project file which returns the following for the selected column from the dataset:
- returns the descriptive statistics (mean, median and standard deviation) as a list
- prints these results to
summary.md
file - generates a histogram of the selcted column and saves it as
output.png
file
The code reads the data from the csv and stores it as a polars DataFrame for the analysis.
This repo has been created using the IDS-706_rg361_week-2 template which was created as week-2 mini-project.
Date Created: 2023-09-14
Create a Codespace on main which will initialize the enviroment with the required packages and settings to execute the code.
The descriptive_stats
function in main.py
returns a list which contains the the [mean, median, standard deviation] of the selected column in the data.
The code also writes these results to a summary.md
file in the resources
folder for future reference
The code stores the histogram as an image in the resources
folder as output.png
Note: The code saves the summary.md and output.png in the resources folder by default, please change this file path within main.py in case required
The function takes in the following 2 parameters:
- fname (required) - path or link to the csv file with the desired data
- col (optional) - column number for which the statistics needs to be analyzed. if no input is given, the last column in the data is considered for analysis
Notes
- Count the column numbers starting at 1
- The code assumes that the data has a header row, which is the default behaviour of the
read_csv
function from polars which is used to read the data and create a Dataframe
contains the information about the repository and instructions for using it
contains the list of packages and libraries which are required for running the project.
github actions is used to automate the following 4 actions whenever a change is made to the files in the repository:
install
: installs the packages and libraries mentioned in the requirements.txtformat
: uses black to format the python fileslint
: uses pylint to lint the python filestest
: uses pytest to test the python codes using the test_* files to test the main files
Note -if all the processes run successfully the following output will be visible in github actions:
contains the instructions for the processes used in github actions and .devcontainer for creating the virtual environment
contains the dockerfile
and devcontainer.json
files which are used to build and define the setting of the virtual environment (codespaces - python) for running the code.
main.py
: contains thedescriptive_stats
function which returns the descriptive statistics and wirtes the summary.md and output.png filestest_main
: a test file to verify the main.py file which contains a sample DataFrame and the expected results when testing with the descriptive_stats function
a sample Dataset of blood-pressure from Github has been loaded into the resources folder and is used for testing the code.
Two test cases are run to check the proper functioning of the code:
- We specify the column number (in this test, column 4 is passed as argument to the function)
- We do not specify a column number (in this test, no argument is passed to the funtion)
The code runs as expected and the graph and summary are saved in the resources folder:
Note : Only the last graph and summary are stored since the test file calls the funtion twice and the function clears the previous output before saving a new one