This project involves working with a real-life data set and exploring its preparation for machine learning. The dataset used is the Howell dataset, which includes height, weight, age, and gender information. The goal is to perform an initial exploration, prepare the data for modeling, and perform various analyses.
The project consists of the following files:
lab2_starter.ipynb
: The Jupyter Notebook containing the code and analyses.Howell.csv
: The dataset used in this project.README.md
: This README file.dgraves-ml-lab2-checkpoints.docx
: The document containing screenshots and checkpoint analyses.
To get started with this project, you need to clone the repository to your local machine:
git clone https://github.com/yourusername/lab2-working-with-dataset.git
cd lab2-working-with-dataset
- pandas
- matplotlib
- scikit-learn
- numpy
- Create virtual environment and activate
#Create environment
python -m venv .venv
#Activate environment
source .venv/scripts/activate
- Install required packages
pip install pandas matplotlib scikit-learn numpy
- Navigate to the project directory
- Open Jupyter lab with command:
jupyter lab
- Follow instructions in starter notebook alongside course reference pdf to run code cells and perform analyses.
- Data Overview: Displaying basic information about the dataset, including the number of instances, features, and missing values.
- Data Distributions: Visualizing the distributions of height, weight, and age.
- Correlation Analysis: Identifying the highest correlation between features.
- Age vs. Weight Analysis: Exploring the relationship between age and weight.
- Age Histogram: Comparing the age distribution in the dataset with modern populations.
- BMI Calculation: Adding a new feature for BMI and categorizing it.
- Stratified Data Split: Splitting the data into training and test sets while maintaining the ratio of males to females.
- Male-to-Female Ratios: Computing the male-to-female ratios for the entire dataset, training set, and test set.