Azure Kinect is used for capturing with both RGB and depth camera. User is wearing specially colored glove such that it's possible to easily extract ground-truth semantic segmentation. DNN is trained to perform semantic segmentation on depth images. After DNN takes input data, simple classifier of hand gastures is trained on its' outputs.
Data is collected using Azure Kinect with RGB and depth camera recording simultaneously. RGB image is then projected in depth POV using official SDK and automatically segmented using hand-crafted method based on HSV ranges. DNN for semantic segmentation is trained on depth images having previously segmented images as targets. Gesture classifier is trained on DNN's outputs of depth images with hand extracted using bounding box and downsampled.
There is total of 5 gestures that are taken from ASL. Gestures are recored in takes of 2-3 seconds length. Since gesture is actually happening in the middle of a take, first and last portion of take are rejected. Total of two users are used for data collection.
Every take of a single gesture belongs to either training or validation set but only one take per gesture is used for validation.
For semantic segmentation DeepLab v3+ is used implemented in PyTorch. For gesture classification, LeNet is used implemented in PyTorch as well.
Every take of a single gesture belongs to either training or validation set but only one take per gesture is used for validation.
There were total of ~2800 frames. Epoch for DeepLab takes about 30 minutes on GTX 1060 with 6GB of video memory. It was trained for 5 epochs. LeNet was trained until validation loss started increasing. Whole LeNet training takes about 5 minutes on Tesla k80.
- Mean intersection over union (mIOU): 0.859
- Class acuraccy: 93.5%
- Gesture classification: 97%
- Collect more data in different environment
- Use more gestures with unkown class
- Use optical flow to determine static portion of a take
- Increase variance in gestures
- Make pipeline real-time
Model files that are provided are only that are different. Everything else is same as in original implementation. Pretrained models are not provided since they converge extremely fast. Dataset is not provided.
- Kosta Grujčić author
- Mihailo Grbić author
- Aleksa Gordić mentor