27.04.2024 Training model using HaGRID dataset

This week, I tried to test my model that was trained on the dataset collected with LeapMotion, on unseen data. I borrowed LeapMotion version 1 and 2 from cci lab and did a setup at home.

While developing my gesture recognition model, I trained it using static images that were captured by LeapMotion, where each image represents a single moment of a gesture. My goal was to deploy this model in real time using the Leap Motion Controller.

Initially, I attempted to capture frames from the Leap Motion’s infrared (IR) cameras using the legacy SDK. These images could be directly processed by the model. However, with the newer Ultraleap Gemini SDK, this functionality is no longer supported. The SDK has shifted away from image-based input entirely — instead of providing raw camera images, it now delivers hand landmark data (3D positions of joints, bones, palm, etc.).

Then, I moved to a HaGRID dataset (https://github.com/hukenovs/hagrid).HaGRIDv2 size is 1.5T and dataset contains 1,086,158 FullHD RGB images divided into 33 classes of gestures. It does not require any physical devices apart from the camera to capture gestures.

I had a problem with working with full dataset as it was too heavy, so I decided to clean it and only leave classes that I need for my project. Besides, I only left 2000 images per class.

In order to correctly preprocess the images, I first needed to crop the gestures out of the images as background of the photos was too noisy. I tried to crop it 2 ways: using pre-trained YOLO model that detects specifically hands, or using Mediapipe.

Dataset cropped with YOLO

Dataset cropped with Mediapipe:

From just observing and comparing the datasets it is evident that dataset produced with mediapipe is way more accurate and clean.

After creating a preprocessing pipeline (one-size images), I then moved to my model.

This is a 3D CNN model designed to recognize gestures based on sequences of image frames (e.g. 5 infrared frames stacked over time). It processes spatial and temporal patterns simultaneously — which is ideal for dynamic gestures.

Dynamic gestures that I declared:

DYNAMIC_GESTURES = {

“pinch_in”: [“fist”, “three2”],

“pinch_out”: [“three2”, “fist”],

“swipe_left”: [“palm”, “fist”],

“swipe_right”: [“fist”, “palm”],

}

After training the model reaches:

Train Accuracy: 97.2%

Validation Accuracy: 90.0%

Validation Loss: 0.79

The model demonstrated strong learning capabilities, achieving a high final training accuracy of 97.2% and a validation accuracy of 90.0%, indicating that it effectively generalized to unseen data. While early epochs showed unstable validation performance and signs of underfitting, the model quickly improved and stabilized, likely due to the use of regularization techniques such as dropout and batch normalization. Although some fluctuations in validation loss and accuracy were observed in later epochs—potentially due to class imbalance or sample variability—the model ultimately recovered and achieved consistent results, making it suitable for real-time gesture recognition tasks.

The confusion matrix confirms that the model performs very well overall, with strong classification accuracy across all gesture classes. Most predictions fall cleanly along the diagonal, indicating correct classifications—particularly for gestures like “fist,” “palm,” “point,” “thumb_index,” and “two_up,” which show near-perfect precision. The “grabbing” and “three2” gestures show slightly more confusion, occasionally being misclassified as “point” or “thumb_index,” likely due to visual similarity in hand posture. Nonetheless, the matrix reveals high model reliability and balanced performance across categories, making it a solid candidate for deployment in real-time gesture recognition.

This is the result after testing the model using webcam:

full video: https://drive.google.com/file/d/1LhtxcjpSOO-c0jDdglJovK1v8Uv73rdD/view?usp=sharing


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *