Real-time hand tracking and gesture-based interaction using OpenCV, MediaPipe, and ModernGL
This project focuses on building an Augmented Reality (AR) system for real-time hand tracking and gesture-based interaction. Leveraging OpenCV, MediaPipe, and ModernGL, the application captures live video, detects hands in 3D space, and enables intuitive manipulation of virtual 3D objects. The system supports grabbing and moving a virtual cube using natural gestures like pinching, creating an immersive AR experience using only a standard webcam.
The AR pipeline is built in Python, integrating several key technologies:
The system starts by capturing webcam frames and passing them through MediaPipe's HandLandmarker to extract
both 2D image-space
and relative 3D model-space landmarks. OpenCV visualizes initial detections, and then the program calculates
world-space coordinates
by solving the Perspective-n-Point (PnP) problem using OpenCV's solvePnP
. This transformation
enables proper alignment
between physical hands and virtual overlays.
Moderngl is used to render a virtual cube, textured using shaders, overlaid on the camera feed. Users can interact with the cube using pinch gestures. The index finger’s proximity to the cube triggers a “grab” state, letting the user reposition the cube within the 3D space. Fingertips and important joints are highlighted using marker meshes for visual feedback. The system supports gesture recognition and real-time updates with smooth animation.
The application defines a pinch gesture by detecting minimal distance between the thumb and index fingertips. Combined with proximity checks against the virtual cube, this triggers interactive actions such as dragging. These mechanics allow intuitive and physically consistent manipulation in the AR environment.
The AR system runs in real time, consistently achieving over 10 FPS on a typical laptop. Visual re-projection of 3D landmarks ensures accurate alignment with the user's real hand. Color and saturation adjustments in shaders enhance visual clarity, and fallback logic ensures usability under varying lighting and movement speeds.
Watch a demonstration of the hand tracking and AR interaction capabilities below:
Developed by Aryan Singh. Explore the full implementation on GitHub.