- High-Level Architecture
- Detailed Components
- Calibration Process
- Backend Processing Pipeline
- Frame Processing and Gaze Time Update
- Metric Calculation
- Performance Calculation
- Notification Alert System
- Statistics Calculation
- Data Management
graph TD
subgraph Client
UI[User Interface]
GVD[Gaze Vector Display]
GCW[Calibration Window]
STW[Screen Time Widget]
STS[Statistics Window]
CPR[Camera Permission Request]
RCK[Run-Time Control Keys]
end
subgraph Backend
CL[Core Logic]
GDM[Gaze Detection Engine]
GVC[Gaze Vector Calibration]
EGT[Eye Gaze Time Tracker]
BNS[Break Notification System]
SC[Statistics Calculator]
MC[Metric Calculator]
PC[Performance Calculator]
end
subgraph Data
UM[Usage Metrics]
end
UI <-->|Input/Output| CL
CPR -->|Permission Status| CL
RCK -->|Control Commands| CL
CL <--> UM
CL <--> GDM
CL <--> GVC
CL <--> EGT
CL --> BNS
CL <--> SC
CL <--> MC
CL <--> PC
BNS --> UI
CL --> GVD
CL --> GCW
CL --> STW
SC --> STS
PC --> UI
style UI fill:#f0f9ff,stroke:#0275d8,stroke-width:2px
style GVD fill:#f0f9ff,stroke:#0275d8,stroke-width:1px
style GCW fill:#f0f9ff,stroke:#0275d8,stroke-width:1px
style STW fill:#f0f9ff,stroke:#0275d8,stroke-width:1px
style STS fill:#f0f9ff,stroke:#0275d8,stroke-width:1px
style CPR fill:#f0f9ff,stroke:#0275d8,stroke-width:1px
style RCK fill:#f0f9ff,stroke:#0275d8,stroke-width:1px
style CL fill:#fff3cd,stroke:#ffb22b,stroke-width:2px
style GDM fill:#fff3cd,stroke:#ffb22b,stroke-width:1px
style GVC fill:#fff3cd,stroke:#ffb22b,stroke-width:1px
style EGT fill:#fff3cd,stroke:#ffb22b,stroke-width:1px
style BNS fill:#fff3cd,stroke:#ffb22b,stroke-width:1px
style SC fill:#fff3cd,stroke:#ffb22b,stroke-width:1px
style MC fill:#fff3cd,stroke:#ffb22b,stroke-width:1px
style PC fill:#fff3cd,stroke:#ffb22b,stroke-width:1px
style UM fill:#f2dede,stroke:#d9534f,stroke-width:1px
The client consists of two main components:
- Main Window Application: Runs in the foreground and provides the primary user interface.
- System Tray Application: Runs in the background within the OS system tray.
- Core Logic
- Gaze Detection Engine
- Gaze Vector Calibration
- Eye Gaze Time Tracker
- Break Notification System
- Metric Calculator
- Performance Calculator
- Statistics Calculator
- Usage Metrics: Stores data on user's screen time
For a detailed architectural overview of each component, please refer to the Detailed Component Architecture document.
graph TD
A[Start Calibration] --> B[Four-Point Gaze Capture]
B --> C{Capture Successful?}
C -->|Yes| D[Combine Gaze Points]
C -->|No| B
D --> E[Calculate Convex Hull]
E --> F[Apply Error Margin]
F --> G[Intersect with Screen Boundaries]
G --> H[Determine Final Calibration Points]
H --> I[End Calibration]
style A fill:#98FB98,stroke:#333,stroke-width:2px
style B fill:#87CEFA,stroke:#333,stroke-width:2px
style C fill:#FFA07A,stroke:#333,stroke-width:2px
style D fill:#87CEFA,stroke:#333,stroke-width:2px
style E fill:#87CEFA,stroke:#333,stroke-width:2px
style F fill:#87CEFA,stroke:#333,stroke-width:2px
style G fill:#87CEFA,stroke:#333,stroke-width:2px
style H fill:#87CEFA,stroke:#333,stroke-width:2px
style I fill:#98FB98,stroke:#333,stroke-width:2px
[Diagram: Four-Point Calibration Screen] Description: A full-screen view with four numbered green dots in the corners and center text guiding the user.
Process:
- Look at each green dot as it appears for 1.2 seconds.
- Multiple gaze points are captured for each corner.
Process:
- All captured gaze points are combined.
- A convex hull algorithm finds the smallest polygon enclosing all points.
Process:
- The convex hull is extended by the specified error margin (default: 150 pixels).
- This accounts for potential gaze tracking inaccuracies.
Process:
- The extended convex hull is intersected with screen boundaries.
- The four corners of this intersection become the final calibration points.
- Image Input: Raw frame from the camera.
- Face Detection: Locates faces in the image.
- Facial Landmark Detection: Identifies key facial points.
- Head Pose Estimation: Determines the orientation of the head.
- Eye State Estimation: Checks if eyes are open or closed.
- Gaze Estimation: Calculates the gaze vector.
- Gaze Time Estimation: Determines if the gaze is on the screen.
- Screen Gaze Time Accumulation: Updates the total screen time.
- Usage Metrics Update: Records the latest usage data.
- Break Notification Trigger: Initiates break alerts if necessary.
The VisionGuard Core is powered by the OpenVINO model zoo to estimate a user's gaze and calculate the accumulated screen gaze time. The following models are integral to the backend, each playing a crucial role in the processing pipeline:
-
Face Detection Model: Identifies the locations of faces within an image, serving as the first step in the gaze estimation process. Supported networks include:
-
Head Pose Estimation Model: Estimates the head pose in Tait-Bryan angles, outputting yaw, pitch, and roll angles in degrees, which are crucial inputs for the gaze estimation model:
-
Facial Landmark Detection Model: Determines the coordinates of key facial landmarks, particularly around the eyes, which are necessary for precise gaze estimation. Supported networks include:
-
Eye State Estimation Model: Assesses whether the eyes are open or closed, an important factor in determining gaze direction:
-
Gaze Estimation Model: Utilizes inputs from both eyes and the head pose angles to output a 3D vector representing the direction of a person's gaze in Cartesian coordinates:
For a detailled overview of the low level architecture refer to the wikipage
The following diagram illustrates the processing pipeline in VisionGuard, demonstrating how different models interact to produce accurate gaze estimation:
graph TD
A[Image Input] --> B[Face Detection]
B --> |Face Image| C[Facial Landmark Detection]
B --> |Face Image| D[Head Pose Estimation]
C --> E[Eye State Estimation]
D --> |Head Pose Angles| F[Gaze Estimation]
C --> |Eye Image| F
E --> |Eye State| F
F --> |Gaze Vector|G[Gaze Time Estimation]
G --> H[Accumulate Screen Gaze Time]
%% Styling
style B fill:#FFDDC1,stroke:#333,stroke-width:2px
style C fill:#FFDDC1,stroke:#333,stroke-width:2px
style D fill:#FFDDC1,stroke:#333,stroke-width:2px
style E fill:#FFDDC1,stroke:#333,stroke-width:2px
style F fill:#FFDDC1,stroke:#333,stroke-width:2px
style G fill:#FFDDC1,stroke:#333,stroke-width:2px
style A fill:#C1E1FF,stroke:#333,stroke-width:2px
style H fill:#C1E1FF,stroke:#333,stroke-width:2px
The VisionGuard system processes each video frame to determine the user's gaze direction and update screen time metrics. Here's a high-level overview of the algorithm:
- Face and Gaze Detection: Detect faces and estimate gaze direction.
- Gaze Screen Intersection: Convert 3D gaze vector to 2D screen point.
- Gaze Time Update: Update screen time or gaze lost duration.
- Visual Feedback: Display detection results and metrics.
- Performance Tracking: Update resource utilization data.
flowchart TD
A[Process Frame] --> B[Detect Face & Estimate Gaze]
B --> C{Gaze on Screen?}
C -->|Yes| D{Eyes Open?}
C -->|No| E[Update Gaze Lost Time]
D -->|Yes| F[Accumulate Screen Time]
D -->|No| E
E --> G{Gaze Lost > Threshold?}
G -->|Yes| H[Reset Screen Time]
G -->|No| I[Update Metrics]
F --> I
H --> I
I --> J[Display Results]
style A fill:#f9f,stroke:#333,stroke-width:2px
style I fill:#bbf,stroke:#333,stroke-width:2px
style J fill:#bfb,stroke:#333,stroke-width:2px
The following image illustrates how the gaze vector intersects with the screen:
This visualization helps understand how the 3D gaze vector is projected onto the 2D screen space.
To determine if the gaze point is within the screen boundaries, VisionGuard uses a ray-casting algorithm, which is a common method for solving the point-in-polygon problem. Here's a visual representation of how this algorithm works:
Image credit: Wikimedia Commons
The algorithm works as follows:
- Cast a ray from the gaze point to infinity (usually along the x-axis)
- Count the number of intersections with the polygon's edges
- If the count is odd, the point is inside the polygon; if even, it's outside
This method is efficient and works for both convex and concave polygons, making it suitable for various screen shapes and calibration setups. The test also works in three dimensions, which is particularly useful for VisionGuard's 3D gaze estimation.
For more detailed information about this algorithm, please refer to the Point in polygon article on Wikipedia.
This algorithm provides a robust way to determine if the user's gaze is directed at the screen, allowing VisionGuard to accurately track screen time and manage break notifications.
The Metric Calculator computes usage metrics:
- Total screen time
- Continuous gaze away durations
The Performance Calculator analyzes system performance and resource usage:
- CPU Utilization: Tracks CPU usage of the VisionGuard application.
- Memory Usage: Monitors RAM consumption.
- Frame Processing Speed: Calculates frames per second for video processing.
- Latency of inference: Calculates latency in milliseconds for video processing.
The Break Notification System manages alerts based on user settings and gaze behavior:
- Break Reminders: Triggered after prolonged screen time.
- Custom Alerts: User-defined notifications based on specific conditions.
The Statistics Calculator generates comprehensive reports on usage patterns:
- Daily Usage Summary: Screen time, break frequency, and duration per day.
- Weekly Trends: Week-over-week comparisons of usage patterns.
- Mean Screen Time: Mean screen time on a daily and weekly basis.
The only data that is persisted and stored locally is the screen time statistics. The weekly statistics are maintained, and once the data becomes older than a week, the stale data is automatically cleared.