An app that allows skipping commercial break on recorded TV shows by using computer vision elements.
I first thought of it about 2 years ago, but I ditched it too fast to be significant. Back then I made a little presentation of my blueprints in order to consult with friends about it, it is still available here
The app relies on the prior knowledge we have on the TV channels, which changes the icons and/or presenting timers on commercial breaks. The app allows capturing the content of the TV in several ways (TBD below), then uses YOLO and CV methods to isolate the TV and the ROIs. The ROIs for our case are the top-left, top-right corners, which are then compared to the "ad icons" of the channels. There is an optional functionality for OCR to get the exact skipping required, but it is currently too slow and therefore commented-out.
It includes a testing GUI to examine the process that is also presenting the stages:
I am a true believer in the ability to innovate using existing building blocks. Modern developer's ability to use existing pieces of code and methods, connecting them together in a never-seen-before way to create value. I find it fascinating when I run into such plug&play, well built, well explained repo, and therefore I tried to be one.
This can be generalized to a template for "vision triggered apps":
- Frames are captured from various sources.
- Enhancing, normalizing frames using CV techniques or algebraic transformations.
- Feeding frames into first stage which uses fast methods (like YOLO, CV techniques) for initial detection and reducing frames to ROI. This stage requires contextual planing and trade-offs, with some prior knowledge (color spaces for example).
- Applying advance, usually slower, methods on the minimized ROIs (like OCR). Minimizing ROIs support both noise-reduction (ignoring false detection) and performances (less area to scan) purposes.
- According to results, apply relatively simple logic to trigger actions.
This time it is skipping ads, but it could be easily converted for detecting bus lines numbers, Wolt bikers hiding their IDs or opening a paid parking gate for members' cars only.
The app is designed to be highly modular, and each class was built to allow easy integration for components. I have used consumer-producer design pattern, stages have easy communication between them using input and output cyclic queues.
That architecture offers several benefits:
- Allows users to easily integrate my classes in other apps due to simple i/o.
- Adding other modules to use the input in parallel (the queues are readable, input source will not be held preemptively). For example, another app that preform parental control can use the input simultaneously.
- Cyclic queues' size can be adjusted, they limit memory resources consumptions even if there are pace differences between stages. That is obviously a context related tradeoff I made, while other needs might require 0 frame drops which derives different solutions.
- It allows using parallelism relatively easy as there's no need for many shared resources.
- Using the queues has the by-product of "ZOH" (zero order hold), meaning that we can still hold the last valid input until new updates arrives.
- REAL OOP: each class has its own
main()
which allows running and testing it separately and independently.
Attached is an explained scheme:
Link to PDF for better view
From the nature of the challenge I am trying to solve, there's no much value for a true real-time process. Sampling in a frequency of once a second (or even a once a few seconds) is sufficient for those needs. I am using a very small queue (N=1,2) as I couldn't justify piling up frames if I can't process them on the next module in a timely manner. The RPI5 has its limitations, and the tv detection plus lpr are taking about 2-3 seconds at the moment.
- My app's modules were built to be a part of a larger app, and therefore hogging resources won't be cool at all. Keeping it as a single process will allow lighter HW to use it and also other developers to use multiprocessing in their larger app.
- Further down the roadmap I will convert it to use the Hailo kit especially for improving performances and timing, on their documentation they refer to multithreading rather than multiprocessing, so I expect the upgrade and transition to be better this way. Anyhow, future improvements will most likely derive rethinking the workers' distribution.
Please refer to set_up
dir and create your virtual env. I'd rather use conda, but a venv should do as well.
I have added the packages (apt instal...
) list of my device as well, as I ran into some errors while creating it.
If your env isn't working, consider installing my packages as well.
- Go to:
cd <repo root>/set_up/install_packages_ <with or without> _versions.sh
- Set permissions:
chmod +x install_packages_ <with or without> _versions.sh
./install_packages_ <with or without> _versions.sh
- Install miniforge or some conda suits your devices.
- Create virtual env by
conda env create -f set_up/environment.yml
.
Please note that each module contains main()
function to allow testing its basic functionalities independently,
meaning that you can run (<conda _env>) python tv_detector.py
for example.
This module's responsibility is taking the input source and push frames of it into the frame_queue. Input sources can vary: USB camera, IP Camera, Pre-recorded video file, HDMI stream, ADB snapshots. The purpose is to allow the users to use whatever is easier for them to do.
USB camera, IP Camera, Pre - recorded video file - are treated as "raw", and will have to be processed by the tv detector to get them "normalized" into perfect tv segmented rectangle. HDMI stream, ADB snapshots - are the "normalized" input, and they are (obviously) offering much easier detection due to their superior quality. In case of HDMI stream, ADB snapshots most of the tv_detector process will be skipped.
Sending snapshot commands via adb gives astonishing frames for inputs, with crisp images and even high enough frame rate. However, many of the content apps for android TV uses SAFE_FLAG to prevent users from taking screenshots of the content. The results are usually black screen with only system icons on it (volume indicator for example). That block makes the ADB method to be unreliable for most crowd, but a GREAT option for some (in case the app developers you're using have missed it, or if you are using a rooted android).
Given a frame from a camera which contains the TV, we'd like to detect the corners of the screen in order to normalize and transform. That's not that simple, as even YOLO8 segmentation models tend to miss some of the TV, even in a relatively good conditions. Unfortunately, the misses are usually on the top-bottom edges, where most of our data is: Using bigger YoloV8 models had the same problem, therefore I returned using nano.
The solution I have used is a multi-stage approach, where I use the Yolo segment but also refining it with basic CV methods of corner detection. Once corners are detected, we can ues perspective transform to make the side-view a straight rectangle, cropping the TV, passing it as an ROI, ignoring the rest.
- Note: in order to prevent false detection of objects on frame (including TVs on what's shown on TV) I am picking the largest TV from all TV detections (by area).
Assuming enough frames, we can segment the TV by averaging frames and looking for diffs. Problem is that on many cases, like news for example, much of the frame is static and therefore diffs aren't shown. It can be relevant if we'd like to trade time for compute resources, as it can be applied on a much lighter HW, not using YOLO or such ML/DL methods. Diffing>Averaging>Masking it or watershedding it.
This module is built to indicate whether we are currently watching ads or not. It is relying on the different icons on the top corners and the timer that is usually presented, counting down back to the show.
Initially I have tried to implement that model using License Plates Recognition, therefore the name. LPR has much in common with the need to "understand" what's the timer on the top-left corner is showing. I have struggled getting it to preform well and therefore it is currently commented out, meaning that the indication is just binary, whether we're on ads or not. (For example, Mnist, EasyOCR, Tesseract are too slow).
There are more "old school" techniques I consider that might resolve it, for example: - KNN for numbers, which are usually standard font. - OCR+refinement for time properties like MM=[0,59] - OCR with "clock obeys time rules" meaning that if it is currently :15, next sample has X seconds gap, we should expect something in the areas of :15-X seconds. - Old school: Detecting ":" or sampling frequency to detect areas that changes EXCATLY every 1 second, considering it as the font color. applying TH/colorspace mask for this color, then we expect text to be contrasted and clearer.
The detection is done for each corner independently, each frame's corners are compared to the collection of references I collected in advance. The comparison is done by feature extraction with Meta's Dino, then comparing the cosine similarity of their vectors.
For some noise reduction, I designed it to toggle between Ad - Non-ad only after some N consecutive frames of the same state.
Once we detected a commercial break, we'd like to skip it or notifying for it (playing sound alerting that commercial break has passed) We can transmit it to the TV using several ways, each simulated the remote control action in a different way:
- IR (infra-red) - good old remote control method. Recording the TV's signal once then reusing it.
- Pros: 99% of the TVs has IR receiver.
- Cons: Requires another HW (even if cheap). Requires setting it up for each specific TV model.
- ADB command - simulating press is a basic adb command that can be used.
- Pros: such commands are easy to use and are usually standard.
- Cons: Requires being on the same local network as the TV.
- Bluetooth - Given that many remote controllers are a BT device, we can mimic their actions.
- Pros: Very standard protocol allows easy setup.
- Cons: Requires pairing it like a new remote.
- Virtual keyboard - This way is simulating having a keyboard connected to the TV's USB, then relying on the standard protocols to pass "fast forward" button.
- Pros: Robust way to communicate, should be a plug&play thing.
- Cons: Requires direct USB connection from the RPI to the TV, which might limit us, especially if using a usb camera which also requires considering were to place the camera and RPI.
- Improving timings using faster CV elements and simplified OCR (techniques like KNN might offer interesting approach).
- Enabling OCR of the top-left corner in relevant timing (<2 secs/frame).
- Finding a way to bypass adb SAFE_FLAG to be able to get input using adb. Accessibility features of android or some privileged android app might resolve it.
- Refactoring stages to use Hailo's LPR demo, harnessing the RPI's AI Hat's abilities.