Picture this: Unleashing the power of artificial intelligence on a pocket-sized microcontroller like the ESP32, where you can teach it to distinguish between everyday objects in mere minutes. Mind-blowing, right? Stick around as we explore how this tech marvel makes complex tasks accessible to everyone.
Published just moments ago
Hello, I'm Adam Conway, a passionate tech enthusiast from Ireland holding a BSc in Computer Science. As XDA's Lead Technical Editor, my academic journey included a thesis on evaluating the hidden performance aspects of Android apps and devices. Since 2017, I've been immersed in the tech world, and when I'm not diving into code or articles, you'll catch me gaming classics like Counter-Strike or VALORANT. Feel free to connect at adam@xda-developers.com, follow me on Twitter as @AdamConwayIE (https://twitter.com/AdamConwayIE), check out my Instagram at adamc.99 (https://www.instagram.com/adamc.99/), or hit me up on Reddit as u/AdamConwayIE.
The ESP32 (https://www.xda-developers.com/more-than-one-esp32-differences-between-all/) stands out as a compact yet formidable microcontroller, opening doors to extraordinary projects. Among its strongest suits is TinyML (https://www.xda-developers.com/tinyml-impressive-software-esp32/), a streamlined machine learning framework perfect for applications such as spotting irregularities, interpreting sounds, or categorizing images. Take image classification, for instance—it's remarkably straightforward, and I personally crafted a binary classifier in under five minutes.
Binary classification focuses on teaching a model to differentiate between two distinct labeled items. Once deployed, the ESP32 evaluates an image and assigns a probability percentage indicating whether it matches item A or item B. If you're equipped with the XIAO ESP32-S3 Sense, setup is a breeze, but I'll also guide you through adapting it for any ESP32-S3 CAM (https://www.xda-developers.com/built-local-first-ring-doorbell-esp32/), provided you can gather your own training data.
Leveraging SenseCraft AI
A user-friendly, code-free approach
This approach necessitates the XIAO ESP32-S3 Sense from Seeed Studio—if that's not in your toolkit, feel free to jump ahead. Still, crafting a custom image classifier via SenseCraft AI (https://sensecraft.seeed.cc/ai/) is delightfully simple, and it left me genuinely impressed. It's a comprehensive web-based interface for deploying various models, including your very own image recognizer. Head to the Training section at the top, where you'll curate images for the learning process.
In this space, link your camera to the SenseCraft AI training portal and pick your initial classifier type. With the ESP32's camera activated, begin by holding to capture snapshot images. Opting for images directly from the ESP32 ensures they mirror what the live model will encounter. For my demo, I trained it on an ESP32 device and my Google Pixel Watch 2.
After gathering your snapshots, hit Start training. This initiates the creation of a detection model from your provided images, ready for deployment.
As evidenced above, it pinpointed the ESP32 flawlessly! Granted, this is a basic setup and might not achieve perfect precision in real-world scenarios. Yet, for demonstrating potential, it's effective, showcasing feasible uses with just an ESP32 and camera module.
But here's where it gets controversial... What if you lack the XIAO ESP32-S3 Sense? No worries—it's feasible to run this on any ESP32-S3, and I developed a rudimentary TensorFlow Lite setup to illustrate.
Crafting Binary Image Classification on the ESP32-S3
Rolling up our sleeves for a DIY solution
I've engineered a custom version for the ESP32-S3, though it comes with a few compromises. To start, it relies on Wi-Fi and a web server for viewing outcomes, introducing a significant drawback: sluggishness. Each image analysis drags on for roughly 10 seconds, even after tweaking—deactivating the ESP32's watchdog timer and allocating inference to one of the two cores. Disabling the watchdog isn't advisable, as it's a safety net, but it was essential here. Attempts to pause for other processes failed, probably because the yield in my analysis cycle wasn't triggered swiftly enough. This lag partly stems from the camera's inherent demands, compounded by adding a web server and ML computations, overwhelming the system.
The architecture divides tasks across two cores to isolate intensive calculations from web duties. Core 0 manages a dedicated analysis routine that repeatedly grabs camera frames, executes TensorFlow Lite processing, and refreshes a stored JPEG image. Core 1 oversees Wi-Fi and web server functions via AsyncTCP and ESPAsyncWebServer, keeping the interface functional during heavy ML workloads. This division stops the web server from freezing under computational strain.
The analysis routine runs in an endless cycle. It first secures a frame via a mutex in grayscale at 96x96 pixels. The preprocessframeto_input() routine converts raw pixel values (ranging 0-255) to a 0-1 scale, then compresses them to int8 using the model's specific scaling and zero-point settings. Next, the TensorFlow Lite Micro engine performs the analysis with an AllOpsResolver and a 250KB tensor space in PSRAM. Results are converted back to floating-point probabilities via output tensor parameters, yielding scores for "ESP32" versus "Watch" matches.
The web portal offers three paths: / for the main page, /predict for probability data in JSON, and /stream for live images. To prevent server delays, /stream serves a pre-processed JPEG from a continuously updated buffer, rather than capturing on the fly. The analysis task, on Core 0, transforms each frame to JPEG with fmt2jpg() and stores it securely with a mutex. Meanwhile, the HTML page checks /predict every 500ms for fresh probabilities and reloads the image every second upon buffer changes.
Two mutexes manage access: one guards the camera to avoid conflicts, the other shields the JPEG cache. The watchdog is deliberately turned off on both cores using disableCore0WDT() and disableCore1WDT(), as ML analysis surpasses the typical five-second limit.
And this is the part most people miss... I could optimize this further, but it functions as a solid starting point. I used identical images from the XIAO experiment, building a classifier that downsizes photos to grayscale from labeled folders, then trains an FP32 binary model. This gets transformed into an int8 TFLite version ideal for ESP32, producing a C header for your code. It searches for a "datasets" folder, treating subfolders as categorized image sets.
To try it out, implement this on an ESP32-S3 following the code in my GitHub repo (https://github.com/Incipiens/ESP32S3CamBinaryClassification)! It includes step-by-step guidance for custom model training and deployment. I must emphasize, this isn't the most efficient setup, but it highlights image classification possibilities. With targeted classifiers, imagine applications like monitoring for alerts, detecting flashing lights, identifying individuals, and endless innovations.
What do you think? Is sacrificing speed for on-device privacy in AI worth the trade-off, or should we push for faster hardware? Could disabling safety features like the watchdog lead to reliability issues in real projects? And here's a provocative twist: Some argue that low-power ML on microcontrollers democratizes AI, but others worry it oversimplifies complex decisions. Do you agree, or do you see it as a game-changer? Drop your opinions in the comments—let's debate!