In this tutorial you will learn how to detect and classify objects using the ESP32-CAM module and YOLO, a deep-learning system for object detection.
I’ll guide you through building a project where the ESP32-CAM captures images, operates as a web server, and sends the images to a computer for analysis. The computer will use YOLO to detect and classify objects. You’ll learn to assemble the hardware, configure the camera, and serve JPEG images via HTTP.
By the end, you’ll have a working web interface to view snapshots captured by your ESP32-CAM and a object detection system for 80 different types of objects.
Required Parts
Below you will find the components required to build the project. Instead of the FTDI Programmer you could also use a Programming Shield for the ESP32-CAM but I recommend the former.
ESP32-CAM
FTDI USB-TTL Adapter
USB Data Cable
Arduino IDE
Makerguides.com is a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for sites to earn advertising fees by advertising and linking to products on Amazon.com. As an Amazon Associate we earn from qualifying purchases.
System Architecture
The system we are going to build is composed of two main components. 1) a ESP32-CAM module that captures images and operates as a web server that sends the images via WiFi. 2) a PC where the YOLO object detection system is running. It analyzes the images and annotates the detected objects. The diagram below gives you an overview of the system architecture:
The following picture shows you an example detection of the system. On my cluttered desk, YOLO could detect a cup with 88% confidence, a pair of scissors with 68%, and a laptop with 59%. You can see the bounding boxes around the objects with their names and detection confidence annotated in the upper left corner:
In the next sections you will learn how to program the web server on the ESP32-CAM and how to set up the YOLO detection system on a PC.
The ESP32-CAM Development Board
The ESP32-CAM Development Board is a compact module that combines an ESP32-S chip, a camera, a built-in flash, and a microSD card slot. The board has integrated Wi-Fi and Bluetooth and supports a OV2640 or OV7670 camera with up to 2 megapixels resolution.
In the tutorial we will refer to the original AI-Thinker model of the ESP32-CAM board but there are many clones with exactly the same specifications. They are programmed and used in the same way – including the one we listed under Required Parts.
Connecting the FTDI programmer
You can program the ESP32-CAM via a Programming Shield or via a FTDI Programmer. The latter is more easily to use and more flexible. It converts USB signals to serial signals and allows you to program microcontrollers such as the Arduino and the ESP32 via the UART interface. The following picture shows you how to connect the FTDI Programmer to the ESP32-CAM module.
The connections are simple. Start by connecting GND of the Programmer with GND of the ESP32-CAM module (blue wire). Then do the same with the 5V power supply (red wire). Note that some FTDI Programmer have jumpers or switches to change from 3.3V to 5V. Watch out for that and use 5V, if possible.
Next we connect the U0T (U0TXD) pin of the ESP32-CAM to the RXD pin of the Programmer (yellow wire). Similarly, U0R gets connected to TXD (green wire). With that the serial communication is established.
To switch the ESP32-CAM into programming mode the IO0 pin needs to be connected to ground (GND). But if you want to run the program, the IO0 pin needs to be left unconnected. I therefore added a switch between IO0 and GND (purple wire) that allowed me to switch from programming to running mode and vice versa. The photo below shows my wiring of ESP32-CAM with the switch and the FTDI Programmer:
Installing the ESP32 Core
If this is your first project with any board of the ESP32 series, you need to do the board installation first. If ESP32 boards are already installed in your Arduino IDE, you can skip this installation section.
Start by opening the Preferences dialog by selecting “Preferences…” from the “File” menu. This will open the Preferences dialog shown below.
Under the Settings tab you will find an edit box at the bottom of the dialog that is labelled “Additional boards manager URLs“:
In this input field copy the following URL: “https://espressif.github.io/arduino-esp32/package_esp32_dev_index.json
“
This will let the Arduino IDE know, where to find the ESP32 core libraries. Next we will actually install the ESP32 core libraries using the Boards Manager.
Open the Boards Manager via “Tools -> Boards -> Board Manager”. You will see the Boards Manager appearing in the left Sidebar. Enter “ESP32” in the search field at the top and you should see two types of ESP32 boards; the “Arduino ESP32 Boards” and the “esp32 by Espressif” boards. We want the esp32 libraries by Espressif. Click on the INSTALL button and wait until the download and install is complete.
Selecting ESP32-CAM Board
Click on the drop-down menu and then on “Select other board and port…”:
This will open a dialog where you enter “ESP32-CAM in the search bar. You will see the “AI Thinker ESP32-CAM” board under Boards. Click on it and the COM port to activate it and then click OK:
If you cannot select a PORT despite the ESP32-CAM plugged into a USB port via the FTDI programmer, then the CP210X driver is missing. Go to SILICON LABS Software Downloads and download the CP210x driver for your operating system, e.g. for Windows it is “CP210x VCP Windows”:
This will download a ZIP file. Unpack it and run the installer. After that your ESP32-CAM should appear as connected to a USB Port. If you still have issues, you may have to install a FTDI Driver as well.
Installing the ESP32-CAM library
For our Web Server we are going to use the esp32cam library. Go to the github repo, click on the green CODE button and then “Download ZIP” to download the library:
Then click on “Sketch->Include Library->Add .Zip Library”:
and select the path to the ZIP file you just downloaded to install the library. In the next section we write and explain the code for running a Web Server on the ESP32-CAM.
Code for ESP32-CAM Web Server
The following code sets up an ESP32-CAM module to send images over a Wi-Fi network. It captures images and serves them as JPEG files to clients that request them. The server runs on port 80, which is the default HTTP port.
#include "WebServer.h" #include "WiFi.h" #include "esp32cam.h" const char* WIFI_SSID = "SSID"; const char* WIFI_PASS = "PASSWORD"; const char* URL = "/cam.jpg"; static auto RES = esp32cam::Resolution::find(800, 600); WebServer server(80); void serveJpg() { auto frame = esp32cam::capture(); if (frame == nullptr) { Serial.println("CAPTURE FAILED!"); server.send(503, "", ""); return; } Serial.printf("CAPTURE OK %dx%d %db\n", frame->getWidth(), frame->getHeight(), static_cast<int>(frame->size())); server.setContentLength(frame->size()); server.send(200, "image/jpeg"); WiFiClient client = server.client(); frame->writeTo(client); } void handleJpg() { if (!esp32cam::Camera.changeResolution(RES)) { Serial.println("CAN'T SET RESOLUTION!"); } serveJpg(); } void initCamera() { { using namespace esp32cam; Config cfg; cfg.setPins(pins::AiThinker); cfg.setResolution(RES); cfg.setBufferCount(2); cfg.setJpeg(80); bool ok = Camera.begin(cfg); Serial.println(ok ? "CAMERA OK" : "CAMERA FAIL"); } } void initWifi() { WiFi.persistent(false); WiFi.mode(WIFI_STA); WiFi.begin(WIFI_SSID, WIFI_PASS); while (WiFi.status() != WL_CONNECTED) ; Serial.printf("http://%s%s\n", WiFi.localIP().toString().c_str(), URL); } void initServer() { server.on(URL, handleJpg); server.begin(); } void setup() { Serial.begin(115200); initWifi(); initCamera(); initServer(); } void loop() { server.handleClient(); }
Let’s break down the code into its components to understand how it works.
Libraries and Constants
At the beginning of the code, we include the necessary libraries for the web server, Wi-Fi functionality, and camera control.
#include "WebServer.h" #include "WiFi.h" #include "esp32cam.h"
We also define constants for the Wi-Fi credentials and the URL endpoint for accessing the camera image.
const char* WIFI_SSID = "SSID"; const char* WIFI_PASS = "PASSWORD"; const char* URL = "/cam.jpg";
Obviously, you will have to replace the credentials with the SSID and password for your Wi-Fi network.
Camera Resolution
We set the desired resolution for the camera. In this case, we are looking for a resolution of 800×600 pixels.
static auto RES = esp32cam::Resolution::find(800, 600);
Web Server Initialization
Next we create an instance of the web server that listens on port 80.
WebServer server(80);
Serve JPEG Function
The serveJpg()
function captures an image from the camera and sends it to the client as a JPEG file. If the capture fails, it sends a “503 Service Unavailable” response.
void serveJpg() { auto frame = esp32cam::capture(); if (frame == nullptr) { Serial.println("CAPTURE FAILED!"); server.send(503, "", ""); return; } Serial.printf("CAPTURE OK %dx%d %db\n", frame->getWidth(), frame->getHeight(), static_cast<int>(frame->size())); server.setContentLength(frame->size()); server.send(200, "image/jpeg"); WiFiClient client = server.client(); frame->writeTo(client); }
Here, we first attempt to capture a frame. If successful, we log the dimensions and size of the image, set the content length, and send the image back to the client.
Handle JPEG Function
The handleJpg()
function changes the camera resolution and calls serveJpg()
to serve the image.
void handleJpg() { if (!esp32cam::Camera.changeResolution(RES)) { Serial.println("CAN'T SET RESOLUTION!"); } serveJpg(); }
This function ensures that the camera is set to the desired resolution before serving the image.
Camera Initialization
The initCamera()
function configures the camera settings, including pin assignments, resolution, buffer count, and JPEG quality.
void initCamera() { { using namespace esp32cam; Config cfg; cfg.setPins(pins::AiThinker); cfg.setResolution(RES); cfg.setBufferCount(2); cfg.setJpeg(80); bool ok = Camera.begin(cfg); Serial.println(ok ? "CAMERA OK" : "CAMERA FAIL"); } }
We create a configuration object, set the necessary parameters, and initialize the camera. A message is printed to the serial monitor indicating whether the camera initialization was successful.
Wi-Fi Initialization
The initWifi()
function connects the ESP32 to the specified Wi-Fi network.
void initWifi() { WiFi.persistent(false); WiFi.mode(WIFI_STA); WiFi.begin(WIFI_SSID, WIFI_PASS); while (WiFi.status() != WL_CONNECTED) ; Serial.printf("http://%s%s\n", WiFi.localIP().toString().c_str(), URL); }
We disable persistent Wi-Fi connections, set the mode to station, and attempt to connect to the Wi-Fi network. Once connected, we print the URL to access the camera image.
Server Initialization
The initServer()
function sets up the server to handle requests for the camera image.
void initServer() { server.on(URL, handleJpg); server.begin(); }
We define the URL endpoint and associate it with the handleJpg()
function, then start the server.
Setup Function
The setup()
function initializes the serial communication, Wi-Fi, camera, and server.
void setup() { Serial.begin(115200); initWifi(); initCamera(); initServer(); }
Loop Function
Finally, the loop()
function continuously handles incoming client requests.
void loop() { server.handleClient(); }
This function ensures that the server is responsive to client requests, allowing clients to retrieve images captured by the camera.
Test the ESP32-CAM Web Server
Now, let us test the Web Server. Compile and upload the code above. To upload code to the ESP32-CAM, switch the board into programming mode by flipping the switch, then shortly press the Reset button on the board and then click the Upload button in the Arduino IDE.
If you need more help for uploading code to the ESP32-CAM, have a look at the Programming the ESP32-CAM tutorial, which provides more details.
After a successful upload you will see the URL for the camera pictures printed to the Serial Monitor and you should also see the text “CAMERA OK”:
http://192.168.1.146/cam.jpg CAMERA OK
Copy this URL into Address Bar of Web Browser and you should see the picture the camera has taken:
Every time you press the reload button in your Web Browser the Web Server takes this request, asks the ESP32-CAM to take a new picture and sends this new picture to your Web Browser. Below a picture of my desk, taken in this way:
In the next section we send the pictures to the YOLO Object Detection Model to recognize the objects in the scene.
YOLO Object Detection
YOLO (You Only Look Once) is a deep-learning model for object detection known for its speed and accuracy. It was first introduced by Joseph Redmon et al. in 2016. Since then there have been many improved versions with YOLO11 by Ultralytics being the latest one (as of Feb 2025).
However, we are going to use an older model YOLOv3v, since it is smaller and easier to use but it’s accuracy is not as high as the more recent models.
The YOLO Model is a deep convolutional network that takes an RGB image with dimensions 448x448x3 as input and outputs the bounding boxes and confidence scores for the detected objects in a 7×7×30 tensor. We are going to use a version of the model that is trained to detect 80 different objects, such as:
- person
- bicycle
- car
- motorbike
- …
- scissors
- teddy bear
- hair drier
- toothbrush
We don’t go into the details of the model here but if you want to learn more here are the links to the original YOLO publication, a description of the improvements in Version 3 of YOLO and an application paper with useful information:
- YOLOv3: An Incremental Improvement
- You Only Look Once: Unified, Real-Time Object Detection
- YOLO v3: Visual and Real-Time Object Detection Model for Smart Surveillance Systems(3s)
Project Folder Structure
For running the YOLO object detection system on a PC we need to create a project folder, let’s say “esp32-cam-yolo-object-detection”. Within this folder create a subfolder named “YOLO” and a python file named “detect.py”. Your folder structure should look like follows:
Download YOLO files
Next you have to download the required YOLO files (weights, architecture config, class names) and place them in the “YOLO” folder of the project. Here are the links to these files:
- https://pjreddie.com/media/files/yolov3.weights
- https://github.com/pjreddie/darknet/blob/master/cfg/yolov3.cfg
- https://github.com/pjreddie/darknet/blob/master/data/coco.names
The contents of your “YOLO” folder should then look like this:
Creating Virtual Environment
We also have to install some Python libraries and we will install them in a Virtual Environment using venv. Open a command shell and execute the following commands:
cd esp32-cam-object-detection python -m venv venv venv\Scripts\activate.bat pip install opencv-python opencv-python-headless numpy torch torchvision
The cd
command moves us into the project folder. The ven
v command create the virtual environment and within this we install the required libraries via pip install
. This will create a “venv” folder in the project folder that contains the libraries:
Object Detection Code
Finally, as a last step copy the following code into the detect.py
file in your project folder.
import cv2 import numpy as np import urllib.request # Camera URL url = "http://192.168.1.146/cam.jpg" # YOLO model files weights_path = r"./YOLO/yolov3.weights" config_path = r"./YOLO/yolov3.cfg" names_path = r"./YOLO/coco.names" # Load the YOLO model and COCO class names net = cv2.dnn.readNet(weights_path, config_path) with open(names_path, "r") as f: classes = [line.strip() for line in f.readlines()] layer_names = net.getLayerNames() # Handling the return value of getUnconnectedOutLayers() out_layers = net.getUnconnectedOutLayers() if isinstance(out_layers[0], list): output_layers = [layer_names[i[0] - 1] for i in out_layers] else: output_layers = [layer_names[i - 1] for i in out_layers] # Generate random colors for each class colors = np.random.uniform(0, 255, size=(len(classes), 3)) def detect_objects(frame): height, width, _ = frame.shape blob = cv2.dnn.blobFromImage(frame, 1 / 255.0, (416, 416), swapRB=True, crop=False) net.setInput(blob) layer_outputs = net.forward(output_layers) boxes = [] confidences = [] class_ids = [] for output in layer_outputs: for detection in output: scores = detection[5:] class_id = np.argmax(scores) confidence = scores[class_id] if confidence > 0.3: center_x = int(detection[0] * width) center_y = int(detection[1] * height) w = int(detection[2] * width) h = int(detection[3] * height) x = int(center_x - w / 2) y = int(center_y - h / 2) boxes.append([x, y, w, h]) confidences.append(float(confidence)) class_ids.append(class_id) indexes = cv2.dnn.NMSBoxes(boxes, confidences, 0.3, 0.4) # Draw detections on the frame if len(indexes) > 0 and isinstance(indexes, np.ndarray): indexes = indexes.flatten() for i in indexes: x, y, w, h = boxes[i] label = str(classes[class_ids[i]]) confidence = confidences[i] color = colors[class_ids[i]] print(f"Detected: {label} with confidence {confidence:.2f}") cv2.rectangle(frame, (x, y), (x + w, y + h), color, 2) cv2.putText( frame, f"{label} {confidence:.2f}", (x, y - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2, ) return frame def main(): cv2.namedWindow("Object Detection", cv2.WINDOW_AUTOSIZE) while True: try: img_resp = urllib.request.urlopen(url) imgnp = np.array(bytearray(img_resp.read()), dtype=np.uint8) frame = cv2.imdecode(imgnp, -1) frame = detect_objects(frame) cv2.imshow("Object Detection", frame) if cv2.waitKey(1) & 0xFF == ord("q"): break except Exception as e: print(f"Error occurred: {e}") break cv2.destroyAllWindows() if __name__ == "__main__": main()
The code above implements an object detection system using the YOLO model with OpenCV. It captures images from a camera feed and detects objects in real-time, displaying the results on the screen.
Importing Libraries
We start by importing the necessary libraries: cv2
for computer vision tasks, numpy
for numerical operations, and urllib.request
for handling URL requests.
import cv2 import numpy as np import urllib.request
Camera URL
Here, we define the URL of the camera feed from which we will capture images. You will have to replace this constant with the URL your Web Server is delivering images to:
url = "http://192.168.1.146/cam.jpg"
YOLO Model Files
Next, we specify the paths to the YOLO model files: the weights file, the configuration file, and the names of the objects that the model can detect.
weights_path = r"./YOLO/yolov3.weights" config_path = r"./YOLO/yolov3.cfg" names_path = r"./YOLO/coco.names"
Loading the YOLO Model
We load the YOLO model using OpenCV’s dnn
module and read the class names from the specified file. The layer names are also retrieved for later use.
net = cv2.dnn.readNet(weights_path, config_path) with open(names_path, "r") as f: classes = [line.strip() for line in f.readlines()] layer_names = net.getLayerNames()
Output Layers
We determine the output layers of the network. This is crucial for understanding which layers provide the final detections.
out_layers = net.getUnconnectedOutLayers() if isinstance(out_layers[0], list): output_layers = [layer_names[i[0] - 1] for i in out_layers] else: output_layers = [layer_names[i - 1] for i in out_layers]
Generating Colors for Classes
To visualize and distinguish the detected objects, we generate random colors for the bounding boxes.
colors = np.random.uniform(0, 255, size=(len(classes), 3))
Object Detection Function
The detect_objects()
function takes an image frame as input, processes it, and detects objects using the YOLO model. It returns the frame with bounding boxes and labels drawn on it.
def detect_objects(frame): height, width, _ = frame.shape blob = cv2.dnn.blobFromImage(frame, 1 / 255.0, (416, 416), swapRB=True, crop=False) net.setInput(blob) layer_outputs = net.forward(output_layers) boxes = [] confidences = [] class_ids = []
In this function, we first create a blob from the input frame, which is a preprocessed version of the image suitable for the model. We then perform a forward pass to get the output from the model.
Processing Detections
We loop through the outputs to extract bounding boxes, confidence scores, and class IDs for detected objects. Only detections with a confidence greater than 0.3 are considered valid. Feel free to change this parameter (0…1) to show less or more confident detection.
for output in layer_outputs: for detection in output: scores = detection[5:] class_id = np.argmax(scores) confidence = scores[class_id] if confidence > 0.3: center_x = int(detection[0] * width) center_y = int(detection[1] * height) w = int(detection[2] * width) h = int(detection[3] * height) x = int(center_x - w / 2) y = int(center_y - h / 2) boxes.append([x, y, w, h]) confidences.append(float(confidence)) class_ids.append(class_id)
Non-Maximum Suppression
To eliminate redundant overlapping boxes, we apply Non-Maximum Suppression (NMS) to keep only the best bounding boxes.
indexes = cv2.dnn.NMSBoxes(boxes, confidences, 0.3, 0.4)
Drawing Detections
We draw the bounding boxes and labels on the frame for each detected object. The detected class/object name and confidence score are displayed.
if len(indexes) > 0 and isinstance(indexes, np.ndarray): indexes = indexes.flatten() for i in indexes: x, y, w, h = boxes[i] label = str(classes[class_ids[i]]) confidence = confidences[i] color = colors[class_ids[i]] print(f"Detected: {label} with confidence {confidence:.2f}") cv2.rectangle(frame, (x, y), (x + w, y + h), color, 2) cv2.putText( frame, f"{label} {confidence:.2f}", (x, y - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2, )
Main Function
The main()
function sets up a window for displaying the detections and continuously captures frames from the camera feed. It processes each frame through the detect_objects()
function and displays the result.
def main(): cv2.namedWindow("Object Detection", cv2.WINDOW_AUTOSIZE) while True: try: img_resp = urllib.request.urlopen(url) imgnp = np.array(bytearray(img_resp.read()), dtype=np.uint8) frame = cv2.imdecode(imgnp, -1) frame = detect_objects(frame) cv2.imshow("Object Detection", frame) if cv2.waitKey(1) & 0xFF == ord("q"): break except Exception as e: print(f"Error occurred: {e}") break cv2.destroyAllWindows()
It will open a window and if you press “q”, while the window is in front, it will end the application.
Execution Entry Point
Finally, we check if the script is being run directly and call the main()
function to start the program.
if __name__ == "__main__": main()
In the next section we put everything together and run our object detection system.
Running the Object Detector
First, fire up your ESP32-CAM module with the Web Server code and make sure that the ESP32-CAM captures images and shows them in a Web Brower under the URL printed to the Serial Monitor. Also make sure that this URL is used in detect.py, e.g. in my case this URL is:
url = "http://192.168.1.146/cam.jpg"
Next we start the YOLO object detector. Go to your project folder (“esp32-cam-object-detection”), activate the virtual environment and run the detector code detect.py
:
cd esp32-cam-object-detection venv\Scripts\activate.bat python detect.py
Note that you can deactivate the virtual environment by calling:
venv\Scripts\deactivate.bat
If the code is running you should see names of the detected objects with the confidence score printed to the console:
Detected: cup with confidence 0.76 Detected: laptop with confidence 0.39 Detected: cup with confidence 0.51 Detected: laptop with confidence 0.33 Detected: cup with confidence 0.44 Detected: cup with confidence 0.65 Detected: cup with confidence 0.63
Also a window will open, named “Object Detection” that shows the current picture the camera sees with bounding boxes around the objects the system could detect. Below an example where the system correctly detects a cup, a remote and a laptop:
If you want to see more examples of the detection capabilities of the YOLO model go to the following YOLO Demo Video.
Conclusions
In this tutorial you learned how to build an object detection system. The ESP32-CAM module was used to capture images and to run a Web Server for those images. The images were then sent via Wi-Fi to a PC that runs an object detection software based on the YOLO deep-learning model.
Compiling code for and uploading code to the ESP32-CAM can be quite tricky. If you run into issues have a look at the Programming the ESP32-CAM tutorial, which provides more detailed instructions.
Note that our little object detection system is limited to 80 predefined objects (or classes). However, you can train the YOLO model with your own objects. The How to Train YOLOv3 to Detect Custom Objects? tutorial might help if you want to do this.
If you have any further questions, feel free to leave them in the comment section.
Have fun ; )
Alice is a Mechatronics and Control Engineer and a part-time hobbyist. She has extensive research experience in the field of AI-based Embedded systems and believes that only research & technology can make this world a better place.