Skip to Content

Object Detection with ESP32-CAM and YOLO

Object Detection with ESP32-CAM and YOLO

In this tutorial you will learn how to detect and classify objects using the ESP32-CAM module and YOLO, a deep-learning system for object detection.

I’ll guide you through building a project where the ESP32-CAM captures images, operates as a web server, and sends the images to a computer for analysis. The computer will use YOLO to detect and classify objects. You’ll learn to assemble the hardware, configure the camera, and serve JPEG images via HTTP.

By the end, you’ll have a working web interface to view snapshots captured by your ESP32-CAM and a object detection system for 80 different types of objects.

Required Parts

Below you will find the components required to build the project. Instead of the FTDI Programmer you could also use a Programming Shield for the ESP32-CAM but I recommend the former.

ESP32-CAM

FTDI USB-TTL Adapter

USB data cable

USB Data Cable

Makerguides.com is a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for sites to earn advertising fees by advertising and linking to products on Amazon.com. As an Amazon Associate we earn from qualifying purchases.

System Architecture

The system we are going to build is composed of two main components. 1) a ESP32-CAM module that captures images and operates as a web server that sends the images via WiFi. 2) a PC where the YOLO object detection system is running. It analyzes the images and annotates the detected objects. The diagram below gives you an overview of the system architecture:

System Architecture Object Detection System
System Architecture Object Detection System

The following picture shows you an example detection of the system. On my cluttered desk, YOLO could detect a cup with 88% confidence, a pair of scissors with 68%, and a laptop with 59%. You can see the bounding boxes around the objects with their names and detection confidence annotated in the upper left corner:

Objects detected with YOLO
Objects detected with YOLO

In the next sections you will learn how to program the web server on the ESP32-CAM and how to set up the YOLO detection system on a PC.

The ESP32-CAM Development Board

The ESP32-CAM Development Board is a compact module that combines an ESP32-S chip, a camera, a built-in flash, and a microSD card slot. The board has integrated Wi-Fi and Bluetooth and supports a OV2640 or OV7670 camera with up to 2 megapixels resolution.

Front and Back of ESP32-CAM
Front and Back of ESP32-CAM

In the tutorial we will refer to the original AI-Thinker model of the ESP32-CAM board but there are many clones with exactly the same specifications. They are programmed and used in the same way – including the one we listed under Required Parts.

Connecting the FTDI programmer

You can program the ESP32-CAM via a Programming Shield or via a FTDI Programmer. The latter is more easily to use and more flexible. It converts USB signals to serial signals and allows you to program microcontrollers such as the Arduino and the ESP32 via the UART interface. The following picture shows you how to connect the FTDI Programmer to the ESP32-CAM module.

Wiring of FTDI Programmer with ESP32-CAM
Wiring of FTDI Programmer with ESP32-CAM

The connections are simple. Start by connecting GND of the Programmer with GND of the ESP32-CAM module (blue wire). Then do the same with the 5V power supply (red wire). Note that some FTDI Programmer have jumpers or switches to change from 3.3V to 5V. Watch out for that and use 5V, if possible.

Next we connect the U0T (U0TXD) pin of the ESP32-CAM to the RXD pin of the Programmer (yellow wire). Similarly, U0R gets connected to TXD (green wire). With that the serial communication is established.

To switch the ESP32-CAM into programming mode the IO0 pin needs to be connected to ground (GND). But if you want to run the program, the IO0 pin needs to be left unconnected. I therefore added a switch between IO0 and GND (purple wire) that allowed me to switch from programming to running mode and vice versa. The photo below shows my wiring of ESP32-CAM with the switch and the FTDI Programmer:

Switch to enable programming mode of ESP32-CAM
Switch to enable programming mode of ESP32-CAM

Installing the ESP32 Core

If this is your first project with any board of the ESP32 series, you need to do the board installation first. If ESP32 boards are already installed in your Arduino IDE, you can skip this installation section.

Start by opening the Preferences dialog by selecting “Preferences…” from the “File” menu. This will open the Preferences dialog shown below.

Under the Settings tab you will find an edit box at the bottom of the dialog that is labelled “Additional boards manager URLs“:

In this input field copy the following URL: “https://espressif.github.io/arduino-esp32/package_esp32_dev_index.json

This will let the Arduino IDE know, where to find the ESP32 core libraries. Next we will actually install the ESP32 core libraries using the Boards Manager.

Open the Boards Manager via “Tools -> Boards -> Board Manager”. You will see the Boards Manager appearing in the left Sidebar. Enter “ESP32” in the search field at the top and you should see two types of ESP32 boards; the “Arduino ESP32 Boards” and the “esp32 by Espressif” boards. We want the esp32 libraries by Espressif. Click on the INSTALL button and wait until the download and install is complete.

Install ESP32 Core libraries
Install ESP32 Core libraries

Selecting ESP32-CAM Board

Click on the drop-down menu and then on “Select other board and port…”:

Drop-down Menu for Board Selection
Drop-down Menu for Board Selection

This will open a dialog where you enter “ESP32-CAM in the search bar. You will see the “AI Thinker ESP32-CAM” board under Boards. Click on it and the COM port to activate it and then click OK:

Board Selection Dialog with AI Thinker ESP32-CAM
Board Selection Dialog with AI Thinker ESP32-CAM

If you cannot select a PORT despite the ESP32-CAM plugged into a USB port via the FTDI programmer, then the CP210X driver is missing. Go to SILICON LABS Software Downloads and download the CP210x driver for your operating system, e.g. for Windows it is “CP210x VCP Windows”:

Download CP210X Driver
Download CP210X Driver

This will download a ZIP file. Unpack it and run the installer. After that your ESP32-CAM should appear as connected to a USB Port. If you still have issues, you may have to install a FTDI Driver as well.

Installing the ESP32-CAM library

For our Web Server we are going to use the esp32cam library. Go to the github repo, click on the green CODE button and then “Download ZIP” to download the library:

Download esp32cam library
Download esp32cam library

Then click on “Sketch->Include Library->Add .Zip Library”:

and select the path to the ZIP file you just downloaded to install the library. In the next section we write and explain the code for running a Web Server on the ESP32-CAM.

Code for ESP32-CAM Web Server

The following code sets up an ESP32-CAM module to send images over a Wi-Fi network. It captures images and serves them as JPEG files to clients that request them. The server runs on port 80, which is the default HTTP port.

#include "WebServer.h"
#include "WiFi.h"
#include "esp32cam.h"

const char* WIFI_SSID = "SSID";
const char* WIFI_PASS = "PASSWORD";
const char* URL = "/cam.jpg";

static auto RES = esp32cam::Resolution::find(800, 600);

WebServer server(80);

void serveJpg() {
  auto frame = esp32cam::capture();
  if (frame == nullptr) {
    Serial.println("CAPTURE FAILED!");
    server.send(503, "", "");
    return;
  }
  Serial.printf("CAPTURE OK %dx%d %db\n",
                frame->getWidth(), frame->getHeight(),
                static_cast<int>(frame->size()));

  server.setContentLength(frame->size());
  server.send(200, "image/jpeg");

  WiFiClient client = server.client();
  frame->writeTo(client);
}

void handleJpg() {
  if (!esp32cam::Camera.changeResolution(RES)) {
    Serial.println("CAN'T SET RESOLUTION!");
  }
  serveJpg();
}

void initCamera() {
  {
    using namespace esp32cam;
    Config cfg;
    cfg.setPins(pins::AiThinker);
    cfg.setResolution(RES);
    cfg.setBufferCount(2);
    cfg.setJpeg(80);

    bool ok = Camera.begin(cfg);
    Serial.println(ok ? "CAMERA OK" : "CAMERA FAIL");
  }
}

void initWifi() {
  WiFi.persistent(false);
  WiFi.mode(WIFI_STA);
  WiFi.begin(WIFI_SSID, WIFI_PASS);
  while (WiFi.status() != WL_CONNECTED)
    ;
  Serial.printf("http://%s%s\n",
                WiFi.localIP().toString().c_str(), URL);
}

void initServer() {
  server.on(URL, handleJpg);
  server.begin();
}

void setup() {
  Serial.begin(115200);
  initWifi();
  initCamera();
  initServer();
}

void loop() {
  server.handleClient();
}

Let’s break down the code into its components to understand how it works.

Libraries and Constants

At the beginning of the code, we include the necessary libraries for the web server, Wi-Fi functionality, and camera control.

#include "WebServer.h"
#include "WiFi.h"
#include "esp32cam.h"

We also define constants for the Wi-Fi credentials and the URL endpoint for accessing the camera image.

const char* WIFI_SSID = "SSID";
const char* WIFI_PASS = "PASSWORD";
const char* URL = "/cam.jpg";

Obviously, you will have to replace the credentials with the SSID and password for your Wi-Fi network.

Camera Resolution

We set the desired resolution for the camera. In this case, we are looking for a resolution of 800×600 pixels.

static auto RES = esp32cam::Resolution::find(800, 600);

Web Server Initialization

Next we create an instance of the web server that listens on port 80.

WebServer server(80);

Serve JPEG Function

The serveJpg() function captures an image from the camera and sends it to the client as a JPEG file. If the capture fails, it sends a “503 Service Unavailable” response.

void serveJpg() {
  auto frame = esp32cam::capture();
  if (frame == nullptr) {
    Serial.println("CAPTURE FAILED!");
    server.send(503, "", "");
    return;
  }
  Serial.printf("CAPTURE OK %dx%d %db\n",
                frame->getWidth(), frame->getHeight(),
                static_cast<int>(frame->size()));

  server.setContentLength(frame->size());
  server.send(200, "image/jpeg");

  WiFiClient client = server.client();
  frame->writeTo(client);
}

Here, we first attempt to capture a frame. If successful, we log the dimensions and size of the image, set the content length, and send the image back to the client.

Handle JPEG Function

The handleJpg() function changes the camera resolution and calls serveJpg() to serve the image.

void handleJpg() {
  if (!esp32cam::Camera.changeResolution(RES)) {
    Serial.println("CAN'T SET RESOLUTION!");
  }
  serveJpg();
}

This function ensures that the camera is set to the desired resolution before serving the image.

Camera Initialization

The initCamera() function configures the camera settings, including pin assignments, resolution, buffer count, and JPEG quality.

void initCamera() {
  {
    using namespace esp32cam;
    Config cfg;
    cfg.setPins(pins::AiThinker);
    cfg.setResolution(RES);
    cfg.setBufferCount(2);
    cfg.setJpeg(80);

    bool ok = Camera.begin(cfg);
    Serial.println(ok ? "CAMERA OK" : "CAMERA FAIL");
  }
}

We create a configuration object, set the necessary parameters, and initialize the camera. A message is printed to the serial monitor indicating whether the camera initialization was successful.

Wi-Fi Initialization

The initWifi() function connects the ESP32 to the specified Wi-Fi network.

void initWifi() {
  WiFi.persistent(false);
  WiFi.mode(WIFI_STA);
  WiFi.begin(WIFI_SSID, WIFI_PASS);
  while (WiFi.status() != WL_CONNECTED)
    ;
  Serial.printf("http://%s%s\n",
                WiFi.localIP().toString().c_str(), URL);
}

We disable persistent Wi-Fi connections, set the mode to station, and attempt to connect to the Wi-Fi network. Once connected, we print the URL to access the camera image.

Server Initialization

The initServer() function sets up the server to handle requests for the camera image.

void initServer() {
  server.on(URL, handleJpg);
  server.begin();
}

We define the URL endpoint and associate it with the handleJpg() function, then start the server.

Setup Function

The setup() function initializes the serial communication, Wi-Fi, camera, and server.

void setup() {
  Serial.begin(115200);
  initWifi();
  initCamera();
  initServer();
}

Loop Function

Finally, the loop() function continuously handles incoming client requests.

void loop() {
  server.handleClient();
}

This function ensures that the server is responsive to client requests, allowing clients to retrieve images captured by the camera.

Test the ESP32-CAM Web Server

Now, let us test the Web Server. Compile and upload the code above. To upload code to the ESP32-CAM, switch the board into programming mode by flipping the switch, then shortly press the Reset button on the board and then click the Upload button in the Arduino IDE.

If you need more help for uploading code to the ESP32-CAM, have a look at the Programming the ESP32-CAM tutorial, which provides more details.

After a successful upload you will see the URL for the camera pictures printed to the Serial Monitor and you should also see the text “CAMERA OK”:

http://192.168.1.146/cam.jpg
CAMERA OK

Copy this URL into Address Bar of Web Browser and you should see the picture the camera has taken:

Address Bar with URL
Address Bar with URL

Every time you press the reload button in your Web Browser the Web Server takes this request, asks the ESP32-CAM to take a new picture and sends this new picture to your Web Browser. Below a picture of my desk, taken in this way:

Picture taken by ESP32-CAM and served via Web Server in Browser
Picture taken by ESP32-CAM and served via Web Server in Browser

In the next section we send the pictures to the YOLO Object Detection Model to recognize the objects in the scene.

YOLO Object Detection

YOLO (You Only Look Once) is a deep-learning model for object detection known for its speed and accuracy. It was first introduced by Joseph Redmon et al. in 2016. Since then there have been many improved versions with YOLO11 by Ultralytics being the latest one (as of Feb 2025).

However, we are going to use an older model YOLOv3v, since it is smaller and easier to use but it’s accuracy is not as high as the more recent models.

Architecture of YOLO Model (source)

The YOLO Model is a deep convolutional network that takes an RGB image with dimensions 448x448x3 as input and outputs the bounding boxes and confidence scores for the detected objects in a 7×7×30 tensor. We are going to use a version of the model that is trained to detect 80 different objects, such as:

  • person
  • bicycle
  • car
  • motorbike
  • scissors
  • teddy bear
  • hair drier
  • toothbrush

We don’t go into the details of the model here but if you want to learn more here are the links to the original YOLO publication, a description of the improvements in Version 3 of YOLO and an application paper with useful information:

Project Folder Structure

For running the YOLO object detection system on a PC we need to create a project folder, let’s say “esp32-cam-yolo-object-detection”. Within this folder create a subfolder named “YOLO” and a python file named “detect.py”. Your folder structure should look like follows:

Project Folder Structure

Download YOLO files

Next you have to download the required YOLO files (weights, architecture config, class names) and place them in the “YOLO” folder of the project. Here are the links to these files:

The contents of your “YOLO” folder should then look like this:

YOLO Folder
YOLO Folder

Creating Virtual Environment

We also have to install some Python libraries and we will install them in a Virtual Environment using venv. Open a command shell and execute the following commands:

cd esp32-cam-object-detection
python -m venv venv
venv\Scripts\activate.bat    
pip install opencv-python opencv-python-headless numpy torch torchvision

The cd command moves us into the project folder. The venv command create the virtual environment and within this we install the required libraries via pip install. This will create a “venv” folder in the project folder that contains the libraries:

venv folder within Project Folder
venv folder within Project Folder

Object Detection Code

Finally, as a last step copy the following code into the detect.py file in your project folder.

import cv2
import numpy as np
import urllib.request

# Camera URL
url = "http://192.168.1.146/cam.jpg"

# YOLO model files
weights_path = r"./YOLO/yolov3.weights"
config_path = r"./YOLO/yolov3.cfg"
names_path = r"./YOLO/coco.names"

# Load the YOLO model and COCO class names
net = cv2.dnn.readNet(weights_path, config_path)
with open(names_path, "r") as f:
    classes = [line.strip() for line in f.readlines()]

layer_names = net.getLayerNames()

# Handling the return value of getUnconnectedOutLayers()
out_layers = net.getUnconnectedOutLayers()
if isinstance(out_layers[0], list):
    output_layers = [layer_names[i[0] - 1] for i in out_layers]
else:
    output_layers = [layer_names[i - 1] for i in out_layers]

# Generate random colors for each class
colors = np.random.uniform(0, 255, size=(len(classes), 3))


def detect_objects(frame):
    height, width, _ = frame.shape
    blob = cv2.dnn.blobFromImage(frame, 1 / 255.0, (416, 416), swapRB=True, crop=False)
    net.setInput(blob)

    layer_outputs = net.forward(output_layers)

    boxes = []
    confidences = []
    class_ids = []

    for output in layer_outputs:
        for detection in output:
            scores = detection[5:]
            class_id = np.argmax(scores)
            confidence = scores[class_id]
            if confidence > 0.3:
                center_x = int(detection[0] * width)
                center_y = int(detection[1] * height)
                w = int(detection[2] * width)
                h = int(detection[3] * height)

                x = int(center_x - w / 2)
                y = int(center_y - h / 2)

                boxes.append([x, y, w, h])
                confidences.append(float(confidence))
                class_ids.append(class_id)

    indexes = cv2.dnn.NMSBoxes(boxes, confidences, 0.3, 0.4)

    # Draw detections on the frame
    if len(indexes) > 0 and isinstance(indexes, np.ndarray):
        indexes = indexes.flatten()
        for i in indexes:
            x, y, w, h = boxes[i]
            label = str(classes[class_ids[i]])
            confidence = confidences[i]
            color = colors[class_ids[i]]
            print(f"Detected: {label} with confidence {confidence:.2f}")

            cv2.rectangle(frame, (x, y), (x + w, y + h), color, 2)
            cv2.putText(
                frame,
                f"{label} {confidence:.2f}",
                (x, y - 10),
                cv2.FONT_HERSHEY_SIMPLEX,
                0.5,
                color,
                2,
            )

    return frame


def main():
    cv2.namedWindow("Object Detection", cv2.WINDOW_AUTOSIZE)

    while True:
        try:
            img_resp = urllib.request.urlopen(url)
            imgnp = np.array(bytearray(img_resp.read()), dtype=np.uint8)
            frame = cv2.imdecode(imgnp, -1)
            frame = detect_objects(frame)

            cv2.imshow("Object Detection", frame)

            if cv2.waitKey(1) & 0xFF == ord("q"):
                break
        except Exception as e:
            print(f"Error occurred: {e}")
            break

    cv2.destroyAllWindows()


if __name__ == "__main__":
    main()

The code above implements an object detection system using the YOLO model with OpenCV. It captures images from a camera feed and detects objects in real-time, displaying the results on the screen.

Importing Libraries

We start by importing the necessary libraries: cv2 for computer vision tasks, numpy for numerical operations, and urllib.request for handling URL requests.

import cv2
import numpy as np
import urllib.request

Camera URL

Here, we define the URL of the camera feed from which we will capture images. You will have to replace this constant with the URL your Web Server is delivering images to:

url = "http://192.168.1.146/cam.jpg"

YOLO Model Files

Next, we specify the paths to the YOLO model files: the weights file, the configuration file, and the names of the objects that the model can detect.

weights_path = r"./YOLO/yolov3.weights"
config_path = r"./YOLO/yolov3.cfg"
names_path = r"./YOLO/coco.names"

Loading the YOLO Model

We load the YOLO model using OpenCV’s dnn module and read the class names from the specified file. The layer names are also retrieved for later use.

net = cv2.dnn.readNet(weights_path, config_path)
with open(names_path, "r") as f:
    classes = [line.strip() for line in f.readlines()]

layer_names = net.getLayerNames()

Output Layers

We determine the output layers of the network. This is crucial for understanding which layers provide the final detections.

out_layers = net.getUnconnectedOutLayers()
if isinstance(out_layers[0], list):
    output_layers = [layer_names[i[0] - 1] for i in out_layers]
else:
    output_layers = [layer_names[i - 1] for i in out_layers]

Generating Colors for Classes

To visualize and distinguish the detected objects, we generate random colors for the bounding boxes.

colors = np.random.uniform(0, 255, size=(len(classes), 3))

Object Detection Function

The detect_objects() function takes an image frame as input, processes it, and detects objects using the YOLO model. It returns the frame with bounding boxes and labels drawn on it.

def detect_objects(frame):
    height, width, _ = frame.shape
    blob = cv2.dnn.blobFromImage(frame, 1 / 255.0, (416, 416), swapRB=True, crop=False)
    net.setInput(blob)

    layer_outputs = net.forward(output_layers)

    boxes = []
    confidences = []
    class_ids = []

In this function, we first create a blob from the input frame, which is a preprocessed version of the image suitable for the model. We then perform a forward pass to get the output from the model.

Processing Detections

We loop through the outputs to extract bounding boxes, confidence scores, and class IDs for detected objects. Only detections with a confidence greater than 0.3 are considered valid. Feel free to change this parameter (0…1) to show less or more confident detection.

for output in layer_outputs:
    for detection in output:
        scores = detection[5:]
        class_id = np.argmax(scores)
        confidence = scores[class_id]
        if confidence > 0.3:
            center_x = int(detection[0] * width)
            center_y = int(detection[1] * height)
            w = int(detection[2] * width)
            h = int(detection[3] * height)

            x = int(center_x - w / 2)
            y = int(center_y - h / 2)

            boxes.append([x, y, w, h])
            confidences.append(float(confidence))
            class_ids.append(class_id)

Non-Maximum Suppression

To eliminate redundant overlapping boxes, we apply Non-Maximum Suppression (NMS) to keep only the best bounding boxes.

indexes = cv2.dnn.NMSBoxes(boxes, confidences, 0.3, 0.4)

Drawing Detections

We draw the bounding boxes and labels on the frame for each detected object. The detected class/object name and confidence score are displayed.

if len(indexes) > 0 and isinstance(indexes, np.ndarray):
    indexes = indexes.flatten()
    for i in indexes:
        x, y, w, h = boxes[i]
        label = str(classes[class_ids[i]])
        confidence = confidences[i]
        color = colors[class_ids[i]]
        print(f"Detected: {label} with confidence {confidence:.2f}")

        cv2.rectangle(frame, (x, y), (x + w, y + h), color, 2)
        cv2.putText(
            frame,
            f"{label} {confidence:.2f}",
            (x, y - 10),
            cv2.FONT_HERSHEY_SIMPLEX,
            0.5,
            color,
            2,
        )

Main Function

The main() function sets up a window for displaying the detections and continuously captures frames from the camera feed. It processes each frame through the detect_objects() function and displays the result.

def main():
    cv2.namedWindow("Object Detection", cv2.WINDOW_AUTOSIZE)

    while True:
        try:
            img_resp = urllib.request.urlopen(url)
            imgnp = np.array(bytearray(img_resp.read()), dtype=np.uint8)
            frame = cv2.imdecode(imgnp, -1)
            frame = detect_objects(frame)

            cv2.imshow("Object Detection", frame)

            if cv2.waitKey(1) & 0xFF == ord("q"):
                break
        except Exception as e:
            print(f"Error occurred: {e}")
            break

    cv2.destroyAllWindows()

It will open a window and if you press “q”, while the window is in front, it will end the application.

Execution Entry Point

Finally, we check if the script is being run directly and call the main() function to start the program.

if __name__ == "__main__":
    main()

In the next section we put everything together and run our object detection system.

Running the Object Detector

First, fire up your ESP32-CAM module with the Web Server code and make sure that the ESP32-CAM captures images and shows them in a Web Brower under the URL printed to the Serial Monitor. Also make sure that this URL is used in detect.py, e.g. in my case this URL is:

url = "http://192.168.1.146/cam.jpg"

Next we start the YOLO object detector. Go to your project folder (“esp32-cam-object-detection”), activate the virtual environment and run the detector code detect.py:

cd esp32-cam-object-detection
venv\Scripts\activate.bat  
python detect.py

Note that you can deactivate the virtual environment by calling:

venv\Scripts\deactivate.bat

If the code is running you should see names of the detected objects with the confidence score printed to the console:

Detected: cup with confidence 0.76
Detected: laptop with confidence 0.39
Detected: cup with confidence 0.51
Detected: laptop with confidence 0.33
Detected: cup with confidence 0.44
Detected: cup with confidence 0.65
Detected: cup with confidence 0.63     

Also a window will open, named “Object Detection” that shows the current picture the camera sees with bounding boxes around the objects the system could detect. Below an example where the system correctly detects a cup, a remote and a laptop:

Window of Object Detection Application
Window of Object Detection Application

If you want to see more examples of the detection capabilities of the YOLO model go to the following YOLO Demo Video.

Conclusions

In this tutorial you learned how to build an object detection system. The ESP32-CAM module was used to capture images and to run a Web Server for those images. The images were then sent via Wi-Fi to a PC that runs an object detection software based on the YOLO deep-learning model.

Compiling code for and uploading code to the ESP32-CAM can be quite tricky. If you run into issues have a look at the Programming the ESP32-CAM tutorial, which provides more detailed instructions.

Note that our little object detection system is limited to 80 predefined objects (or classes). However, you can train the YOLO model with your own objects. The How to Train YOLOv3 to Detect Custom Objects? tutorial might help if you want to do this.

If you have any further questions, feel free to leave them in the comment section.

Have fun ; )