Skip to Content

Vision Chatbot with DFRobot ESP32-S3 AI Camera and OpenAI

Vision Chatbot with DFRobot ESP32-S3 AI Camera and OpenAI

The DFRobot ESP32-S3 AI Camera (DFR1154) is a development board built for AI and vision projects. It features a ESP32-S3 microcontroller and integrates a camera module, microphone, and speaker on a single board..

In this project, we will create a Vision Chatbot using the ESP32-S3 AI Camera board and OpenAI. The board captures images and listens to spoken questions through the microphone. It then converts the speech to text using OpenAI. Next it sends both the image and the text question to another OpenAI model for analysis. This AI model processes the input, generates an answer about the image, and returns the response as text. Finally, we use OpenAI a third time to convert the answer text into speech and play it through the speaker.

The short video clip below shows the Vision Chatbot in action. I am holding the board in the left and its camera points to a small model skull on my desk. I press the BOOT button on the board to start the Vision Chatbot and you will hear my question and the answer of the bot.

Adjust the volume if you can’t hear the audio and note there is a few seconds delay between the question and the answer due to the data transfer and processing at OpenAI:

Vision Chatbot in action

Audio recording, audio playback and image capture are handled locally by the ESP32-S3 AI Camera board. But Image analysis, Text-to-Speech (TTS) and Speech-to-Text (STT) utilize OpenAI’s AI models. Therefore a WiFi connection and an OpenAI API key are required.

Required Parts

For this project I use the ESP32-S3 AI Camera Module (DFR1154) by DFRobot. You can get it from DFRobot using the links below. Make sure that you get Version 1.1 and not the older Version 1.0. You may also need a USB-C cable.

DFRobot ESP32-S3 AI Camera

USB C Cable

Makerguides is a participant in affiliate advertising programs designed to provide a means for sites to earn advertising fees by linking to Amazon, AliExpress, Elecrow, and other sites. As an Affiliate we may earn from qualifying purchases.

Hardware of the DFRobot AI Camera

The DFRobot ESP32-S3 AI Camera (DFR1154) is a compact embedded vision and AI development board built around Espressif’s ESP32-S3 microcontroller. It integrates wireless connectivity, camera imaging, audio I/O, and AI acceleration capabilities into a single module. This board is designed for edge computing tasks such as object recognition, voice interaction, and real-time visual analysis.

Its physical footprint is a square printed circuit board roughly 42 mm on each side, which makes it suitable for integration into robotics, smart sensors, and IoT monitoring systems. The picture below shows the front and back of the board:

Front and Back of DFRobot ESP32-S3 AI Camera (source)
Front and Back of DFRobot ESP32-S3 AI Camera (source)

Microcontroller and Memory Architecture

At the core of the board, the ESP32-S3 microcontroller executes application code and handles communications. This microcontroller features a Tensilica Xtensa® dual-core 32-bit LX7 CPU running at up to 240 MHz. It includes embedded SRAM for fast access to data and instructions as well as dedicated RTC SRAM for low-power clock operations.

On-die memory is complemented by external storage: the board provides 16 MB of flash memory for firmware and data, and an 8 MB PSRAM chip to support larger run-time heaps required for image buffering or AI model execution. The USB interface conforms to USB 2.0 OTG full-speed, enabling both power delivery and data transfer.

Camera Subsystem

The imaging subsystem centers on an OmniVision OV3660 camera sensor. This sensor captures up to 2 megapixel images and includes sensitivity to both visible light and 940 nm infrared, which expands the operational range into low-light or night-vision conditions.

The optics deliver an approximately 160-degree field of view, and the fixed focal system has a focal length of about 0.95 mm with an aperture set near f/2.0. There are also four IR LEDs for illumination.

Wireless and Connectivity

The board supports dual-mode wireless communication. For local network connectivity, it implements IEEE 802.11b/g/n Wi-Fi in the 2.4 GHz band, with support for both 20 MHz and 40 MHz channels and multiple operational modes including station, soft access point, and combined station+AP modes. Bluetooth is available per Bluetooth 5 and Bluetooth Mesh protocols, enabling low-energy peer communication or sensor network participation.

Audio and Sensor Interfaces

In addition to imagery, the board has an onboard I2S PDM microphone for audio capture, which is routed through an internal amplifier (MAX98357) and exposed to a dedicated speaker interface. There is furthermore an SD Card port that enables the storage of audio and video data.

The LTR-308 ambient light sensor allows adaptive imaging or power scaling based on environmental illumination. This is especially useful in connection with the 4 IR LEDs that are present for illumination.

The board offers a Gravity 4-pin interface (3.3V, GND, GPIO44/RX, GPIO43/TX) that provides simple UART/I2C connectivity to external peripherals or sensors. Note that in the earlier version V1.0 of the board pin 1 of the Gravity interface was an 3.3-5V input. But in the current version V1.1 pin 1 is a 3.3V output!

GPIO Pins

The table below lists the GPIO pins and their assignment to the different hardware components such as camera (CAM), microphone (MIC), audio amplifier (MAX98357), ambient light sensor (ALS), SD card (SD), IR LEDs, BOOT button and status LED:

Pinout of DFRobot ESP32-S3 AI Camera (source)

Power Management and Physical Design

The DFR1154 accepts multiple supply configurations. It operates nominally at 3.3 V, and it can receive input power via a USB-C connector at 5 V DC or via a VIN connector with a wider range of 3.7 V to 15 V DC. A dedicated power management IC (HM6245) regulates these inputs down to required core voltages.

The overall temperature range is specified from roughly -10 °C up to 60 °C, which is intended for standard indoor or protected outdoor environments.

Technical Specifications

The following table summarizes the key hardware specifications of the DFRobot ESP32-S3 AI Camera (DFR1154) module (source):

SpecificationDetails
MicrocontrollerESP32-S3R8 with dual-core Tensilica Xtensa LX7 CPU, 240 MHz
SRAM512 KB
ROM384 KB
External Flash16 MB
External PSRAM8 MB
RTC SRAM16 KB
Camera SensorOV3660, 2 MP, 160° FoV, infrared support
Optics0.95 mm fixed focal, f/2.0 aperture, <8 % distortion
WirelessWi-Fi 802.11b/g/n (2.4 GHz); Bluetooth 5 & Mesh
AudioI2S PDM microphone; speaker interface via MAX98357 amplifier
SensorsLTR-308 ambient light sensor
USBUSB 2.0 OTG Full Speed (Type-C)
PowerOperating voltage 3.3 V; USB-C 5 V; VIN 3.7–15 V
ButtonsReset and Boot
Dimensions42 mm × 42 mm
Operating Temp-10 °C to 60 °C

Get OpenAI API Key

The Vision Chatbot is using AI models provided by OpenAI. You therefore will need an OpenAI account. Go to https://platform.openai.com and sign up with an email address or an existing Google or Microsoft account.

After verifying your email and completing the initial setup, log in to the OpenAI dashboard, platform.openai.com/api-keys and find or create your API Key (=SECRET KEY) as shown below:

OpenAI API keys
OpenAI API keys

The API Key is a unique, long string, starting with “sk-proj-” that is needed to authenticate your API requests (see below). Later you will need to copy this entire string into the code for the Vision ChatBot.

sk-proj-xcA.......................OtDu0U

That is all you really need but I recommend you set a usage limit for your account as well. This ensures that you don’t accidentally end up with an expensive bill due to a bug in your code (e.g. sending hundreds of images).

You can set Usage Limits and also find out the Pricing (Cost) for the different AI models under the Billing tab (platform.openai.com/settings/organization/billing).

Billing Overview
Billing Overview

I myself have set a Usage Limit of 20 USD and have not enabled Auto Recharge. That small budget lasts for a long time, as long as the cheap AI models are used. As you can see, my Balance is still 14 USD and I have been playing with the OpenAI models for several months.

Install ESP32 Core

If this is your first project with a board of the ESP32 series, you also will need to install the ESP32 core. If ESP32 boards are already installed in your Arduino IDE, you can skip this section.

Start by opening the Preferences dialog by selecting “Preferences…” from the “File” menu. This will open the Preferences dialog shown below.

Under the Settings tab you will find an edit box at the bottom of the dialog that is labelled “Additional boards manager URLs“:

Additional boards manager URLs in Preferences
Additional boards manager URLs in Preferences

In this input field copy the following URL:

https://espressif.github.io/arduino-esp32/package_esp32_dev_index.json

This will let the Arduino IDE know, where to find the ESP32 core libraries. Next we will install the ESP32 boards using the Boards Manager.

Open the Boards Manager via “Tools -> Boards -> Board Manager”. You will see the Boards Manager appearing in the left Sidebar. Enter “ESP32” in the search field at the top and you should see two types of ESP32 boards; the “Arduino ESP32 Boards” and the “esp32 by Espressif” boards. We want the “esp32 libraries by Espressif”. Click on the INSTALL button and wait until the download and install is complete.

Install ESP32 Core libraries
Install ESP32 Core libraries

I am using the current version 3.3.5 here but any other of the 3.x version should work as well for this project.

Selecting Board

You also need to select a ESP32 board. In case of the DFRobot ESP32-S3 AI Camera, you can pick the generic “ESP32S3 Dev Module”. For that, click on the drop-down menu and then on “Select other board and port…”:

Drop-down Menu for Board Selection
Drop-down Menu for Board Selection

This will open a dialog where you can enter “esp32s3 dev” in the search bar. You will see the “ESP32S3 Dev Module” board under Boards. Click on it and the COM port to activate it and then click OK:

Board Selection Dialog "ESP32S3 Dev Module" board
Board Selection Dialog “ESP32S3 Dev Module” board

Note that you need to connect the board via the USB cable to your computer, before you can select a COM port.

Tool Settings

Below are the settings you need use with the board. You find them under the Tools menu in your Arduino IDE.

Tool settings for DFRobot ESP32-S3 AI Camera
Tool settings for DFRobot ESP32-S3 AI Camera

The most important settings are “16MB Flash Size”, “Huge APP partition”, and “OPI PSRAM”. To see text output on the Serial Monitor make sure that USB CDC on Boot is “Enabled. The other settings are typically the default settings and are fine as they are.

Install Libraries

The code for the Vision Chatbot uses two libraries, the ArduinoJson and the ESP8266Audio library. Open the LIBRARY MANAGER, search for “ArduinoJson” and “ESP8266Audio” and press the INSTALL button to install these libraries:

Installing "ArduinoJson" and "ESP8266" libraries
Installing “ArduinoJson” and “ESP8266” libraries

As you can see, I installed Version 2.4.1 of the ESP8266Audio library and Version 7.4.2 of the ArduinoJson library. However, the exact version should not matter much. Also note that despite its name, the “ESP8266Audio” library works fine with an ESP32 as well.

Code for Vision Chatbot

In this section I show you the code for the Vision Chatbot. It is a reimplementation of the OpenAI image recognition example that you can find in the DFRobot github repo but which unfortunately doesn’t work reliably. I therefore rewrote it entirely.

The code starts by waiting for the BOOT button to be pressed and records audio until the BOOT button is released or more than 3 seconds have passed. The BOOT button in the lower, right corner at the back of the board above the reset (RST) button:

BOOT Button
BOOT Button

Next the code sends the recorded audio to OpenAI’s text-to-speech model for transcription, captures an image with the camera, and then queries OpenAI’s vision-capable GPT model to answer questions about the image. The answer is converted to audio by OpenAI’s speech-to-text model and played through the little speaker of the ESP32-S3 AI Camera Module.

Have a quick look at the code first before we discuss the details. If you don’t want to copy and paste the code you can download the vision_chatbot.zip file that contains the complete code. You still will have to set WiFi credentials and the OpenAI API Key in the code, however.

#include "WiFi.h"
#include "WiFiClientSecure.h"
#include "HTTPClient.h"
#include "ArduinoJson.h"
#include "esp_heap_caps.h"
#include "ESP_I2S.h"
#include "base64.h"
#include "camera.h"
#include "wav_header.h"
#include "Audio.h"

/* ===================== Pins ===================== */
#define BUTTON_PIN 0
#define LED_PIN 3

#define MIC_DATA_PIN 39
#define MIC_CLK_PIN 38

#define I2S_DOUT 42
#define I2S_BCLK 45
#define I2S_LRC 46

/* ===================== Audio ===================== */
#define SAMPLE_RATE 16000
#define MAX_RECORD_TIME_MS 3000
#define WAV_HEADER_SZ PCM_WAV_HEADER_SIZE

/* ===================== Models ===================== */
#define TTS_MODEL "tts-1"
#define TTS_VOICE "shimmer"
#define TTS_VOLUME 16
#define STT_MODEL "whisper-1"
#define VISION_MODEL "gpt-4o-mini"

/* ===================== Network ===================== */
const char* ssid = "ssid";
const char* password = "pwd";
const char* apiKey = "api-key";

/* ===================== Globals ===================== */
WiFiClientSecure secureClient;

I2SClass I2S;
Audio audio;

uint8_t* wavBuf = nullptr;
size_t wavSize = 0;
size_t wavMax = 0;

bool recording = false;
bool busy = false;
unsigned long recordStartMs = 0;


void initMic() {
  I2S.setPinsPdmRx(MIC_CLK_PIN, MIC_DATA_PIN);
  I2S.begin(I2S_MODE_PDM_RX, SAMPLE_RATE,
            I2S_DATA_BIT_WIDTH_16BIT,
            I2S_SLOT_MODE_MONO);      
}

void startRecording() {
  digitalWrite(LED_PIN, HIGH);

  wavMax = WAV_HEADER_SZ + SAMPLE_RATE * 2 * MAX_RECORD_TIME_MS / 1000;
  wavBuf = (uint8_t*)heap_caps_malloc(wavMax, MALLOC_CAP_SPIRAM);
  Serial.printf("[REC] WAV buffer allocated in PSRAM (%u bytes)\n", wavMax);

  if (!wavBuf) {
    Serial.println("[ERR] WAV buffer allocation failed");
    return;
  }

  pcm_wav_header_t hdr = PCM_WAV_HEADER_DEFAULT(0, 16, SAMPLE_RATE, 1);
  memcpy(wavBuf, &hdr, WAV_HEADER_SZ);

  wavSize = WAV_HEADER_SZ;
  recordStartMs = millis();
  recording = true;

  Serial.println("[REC] Recording started");
}

void pollRecording() {
  size_t avail = I2S.available();
  if (!avail) return;

  if (wavSize + avail > wavMax) {
    stopRecording();
    return;
  }

  wavSize += I2S.readBytes((char*)(wavBuf + wavSize), avail);
}

void stopRecording() {
  recording = false;

  if (!wavBuf || wavSize <= WAV_HEADER_SZ) {
    Serial.println("[ERR] No audio recorded");
    return;
  }

  pcm_wav_header_t* h = (pcm_wav_header_t*)wavBuf;
  h->descriptor_chunk.chunk_size = wavSize - 8;
  h->data_chunk.subchunk_size = wavSize - WAV_HEADER_SZ;

  Serial.printf("[REC] Recording stopped, %u bytes total\n", wavSize);
  digitalWrite(LED_PIN, LOW);
}

bool isValidWavBuffer() {
  if (!wavBuf || wavSize < WAV_HEADER_SZ) {
    Serial.println("[STT] Invalid WAV buffer");
    return false;
  }
  return true;
}

void releaseWavBuffer() {
  free(wavBuf);
  wavBuf = nullptr;
  wavSize = 0;
}

String buildMultipartHead(const char* boundary) {
  return String("--") + boundary + "\r\n"
       + "Content-Disposition: form-data; name=\"model\"\r\n\r\n"
       + STT_MODEL + "\r\n--"
       + boundary + "\r\n"
       + "Content-Disposition: form-data; name=\"file\"; filename=\"audio.wav\"\r\n"
       + "Content-Type: audio/wav\r\n\r\n";
}

String buildMultipartTail(const char* boundary) {
  return "\r\n--" + String(boundary) + "--\r\n";
}

uint8_t* buildMultipartBody(
  const String& head,
  const String& tail,
  size_t& outLen
) {
  outLen = head.length() + wavSize + tail.length();

  uint8_t* body =
    (uint8_t*)heap_caps_malloc(outLen, MALLOC_CAP_SPIRAM);

  if (!body) {
    Serial.println("[STT] Out of memory (multipart)");
    return nullptr;
  }

  memcpy(body, head.c_str(), head.length());
  memcpy(body + head.length(), wavBuf, wavSize);
  memcpy(body + head.length() + wavSize, tail.c_str(), tail.length());

  Serial.printf("[STT] Multipart body in PSRAM (%u bytes)\n", outLen);
  return body;
}

int postMultipart(
  uint8_t* body,
  size_t bodyLen,
  const char* boundary,
  String& response
) {
  HTTPClient http;

  http.begin(secureClient,
             "https://api.openai.com/v1/audio/transcriptions");
  http.addHeader("Authorization", String("Bearer ") + apiKey);
  http.addHeader("Content-Type",
                 String("multipart/form-data; boundary=") + boundary);

  Serial.printf("[STT] Uploading %u bytes\n", bodyLen);

  int code = http.POST(body, bodyLen);
  Serial.printf("[STT] HTTP code: %d\n", code);

  if (code == 200) {
    response = http.getString();
  } else {
    Serial.println("[STT] Error response:");
    Serial.println(http.getString());
  }

  http.end();
  return code;
}

String parseSttResponse(const String& resp) {
  StaticJsonDocument<512> doc;

  if (deserializeJson(doc, resp)) {
    Serial.println("[STT] JSON parse failed");
    return "";
  }

  return doc["text"] | "";
}

String speechToText() {
  Serial.println("[STT] Building multipart body");
  Serial.printf(
    "[HEAP] Internal heap free before TLS: %u\n",
    heap_caps_get_free_size(MALLOC_CAP_INTERNAL));

  if (!isValidWavBuffer()) {
    return "";
  }

  const char* boundary = "----ESP32Boundary";

  String head = buildMultipartHead(boundary);
  String tail = buildMultipartTail(boundary);

  size_t bodyLen = 0;
  uint8_t* body = buildMultipartBody(head, tail, bodyLen);
  if (!body) {
    return "";
  }

  releaseWavBuffer();

  String response;
  int code = postMultipart(body, bodyLen, boundary, response);
  free(body);

  if (code != 200) {
    return "";
  }

  return parseSttResponse(response);
}

void textToSpeech(String text) {
  busy = true;
  audio.openai_speech(apiKey, TTS_MODEL,
                      text.c_str(), "",
                      TTS_VOICE, "mp3", "1");
}

String visionAnswer(String question, camera_fb_t* fb) {
  Serial.println("[VISION] Encoding image...");
  String imageBase64 = base64::encode(fb->buf, fb->len);

  HTTPClient http;
  http.begin(secureClient, "https://api.openai.com/v1/chat/completions");
  http.addHeader("Authorization", String("Bearer ") + apiKey);
  http.addHeader("Content-Type", "application/json");

  StaticJsonDocument<3072> req;
  req["model"] = "gpt-4o-mini";

  JsonArray msgs = req.createNestedArray("messages");

  JsonObject system = msgs.createNestedObject();
  system["role"] = "system";
  system["content"] =
    "You are a helpful vision assistant. Analyze images and answer questions concisely";

  JsonObject user = msgs.createNestedObject();
  user["role"] = "user";

  JsonArray content = user.createNestedArray("content");

  JsonObject txt = content.createNestedObject();
  txt["type"] = "text";
  txt["text"] = question;

  JsonObject img = content.createNestedObject();
  img["type"] = "image_url";
  img["image_url"]["url"] =
    "data:image/jpeg;base64," + imageBase64;

  String body;
  serializeJson(req, body);

  Serial.println("[VISION] Sending request...");
  int code = http.POST(body);

  esp_camera_fb_return(fb);

  Serial.printf("[VISION] HTTP code: %d\n", code);

  if (code <= 0) {
    http.end();
    return "";
  }

  String payload = http.getString();
  http.end();

  // Serial.println("[VISION] Response:");
  // Serial.println(payload);

  StaticJsonDocument<1024> resp;
  if (deserializeJson(resp, payload)) {
    Serial.println("[ERR] Vision JSON parse failed");
    return "";
  }

  return resp["choices"][0]["message"]["content"] | "";
}

camera_fb_t* captureImage() {
  return esp_camera_fb_get();
}

void initSpeaker() {
  audio.setPinout(I2S_BCLK, I2S_LRC, I2S_DOUT);
  audio.setVolume(TTS_VOLUME);
}

void initPins() {
  pinMode(BUTTON_PIN, INPUT_PULLUP);
  pinMode(LED_PIN, OUTPUT);
}

void initTime() {
  configTime(0, 0, "pool.ntp.org", "time.nist.gov");
  time_t now;
  while (time(&now) < 100000) delay(100);
}

void initWiFi() {
  WiFi.begin(ssid, password);
  while (WiFi.status() != WL_CONNECTED) delay(200);
}

void initSerial() {
  Serial.begin(115200);
  delay(100);
}

void setup() {
  secureClient.setInsecure();

  initSerial();
  initPins();
  initWiFi();
  initTime();
  initMic();
  initSpeaker();
  initCamera();
  
  Serial.println("Ready");
}

void loop() {
  audio.loop();

  if (busy && !audio.isRunning()) {
    busy = false;
  }
  if (busy) return;

  static bool lastBtn = HIGH;
  bool btn = digitalRead(BUTTON_PIN);

  if (btn == LOW && lastBtn == HIGH && !recording) {
    startRecording();
  }

  if (recording) {
    pollRecording();

    bool released = (btn == HIGH && lastBtn == LOW);
    bool timeout = (millis() - recordStartMs >= MAX_RECORD_TIME_MS);

    if (released || timeout) {
      stopRecording();
      String q = speechToText();
      if (q.length()) {
        Serial.printf("[STT] %s\n", q.c_str());
        camera_fb_t* fb = captureImage();
        if (fb) {
          String a = visionAnswer(q, fb);
          Serial.printf("[VISION] %s\n", a.c_str());
          if (a.length()) {
            textToSpeech(a);
          }
        }
      }
    }
  }

  lastBtn = btn;
  delay(5);
}

Imports

The code begins by including necessary libraries for WiFi connectivity, secure HTTP communication, JSON parsing, camera control, audio processing, and base64 encoding. These libraries enable the ESP32-S3 to interact with hardware peripherals and communicate with OpenAI’s cloud APIs.

#include "WiFi.h"
#include "WiFiClientSecure.h"
#include "HTTPClient.h"
#include "ArduinoJson.h"
#include "esp_heap_caps.h"
#include "ESP_I2S.h"
#include "base64.h"
#include "camera.h"
#include "wav_header.h"
#include "Audio.h"

Pins and Audio Configuration

Several constants define the GPIO pins used for the button, LED, microphone data and clock, and I2S audio output pins. Audio parameters such as sample rate, maximum recording time, and WAV header size are also defined to configure audio capture and playback.

#define BUTTON_PIN 0
#define LED_PIN 3

#define MIC_DATA_PIN 39
#define MIC_CLK_PIN 38

#define I2S_DOUT 42
#define I2S_BCLK 45
#define I2S_LRC 46

#define SAMPLE_RATE 16000
#define MAX_RECORD_TIME_MS 3000
#define WAV_HEADER_SZ PCM_WAV_HEADER_SIZE

Model and Network Credentials

The code specifies the OpenAI models used for text-to-speech (TTS), speech-to-text (STT), and vision question answering. It also stores the WiFi SSID, password, and OpenAI API key as constants for network connection and API authentication.

#define TTS_MODEL "tts-1"
#define TTS_VOICE "shimmer"
#define TTS_VOLUME 16
#define STT_MODEL "whisper-1"
#define VISION_MODEL "gpt-4o-mini"

const char* ssid = "ssid";
const char* password = "pwd";
const char* apiKey = "apikey";

You needs to replace the dummy values for ssid, password and OpenAI apiKey with the correct values for your WiFi network and your OpenAI apiKey. Otherwise the Vision Chatbot cannot communicate with the AI models at OpenAI and will not work.

Global Variables and Objects

The code declares a secure WiFi client for HTTPS communication, an I2S audio interface object, and an Audio object for playback. It also manages a buffer for storing recorded WAV audio data, flags for recording and busy states, and timing variables to control recording duration.

WiFiClientSecure secureClient;

I2SClass I2S;
Audio audio;

uint8_t* wavBuf = nullptr;
size_t wavSize = 0;
size_t wavMax = 0;

bool recording = false;
bool busy = false;
unsigned long recordStartMs = 0;

Microphone Initialization

The initMic() function configures the I2S peripheral to receive PDM microphone data using the specified clock and data pins. It sets the sample rate, data bit width, and mono channel mode to prepare for audio recording.

void initMic() {
  I2S.setPinsPdmRx(MIC_CLK_PIN, MIC_DATA_PIN);
  I2S.begin(I2S_MODE_PDM_RX, SAMPLE_RATE,
            I2S_DATA_BIT_WIDTH_16BIT,
            I2S_SLOT_MODE_MONO);      
}

Audio Recording Control

The startRecording() function begins audio capture by turning on the LED indicator and allocating a buffer in PSRAM to hold the WAV data. It writes a default WAV header to the buffer, records the start time, and sets the recording flag.

void startRecording() {
  digitalWrite(LED_PIN, HIGH);

  wavMax = WAV_HEADER_SZ + SAMPLE_RATE * 2 * MAX_RECORD_TIME_MS / 1000;
  wavBuf = (uint8_t*)heap_caps_malloc(wavMax, MALLOC_CAP_SPIRAM);
  Serial.printf("[REC] WAV buffer allocated in PSRAM (%u bytes)\n", wavMax);

  if (!wavBuf) {
    Serial.println("[ERR] WAV buffer allocation failed");
    return;
  }

  pcm_wav_header_t hdr = PCM_WAV_HEADER_DEFAULT(0, 16, SAMPLE_RATE, 1);
  memcpy(wavBuf, &hdr, WAV_HEADER_SZ);

  wavSize = WAV_HEADER_SZ;
  recordStartMs = millis();
  recording = true;

  Serial.println("[REC] Recording started");
}

The pollRecording() function reads available audio data from the I2S peripheral and appends it to the WAV buffer. If the buffer is full, it stops recording automatically.

void pollRecording() {
  size_t avail = I2S.available();
  if (!avail) return;

  if (wavSize + avail > wavMax) {
    stopRecording();
    return;
  }

  wavSize += I2S.readBytes((char*)(wavBuf + wavSize), avail);
}

The stopRecording() function finalizes the WAV data by updating the header with the correct sizes, stops the recording flag, and turns off the LED indicator.

void stopRecording() {
  recording = false;

  if (!wavBuf || wavSize <= WAV_HEADER_SZ) {
    Serial.println("[ERR] No audio recorded");
    return;
  }

  pcm_wav_header_t* h = (pcm_wav_header_t*)wavBuf;
  h->descriptor_chunk.chunk_size = wavSize - 8;
  h->data_chunk.subchunk_size = wavSize - WAV_HEADER_SZ;

  Serial.printf("[REC] Recording stopped, %u bytes total\n", wavSize);
  digitalWrite(LED_PIN, LOW);
}

Speech-to-Text Multipart Request Construction

To send the recorded audio to OpenAI’s Speech-to-Text model, the code builds a multipart/form-data HTTP request body. The buildMultipartHead() and buildMultipartTail() functions create the multipart boundaries and headers, while buildMultipartBody() assembles the full request body in PSRAM by concatenating the head, WAV data, and tail.

String buildMultipartHead(const char* boundary) {
  return String("--") + boundary + "\r\n"
       + "Content-Disposition: form-data; name=\"model\"\r\n\r\n"
       + STT_MODEL + "\r\n--"
       + boundary + "\r\n"
       + "Content-Disposition: form-data; name=\"file\"; filename=\"audio.wav\"\r\n"
       + "Content-Type: audio/wav\r\n\r\n";
}

String buildMultipartTail(const char* boundary) {
  return "\r\n--" + String(boundary) + "--\r\n";
}

uint8_t* buildMultipartBody(
  const String& head,
  const String& tail,
  size_t& outLen
) {
  outLen = head.length() + wavSize + tail.length();

  uint8_t* body =
    (uint8_t*)heap_caps_malloc(outLen, MALLOC_CAP_SPIRAM);

  if (!body) {
    Serial.println("[STT] Out of memory (multipart)");
    return nullptr;
  }

  memcpy(body, head.c_str(), head.length());
  memcpy(body + head.length(), wavBuf, wavSize);
  memcpy(body + head.length() + wavSize, tail.c_str(), tail.length());

  Serial.printf("[STT] Multipart body in PSRAM (%u bytes)\n", outLen);
  return body;
}

The STT_MODEL constant specifics the Speech-to-Text model that is used. I am using “whisper-1” here but OpenAI has other Speech-to-Text models such as “gpt-4o-mini-transcribe”, “gpt-4o-transcribe” or “gpt-4o-transcribe-diarize” that you could try out.

HTTP POST for Speech-to-Text

The postMultipart() function performs the HTTPS POST request to OpenAI’s audio transcription endpoint. It sets the authorization and content-type headers, uploads the multipart body, and retrieves the response. The function returns the HTTP status code and stores the response string.

int postMultipart(
  uint8_t* body,
  size_t bodyLen,
  const char* boundary,
  String& response
) {
  HTTPClient http;

  http.begin(secureClient,
             "https://api.openai.com/v1/audio/transcriptions");
  http.addHeader("Authorization", String("Bearer ") + apiKey);
  http.addHeader("Content-Type",
                 String("multipart/form-data; boundary=") + boundary);

  Serial.printf("[STT] Uploading %u bytes\n", bodyLen);

  int code = http.POST(body, bodyLen);
  Serial.printf("[STT] HTTP code: %d\n", code);

  if (code == 200) {
    response = http.getString();
  } else {
    Serial.println("[STT] Error response:");
    Serial.println(http.getString());
  }

  http.end();
  return code;
}

Speech-to-Text Processing

The speechToText() function orchestrates the process of building the multipart request, sending it, and parsing the JSON response to extract the transcribed text. It also handles memory management by releasing the WAV buffer after sending.

String speechToText() {
  Serial.println("[STT] Building multipart body");
  Serial.printf(
    "[HEAP] Internal heap free before TLS: %u\n",
    heap_caps_get_free_size(MALLOC_CAP_INTERNAL));

  if (!isValidWavBuffer()) {
    return "";
  }

  const char* boundary = "----ESP32Boundary";

  String head = buildMultipartHead(boundary);
  String tail = buildMultipartTail(boundary);

  size_t bodyLen = 0;
  uint8_t* body = buildMultipartBody(head, tail, bodyLen);
  if (!body) {
    return "";
  }

  releaseWavBuffer();

  String response;
  int code = postMultipart(body, bodyLen, boundary, response);
  free(body);

  if (code != 200) {
    return "";
  }

  return parseSttResponse(response);
}

Vision Question Answering

The visionAnswer() function sends a question along with a captured image to OpenAI’s GPT-4o-mini model for vision-based question answering. It encodes the image in base64, constructs a JSON chat completion request with system and user messages, and parses the response to extract the assistant’s answer.

String visionAnswer(String question, camera_fb_t* fb) {
  Serial.println("[VISION] Encoding image...");
  String imageBase64 = base64::encode(fb->buf, fb->len);

  HTTPClient http;
  http.begin(secureClient, "https://api.openai.com/v1/chat/completions");
  http.addHeader("Authorization", String("Bearer ") + apiKey);
  http.addHeader("Content-Type", "application/json");

  StaticJsonDocument<3072> req;
  req["model"] = VISION_MODEL;

  JsonArray msgs = req.createNestedArray("messages");

  JsonObject system = msgs.createNestedObject();
  system["role"] = "system";
  system["content"] =
    "You are a helpful vision assistant. Analyze images and answer questions concisely";

  JsonObject user = msgs.createNestedObject();
  user["role"] = "user";

  JsonArray content = user.createNestedArray("content");

  JsonObject txt = content.createNestedObject();
  txt["type"] = "text";
  txt["text"] = question;

  JsonObject img = content.createNestedObject();
  img["type"] = "image_url";
  img["image_url"]["url"] =
    "data:image/jpeg;base64," + imageBase64;

  String body;
  serializeJson(req, body);

  Serial.println("[VISION] Sending request...");
  int code = http.POST(body);

  esp_camera_fb_return(fb);

  Serial.printf("[VISION] HTTP code: %d\n", code);

  if (code <= 0) {
    http.end();
    return "";
  }

  String payload = http.getString();
  http.end();

  StaticJsonDocument<1024> resp;
  if (deserializeJson(resp, payload)) {
    Serial.println("[ERR] Vision JSON parse failed");
    return "";
  }

  return resp["choices"][0]["message"]["content"] | "";
}

The VISION_MODEL constant specifies the OpenAI vision model used. I am using “gpt-4o-mini” but there are others such as “gpt-image-1”, “gpt-5-mini”, “gpt-5-nano” or “gpt-4.1-nano” that you could try. They have different capabilities, speeds and costs.

Text-to-Speech Playback

The textToSpeech() function uses the ESpP8266Audio library to request speech synthesis from OpenAI’s TTS model. It sets the busy flag to prevent overlapping operations while audio is playing.

void textToSpeech(String text) {
  busy = true;
  audio.openai_speech(apiKey, TTS_MODEL,
                      text.c_str(), "",
                      TTS_VOICE, "mp3", "1");
}

The TTS_MODEL constant specifies the Text-to-Speech model that is used. I am using “tts-1” but you could also use “tts-1-hd”. The “tts-1” model provides lower latency, but at a lower quality than the “tts-1-hd” model. A more intelligent but also more expensive model is the “gpt-4o-mini-tts” model that you also could use.

The voice for the audio output is specified by the TTS_VOICE constant, which is set to “shimmer”. You can try other voices but note that voice availability depends on the model. The tts-1 and tts-1-hd models support a smaller set of voices: “alloy”, “ash”, “coral”, “echo”, “fable”, “onyx”, “nova”, “sage”, and “shimmer” (platform.openai.com/docs/guides/text-to-speech).

Camera Capture

The captureImage() function captures a frame from the ESP32 camera and returns a pointer to the frame buffer for further processing.

camera_fb_t* captureImage() {
  return esp_camera_fb_get();
}

Hardware Initialization

Several helper functions initialize hardware components and system services. initSpeaker() configures the audio output pins and volume. initPins() sets up the button and LED GPIO modes. initTime() synchronizes the system time using NTP servers. initWiFi() connects to the specified WiFi network. initSerial() starts the serial communication for debugging.

void initSpeaker() {
  audio.setPinout(I2S_BCLK, I2S_LRC, I2S_DOUT);
  audio.setVolume(TTS_VOLUME);
}

void initPins() {
  pinMode(BUTTON_PIN, INPUT_PULLUP);
  pinMode(LED_PIN, OUTPUT);
}

void initTime() {
  configTime(0, 0, "pool.ntp.org", "time.nist.gov");
  time_t now;
  while (time(&now) < 100000) delay(100);
}

void initWiFi() {
  WiFi.begin(ssid, password);
  while (WiFi.status() != WL_CONNECTED) delay(200);
}

void initSerial() {
  Serial.begin(115200);
  delay(100);
}

Setup Function

The setup() function is called once at startup. It configures the secure client to skip certificate verification, initializes serial communication, GPIO pins, WiFi, system time, microphone, speaker, and camera. Finally, it prints “Ready” to indicate the system is prepared.

void setup() {
  secureClient.setInsecure();

  initSerial();
  initPins();
  initWiFi();
  initTime();
  initMic();
  initSpeaker();
  initCamera();

  Serial.println("Ready");
}

Loop Function

The loop() function runs repeatedly. It processes audio playback in the background. If the system is busy playing audio, it waits. Otherwise, it reads the button state to detect presses and releases.

When the button is pressed, recording starts. While recording, audio data is polled and appended to the buffer. Recording stops either when the button is released or the maximum recording time is reached.

After stopping, the recorded audio is sent for transcription. If transcription succeeds, the camera captures an image, and the question is sent to the vision model. The answer is then converted to speech and played back. The loop includes a small delay to debounce the button.

void loop() {
  audio.loop();

  if (busy && !audio.isRunning()) {
    busy = false;
  }
  if (busy) return;

  static bool lastBtn = HIGH;
  bool btn = digitalRead(BUTTON_PIN);

  if (btn == LOW && lastBtn == HIGH && !recording) {
    startRecording();
  }

  if (recording) {
    pollRecording();

    bool released = (btn == HIGH && lastBtn == LOW);
    bool timeout = (millis() - recordStartMs >= MAX_RECORD_TIME_MS);

    if (released || timeout) {
      stopRecording();
      String q = speechToText();
      if (q.length()) {
        Serial.printf("[STT] %s\n", q.c_str());
        camera_fb_t* fb = captureImage();
        if (fb) {
          String a = visionAnswer(q, fb);
          Serial.printf("[VISION] %s\n", a.c_str());
          if (a.length()) {
            textToSpeech(a);
          }
        }
      }
    }
  }

  lastBtn = btn;
  delay(5);
}

And this is the complete code for the Vision Chatbot itself. However, we also need some code for the camera, which is described in the next section.

Camera.h code

The camera code is essentially a copy of the “camera.h” file from the OpenAI image recognition example that you can find in the DFRobot github repo.

#include "esp_camera.h"
#include "soc/soc.h"           // Disable brownout problems
#include "soc/rtc_cntl_reg.h"  // Disable brownout problems
#include <Arduino.h>

// OV2640 camera module pins (CAMERA_MODEL_AI_THINKER)
#define PWDN_GPIO_NUM -1
#define RESET_GPIO_NUM -1
#define XCLK_GPIO_NUM 5
#define SIOD_GPIO_NUM 8
#define SIOC_GPIO_NUM 9
#define Y9_GPIO_NUM 4
#define Y8_GPIO_NUM 6
#define Y7_GPIO_NUM 7
#define Y6_GPIO_NUM 14
#define Y5_GPIO_NUM 17
#define Y4_GPIO_NUM 21
#define Y3_GPIO_NUM 18
#define Y2_GPIO_NUM 16
#define VSYNC_GPIO_NUM 1
#define HREF_GPIO_NUM 2
#define PCLK_GPIO_NUM 15

void initCamera() {
  camera_config_t config;
  config.ledc_channel = LEDC_CHANNEL_0;
  config.ledc_timer = LEDC_TIMER_0;
  config.pin_d0 = Y2_GPIO_NUM;
  config.pin_d1 = Y3_GPIO_NUM;
  config.pin_d2 = Y4_GPIO_NUM;
  config.pin_d3 = Y5_GPIO_NUM;
  config.pin_d4 = Y6_GPIO_NUM;
  config.pin_d5 = Y7_GPIO_NUM;
  config.pin_d6 = Y8_GPIO_NUM;
  config.pin_d7 = Y9_GPIO_NUM;
  config.pin_xclk = XCLK_GPIO_NUM;
  config.pin_pclk = PCLK_GPIO_NUM;
  config.pin_vsync = VSYNC_GPIO_NUM;
  config.pin_href = HREF_GPIO_NUM;
  config.pin_sccb_sda = SIOD_GPIO_NUM;
  config.pin_sccb_scl = SIOC_GPIO_NUM;
  config.pin_pwdn = PWDN_GPIO_NUM;
  config.pin_reset = RESET_GPIO_NUM;
  config.xclk_freq_hz = 8000000;
  config.frame_size = FRAMESIZE_240X240;
  config.pixel_format = PIXFORMAT_JPEG;  // for streaming
  config.grab_mode = CAMERA_GRAB_WHEN_EMPTY;
  config.fb_location = CAMERA_FB_IN_PSRAM;
  config.jpeg_quality = 12;
  config.fb_count = 2;

  // if PSRAM IC present, init with UXGA resolution and higher JPEG quality
  //                      for larger pre-allocated frame buffer.
  if (config.pixel_format == PIXFORMAT_JPEG) {
    if (psramFound()) {
      config.jpeg_quality = 10;
      config.fb_count = 2;
      config.grab_mode = CAMERA_GRAB_LATEST;
    } else {
      // Limit the frame size when PSRAM is not available
      config.frame_size = FRAMESIZE_SVGA;
      config.fb_location = CAMERA_FB_IN_DRAM;
    }
  } else {
    // Best option for face detection/recognition
    config.frame_size = FRAMESIZE_240X240;
#if CONFIG_IDF_TARGET_ESP32S3
    config.fb_count = 2;
#endif
  }

  // camera init
  esp_err_t err = esp_camera_init(&config);
  if (err != ESP_OK) {
    Serial.printf("Camera init failed with error 0x%x", err);
    return;
  }

  sensor_t *s = esp_camera_sensor_get();
  // initial sensors are flipped vertically and colors are a bit saturated
  if (s->id.PID == OV3660_PID) {
    s->set_vflip(s, 1);        // flip it back
    s->set_brightness(s, 1);   // up the brightness just a bit
    s->set_saturation(s, -2);  // lower the saturation
  }
  // drop down frame size for higher initial frame rate
  if (config.pixel_format == PIXFORMAT_JPEG) {
    s->set_framesize(s, FRAMESIZE_QVGA);
  }
}

However, I made with one important change. I noticed that “cam_hal: FB-OVF” errors where printed to the Serial Monitor when I was running the Vision Chatbot.

“FB-OVF” stands for Frame Buffer Overflow. It essentially means the camera is sending data faster than the ESP32 can process it or store it in memory.

The recommend method to avoid this error is to to reduce the framerate. I therefore changed the original camera code and set the framerate to 8Mhz:

config.xclk_freq_hz = 8000000;

That eliminated the “cam_hal: FB-OVF” errors.

Project Folder

You can download the entire project file vision_chatbot.zip or create the Arduino project for the Vision Chatbot yourself. For that create a folder “vision_chatbot” with two files (“camera.h”, “vision_chatbot.ino”) in it:

The “camera.h” file contains the code for the camera and the “vision_chatbot.ino” contains the file for the Vision Chatbot. Once you have set the correct board (“ESP32S3 Dev Module”) and the correct tool settings (PSRAM, Huge APP, …) you can flash the code to the board and enjoy your Vision Chatbot in action. The next two sections show two examples with the answers of the Vision Chatbot.

Example: What do you see?

In this first example, I showed the ESP32-S3 AI Camera s a small model skull on my desk with some other stuff in the background.

And here is the output of the Vision Chatbot on the Serial Monitor when presented with this image:

13:22:02.672 -> E (1763) i2s_common: i2s_channel_disable(1217): the channel has not been enabled yet
13:22:03.049 -> Ready
13:25:49.615 -> [REC] WAV buffer allocated in PSRAM (96044 bytes)
13:25:49.615 -> [REC] Recording started
13:25:51.288 -> [REC] Recording stopped, 55724 bytes total
13:25:51.288 -> [STT] Building multipart body
13:25:51.288 -> [HEAP] Internal heap free before TLS: 181256
13:25:51.288 -> [STT] Multipart body in PSRAM (55944 bytes)
13:25:51.288 -> [STT] Uploading 55944 bytes
13:25:52.847 -> [STT] HTTP code: 200
13:25:52.847 -> [STT] What do you see?
13:25:52.847 -> [VISION] Encoding image...
13:25:52.847 -> [VISION] Sending request...
13:25:54.995 -> [VISION] HTTP code: 200
13:25:55.032 -> [VISION] I see a small skull placed on a light-colored surface, likely a table. In the background, there are various objects and possibly some clutter. The setting appears to be an indoor space.

You can see the question asked “What do you see?” and the answer of the Chatbot “I see a small skull placed on a light-colored surface, likely a table. In the background, there are various objects and possibly some clutter. The setting appears to be an indoor space.”

Note that there is an error message “E (1763) i2s_common: i2s_channel_disable(1217): the channel has not been enabled yet” at the beginning that you can ignore. It seems to be related to an issue with the current ESP32 core but doesn’t affect the function of the Chatbot.

Example: How many clocks?

You can also ask the Chatbot for specific objects in an image. That doesn’t work all the time, and the Bot had issues recognizing chairs, for instance. However, in the rather complex scene shown below that contains a wall clock, the Chatbot correctly reported that there is one clock:

Here is the output on the Serial Monitor. I asked “How many clocks are in this image?” and the answer was “There is one clock visible in the image.”:

15:25:00.656 -> Ready
15:28:28.984 -> [REC] WAV buffer allocated in PSRAM (96044 bytes)
15:28:28.984 -> [REC] Recording started
15:28:31.817 -> [REC] Recording stopped, 92204 bytes total
15:28:31.817 -> [STT] Building multipart body
15:28:31.817 -> [HEAP] Internal heap free before TLS: 181256
15:28:31.817 -> [STT] Multipart body in PSRAM (92424 bytes)
15:28:31.817 -> [STT] Uploading 92424 bytes
15:28:33.445 -> [STT] HTTP code: 200
15:28:33.445 -> [STT] How many clocks are in this image?
15:28:33.445 -> [VISION] Encoding image...
15:28:33.445 -> [VISION] Sending request...
15:28:34.953 -> [VISION] HTTP code: 200
15:28:34.989 -> [VISION] There is one clock visible in the image.

Considering the low quality, the small size of the image, and that there are many objects in the image, the Vision Chatbot was really good in finding the clock.

Have fun playing with your Vision Chatbot but keep an eye on the OpenAI costs and your budget, especially if you start using more capable but also more expensive models.

Conclusions

In this tutorial you learned how tot build a Vision Chatbot using the DFRobot ESP32-S3 AI Camera Module and OpenAI. Audio recording, playback and image capturing were performed locally by the ESP32, while Text-To-Speech, Speech-To-Text and Image-Analysis were performed remotely by OpenAI services.

Remote processing of image and audio data allows us to use much more powerful AI models than what could be run locally on the ESP32. However, this requires a stable internet connection to communicate with OpenAI’s cloud services and adds latency and cost. Additionally, privacy is a concern when sending images to external servers.

If you want to perform simple speech recognition locally, have a look at the Getting started with Gravity Voice Recognition Module, Voice control with XIAO-ESP32-S3-Sense and Edge Impulse and the Using the Voice Recognition Module V3 with Arduino tutorials.

For speech-to-text see the Gravity Text-to-Speech Module Tutorial, which can generate speech locally but with limited quality.

For further code examples for the DFRobot ESP32-S3 AI Camera have a look at the DFRobot Wiki and their github repo.

If you have any questions feel free to leave them in the comment section.

Happy Tinkering ; )