Skip to Content

Voice control with XIAO-ESP32-S3-Sense and Edge Impulse

Voice control with XIAO-ESP32-S3-Sense and Edge Impulse

In this tutorial you will learn how to build a voice control application with the XIAO-ESP32-S3-Sense board and the Edge Impulse platform. First we will collect audio data with the microphone of the XIAO-ESP32-S3-Sense. Then we will upload this audio data to the Edge Impulse platform and train a keyword-spotting model. Finally, we will deploy this TinyML model on the XIAO-ESP32-S3-Sense, which will allow us to control LEDs with our voice.

The TinyML model will be trained to recognize the words “red”, “yellow” and “green”. We will connect three LEDs (red, yellow, green) to the XIAO-ESP32-S3-Sense board, which you then can switch on via voice commands. See the system in action below:

Voice Control of LEDs with XIAO-ESP32-S3-Sense

If you haven’t used the XIAO-ESP32-S3 Sense before, have a look at the Getting started with XIAO-ESP32-S3-Sense tutorial first, otherwise let’s begin …

Required Parts

You will need a XIAO ESP32 S3 Sense board to try out the code examples. Note that the board can get hot and you may want to attach a small Heatsink at the back (see the listed part below).

Next, you will need an SD Card to store the collected audio samples. I listed a 32 GB card but a smaller one (8GB) will be just fine as well. And, finally, if your computer does not have a built-in SD card reader, you will need one as well.

Seeed Studio XIAO ESP32 S3 Sense

USB C Cable

Small Heatsink 9×9 mm

SD Card 32GB

SD Card Reader

Makerguides is a participant in affiliate advertising programs designed to provide a means for sites to earn advertising fees by linking to Amazon, AliExpress, Elecrow, and other sites. As an Affiliate we may earn from qualifying purchases.

The XIAO-ESP32-S3-Sense board

The XIAO-ESP32-S3-Sense board is based on the ESP32-S3, a chip from Espressif tailored for AI and edge computing tasks. The board has four parts, the Main board, the Sense Hat, the Camera and an external Wi-Fi Antenna:

Parts of the XIAO-ESP32-S3 Sense
Parts of the XIAO-ESP32-S3 Sense

The Sense Hat can be plugged into the main board and has an SD card socket, a socket for the camera and a microphone. The picture below show the completely assembled XIAO-ESP32-S3-Sense board with an SD Card inserted:

Assembled XIAO-ESP32-S3-Sense board with Camera, Antenna and SD Card
Assembled XIAO-ESP32-S3-Sense board with Camera, Antenna and SD Card

However, for this project we will not need the WiFi antenna or the camera. So, the configuration shown below is sufficient and you won’t even need the SD card, once you have collected the audio samples.

XIAO-ESP32-S3-Sense with SD Card
XIAO-ESP32-S3-Sense with SD Card

Note that power to the board can be supplied either by the Type-C USB port or by the battery charging interface that can be connected to a 3.7V LiPo battery.

Microphone of XIAO-ESP32-S3-Sense

The XIAO-ESP32-S3-Sense comes with a built-in digital MEMS Microphone of the type MSM261D3526H1CPM that is located on the Sense Hat. See the picture below:

Microphone on Sense Hat
Microphone on Sense Hat

You can communicate with the microphone over two signal lines (PDM_CLK, PDM_DATA) via I2S protocol that are connected to IO42 and IO41 as shown in the schematic below:

Schematics for Microphone on Sense Hat
Schematics for Microphone on Sense Hat

For more detailed information see our Record Audio with XIAO-ESP32-S3-Sense tutorial.

Pinout of the XIAO-ESP32-S3-Sense board

Finally, let’s have a look at the Pinout of the XIAO-ESP32-S3-Sense board:

Pinout of XIAO-ESP32-S3-Sense board
Pinout of XIAO-ESP32-S3-Sense board (source)

The 5V pin is the 5V from the USB port. The 3V3 pin provides the output from the onboard regulator and can deliver up to draw 700mA. And the GND pin provides ground.

As for GPIOs: the board offers 11 digital/analog GPIOs but GPIO0, GPIO3, GPIO43 and GPIO44 are strapping pins that need to be in a specific state during startup – so be careful with these ones. Once the microcontroller is running the strapping pins operate as regular IO pins.

If you need more help see the Getting started with XIAO-ESP32-S3-Sense tutorial.

Format SD Card for Audio Collection

Before we can start collecting audio samples to train our TinyML model, we need to make sure that the SD Card is formatted correctly.

Insert the SD Card into the SD Card Reader, which is either built-in into your computer or use the external SD Card reader (listed under Required Parts).

Next open the Explorer (under Windows), look for a new USB drive and perform a right click to open the menu for the drive. There select “Show more options” and then “Format …” to open the Format dialog:

Menu to Format SD Card
Menu to Format SD Card

Then check if the File system is set to “FAT32” and press “Start”. Make sure that you selected the correct USB drive, since formatting will delete all existing data on that drive!

I usually give the drive first a new name, e.g. “SAMPLES” to make sure that I have the correct drive and not accidentally format a different drive.

Collecting Audio Samples with XIAO-ESP32-S3-Sense

Next we write the code to collect the audio samples needed to train our TinyML model for voice control. Make sure you have formatted and inserted the SD Card correctly in the XIAO-ESP32-S3-Sense, as shown below:

SD Card inserted into XIAO-ESP32-S3-Sense
SD Card inserted into XIAO-ESP32-S3-Sense

We want to use our voice to switch on a red, yellow or green LED. The control words therefore will be “red”, “yellow” and “green”, though you could pick any others words in any language you like as well.

Red, yellow, green LEDs
Red, yellow, green LEDs

You could collect those audio samples using your mobile phone or the microphone in your computer but the characteristics (sound) of these microphones will be different from the one in the XIAO-ESP32-S3-Sense. It is therefore best to use the microphone in the XIAO-ESP32-S3-Sense but that requires some code to record audio samples and store them on the SD card.

The following code collects audio samples. You have to enter the label (“red”, “yellow” or “green”) in the Serial Monitor and hit enter.

This will start a 1 second long recording, where you speak the control word (e.g. “red”). The recording is then saved on the SD Card as an audio sample. Have a quick look at the complete code first and then we will discuss the details:

#include "ESP_I2S.h"
#include "FS.h"
#include "SD.h"

I2SClass i2s;


void scaleVolume(int16_t* audioData, size_t sampleCount) {
  const float gain = 16.0;
  for (size_t i = 0; i < sampleCount; i++) {
    audioData[i] = (int16_t)constrain(audioData[i] * gain, INT16_MIN, INT16_MAX);
  }
}


void setup() {
  Serial.begin(115200);
  while (!Serial) {
    delay(10);
  }

  i2s.setPinsPdmRx(42, 41);
  if (!i2s.begin(I2S_MODE_PDM_RX, 16000, I2S_DATA_BIT_WIDTH_16BIT, I2S_SLOT_MODE_MONO)) {
    Serial.println("Can't find microphone!");
  }

  if (!SD.begin(21)) {
    Serial.println("Failed to mount SD Card!");
  }

  Serial.println("Enter a label and press Enter to record 1 second of audio:");
}


void loop() {
  static int cnt = 1;
  static char filename[64];
  static String label = "audio";

  if (Serial.available() > 0) {
    String entered = Serial.readStringUntil('\n');
    entered.trim();

    if (entered.length() > 0) {
      label = entered;
      cnt = 1;
      String folder = "/" + label;
      if (!SD.exists(folder)) {
        SD.mkdir(folder);
      }
    }

    sprintf(filename, "/%s/%s_%d.wav", label.c_str(), label.c_str(), cnt++);
    Serial.printf("Recording to: %s\n", filename);

    uint8_t* wav_buffer;
    size_t wav_size;

    wav_buffer = i2s.recordWAV(1, &wav_size);
    if (wav_size > 44) {
      size_t sampleCount = (wav_size - 44) / 2;  // 16-bit samples = 2 bytes per sample
      scaleVolume((int16_t*)(wav_buffer + 44), sampleCount);
    }

    File file = SD.open(filename, FILE_WRITE);
    if (!file) {
      Serial.println("Failed to open file for writing!");
      return;
    }

    if (file.write(wav_buffer, wav_size) != wav_size) {
      Serial.println("Failed to write audio data to file!");
      return;
    }

    file.close();
    Serial.println("done.");
  }

  delay(100); 
}

The first step of this program is to include the required libraries. These libraries allow the ESP32 to use the I2S interface for microphone input and to store files on an SD card. Without these, the ESP32 would not be able to record audio or save it.

#include "ESP_I2S.h"
#include "FS.h"
#include "SD.h"

Constants

The code defines a constant for the WAV file header size. Every WAV file begins with a 44-byte header that contains information such as sample rate, bit depth, and channel count. By defining header_size, the code knows where the actual audio data starts.

const int header_size = 44;

Objects

Next, the program creates an I2SClass object. This object handles all communication with the microphone using the I2S protocol. By creating this instance, the ESP32 can initialize the microphone, set its parameters, and record audio.

I2SClass i2s;

scaleVolume

The function scaleVolume() adjusts the loudness of the recorded audio. It takes in raw audio samples, multiplies them by a gain factor, and then constrains the values to fit within the 16-bit audio range. This ensures the sound does not distort due to overflow.

void scaleVolume(int16_t* audioData, size_t sampleCount) {
  const float gain = 16.0;
  for (size_t i = 0; i < sampleCount; i++) {
    audioData[i] = (int16_t)constrain(audioData[i] * gain, INT16_MIN, INT16_MAX);
  }
}

You could also normalize the volume (typical scaling factor I observed was 20x) but this won’t work with streaming and will amplify noise. But good to get an idea of a suitable gain factor, e.g. 8x or 16x.

void normalizeVolume(int16_t* audioData, size_t sampleCount) {
  float maxAmplitude = 0;
  for (size_t i = 0; i < sampleCount; i++) {
    int16_t absValue = abs(audioData[i]);
    if (absValue > maxAmplitude) {
      maxAmplitude = absValue;
    }
  }

  float scaleFactor = (float)INT16_MAX / (maxAmplitude + 1);
  Serial.printf("scaleFactor %f\n", scaleFactor);
  for (size_t i = 0; i < sampleCount; i++) {
    float scaledSample = (float)audioData[i] * scaleFactor;
    audioData[i] = (int16_t)constrain(scaledSample, INT16_MIN, INT16_MAX);
  }
}

setup

The setup() function prepares everything before the main loop runs. First, it initializes the serial monitor for debugging. It then configures the microphone pins with i2s.setPinsPdmRx(42, 41) and starts the I2S interface at a sample rate of 16 kHz in mono mode. If no microphone is detected, it prints an error message. After that, the SD card is initialized on pin 21. If the SD card fails to mount, another error message is printed. Finally, the ESP32 prompts the user to enter a label through the serial monitor.

void setup() {
  Serial.begin(115200);
  while (!Serial) {
    delay(10);
  }

  i2s.setPinsPdmRx(42, 41);
  if (!i2s.begin(I2S_MODE_PDM_RX, 16000, I2S_DATA_BIT_WIDTH_16BIT, I2S_SLOT_MODE_MONO)) {
    Serial.println("Can't find microphone!");
  }

  if (!SD.begin(21)) {
    Serial.println("Failed to mount SD Card!");
  }

  Serial.println("Enter a label and press Enter to record 1 second of audio:");
}

loop

Inside the loop(), the program waits for user input from the serial monitor. The user can type in a label, such as “red” or “noise”. When a label is entered, the code creates a folder on the SD card with that name. This organization makes it easy to separate recordings and will simplify the upload to the Edge Impulse platform.

if (Serial.available() > 0) {
  String entered = Serial.readStringUntil('\n');
  entered.trim();

  if (entered.length() > 0) {
    label = entered;
    cnt = 1;
    String folder = "/" + label;
    if (!SD.exists(folder)) {
      SD.mkdir(folder);
    }
  }

After setting the label, the program generates a filename for the new recording. It uses the format label/label_number.wav. For example, if the label is “red,” the first file will be red/red_1.wav. Each new recording increments the counter, so recordings are saved in order without overwriting.

sprintf(filename, "/%s/%s_%d.wav", label.c_str(), label.c_str(), cnt++);
Serial.printf("Recording to: %s\n", filename);

To record audio, the code calls i2s.recordWAV(1, &wav_size). This function captures one second of audio and stores it in memory as a WAV file. If the recording contains valid audio data, the program applies the scaleVolume() function to increase the gain of the audio samples.

wav_buffer = i2s.recordWAV(1, &wav_size);
if (wav_size > header_size) {
  size_t sampleCount = (wav_size - header_size) / 2;
  scaleVolume((int16_t*)(wav_buffer + header_size), sampleCount);
}

Once the audio is processed, the program opens a file on the SD card and writes the WAV data into it. If the file cannot be opened or written, it prints an error message. After saving the recording, the file is closed, and a confirmation message is printed.

File file = SD.open(filename, FILE_WRITE);
if (!file) {
  Serial.println("Failed to open file for writing!");
  return;
}

if (file.write(wav_buffer, wav_size) != wav_size) {
  Serial.println("Failed to write audio data to file!");
  return;
}

file.close();
Serial.println("done.");

Finally, the loop waits for 100 milliseconds before checking again for new user input. This small delay prevents the ESP32 from overloading the serial input processing.

delay(100);

This project transforms your ESP32 into a handy audio recorder that can capture one-second clips from an I2S microphone and save them as WAV files on an SD card. With labeled folders, it becomes especially useful for building datasets for machine learning projects, such as training a speech recognition model.

Running the Code to collect Samples

Flash your XIAO-ESP32-S3-Sense with the code and open the Serial monitor. Enter the label for the class you want to collect samples for in the message box, e.g. “yellow” and then press Enter.

Recording Audio Samples and Serial Monitor Output
Recording Audio Samples and Serial Monitor Output

The Serial Monitor will show the filename for the audio sample, e.g. “/yellow/yellow_1.wav” and you now have one second to speak the word “yellow”. Once done, the text “done.” will be printed to the Serial Monitor. Below an example of a recorded audio sample for the word “yellow:

For subsequent recordings of the same class, e.g. “yellow”, just press Enter. Don’t enter the same class label again. If you are ready to record the samples for the next class, e.g. “red”, enter “red”, press Enter and keep recording in the same way.

The code will create a new file folder for each class (“red”, “green”, “yellow”, “noise”) and will store the audio samples within this folder:

Folder Structure for Audio Samples
Folder Structure for Audio Samples

I recorded 50 samples per color and 100 samples for noise. For the “noise” class: record silence, environmental noise, music, speech, everything but the words “red”, “green” or “yellow”. The more samples you have for each class the better, but with 50 samples you start to get a decent detection accuracy – not great, but decent ; )

Training Voice Control model using Edge Impulse

In this section we will go through the necessary steps to upload our training data, create the data preprocessing and model pipeline (impulse) and train our TinyML model.

Create a new Edge Impulse project, for instance, with the name “Voice Control” and then we are ready to upload our training data.

Upload Training Data

First, remove the SD Card from the XIAO-ESP32-S3-Sense and connect it to your computer, so that we can upload data from it. It will appear on your computer as a USB drive, if you use an external SD Card reader.

Then click on “Data acquisition” -> “+ Add data” -> “Upload data” to open the dialog for uploading data:

In the dialog tick “Select a folder” and the enter the label for a class you want to upload data for, e.g. “red” at the bottom:

Now click on “Choose files” and select the folder “red” with your audio sample. Then click on “Upload data” and you should see the data being uploaded:

Repeat this process for other the colors and the noise class. Make sure you enter the correct label under “Enter label” before uploading. However, you can later delete samples you accidentally uploaded with the wrong label.

When you are done, close dialog by pressing x in upper right corner:

Explore Data

You then should see the data set in the Data explorer. In the upper left corner you will see a pi-chart, representing the distribution of your classes and the Train/Test split. Below is the Dataset, where you can click on individual samples to listen to them:

If you find something is wrong with your data, there are also functions to delete samples. But if you are happy with it, the next step is to create an Impulse (Data preprocessing + Model).

Create Impulse

Click on “Create impulse” under “Impulse design”:

Then add three blocks: “Time series data”, “Audio (MFE)” and “Transfer Learning (Keyword Spotting)”:

Keep all the default settings for the blocks as they are. In the next sections we configure and train the individual blocks.

MFE

Click on “MFE” under “Impulse design”:

This will open the settings dialog for the MFE block:

MFE stands for Mel Frequency Energy and is digital signal processing method to extract features from an audio signal. We keep the settings for the MFE as they are. Just press “Save parameters” to save them.

In the Feature explorer you will then be able to see how well these extracted features perform:

Each dot in the plot above represents one of our audio samples (a spoken word or noise). Ideally you want dots from the same class to be grouped tightly together and the groups clearly separated.

The plot above shows some grouping but it is not great. You can see that the groups for the color word (“red”, “green”, “yellow”) are stretched out. This is probably because I varied my distance to the microphone when speaking then.

The “noise” class, on the other hand, is circular in shape, since most of the noise samples are relatively similar regarding their loudness but different in content.

Based on this clustering, I would not expect a fantastic recognition accuracy. You can play a bit with the settings for the MFE block and collect more samples, which will should make the model more robust.

Transfer Learning

To train the model click on “Transfer learning” under “Impulse design”:

This will open the Neural Network settings dialog shown below:

You can choose between two models: MobileNetV1 0.1 or MobileNetV2 0.35. The second one is most likely more accurate but also a lot bigger and I could not get it working on the XIAO-ESP32-S3-Sense. I therefore picked the MobileNetV1 0.1 model.

Keyword Spotting Models
Keyword Spotting Models

I increased the number of training cycles to 60, set the learning rate to 0.01 and kept the CPU as training processor. However, none of these parameters is generally very sensitive or important for the final accuracy of the model and the default settings should work nicely as well.

Next click on “Save & Train”, which will train the model. At the end of the training process you will see a confusion matrix and some other metrics for network performance printed out:

As you can see, in my case the overall accuracy was 95% and the model tended to confuse color words with noise, which is to expect. The accuracy probably could be improved with more training samples but it is a very small model and its recognition capacity is therefore limited.

Deploy Model

Finally, we need to deploy the model to run it on the XIAO-ESP32-S3-Sense. Click on “Deployment” under “Impulse design”:

In the upper right corner of the screen you can select the Target processer for the deployment. As of August 2025, Edge Impulse does not directly support the XIAO-ESP32-S3-Sense. Instead, we are selecting the Espressif ESP-EYE:

If you click on it it will open the configuration dialog for the target device:

Later we will fix the deployed code to make it work with the XIAO-ESP32-S3-Sense. For the type of deployment we select “Arduino library” and “TensorFlow Lite”:

Next click on the build button, to build and deploy the model:

It will download the model as ZIP file, in my case: “ei-voice-control-arduino-1.0.8.zip“. The name of the ZIP file will depend on the name you chose for the Edge Impulse project (“Voice Control”).

To use the library, create a new, empty Sketch and then install the downloaded library (ei-voice-...zip) as usual in the Arduino IDE via Sketch -> Include Library -> Add .ZIP library.

Connecting LEDs to XIAO-ESP32-S3-Sense

Before we write the code to control the LEDs, I quickly show you how to connect the three LEDs to the XIAO-ESP32-S3-Sense.:

Connecting LEDs to XIAO-ESP32-S3-Sense

The cathode of all three LEDs is connected to ground (GND). The anodes of the LEDs are connected to GPIO pins 1, 2 and 3 via a 220Ω resistor as shown above. You can pick other GPIO pins as well but if you do, make sure to adjust the following code.

Code for Voice Control of LEDs

The library ei-voice-control-arduino-1.0.8.zip we deployed comes with some example code but we can’t use it, since they are for the ESP-EYE, which has a different microphone interface than the XIAO-ESP32-S3-Sense.

I therefore took the example code and changed it to work with the XIAO-ESP32-S3-Sense. And I added the code to switch the LEDs.

The following code listens to the microphone, sends the recorded audio signal to the classifier model, retrieves the classification result and shortly flashes the red, yellow or green LED, if a color word was recognized with sufficient confidence.

Have a quick look at the complete code first, and then we discuss its details. Make sure to replace the import #include "Voice_control_inferencing.h", with the name of your deployed library!

#include "Voice_control_inferencing.h"
#include "ESP_I2S.h"
#include "freertos/FreeRTOS.h"
#include "freertos/task.h"

typedef struct {
  int16_t *buffer;
  uint8_t buf_ready;
  uint32_t buf_count;
  uint32_t n_samples;
} inference_t;

static inference_t inference;
static const uint32_t sample_buffer_size = 2048;
static signed short sampleBuffer[sample_buffer_size];
static bool debug_nn = false;
static bool record_status = true;

static const int8_t I2S_CLK = 42;
static const int8_t I2S_DIN = 41;
static const uint32_t SAMPLERATE = 16000;

static const int8_t redPin = 1;
static const int8_t yellowPin = 2;
static const int8_t greenPin = 3;

I2SClass I2S;

static void audio_inference_callback(uint32_t n_samples) {
  for (uint32_t i = 0; i < n_samples; i++) {
    inference.buffer[inference.buf_count++] = sampleBuffer[i];

    if (inference.buf_count >= inference.n_samples) {
      inference.buf_count = 0;
      inference.buf_ready = 1;
    }
  }
}

static void capture_samples(void *arg) {
  const float gain = 16.0;
  const uint32_t n_samples_to_read = (uint32_t)arg;

  while (record_status) {
    for (uint32_t i = 0; i < n_samples_to_read; i++) {
      int16_t sample = I2S.read();
      sampleBuffer[i] = (int16_t)constrain(sample * gain, INT16_MIN, INT16_MAX);
    }

    if (record_status) {
      audio_inference_callback(n_samples_to_read);
    }
  }
  vTaskDelete(NULL);
}

static bool microphone_inference_start(uint32_t n_samples) {
  inference.buffer = (int16_t *)malloc(n_samples * sizeof(int16_t));
  if (!inference.buffer) return false;

  inference.buf_count = 0;
  inference.n_samples = n_samples;
  inference.buf_ready = 0;

  I2S.setPinsPdmRx(I2S_CLK, I2S_DIN);
  if (!I2S.begin(I2S_MODE_PDM_RX, SAMPLERATE, I2S_DATA_BIT_WIDTH_16BIT, I2S_SLOT_MODE_MONO)) {
    ei_printf("Can't find microphone!\r\n");
    return false;
  }

  record_status = true;
  xTaskCreate(capture_samples, "CaptureSamples", 1024 * 32, (void *)sample_buffer_size, 10, NULL);
  return true;
}

static bool microphone_inference_record(void) {
  while (inference.buf_ready == 0) delay(10);
  inference.buf_ready = 0;
  return true;
}

static int microphone_audio_signal_get_data(size_t offset, size_t length, float *out_ptr) {
  numpy::int16_to_float(&inference.buffer[offset], out_ptr, length);
  return 0;
}

static void microphone_inference_end(void) {
  I2S.end();
  free(inference.buffer);
}

void initLEDs() {
  pinMode(redPin, OUTPUT);
  pinMode(yellowPin, OUTPUT);
  pinMode(greenPin, OUTPUT);
}

void switchOffLEDs() {
  digitalWrite(redPin, LOW);
  digitalWrite(yellowPin, LOW);
  digitalWrite(greenPin, LOW);
}

void flashLED(const char* label) {
  switchOffLEDs();
  if(!strcmp("red", label)) {
    digitalWrite(redPin, HIGH);
  }
  if(!strcmp("yellow", label)) {
    digitalWrite(yellowPin, HIGH);
  }
  if(!strcmp("green", label)) {
    digitalWrite(greenPin, HIGH);
  }  
  delay(500);  
  switchOffLEDs();
}

void setup() {
  Serial.begin(115200);
  
  initLEDs();

  if (!microphone_inference_start(EI_CLASSIFIER_RAW_SAMPLE_COUNT)) {
    ei_printf("ERR: Could not allocate audio buffer\r\n");
    return;
  }
  ei_printf("Listening...\n");
}

void loop() {
  if (!microphone_inference_record()) {
    ei_printf("ERR: Failed to record audio...\n");
    return;
  }

  signal_t signal;
  signal.total_length = EI_CLASSIFIER_RAW_SAMPLE_COUNT;
  signal.get_data = &microphone_audio_signal_get_data;

  ei_impulse_result_t result = { 0 };
  EI_IMPULSE_ERROR r = run_classifier(&signal, &result, debug_nn);

  if (r != EI_IMPULSE_OK) {
    ei_printf("ERR: Failed to run classifier (%d)\n", r);
    return;
  }

  float max_val = -1.0;
  int max_idx = -1;
  auto cl = result.classification;
  for (size_t ix = 0; ix < EI_CLASSIFIER_LABEL_COUNT; ix++) {
    if (cl[ix].value > max_val && strcmp("noise", cl[ix].label)) {
      max_val = cl[ix].value;
      max_idx = ix;
    }
  }

  if (max_idx >= 0 && max_val > 0.3) {
    ei_printf("Predicted label: %s (%.3f)\n", cl[max_idx].label, max_val);
    flashLED(cl[max_idx].label);
  }
}

Imports

The first lines bring in the libraries we need. Each one plays an important role in voice recognition and hardware control.

#include "Voice_control_inferencing.h"
#include "ESP_I2S.h"
#include "freertos/FreeRTOS.h"
#include "freertos/task.h"

The Voice_control_inferencing.h file is generated by Edge Impulse and contains the trained neural network. As mentioned before, this name will depend on your name of the Edge Impulse project and the deployed library. You may have to adjust this.

The ESP_I2S.h library provides access to the I2S microphone interface and The FreeRTOS headers allow us to run the audio capture as a background task.

Constants and Buffers

Next we define structures and constants to manage audio recording and classification.

typedef struct {
  int16_t *buffer;
  uint8_t buf_ready;
  uint32_t buf_count;
  uint32_t n_samples;
} inference_t;

static inference_t inference;
static const uint32_t sample_buffer_size = 2048;
static signed short sampleBuffer[sample_buffer_size];
static bool debug_nn = false;
static bool record_status = true;

The inference_t structure holds the audio buffer and keeps track of how many samples we have. We allocate sampleBuffer as a temporary storage space for microphone input. While the flag record_status controls whether we keep recording.

Microphone Pins and Sampling Rate

We then define which ESP32 pins connect to the microphone and how fast we sample.

static const int8_t I2S_CLK = 42;
static const int8_t I2S_DIN = 41;
static const uint32_t SAMPLERATE = 16000;

The microphone connects through I2S. Pin 42 provides the clock signal, and pin 41 is the data input. We sample at 16 kHz, which is perfect for voice recognition.

LED Pins

We assign three pins for visual feedback.

static const int8_t redPin = 1;
static const int8_t yellowPin = 2;
static const int8_t greenPin = 3;

Each pin drives an LED. They will blink when the classifier detects the corresponding color word.

Audio Inference Callback

This function moves samples from the temporary buffer into the main inference buffer.

static void audio_inference_callback(uint32_t n_samples) {
  for (uint32_t i = 0; i < n_samples; i++) {
    inference.buffer[inference.buf_count++] = sampleBuffer[i];

    if (inference.buf_count >= inference.n_samples) {
      inference.buf_count = 0;
      inference.buf_ready = 1;
    }
  }
}

Once the buffer fills up, we mark it as ready so the classifier can use it.

Capturing Microphone Samples

Instead of recording audio in the main loop, we run this task in the background.

static void capture_samples(void *arg) {
  const float gain = 16.0;
  const uint32_t n_samples_to_read = (uint32_t)arg;

  while (record_status) {
    for (uint32_t i = 0; i < n_samples_to_read; i++) {
      int16_t sample = I2S.read();
      sampleBuffer[i] = (int16_t)constrain(sample * gain, INT16_MIN, INT16_MAX);
    }
    if (record_status) {
      audio_inference_callback(n_samples_to_read);
    }
  }
  vTaskDelete(NULL);
}

We continuously read samples from the microphone using I2S.read(). Each sample gets amplified with a gain factor before being stored. When enough samples are collected, we pass them on for inference.

Starting Microphone Inference

This function initializes the microphone and starts the capture task.

static bool microphone_inference_start(uint32_t n_samples) {
  inference.buffer = (int16_t *)malloc(n_samples * sizeof(int16_t));
  if (!inference.buffer) return false;

  inference.buf_count = 0;
  inference.n_samples = n_samples;
  inference.buf_ready = 0;

  I2S.setPinsPdmRx(I2S_CLK, I2S_DIN);
  if (!I2S.begin(I2S_MODE_PDM_RX, SAMPLERATE, I2S_DATA_BIT_WIDTH_16BIT, I2S_SLOT_MODE_MONO)) {
    ei_printf("Can't find microphone!\r\n");
    return false;
  }

  record_status = true;
  xTaskCreate(capture_samples, "CaptureSamples", 1024 * 32, (void *)sample_buffer_size, 10, NULL);
  return true;
}

We allocate memory for the inference buffer and configure I2S for PDM microphones. If everything works, a FreeRTOS task begins capturing samples in the background.

Waiting for New Audio

This function waits until the buffer is ready with new data.

static bool microphone_inference_record(void) {
  while (inference.buf_ready == 0) delay(10);
  inference.buf_ready = 0;
  return true;
}

The code pauses until enough audio is collected for classification.

Converting Audio Data

The Edge Impulse classifier expects floating-point audio samples. This function converts raw 16-bit integers into floats.

static int microphone_audio_signal_get_data(size_t offset, size_t length, float *out_ptr) {
  numpy::int16_to_float(&inference.buffer[offset], out_ptr, length);
  return 0;
}

Stopping the Microphone

When we no longer need audio input, we stop I2S and free memory.

static void microphone_inference_end(void) {
  I2S.end();
  free(inference.buffer);
}

LED Control Functions

Next we prepare helper functions to control the LEDs.

void initLEDs() {
  pinMode(redPin, OUTPUT);
  pinMode(yellowPin, OUTPUT);
  pinMode(greenPin, OUTPUT);
}

void switchOffLEDs() {
  digitalWrite(redPin, LOW);
  digitalWrite(yellowPin, LOW);
  digitalWrite(greenPin, LOW);
}

void flashLED(const char* label) {
  switchOffLEDs();
  if(!strcmp("red", label)) {
    digitalWrite(redPin, HIGH);
  }
  if(!strcmp("yellow", label)) {
    digitalWrite(yellowPin, HIGH);
  }
  if(!strcmp("green", label)) {
    digitalWrite(greenPin, HIGH);
  }  
  delay(500);  
  switchOffLEDs();
}

initLEDs() configures the pins as outputs. switchOffLEDs() ensures no LED stays on. flashLED() turns on the LED that matches the recognized label and then switches it off again after half a second.

Setup

In the setup function, we initialize the serial monitor, LEDs, and microphone.

void setup() {
  Serial.begin(115200);
  
  initLEDs();

  if (!microphone_inference_start(EI_CLASSIFIER_RAW_SAMPLE_COUNT)) {
    ei_printf("ERR: Could not allocate audio buffer\r\n");
    return;
  }
  ei_printf("Listening...\n");
}

If the microphone starts successfully, we are ready to capture voice input.

Loop

The loop runs continuously to process audio and classify spoken commands.

void loop() {
  if (!microphone_inference_record()) {
    ei_printf("ERR: Failed to record audio...\n");
    return;
  }

  signal_t signal;
  signal.total_length = EI_CLASSIFIER_RAW_SAMPLE_COUNT;
  signal.get_data = &microphone_audio_signal_get_data;

  ei_impulse_result_t result = { 0 };
  EI_IMPULSE_ERROR r = run_classifier(&signal, &result, debug_nn);

  if (r != EI_IMPULSE_OK) {
    ei_printf("ERR: Failed to run classifier (%d)\n", r);
    return;
  }

  float max_val = -1.0;
  int max_idx = -1;
  auto cl = result.classification;
  for (size_t ix = 0; ix < EI_CLASSIFIER_LABEL_COUNT; ix++) {
    if (cl[ix].value > max_val && strcmp("noise", cl[ix].label)) {
      max_val = cl[ix].value;
      max_idx = ix;
    }
  }

  if (max_idx >= 0 && max_val > 0.3) {
    ei_printf("Predicted label: %s (%.3f)\n", cl[max_idx].label, max_val);
    flashLED(cl[max_idx].label);
  }
}

We wait for a new audio buffer, prepare a signal object, and run the classifier. The classifier outputs labels with probabilities. We ignore the label "noise" and choose the class with the highest confidence. If the probability is greater than 0.3, we flash the LED that matches the prediction.

This code works well but it reacts a bit slowly to voice commands, since we have to wait until a full inference window is filled. The following, streaming approach is faster but you will notice that the classification accuracy is a bit lower.

Streaming Code for Voice Control

The continuous classifier in the following code is designed to handle sliding windows of audio. It processes overlapping slices of input, so we don’t need to stop and restart inference each time. This makes the system respond faster to keywords (but the accuracy suffers a bit). See the following demo:

Have a quick look at the complete code and then we dive into the differences between this and the previous code.

#define EIDSP_QUANTIZE_FILTERBANK 0

#include "Voice_control_inferencing.h"
#include "ESP_I2S.h"
#include "freertos/FreeRTOS.h"
#include "freertos/task.h"

I2SClass I2S;

static const uint32_t slice_size = EI_CLASSIFIER_SLICE_SIZE;
static signed short sampleBuffer[slice_size];
static bool debug_nn = false;
static int slice_ctn = 0;
static bool record_status = true;

static const int8_t I2S_CLK = 42;
static const int8_t I2S_DIN = 41;
static const uint32_t SAMPLERATE = 16000;

static const int8_t redPin = 1;
static const int8_t yellowPin = 2;
static const int8_t greenPin = 3;


static void capture_samples(void* arg) {
  const float gain = 16.0;
  while (record_status) {
    for (uint32_t i = 0; i < slice_size; i++) {
      int16_t sample = I2S.read();
      sampleBuffer[i] = (int16_t)constrain(sample * gain, INT16_MIN, INT16_MAX);
    }
    vTaskDelay(1);  // give some breathing room for the scheduler
  }
  vTaskDelete(NULL);
}

static bool microphone_inference_start(uint32_t n_samples) {
  I2S.setPinsPdmRx(I2S_CLK, I2S_DIN);
  if (!I2S.begin(I2S_MODE_PDM_RX, SAMPLERATE, I2S_DATA_BIT_WIDTH_16BIT, I2S_SLOT_MODE_MONO)) {
    ei_printf("Can't find microphone!\r\n");
    return false;
  }

  record_status = true;
  xTaskCreate(capture_samples, "CaptureSamples", 1024 * 16, NULL, 10, NULL);

  return true;
}

static bool microphone_inference_record(void) {
  delay(1);
  return true;
}

static int microphone_audio_signal_get_data(size_t offset, size_t length, float* out_ptr) {
  numpy::int16_to_float(&sampleBuffer[offset], out_ptr, length);
  return 0;
}

static void microphone_inference_end(void) {
  I2S.end();
}

void initLEDs() {
  pinMode(redPin, OUTPUT);
  pinMode(yellowPin, OUTPUT);
  pinMode(greenPin, OUTPUT);
}

void switchOffLEDs() {
  digitalWrite(redPin, LOW);
  digitalWrite(yellowPin, LOW);
  digitalWrite(greenPin, LOW);
}

void flashLED(const char* label) {
  switchOffLEDs();
  if (!strcmp("red", label)) {
    digitalWrite(redPin, HIGH);
  }
  if (!strcmp("yellow", label)) {
    digitalWrite(yellowPin, HIGH);
  }
  if (!strcmp("green", label)) {
    digitalWrite(greenPin, HIGH);
  }
  delay(500);
  switchOffLEDs();
}

void setup() {
  Serial.begin(115200);

  initLEDs();

  run_classifier_init();
  ei_sleep(1000);

  if (!microphone_inference_start(slice_size)) {
    ei_printf("ERR: Could not start microphone\r\n");
    return;
  }

  ei_printf("Listening continously...\n");
}

void loop() {
  if (!microphone_inference_record()) {
    ei_printf("ERR: Failed to record audio...\n");
    return;
  }

  signal_t signal;
  signal.total_length = slice_size;
  signal.get_data = &microphone_audio_signal_get_data;

  ei_impulse_result_t result = { 0 };
  EI_IMPULSE_ERROR r = run_classifier_continuous(&signal, &result, debug_nn);

  if (r != EI_IMPULSE_OK) {
    ei_printf("ERR: Failed to run classifier (%d)\n", r);
    return;
  }

  float max_val = -1.0;
  int max_idx = -1;
  auto cl = result.classification;
  for (size_t ix = 0; ix < EI_CLASSIFIER_LABEL_COUNT; ix++) {
    if (cl[ix].value > max_val && strcmp("noise", cl[ix].label)) {
      max_val = cl[ix].value;
      max_idx = ix;
    }
  }

  slice_ctn = max(--slice_ctn, 0);
  if (max_idx >= 0 && max_val > 0.5 && slice_ctn <= 0) {
    ei_printf("Predicted label: %s (%.3f)\n", cl[max_idx].label, max_val);
    flashLED(cl[max_idx].label);
    slice_ctn = EI_CLASSIFIER_SLICES_PER_MODEL_WINDOW;
  }
}

Preprocessor Directive

Right at the top, this code introduces a new definition:

#define EIDSP_QUANTIZE_FILTERBANK 0

This disables filterbank quantization inside the Edge Impulse DSP pipeline. The model uses full precision instead of compressed values. The first version did not include this line, meaning it kept the default quantization.

Imports

The imports are the same as before: Edge Impulse inference code, I2S support, and FreeRTOS for multitasking. As before, the name of the Voice_control_inferencing.h library will depend on your name of the Edge Impulse project and the deployed library. You may have to adjust this.

#include "Voice_control_inferencing.h"
#include "ESP_I2S.h"
#include "freertos/FreeRTOS.h"
#include "freertos/task.h"

Buffers and Variables

Instead of creating a large inference buffer, this version works with slices of audio.

static const uint32_t slice_size = EI_CLASSIFIER_SLICE_SIZE;
static signed short sampleBuffer[slice_size];
static int slice_ctn = 0;

In the previous code, the buffer held an entire sample window (EI_CLASSIFIER_RAW_SAMPLE_COUNT). Here, only one slice is processed at a time. The variable slice_ctn helps throttle predictions so they don’t trigger too frequently.

Audio Capture Task

The capture loop looks familiar but has one big change:

static void capture_samples(void* arg) {
  const float gain = 16.0;
  while (record_status) {
    for (uint32_t i = 0; i < slice_size; i++) {
      int16_t sample = I2S.read();
      sampleBuffer[i] = (int16_t)constrain(sample * gain, INT16_MIN, INT16_MAX);
    }
    vTaskDelay(1);  // give some breathing room for the scheduler
  }
  vTaskDelete(NULL);
}

Here we continuously fill just one slice instead of a full window. The vTaskDelay(1) call lets FreeRTOS schedule other tasks more smoothly. The first version fed samples into an inference buffer with callbacks, while this one just updates the slice buffer.

Microphone Setup

Both versions configure the microphone with I2S, but this one is simpler.

static bool microphone_inference_start(uint32_t n_samples) {
  I2S.setPinsPdmRx(I2S_CLK, I2S_DIN);
  if (!I2S.begin(I2S_MODE_PDM_RX, SAMPLERATE, I2S_DATA_BIT_WIDTH_16BIT, I2S_SLOT_MODE_MONO)) {
    ei_printf("Can't find microphone!\r\n");
    return false;
  }

  record_status = true;
  xTaskCreate(capture_samples, "CaptureSamples", 1024 * 16, NULL, 10, NULL);

  return true;
}

Notice there’s no dynamic memory allocation (malloc) and no inference buffer setup. This reduces complexity and avoids memory fragmentation.

Recording Audio

The recording function is now a placeholder:

static bool microphone_inference_record(void) {
  delay(1);
  return true;
}

In the first version, this function waited until the buffer was ready. Here, we don’t need that. The classification works slice by slice.

Classifier Input

The function to convert raw samples into floating point values is the same in both versions:

static int microphone_audio_signal_get_data(size_t offset, size_t length, float* out_ptr) {
  numpy::int16_to_float(&sampleBuffer[offset], out_ptr, length);
  return 0;
}

LEDs

The LED functions are identical: light up the recognized label’s LED and turn it off after half a second.

void flashLED(const char* label) { ... }

Setup Function

The setup function has a few new steps:

run_classifier_init();
ei_sleep(1000);

The classifier gets initialized once before use, which is required for continuous classification. The earlier version did not need this because it only ran single-shot inferences.

The microphone also starts with slice_size instead of the full raw sample count, enabling continuous streaming.

Loop Function

This is where the biggest difference lies. Instead of using run_classifier(), this version calls:

EI_IMPULSE_ERROR r = run_classifier_continuous(&signal, &result, debug_nn);

The continuous classifier handles sliding windows of audio. It processes overlapping slices of input, so you don’t need to stop and restart inference each time. This makes the system respond faster to keywords.

Another key change is how predictions are throttled:

slice_ctn = max(--slice_ctn, 0);
if (max_idx >= 0 && max_val > 0.5 && slice_ctn <= 0) {
  ei_printf("Predicted label: %s (%.3f)\n", cl[max_idx].label, max_val);
  blinkLED(cl[max_idx].label);
  slice_ctn = EI_CLASSIFIER_SLICES_PER_MODEL_WINDOW;
}

After detecting a label, the program waits for a full model window before allowing another trigger. This prevents the LED from blinking multiple times for the same word. The earlier code had no such mechanism, so it could retrigger immediately.

Summary of Differences

The first version uses a buffered, one-shot approach:

  • Collects a full inference window.
  • Runs classification once per buffer.
  • Requires waiting until the buffer is ready.
  • Uses run_classifier().

This second version uses a streaming, continuous approach:

  • Works with small slices of audio.
  • Runs classification continuously with overlap.
  • No waiting for buffer readiness.
  • Uses run_classifier_continuous().
  • Adds a cooldown (slice_ctn) to avoid repeated triggers.

Conclusions and Comments

In this tutorial you learned how to build a voice control application with the XIAO-ESP32-S3-Sense board and the Edge Impulse platform. Edge Impulse offers a lot more than was covered in this tutorial and I recommend you to read the Edge Impulse Documentation.

The toy example in this tutorial allowed you to control three LEDs with your voice. Note that you can build more powerful control applications by using a sequence of words and a state machine. For instance, a wake word like “Jarvis”, followed by a location “Desk”, “Corner”, “Shelf”, followed by a device “Light” or ‘Fan”, followed by an action “On” or “Off”:

[Jarvis] -> [Desk, Corner, Shelf] -> [Light, Fan] -> [On, Off]

With eight different words you can already control 1x3x2x2 = 12 different actions such as “Jarvis Desk Light On”

Apart, from home automation there are many other interesting applications of sound classification. Think about environmental sound classification, vibration/sound anomaly detection, bird song classification and so on.

Finally, If you want to detect faces or people have a look at our Face Detection with XIAO ESP32-S3-Sense and SenseCraft AI and Edge AI Room Occupancy Sensor with ESP32 and Person Detection tutorials.

If you have any questions feel free to leave them in the comment section.

Happy Tinkering 😉

Chan Choon Earn

Tuesday 3rd of February 2026

First of all, thank you Stefan for publishing the code and the detailed explanation. I tried out the code in my Xiao ESP32 Sense microcontroller and it works perfect. I added internet (Wifi) connectivity into this code but the program crashed (some kind of memory error) after it runs for half a minute or so. The Xiao microcontroller repeatedly resets itself. Is there a way to add internet connectivity into this code or I need to use a separate ESP to do the IoT job? Please help.

Stefan Maetschke

Wednesday 4th of February 2026

Memory bugs are hard to find/fix. A separate ESP32 for WiFi/communication is definitely an option.