Skip to content

Latest commit

 

History

History
530 lines (395 loc) · 21.1 KB

wio-terminal-speech-to-text.md

File metadata and controls

530 lines (395 loc) · 21.1 KB

Speech to text - Wio Terminal

In this part of the lesson, you will write code to convert speech in the captured audio to text using the speech service.

Send the audio to the speech service

The audio can be sent to the speech service using the REST API. To use the speech service, first you need to request an access token, then use that token to access the REST API. These access tokens expire after 10 minutes, so your code should request them on a regular basis to ensure they are always up to date.

Task - get an access token

  1. Open the smart-timer project if it's not already open.

  2. Add the following library dependencies to the platformio.ini file to access WiFi and handle JSON:

    seeed-studio/Seeed Arduino rpcWiFi @ 1.0.5
    seeed-studio/Seeed Arduino rpcUnified @ 2.1.3
    seeed-studio/Seeed_Arduino_mbedtls @ 3.0.1
    seeed-studio/Seeed Arduino RTC @ 2.0.0
    bblanchon/ArduinoJson @ 6.17.3
  3. Add the following code to the config.h header file:

    const char *SSID = "<SSID>";
    const char *PASSWORD = "<PASSWORD>";
    
    const char *SPEECH_API_KEY = "<API_KEY>";
    const char *SPEECH_LOCATION = "<LOCATION>";
    const char *LANGUAGE = "<LANGUAGE>";
    
    const char *TOKEN_URL = "https://%s.api.cognitive.microsoft.com/sts/v1.0/issuetoken";

    Replace <SSID> and <PASSWORD> with the relevant values for your WiFi.

    Replace <API_KEY> with the API key for your speech service resource. Replace <LOCATION> with the location you used when you created the speech service resource.

    Replace <LANGUAGE> with the locale name for language you will be speaking in, for example en-GB for English, or zn-HK for Cantonese. You can find a list of the supported languages and their locale names in the Language and voice support documentation on Microsoft docs.

    The TOKEN_URL constant is the URL of the token issuer without the location. This will be combined with the location later to get the full URL.

  4. Just like connecting to Custom Vision, you will need to use an HTTPS connection to connect to the token issuing service. To the end of config.h, add the following code:

    const char *TOKEN_CERTIFICATE =
        "-----BEGIN CERTIFICATE-----\r\n"
        "MIIF8zCCBNugAwIBAgIQAueRcfuAIek/4tmDg0xQwDANBgkqhkiG9w0BAQwFADBh\r\n"
        "MQswCQYDVQQGEwJVUzEVMBMGA1UEChMMRGlnaUNlcnQgSW5jMRkwFwYDVQQLExB3\r\n"
        "d3cuZGlnaWNlcnQuY29tMSAwHgYDVQQDExdEaWdpQ2VydCBHbG9iYWwgUm9vdCBH\r\n"
        "MjAeFw0yMDA3MjkxMjMwMDBaFw0yNDA2MjcyMzU5NTlaMFkxCzAJBgNVBAYTAlVT\r\n"
        "MR4wHAYDVQQKExVNaWNyb3NvZnQgQ29ycG9yYXRpb24xKjAoBgNVBAMTIU1pY3Jv\r\n"
        "c29mdCBBenVyZSBUTFMgSXNzdWluZyBDQSAwNjCCAiIwDQYJKoZIhvcNAQEBBQAD\r\n"
        "ggIPADCCAgoCggIBALVGARl56bx3KBUSGuPc4H5uoNFkFH4e7pvTCxRi4j/+z+Xb\r\n"
        "wjEz+5CipDOqjx9/jWjskL5dk7PaQkzItidsAAnDCW1leZBOIi68Lff1bjTeZgMY\r\n"
        "iwdRd3Y39b/lcGpiuP2d23W95YHkMMT8IlWosYIX0f4kYb62rphyfnAjYb/4Od99\r\n"
        "ThnhlAxGtfvSbXcBVIKCYfZgqRvV+5lReUnd1aNjRYVzPOoifgSx2fRyy1+pO1Uz\r\n"
        "aMMNnIOE71bVYW0A1hr19w7kOb0KkJXoALTDDj1ukUEDqQuBfBxReL5mXiu1O7WG\r\n"
        "0vltg0VZ/SZzctBsdBlx1BkmWYBW261KZgBivrql5ELTKKd8qgtHcLQA5fl6JB0Q\r\n"
        "gs5XDaWehN86Gps5JW8ArjGtjcWAIP+X8CQaWfaCnuRm6Bk/03PQWhgdi84qwA0s\r\n"
        "sRfFJwHUPTNSnE8EiGVk2frt0u8PG1pwSQsFuNJfcYIHEv1vOzP7uEOuDydsmCjh\r\n"
        "lxuoK2n5/2aVR3BMTu+p4+gl8alXoBycyLmj3J/PUgqD8SL5fTCUegGsdia/Sa60\r\n"
        "N2oV7vQ17wjMN+LXa2rjj/b4ZlZgXVojDmAjDwIRdDUujQu0RVsJqFLMzSIHpp2C\r\n"
        "Zp7mIoLrySay2YYBu7SiNwL95X6He2kS8eefBBHjzwW/9FxGqry57i71c2cDAgMB\r\n"
        "AAGjggGtMIIBqTAdBgNVHQ4EFgQU1cFnOsKjnfR3UltZEjgp5lVou6UwHwYDVR0j\r\n"
        "BBgwFoAUTiJUIBiV5uNu5g/6+rkS7QYXjzkwDgYDVR0PAQH/BAQDAgGGMB0GA1Ud\r\n"
        "JQQWMBQGCCsGAQUFBwMBBggrBgEFBQcDAjASBgNVHRMBAf8ECDAGAQH/AgEAMHYG\r\n"
        "CCsGAQUFBwEBBGowaDAkBggrBgEFBQcwAYYYaHR0cDovL29jc3AuZGlnaWNlcnQu\r\n"
        "Y29tMEAGCCsGAQUFBzAChjRodHRwOi8vY2FjZXJ0cy5kaWdpY2VydC5jb20vRGln\r\n"
        "aUNlcnRHbG9iYWxSb290RzIuY3J0MHsGA1UdHwR0MHIwN6A1oDOGMWh0dHA6Ly9j\r\n"
        "cmwzLmRpZ2ljZXJ0LmNvbS9EaWdpQ2VydEdsb2JhbFJvb3RHMi5jcmwwN6A1oDOG\r\n"
        "MWh0dHA6Ly9jcmw0LmRpZ2ljZXJ0LmNvbS9EaWdpQ2VydEdsb2JhbFJvb3RHMi5j\r\n"
        "cmwwHQYDVR0gBBYwFDAIBgZngQwBAgEwCAYGZ4EMAQICMBAGCSsGAQQBgjcVAQQD\r\n"
        "AgEAMA0GCSqGSIb3DQEBDAUAA4IBAQB2oWc93fB8esci/8esixj++N22meiGDjgF\r\n"
        "+rA2LUK5IOQOgcUSTGKSqF9lYfAxPjrqPjDCUPHCURv+26ad5P/BYtXtbmtxJWu+\r\n"
        "cS5BhMDPPeG3oPZwXRHBJFAkY4O4AF7RIAAUW6EzDflUoDHKv83zOiPfYGcpHc9s\r\n"
        "kxAInCedk7QSgXvMARjjOqdakor21DTmNIUotxo8kHv5hwRlGhBJwps6fEVi1Bt0\r\n"
        "trpM/3wYxlr473WSPUFZPgP1j519kLpWOJ8z09wxay+Br29irPcBYv0GMXlHqThy\r\n"
        "8y4m/HyTQeI2IMvMrQnwqPpY+rLIXyviI2vLoI+4xKE4Rn38ZZ8m\r\n"
        "-----END CERTIFICATE-----\r\n";

    This is the same certificate you used when connecting to Custom Vision.

  5. Add an include for the WiFi header file and the config header file to the top of the main.cpp file:

    #include <rpcWiFi.h>
    
    #include "config.h"
  6. Add code to connect to WiFi in main.cpp above the setup function:

    void connectWiFi()
    {
        while (WiFi.status() != WL_CONNECTED)
        {
            Serial.println("Connecting to WiFi..");
            WiFi.begin(SSID, PASSWORD);
            delay(500);
        }
    
        Serial.println("Connected!");
    }
  7. Call this function from the setup function after the serial connection has been established:

    connectWiFi();
  8. Create a new header file in the src folder called speech_to_text.h. In this header file, add the following code:

    #pragma once
    
    #include <Arduino.h>
    #include <ArduinoJson.h>
    #include <HTTPClient.h>
    #include <WiFiClientSecure.h>
    
    #include "config.h"
    #include "mic.h"
    
    class SpeechToText
    {
    public:
    
    private:
    
    };
    
    SpeechToText speechToText;

    This includes some necessary header files for an HTTP connection, configuration and the mic.h header file, and defines a class called SpeechToText, before declaring an instance of that class that can be used later.

  9. Add the following 2 fields to the private section of this class:

    WiFiClientSecure _token_client;
    String _access_token;

    The _token_client is a WiFi Client that uses HTTPS and will be used to get the access token. This token will then be stored in _access_token.

  10. Add the following method to the private section:

    String getAccessToken()
    {
        char url[128];
        sprintf(url, TOKEN_URL, SPEECH_LOCATION);
    
        HTTPClient httpClient;
        httpClient.begin(_token_client, url);
    
        httpClient.addHeader("Ocp-Apim-Subscription-Key", SPEECH_API_KEY);
        int httpResultCode = httpClient.POST("{}");
    
        if (httpResultCode != 200)
        {
            Serial.println("Error getting access token, trying again...");
            delay(10000);
            return getAccessToken();
        }
    
        Serial.println("Got access token.");
        String result = httpClient.getString();
    
        httpClient.end();
    
        return result;
    }

    This code builds the URL for the token issuer API using the location of the speech resource. It then creates an HTTPClient to make the web request, setting it up to use the WiFi client configured with the token endpoints certificate. It sets the API key as a header for the call. It then makes a POST request to get the certificate, retrying if it gets any errors. Finally the access token is returned.

  11. To the public section, add a method to get the access token. This will be needed in later lessons to convert text to speech.

    String AccessToken()
    {
        return _access_token;
    }
  12. To the public section, add an init method that sets up the token client:

    void init()
    {
        _token_client.setCACert(TOKEN_CERTIFICATE);
        _access_token = getAccessToken();
    }

    This sets the certificate on the WiFi client, then gets the access token.

  13. In main.cpp, add this new header file to the include directives:

    #include "speech_to_text.h"
  14. Initialize the SpeechToText class at the end of the setup function, after the mic.init call but before Ready is written to the serial monitor:

    speechToText.init();

Task - read audio from flash memory

  1. In an earlier part of this lesson, the audio was recorded to the flash memory. This audio will need to be sent to the Speech Services REST API, so it needs to be read from the flash memory. It can't be loaded into an in-memory buffer as it would be too large. The HTTPClient class that makes REST calls can stream data using an Arduino Stream - a class that can load data in small chunks, sending the chunks one at a time as part of the request. Every time you call read on a stream it returns the next block of data. An Arduino stream can be created that can read from the flash memory. Create a new file called flash_stream.h in the src folder, and add the following code to it:

    #pragma once
    
    #include <Arduino.h>
    #include <HTTPClient.h>
    #include <sfud.h>
    
    #include "config.h"
    
    class FlashStream : public Stream
    {
    public:
        virtual size_t write(uint8_t val)
        {    
        }
    
        virtual int available()
        {
        }
    
        virtual int read()
        {
        }
    
        virtual int peek()
        {
        }
    private:
    
    };

    This declares the FlashStream class, deriving from the Arduino Stream class. This is an abstract class - derived classes have to implement a few methods before the class can be instantiated, and these methods are defined in this class.

    ✅ Read more on Arduino Streams in the Arduino Stream documentation

  2. Add the following fields to the private section:

    size_t _pos;
    size_t _flash_address;
    const sfud_flash *_flash;
    
    byte _buffer[HTTP_TCP_BUFFER_SIZE];

    This defines a temporary buffer to store data read from the flash memory, along with fields to store the current position when reading from the buffer, the current address to read from the flash memory, and the flash memory device.

  3. In the private section, add the following method:

    void populateBuffer()
    {
        sfud_read(_flash, _flash_address, HTTP_TCP_BUFFER_SIZE, _buffer);
        _flash_address += HTTP_TCP_BUFFER_SIZE;
        _pos = 0;
    }

    This code reads from the flash memory at the current address and stores the data in a buffer. It then increments the address, so the next call reads the next block of memory. The buffer is sized based on the largest chunk that the HTTPClient will send to the REST API at one time.

    💁 Erasing flash memory has to be done using the grain size, reading on the other hand does not.

  4. In the public section of this class, add a constructor:

    FlashStream()
    {
        _pos = 0;
        _flash_address = 0;
        _flash = sfud_get_device_table() + 0;
        
        populateBuffer();
    }

    This constructor sets up all the fields to start reading from the start of the flash memory block, and loads the first chunk of data into the buffer.

  5. Implement the write method. This stream will only read data, so this can do nothing and return 0:

    virtual size_t write(uint8_t val)
    {
        return 0;
    }
  6. Implement the peek method. This returns the data at the current position without moving the stream along. Calling peek multiple times will always return the same data as long as no data is read from the stream.

    virtual int peek()
    {
        return _buffer[_pos];
    }
  7. Implement the available function. This returns how many bytes can be read from the stream, or -1 if the stream is complete. For this class, the maximum available will be no more than the HTTPClient's chunk size. When this stream is used in the HTTP client it calls this function to see how much data is available, then requests that much data to send to the REST API. We don't want each chunk to be more than the HTTP clients chunk size, so if more than that is available, the chunk size is returned. If less, then what is available is returned. Once all the data has been streamed, -1 is returned.

    virtual int available()
    {
        int remaining = BUFFER_SIZE - ((_flash_address - HTTP_TCP_BUFFER_SIZE) + _pos);
        int bytes_available = min(HTTP_TCP_BUFFER_SIZE, remaining);
    
        if (bytes_available == 0)
        {
            bytes_available = -1;
        }
    
        return bytes_available;
    }
  8. Implement the read method to return the next byte from the buffer, incrementing the position. If the position exceeds the size of the buffer, it populates the buffer with the next block from the flash memory and resets the position.

    virtual int read()
    {
        int retVal = _buffer[_pos++];
    
        if (_pos == HTTP_TCP_BUFFER_SIZE)
        {
            populateBuffer();
        }
    
        return retVal;
    }
  9. In the speech_to_text.h header file, add an include directive for this new header file:

    #include "flash_stream.h"

Task - convert the speech to text

  1. The speech can be converted to text by sending the audio to the Speech Service via a REST API. This REST API has a different certificate to the token issuer, so add the following code to the config.h header file to define this certificate:

    const char *SPEECH_CERTIFICATE =
        "-----BEGIN CERTIFICATE-----\r\n"
        "MIIF8zCCBNugAwIBAgIQCq+mxcpjxFFB6jvh98dTFzANBgkqhkiG9w0BAQwFADBh\r\n"
        "MQswCQYDVQQGEwJVUzEVMBMGA1UEChMMRGlnaUNlcnQgSW5jMRkwFwYDVQQLExB3\r\n"
        "d3cuZGlnaWNlcnQuY29tMSAwHgYDVQQDExdEaWdpQ2VydCBHbG9iYWwgUm9vdCBH\r\n"
        "MjAeFw0yMDA3MjkxMjMwMDBaFw0yNDA2MjcyMzU5NTlaMFkxCzAJBgNVBAYTAlVT\r\n"
        "MR4wHAYDVQQKExVNaWNyb3NvZnQgQ29ycG9yYXRpb24xKjAoBgNVBAMTIU1pY3Jv\r\n"
        "c29mdCBBenVyZSBUTFMgSXNzdWluZyBDQSAwMTCCAiIwDQYJKoZIhvcNAQEBBQAD\r\n"
        "ggIPADCCAgoCggIBAMedcDrkXufP7pxVm1FHLDNA9IjwHaMoaY8arqqZ4Gff4xyr\r\n"
        "RygnavXL7g12MPAx8Q6Dd9hfBzrfWxkF0Br2wIvlvkzW01naNVSkHp+OS3hL3W6n\r\n"
        "l/jYvZnVeJXjtsKYcXIf/6WtspcF5awlQ9LZJcjwaH7KoZuK+THpXCMtzD8XNVdm\r\n"
        "GW/JI0C/7U/E7evXn9XDio8SYkGSM63aLO5BtLCv092+1d4GGBSQYolRq+7Pd1kR\r\n"
        "EkWBPm0ywZ2Vb8GIS5DLrjelEkBnKCyy3B0yQud9dpVsiUeE7F5sY8Me96WVxQcb\r\n"
        "OyYdEY/j/9UpDlOG+vA+YgOvBhkKEjiqygVpP8EZoMMijephzg43b5Qi9r5UrvYo\r\n"
        "o19oR/8pf4HJNDPF0/FJwFVMW8PmCBLGstin3NE1+NeWTkGt0TzpHjgKyfaDP2tO\r\n"
        "4bCk1G7pP2kDFT7SYfc8xbgCkFQ2UCEXsaH/f5YmpLn4YPiNFCeeIida7xnfTvc4\r\n"
        "7IxyVccHHq1FzGygOqemrxEETKh8hvDR6eBdrBwmCHVgZrnAqnn93JtGyPLi6+cj\r\n"
        "WGVGtMZHwzVvX1HvSFG771sskcEjJxiQNQDQRWHEh3NxvNb7kFlAXnVdRkkvhjpR\r\n"
        "GchFhTAzqmwltdWhWDEyCMKC2x/mSZvZtlZGY+g37Y72qHzidwtyW7rBetZJAgMB\r\n"
        "AAGjggGtMIIBqTAdBgNVHQ4EFgQUDyBd16FXlduSzyvQx8J3BM5ygHYwHwYDVR0j\r\n"
        "BBgwFoAUTiJUIBiV5uNu5g/6+rkS7QYXjzkwDgYDVR0PAQH/BAQDAgGGMB0GA1Ud\r\n"
        "JQQWMBQGCCsGAQUFBwMBBggrBgEFBQcDAjASBgNVHRMBAf8ECDAGAQH/AgEAMHYG\r\n"
        "CCsGAQUFBwEBBGowaDAkBggrBgEFBQcwAYYYaHR0cDovL29jc3AuZGlnaWNlcnQu\r\n"
        "Y29tMEAGCCsGAQUFBzAChjRodHRwOi8vY2FjZXJ0cy5kaWdpY2VydC5jb20vRGln\r\n"
        "aUNlcnRHbG9iYWxSb290RzIuY3J0MHsGA1UdHwR0MHIwN6A1oDOGMWh0dHA6Ly9j\r\n"
        "cmwzLmRpZ2ljZXJ0LmNvbS9EaWdpQ2VydEdsb2JhbFJvb3RHMi5jcmwwN6A1oDOG\r\n"
        "MWh0dHA6Ly9jcmw0LmRpZ2ljZXJ0LmNvbS9EaWdpQ2VydEdsb2JhbFJvb3RHMi5j\r\n"
        "cmwwHQYDVR0gBBYwFDAIBgZngQwBAgEwCAYGZ4EMAQICMBAGCSsGAQQBgjcVAQQD\r\n"
        "AgEAMA0GCSqGSIb3DQEBDAUAA4IBAQAlFvNh7QgXVLAZSsNR2XRmIn9iS8OHFCBA\r\n"
        "WxKJoi8YYQafpMTkMqeuzoL3HWb1pYEipsDkhiMnrpfeYZEA7Lz7yqEEtfgHcEBs\r\n"
        "K9KcStQGGZRfmWU07hPXHnFz+5gTXqzCE2PBMlRgVUYJiA25mJPXfB00gDvGhtYa\r\n"
        "+mENwM9Bq1B9YYLyLjRtUz8cyGsdyTIG/bBM/Q9jcV8JGqMU/UjAdh1pFyTnnHEl\r\n"
        "Y59Npi7F87ZqYYJEHJM2LGD+le8VsHjgeWX2CJQko7klXvcizuZvUEDTjHaQcs2J\r\n"
        "+kPgfyMIOY1DMJ21NxOJ2xPRC/wAh/hzSBRVtoAnyuxtkZ4VjIOh\r\n"
        "-----END CERTIFICATE-----\r\n";
  2. Add a constant to this file for the speech URL without the location. This will be combined with the location and language later to get the full URL.

    const char *SPEECH_URL = "https://%s.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1?language=%s";
  3. In the speech_to_text.h header file, in the private section of the SpeechToText class, define a field for a WiFi Client using the speech certificate:

    WiFiClientSecure _speech_client;
  4. In the init method, set the certificate on this WiFi Client:

    _speech_client.setCACert(SPEECH_CERTIFICATE);
  5. Add the following code to the public section of the SpeechToText class to define a method to convert speech to text:

    String convertSpeechToText()
    {
        
    }
  6. Add the following code to this method to create an HTTP client using the WiFi client configured with the speech certificate, and using the speech URL set with the location and language:

    char url[128];
    sprintf(url, SPEECH_URL, SPEECH_LOCATION, LANGUAGE);
    
    HTTPClient httpClient;
    httpClient.begin(_speech_client, url);
  7. Some headers need to be set on the connection:

    httpClient.addHeader("Authorization", String("Bearer ") + _access_token);
    httpClient.addHeader("Content-Type", String("audio/wav; codecs=audio/pcm; samplerate=") + String(RATE));
    httpClient.addHeader("Accept", "application/json;text/xml");

    This sets headers for the authorization using the access token, the audio format using the sample rate, and sets that the client expects the result as JSON.

  8. After this, add the following code to make the REST API call:

    Serial.println("Sending speech...");
    
    FlashStream stream;
    int httpResponseCode = httpClient.sendRequest("POST", &stream, BUFFER_SIZE);
    
    Serial.println("Speech sent!");

    This creates a FlashStream and uses it to stream data to the REST API.

  9. Below this, add the following code:

    String text = "";
    
    if (httpResponseCode == 200)
    {
        String result = httpClient.getString();
        Serial.println(result);
    
        DynamicJsonDocument doc(1024);
        deserializeJson(doc, result.c_str());
    
        JsonObject obj = doc.as<JsonObject>();
        text = obj["DisplayText"].as<String>();
    }
    else if (httpResponseCode == 401)
    {
        Serial.println("Access token expired, trying again with a new token");
        _access_token = getAccessToken();
        return convertSpeechToText();
    }
    else
    {
        Serial.print("Failed to convert text to speech - error ");
        Serial.println(httpResponseCode);
    }

    This code checks the response code.

    If it is 200, the code for success, then the result is retrieved, decoded from JSON, and the DisplayText property is set into the text variable. This is the property that the text version of the speech is returned in.

    If the response code is 401, then the access token has expired (these tokens only last 10 minutes). A new access token is requested, and the call is made again.

    Otherwise, an error is sent to the serial monitor, and the text is left blank.

  10. Add the following code to the end of this method to close the HTTP client and return the text:

    httpClient.end();
    
    return text;
  11. In main.cpp call this new convertSpeechToText method in the processAudio function, then log out the speech to the serial monitor:

    String text = speechToText.convertSpeechToText();
    Serial.println(text);
  12. Build this code, upload it to your Wio Terminal and test it out through the serial monitor. Once you see Ready in the serial monitor, press the C button (the one on the left-hand side, closest to the power switch), and speak. 4 seconds of audio will be captured, then converted to text.

    --- Available filters and text transformations: colorize, debug, default, direct, hexlify, log2file, nocontrol, printable, send_on_enter, time
    --- More details at http://bit.ly/pio-monitor-filters
    --- Miniterm on /dev/cu.usbmodem1101  9600,8,N,1 ---
    --- Quit: Ctrl+C | Menu: Ctrl+T | Help: Ctrl+T followed by Ctrl+H ---
    Connecting to WiFi..
    Connected!
    Got access token.
    Ready.
    Starting recording...
    Finished recording
    Sending speech...
    Speech sent!
    {"RecognitionStatus":"Success","DisplayText":"Set a 2 minute and 27 second timer.","Offset":4700000,"Duration":35300000}
    Set a 2 minute and 27 second timer.
    

💁 You can find this code in the code-speech-to-text/wio-terminal folder.

😀 Your speech to text program was a success!