More

    Coupling GPT-3 with speech recognition and synthesis to achieve a fully talking chatbot that runs…

    How I created this web app with which you can talk naturally with GPT-3 about any topic you want, all web-based in your laptop, smartphone or tablet.

    I just created a web page-based chatbot that understands what you say and then replies back quite intelligently. I have a couple of articles showing it in action, and here you have a 49-second preview (check my recent article for more examples).

    You can try (i.e. talk naturally with) this chatbot at:

    https://lucianoabriata.altervista.org/tests/gpt-3/TDSgpt3-speechrec-speechsynth.html

    But the point of this article is to understand how this app works -a beautiful example of how different libraries, browser APIs, and external services accessible through API calls can work together to achieve cool applications.

    I superficially said in my previous articles that this web chatbot builds on three great tools available for web browsers. Let’s here see these components in more detail.

    1. Annyang for speech recognition

    (Or you could go straight with the Web Speech API’s SpeechRecognition service)

    Annyang is a small speech recognition JavaScript library that easily allows programmers to incorporate voice commands and voice dictation into their web apps. It has no dependencies, it is very light, and is free to use and modify under the MIT license.

    To achieve speech recognition, Annyang uses the SpeechRecognition service of the Web Speech API -which also includes the SpeechSynthesis service described in point 3 below, to allow the chatbot to speak up.

    The Annyang library runs very well in Chrome on computers, laptops, tablets, and smartphones, requiring only that the web app be hosted under https and that the user accepts the use of the microphone (by default requested only the first time the library is loaded under a given domain). In my tests Annyang also worked very well on Firefox in a Windows Computer.

    Annyang is so easy to use that you as a programmer can insert spoken commands with just these lines of code:


    if (annyang) {
    const commands = {
    'hello': () => { alert('Hello world!'); }
    };

    // Add our commands to annyang
    annyang.addCommands(commands);

    // Start listening.
    annyang.start();
    }

    2. GPT-3 as the brain that “understands” what you say and produces a likely reply

    GPT-3 is a state-of-the-art language processing model, that is a deep neural network that can read in some text and produce new text from a generative statistical model. The output can be an answer to a question asked in the input, or the summary of the inputted text, or an expansion of it; it can also be the text form of a table, or a piece of code written after a few indications in the input, etc.

    For real applications, you won’t probably use GPT-3 exactly as it is but rather adapted to your needs. In particular, you can either fine-tune the model with specific input/output pairs, and you can also provide few-shot learning in the form of a prompt that contains a basic layout of the kinds of text inputs and outputs you expect from it.

    I have written several articles about GPT-3 that illustrate the above-mentioned features. Here are the most important ones:

    Of most relevance to this article, note that I use a PHP-based call to OpenAI’s API. I explain how to achieve this in the first article of the three listed above.

    Besides, I use quite some few-shot learning when calling GPT-3, to make it “understand” that it has to deal with an exchange of information mimicking two persons having a conversation (I’ll show you more later on).

    3. The Speech Synthesis API to speak up the text produced by GPT-3

    The SpeechSynthesis service of the Web Speech API (which also includes the SpeechRecognition service wrapped by Annyang for simpler use) allows you to rather easily synthesize audio from text prompts.

    Instead of using a wrapper (like Annyang for speech recognition), I used the SpeechSynthesis service right away, “vanilla mode”.

    For example, I can make a web page say “Hello world, speaking up here!” with this simple piece of code:

    var to_speak = new SpeechSynthesisUtterance("Hello world, speaking up here!");to_speak.lang = ‘en-US’;window.speechSynthesis.speak(to_speak);

    Done with the components,

    Now it’s time to see how this all really works together to create a natural speech chatbot.

    It might sound quite trivial that putting the above elements together should lead to a chatbot like the one you’ve seen in my videos. However, when you simply put together all three pieces you start getting problems.

    Avoiding that the chatbot ended up “talking with itself”

    In my first attempts, the most important problem was that Annyang kept listening while the speech API was talking, therefore creating an infinite loop where the bot was simply talking with itself. And not precisely in an interesting way but rather repeating phrases of the form “Yes, I said that” over and over! It was funny at first, but eventually got a pain to solve out.

    Luckily, it turns out that you can switch Annyang on and off. While the commands were supposed to be pause and resume, they didn’t really work -and it was not me, because I indeed found some notes online about this issue.

    So, I had to use another method: not pause but actually abort right when a phrase is sent to GPT-3 for analysis; and then start rather than resume when the output produced by GPT-3 was ready to be read.

    But even that wasn’t perfect: sometimes Annyang started listening again slightly ahead of the end of the speech API’s talk, so it grabbed its last words and again tried to reply to those last words. I could solve this out very simply by incorporating a timeout function that waits for 1 second after the end of the speech call before activating Annyang:

    var to_speak = new SpeechSynthesisUtterance(textupdate.replace(“

    ”,””)); to_speak.addEventListener(‘end’, (event) => { setTimeout(annyang.start(),1000); console.log(“End event done”) }) to_speak.lang = ‘en-US’; window.speechSynthesis.speak(to_speak);

    “Explaining” GPT-3 that it has to follow a conversation, providing answers to questions or reasonable continuations to what the human said, and remembering context

    The key for this relies in GPT-3’s capability for few-shot learning. This is essentially shaping the prompt sent to GPT-3 in a way that instructs it to follow the conversation.

    When my program begins, the prompt is simply the following question that contains some background information that the chatbot is supposed to know but doesn’t, such as its own nature, how itself works, my name as a developer, etc. The starting prompt also follows a model of how inputs and answers should follow each other, here flagged with letters Q and A as they resemble questions and answers (although they do not necessarily have to be questions and answers, for example when you ask “Hello, good morning.” and the bot replies something like “Hello there.”

    Here’s the exact prompt used by my app:

    var prompt = “I am a highly intelligent bot with whom you can talk. I try to reply saying the truth, but sometimes I might make up some stuff. I work by integrating GPT-3 as a brain with web APIs for speech recognition and speech synthesis.;Q: What is human life expectancy in the United States?;A: Human life expectancy in the United States is 78 years.;Q: Who was president of the United States in 1955?;A: Dwight D. Eisenhower was president of the United States in 1955.;Q: Which party did he belong to?;A: He belonged to the Republican Party.;Q: How does a telescope work?;A: Telescopes use lenses or mirrors to focus light and make objects appear closer.;Q: Where were the 1992 Olympics held?;A: The 1992 Olympics were held in Barcelona, Spain.;Q: Who developed you?;A: I was developed by Luciano Abriata.”

    Now, when Annyang understands for the first time some text that was spoken to it, the app appends this text to the prompt and the whole new prompt is sent to GPT-3. For example, if the first thing heard by Annyang is “Hello, how are you doing?” then the prompt as sent to GPT-3 will be something like:

    I am a highly intelligent bot with whom you can talk. I try to reply saying the truth, but sometimes I might make up some stuff. I work by integrating GPT-3 as a brain with web APIs for speech recognition and speech synthesis.;Q: What is human life expectancy in the United States?;A: Human life expectancy in the United States is 78 years.;Q: Who was president of the United States in 1955?;A: Dwight D. Eisenhower was president of the United States in 1955.;Q: Which party did he belong to?;A: He belonged to the Republican Party.;Q: How does a telescope work?;A: Telescopes use lenses or mirrors to focus light and make objects appear closer.;Q: Where were the 1992 Olympics held?;A: The 1992 Olympics were held in Barcelona, Spain.;Q: Who developed you?;A: I was developed by Luciano Abriata.;Q: Hello, how are you doing?"

    (I formatted in bold what was added to the starting prompt)

    The output from GPT-3 will consist of this same prompt plus its addition. For example:

    I am a highly intelligent bot with whom you can talk. I try to reply saying the truth, but sometimes I might make up some stuff. I work by integrating GPT-3 as a brain with web APIs for speech recognition and speech synthesis.;Q: What is human life expectancy in the United States?;A: Human life expectancy in the United States is 78 years.;Q: Who was president of the United States in 1955?;A: Dwight D. Eisenhower was president of the United States in 1955.;Q: Which party did he belong to?;A: He belonged to the Republican Party.;Q: How does a telescope work?;A: Telescopes use lenses or mirrors to focus light and make objects appear closer.;Q: Where were the 1992 Olympics held?;A: The 1992 Olympics were held in Barcelona, Spain.;Q: Who developed you?;A: I was developed by Luciano Abriata.;Q: Hello, how are you doing?; A: I am doing fine, thanks."

    When the next spoken text is heard, it is appended to this last prompt and sent again to GPT-3. The prompt then comes back with the new response added. And so on.

    This helps to not only provide information that the chatbot cannot know about (in the first chunk of prompt) but also give continuity to the chat, as the previous exchanges are still there. This is why the few-shot learning capability of GPT-3 is so great. Unfortunately it has a limit, as it is capped to 2048 tokens, so at some point the API will throw an error because the prompt is too long. There could be some ways to improve this, that perhaps you can try out as an exercise.

    Cleaning up characters in phrases to feed into and obtained from GPT-3

    As you’ve seen in the piece of code above showing how the speech synthesis works, there’s some cleaning to do on the strings. For example, we don’t want the bot to talk the whole prompt, so we have to remove the input. We also want to remove those Qs and As as well as any HTML tags, weird question marks, etc. For this I use a lot of replace() functions in the strings (when I use replace(//g) this means “replace all instances” in JavaScript.

    $.ajax({ url: “TDSgpt3withyourapikey.php?prompt=” + theprompt + “&apikey=” + apikey }).done(function(data) { var textupdate = data.replace(theprompt,””).replace(“https://api.openai.com/v1/engines/text-davinci-002/completions","").trim() if (textupdate.substring(0,1) == “?”) { textupdate = textupdate.substring(1); } textupdate = textupdate.replace(/;Q:/g,””).replace(/;A:/g,””).replace(/Q:/g,””).replace(/A:/g,””) textupdate = textupdate.trim() console.log(textupdate)

    One global look at the whole code

    Here’s all the code, with comments that describe briefly what is being done (summary of all the details given above)

    Screenshot by author, of his own code.
    Figure created by the author.

    Coupling GPT-3 with speech recognition and synthesis to achieve a fully talking chatbot that runs… Republished from Source https://towardsdatascience.com/coupling-gpt-3-with-speech-recognition-and-synthesis-to-achieve-a-fully-talking-chatbot-that-runs-abfcb7bf580?source=rss----7f60cf5620c9---4 via https://towardsdatascience.com/feed

    Recent Articles

    spot_img

    Related Stories

    Stay on op - Ge the daily news in your inbox