Conversational Sentiment Analysis on Audio Data

    Analyzing sentiment in Speech

    Photo by Towfiqu barbhuiya on Unsplash

    Sentiment Analysis, also known as opinion mining, is a popular task in Natural Language Processing (NLP) due to its diverse industrial applications. In the context of applying NLP techniques specifically to textual data, the primary objective is to train a model that can classify a given piece of text between different sentiment classes. A high-level overview of a sentiment classifier is shown in the image below.

    An overview of the Sentiment Analysis model (Image by author)

    For instance, the classes for a three-class classification problem can be Positive, Negative and Neutral. An example of the three-class sentiment analysis problem is the popular Twitter Sentiment Analysis dataset which is an Entity-level sentiment analysis task on multi-lingual tweets posted by various users on Twitter.

    While most of the prior research and development in NLP has primarily focused on applying sentiment analysis over text, in recent times, we have seen massive adoption and popularity of speech-based interaction tools among users, veering researchers and organizations to build sentiment classifiers in the speech space.

    Therefore, this post will demonstrate how to build a sentiment analysis system on conversational data using the AssemblyAI API and Python. The end-to-end system holds extensive applicability in areas involving rigorous customer support and feedback evaluation — making it an important and valuable problem to solve, especially in the speech domain. Towards the end, I’ll also demonstrate an extensive analysis to enhance the interpretability of the obtained results and draw appropriate insights from the data.

    You can find the code for this article here. Moreover, the highlight of the article is as follows:

    Sentiment Analysis on Conversational Audio Data
    Sentiment Analysis Results
    Sentiment Analysis Insights

    In this section, I am going to demonstrate the use of AssemblyAI API to classify individual sentences in a given piece of pre-recorded voice conversation into three sentiment classes: Positive, Negative and Neutral.

    An overview of the Sentiment Analysis model through an API (Image by author)

    Step 1: Installing Requirements

    There are very few requirements to build the sentiment classifier. In terms of python libraries, we need therequests package only in Python. This can be done as follows:

    pip install requests

    Step 2: Generating your API Token

    The next step is to create an account on the AssemblyAI website, which you can do for free. Once done, you will get your private API access key, which we will use to access the speech-to-text models.

    Step 3: Uploading Audio File

    For the purpose of this tutorial, I’ll use a pre-recorded audio conversation between two people to perform sentiment analysis on. Once you have obtained the API Key, you can proceed with the sentiment classification task on the pre-recorded audio file.

    However, before doing that, you will need to upload the audio file so that it can be accessed via a URL. Options include uploading to an AWS S3 bucket, audio hosting services like SoundCloud or AssemblyAI’s self-hosting services, etc. I have uploaded the audio file to SoundCloud, which can be accessed below.

    If you wish to upload the audio file directly to AssemblyAI’s hosting services, you can do that too. I have demonstrated this step-by-step procedure in the code blocks below.

    Step 3.1: Import requirements

    We start with importing the requirements for our project.

    Step 3.2: Specify file location and API_Key

    Next, we need to specify the location of the audio file on our local machine and the API key obtained after signing up.

    Step 3.3: Specify Upload Endpoint

    • endpoint: This specifies the service to be invoked, which in this case is the “upload” service.
    • headers: This holds the API key and the content-type.

    Step 3.4: Define the upload function

    Audio files can only be uploaded up to a limit of 5 MBs (5,242,880 bytes) at once. Therefore, we need to upload the data in chunks. These are then merged back on the service endpoint. Hence, you don’t need to worry about handling numerous URLs.

    Step 3.5: Upload

    The last step is to invoke the POST request. The response of the post request is a JSON that holds the upload_url of the audio file. I will use this URL for the next steps of executing the sentiment classification on the audio.

    Step 4: Sentiment Analysis

    At this step, we have fulfilled all the necessary prerequisites to perform the task of sentiment analysis on the audio file. Now, we can proceed with calling the API to fetch the desired results. This is a two-step process which is demonstrated in the subsections below.

    Step 4.1: Submitting Files for Transcription

    The first step is to invoke an HTTP Post request. This essentially sends your audio files to the AI models running in the background for transcription and instructs them to perform sentiment analysis on the transcribed texts.

    The arguments passed to the POST request are:

    1. endpoint: It specifies the transcription service to be invoked.
    2. json: This contains the URL to your audio file as audio_url key. As we wish to perform sentiment analysis on conversational data, the sentiment_analysis flag and speaker_labels are set to True.
    3. headers: This holds the authorization key and the content-type.

    The current status of the post request, as received in the JSON response, is queued. This indicates that the audio is currently being transcribed.

    Moreover, the sentiment_analysis flag is also Truein the JSON response. However, the value corresponding to the sentiment_analysis_results key is None as the status is currently queued.

    Step 4.2: Fetching the Transcription Result

    To check the status of our POST request, we need to make a GET request using the id key in the JSON response received above.

    Next, we can proceed with a GET request, as shown in the code block below.

    The arguments passed to the GET request are:

    1. endpoint: This specifies the service invoked and the API call identifier determined using id key.
    2. headers: This holds your unique API key.

    Here, you should know that the transcription result won’t be ready until the status key changes to completed. The time it takes for transcription depends upon how long your input audio file is. Therefore, you must make repeated GET requests at regular intervals to check transcription status. A simple way of doing this is implemented below:

    Once the status changes to completed, you will receive a response similar to the one mentioned below.

    1. The status in the JSON response is marked as completed. This indicates that there were no errors in transcribing the audio.
    2. The text key contains the entire transcription of the input audio conversation, and it includes 22 sentences.
    3. As the audio file is composed of multiple speakers, we see all speaker keys within the words key as Not Null. The speaker key is either “A” or “B.”
    4. We can see a confidence score for all the individual words and the entire transcription text. The score ranges from 0 to 1–with 0 being the lowest and 1 being the highest.
    5. The results for sentiment analysis on each of the 22 individual sentences in the audio are accessible using the sentiment_analysis_results key of the JSON response.
    6. Corresponding to each sentence, we get a confidence score similar to that in point 4 above.
    7. The sentiment of each sentence can be retrieved using the sentiment key of a sentence’s dictionary. The sentiment analysis result for the second sentence is shown below:

    JSONs are usually hard to read and interpret. Therefore, to make the data visually appealing and to conduct further analysis, let’s convert the sentiment analysis results above to a DataFrame. We will store the text, the duration of the sentence, its speaker, and the sentiment of the sentence. This is implemented below:

    The DataFrame generated with the code snippet above is shown in the image below. Here, we have the 22 sentences that were spoken during the conversation along with the corresponding speaker labels (“A” and “B”), their duration in seconds, and the sentiment of the sentence as predicted by the model.

    Sentences in the audio file (Image by author)

    #1 Speaker distribution

    The number of sentences spoken by each of the speakers can be calculated using the value_counts() method as shown below:

    To view the percentage distribution of the speakers, we can pass normalize = True to the value_counts() method as follows:

    Both the speakers “A” and “B” contributed equally to the conversation in terms of the number of sentences.

    #2 Speaker Duration distribution

    Next, let’s compute the individual contribution of each of the speakers in the conversation. This is shown below:

    We use the groupby() method and compute the total duration of their speech. Speaker A is the dominant speaker in terms of duration.

    #3 Sentiment Distribution

    Out of the 22 sentences spoken during the conversation, only three were tagged as negative sentiment. Moreover, none of the sentences were predicted as positive sentiment.

    The normalized distribution can be calculated as follows:

    #4 Sentiment Distribution on Speaker-level

    Finally, let’s compute the distribution of sentiment across individual speakers. Here, instead of using the groupby() method, we will use crosstab() for better visualization. This is demonstrated below:

    The fraction of negative sentences spoken by Speaker “A” were more than that of Speaker “B”.

    #5 Average Sentence Duration on Sentiment-level

    Lastly, we shall compute the average duration of the sentences belonging to the individual sentiment classes. This is implemented below using the groupby() method:

    The average duration of negative sentences are smaller than that of the neutral sentences.

    To conclude, in this post, we discussed a particular NLP use case of the AssemblyAI API. Specifically, we saw how to build a sentiment classification module on a pre-recorded audio file comprising multiple speakers. Finally, we did an extensive analysis on the sentiment analysis results. The obtained results from the API highlighted the sentiments of the 22 individual sentences in the input audio file.

    You can find the code for this article here.

    In the upcoming posts, I will discuss more use-cases of the AssemblyAI API, such as Entity Detection, Content Moderation, and more, from both the technical and practical perspectives.

    See you next time. Thanks for reading.

    Recent Articles


    Related Stories

    Stay on op - Ge the daily news in your inbox