Analyzing sentiment in Speech
Sentiment Analysis, also known as opinion mining, is a popular task in Natural Language Processing (NLP) due to its diverse industrial applications. In the context of applying NLP techniques specifically to textual data, the primary objective is to train a model that can classify a given piece of text between different sentiment classes. A high-level overview of a sentiment classifier is shown in the image below.
For instance, the classes for a three-class classification problem can be
Neutral. An example of the three-class sentiment analysis problem is the popular Twitter Sentiment Analysis dataset which is an Entity-level sentiment analysis task on multi-lingual tweets posted by various users on Twitter.
While most of the prior research and development in NLP has primarily focused on applying sentiment analysis over text, in recent times, we have seen massive adoption and popularity of speech-based interaction tools among users, veering researchers and organizations to build sentiment classifiers in the speech space.
Therefore, this post will demonstrate how to build a sentiment analysis system on conversational data using the AssemblyAI API and Python. The end-to-end system holds extensive applicability in areas involving rigorous customer support and feedback evaluation — making it an important and valuable problem to solve, especially in the speech domain. Towards the end, I’ll also demonstrate an extensive analysis to enhance the interpretability of the obtained results and draw appropriate insights from the data.
You can find the code for this article here. Moreover, the highlight of the article is as follows:
In this section, I am going to demonstrate the use of AssemblyAI API to classify individual sentences in a given piece of pre-recorded voice conversation into three sentiment classes:
Step 1: Installing Requirements
There are very few requirements to build the sentiment classifier. In terms of python libraries, we need the
requests package only in Python. This can be done as follows:
pip install requests
Step 2: Generating your API Token
The next step is to create an account on the AssemblyAI website, which you can do for free. Once done, you will get your private API access key, which we will use to access the speech-to-text models.
Step 3: Uploading Audio File
For the purpose of this tutorial, I’ll use a pre-recorded audio conversation between two people to perform sentiment analysis on. Once you have obtained the API Key, you can proceed with the sentiment classification task on the pre-recorded audio file.
However, before doing that, you will need to upload the audio file so that it can be accessed via a URL. Options include uploading to an AWS S3 bucket, audio hosting services like SoundCloud or AssemblyAI’s self-hosting services, etc. I have uploaded the audio file to SoundCloud, which can be accessed below.
If you wish to upload the audio file directly to AssemblyAI’s hosting services, you can do that too. I have demonstrated this step-by-step procedure in the code blocks below.
Step 3.1: Import requirements
We start with importing the requirements for our project.
Step 3.2: Specify file location and API_Key
Next, we need to specify the location of the audio file on our local machine and the API key obtained after signing up.
Step 3.3: Specify Upload Endpoint
endpoint: This specifies the service to be invoked, which in this case is the “upload” service.
headers: This holds the API key and the content-type.
Step 3.4: Define the upload function
Audio files can only be uploaded up to a limit of 5 MBs (5,242,880 bytes) at once. Therefore, we need to upload the data in chunks. These are then merged back on the service endpoint. Hence, you don’t need to worry about handling numerous URLs.
Step 3.5: Upload
The last step is to invoke the POST request. The response of the post request is a JSON that holds the
upload_url of the audio file. I will use this URL for the next steps of executing the sentiment classification on the audio.
Step 4: Sentiment Analysis
At this step, we have fulfilled all the necessary prerequisites to perform the task of sentiment analysis on the audio file. Now, we can proceed with calling the API to fetch the desired results. This is a two-step process which is demonstrated in the subsections below.
Step 4.1: Submitting Files for Transcription
The first step is to invoke an HTTP Post request. This essentially sends your audio files to the AI models running in the background for transcription and instructs them to perform sentiment analysis on the transcribed texts.
The arguments passed to the POST request are:
endpoint: It specifies the transcription service to be invoked.
json: This contains the URL to your audio file as
audio_urlkey. As we wish to perform sentiment analysis on conversational data, the
speaker_labelsare set to
headers: This holds the
authorizationkey and the
The current status of the post request, as received in the JSON response, is
queued. This indicates that the audio is currently being transcribed.
sentiment_analysis flag is also
Truein the JSON response. However, the value corresponding to the
sentiment_analysis_results key is None as the status is currently
Step 4.2: Fetching the Transcription Result
To check the status of our POST request, we need to make a GET request using the
id key in the JSON response received above.
Next, we can proceed with a GET request, as shown in the code block below.
The arguments passed to the GET request are:
endpoint: This specifies the service invoked and the API call identifier determined using
headers: This holds your unique API key.
Here, you should know that the transcription result won’t be ready until the
status key changes to
completed. The time it takes for transcription depends upon how long your input audio file is. Therefore, you must make repeated GET requests at regular intervals to check transcription status. A simple way of doing this is implemented below:
status changes to
completed, you will receive a response similar to the one mentioned below.
statusin the JSON response is marked as
completed. This indicates that there were no errors in transcribing the audio.
textkey contains the entire transcription of the input audio conversation, and it includes 22 sentences.
- As the audio file is composed of multiple speakers, we see all
speakerkeys within the
wordskey as Not Null. The
speakerkey is either “A” or “B.”
- We can see a confidence score for all the individual words and the entire transcription text. The score ranges from 0 to 1–with 0 being the lowest and 1 being the highest.
- The results for sentiment analysis on each of the 22 individual sentences in the audio are accessible using the
sentiment_analysis_resultskey of the JSON response.
- Corresponding to each sentence, we get a
confidencescore similar to that in point 4 above.
- The sentiment of each sentence can be retrieved using the
sentimentkey of a sentence’s dictionary. The sentiment analysis result for the second sentence is shown below:
JSONs are usually hard to read and interpret. Therefore, to make the data visually appealing and to conduct further analysis, let’s convert the sentiment analysis results above to a DataFrame. We will store the
duration of the sentence, its
speaker, and the
sentiment of the sentence. This is implemented below:
The DataFrame generated with the code snippet above is shown in the image below. Here, we have the 22 sentences that were spoken during the conversation along with the corresponding speaker labels (“A” and “B”), their duration in seconds, and the sentiment of the sentence as predicted by the model.
#1 Speaker distribution
The number of sentences spoken by each of the speakers can be calculated using the
value_counts() method as shown below:
To view the percentage distribution of the speakers, we can pass
normalize = True to the
value_counts() method as follows:
Both the speakers “A” and “B” contributed equally to the conversation in terms of the number of sentences.
#2 Speaker Duration distribution
Next, let’s compute the individual contribution of each of the speakers in the conversation. This is shown below:
We use the
groupby() method and compute the total duration of their speech. Speaker A is the dominant speaker in terms of duration.
#3 Sentiment Distribution
Out of the 22 sentences spoken during the conversation, only three were tagged as
negative sentiment. Moreover, none of the sentences were predicted as
The normalized distribution can be calculated as follows:
#4 Sentiment Distribution on Speaker-level
Finally, let’s compute the distribution of sentiment across individual speakers. Here, instead of using the
groupby() method, we will use
crosstab() for better visualization. This is demonstrated below:
The fraction of negative sentences spoken by Speaker “A” were more than that of Speaker “B”.
#5 Average Sentence Duration on Sentiment-level
Lastly, we shall compute the average duration of the sentences belonging to the individual sentiment classes. This is implemented below using the
The average duration of
negative sentences are smaller than that of the
To conclude, in this post, we discussed a particular NLP use case of the AssemblyAI API. Specifically, we saw how to build a sentiment classification module on a pre-recorded audio file comprising multiple speakers. Finally, we did an extensive analysis on the sentiment analysis results. The obtained results from the API highlighted the sentiments of the 22 individual sentences in the input audio file.
You can find the code for this article here.
In the upcoming posts, I will discuss more use-cases of the AssemblyAI API, such as Entity Detection, Content Moderation, and more, from both the technical and practical perspectives.
See you next time. Thanks for reading.