Sources Contact Advanced Search Tutorials

An Interest In:

Web News this Week

Search Archive

Some of Our Sources

View All Sources

Help Webnuz

Referal links:

April 14, 2022 12:19 am GMT

Let's Make Python Listen - Part 1.

Hello, fellow human being. In this series of articles, we are going to unravel the mysterious world of speech recognition systems and utilize Deepgram's services in this context. Many people may be interested in this subject matter on the grounds that many voice assistants are competing to quickly become the dominant smart speaker, such as Amazon's Alexa, Google's Assistant, Apple's Siri, that make use of different types of deep neural network(feedforward network and feedback networks). Deep neural networks were introduced in 2006 [0] by the godfather himself: Geoffrey Hinton [1]

Table Of Content (TOC).

What is Speech?
What is Speech Recognition?
Speach Recognition History
What is a Deepgram?
Deepgram's Unique Features
Speech Recognition from a Live Microphone
- Install deepgram-sdk, pyaudio
- Input and Output Devices
  - Audio Recording & Wave Files
  - Experimentations
  - Putting it All Together
    - Functional Programming
    - Object Oriented Approach
Deepgram python sdk
Connecting Pyaudio and Deepgram
Handle Excetpions
Wrapping Up

What is Speech?

Go To TOC.

The human voice is a physical phenomenon that we cannot see. The shape of the back of the throat and its vibration are used to make a speech sound [2]. When a microphone picks up sounds, it converts them into an electrical signal that can be transmitted over a wired or wireless connection to software on your computer, speakers, or a voice-recognition device. The brain initiate speech by triggering your mouth muscles to produce sound [3]. For example, when someone speaks the word "Hello," they articulate it with lips and tongue while their vocal cords vibrate and air passes between them.

What is Speech Recognition?

Go To TOC.

Speech recognition is the process of converting spoken words into text. In some cases, it can be used in conjunction with other technologies to provide computer input or replace the keyboard and mouse. It's a technology that has been around ever since the 1950s, but we have seen significant advancements in recent years. Speech recognition utilizes DSP's (digital signal processing) techniques to process and analyzes audio signals [4].

Speech recognition is often used as a stand-alone application or part of a larger software package that includes other features, such as dictation. It allows the user to control a computer or other device by speaking. It is also known by a variety of terms such as voice recognition, voice to text, speech-to-text, or speech recognition,

Having a brief introduction to speech recognition, Now let's take a look at the exciting history of speech recognition, which is surprisingly enough, ages around 72 years old starting from the 1950s, as mentioned above.

Speach Recognition History

Go To TOC.

Bell-Laboratories-invented-Audrey [5]

In the initial decade of the fiftieth century, scientists in the Bell System created the Audrey(Automatic Digit Recognizer) machine, which has three main components:

A microphone that captures human speech.
A piece of hardware that was programmed to do the actual transcription.
A display that shows the number being spoken into the microphone(right-hand side of the image)

As the name suggests, this machine can recognize digits(0-9).

The Shoebox [6]

In 1962, IBM released the first device called Shoebox [7] to recognize spoken words; It can realize ten digits and six arithmetical words command(e.g., plus, minus, etc.). For example, if someone says 2 plus 2 through the microphone, Shoebox would trigger an adding function to calculate and display the result.

These technologies worked back then by transforming voice signals into electrical impulses, and then each word was split into small Phonetic Units. For example, the term "hello" would be divided into hello 'he l oh' or something along this line.

In the 1970s, the US Department of Defense stepped in financially support research. DARPA (Defense Advanced Research Projects Agency, the same agency that got allegedly exposed for facilitating biological experiments related to s@rs-c0v2 [8]. Damn, dude. All those conspiracy theories were true all the time.) funded one of the most significant speeches recognition projects. The result was to recognize more than a thousand words.

In 1982, SAM synthesizer [9] was the first commercial speech synthesis software giving voice to Commodore 64 computer 1982.

A significant milestone was achieved in the late 1980s when statistical-based models were introduced(e.g., the Hidden Markov Model.), which can recognize approximately five thousand words.

A hidden Markov model for speech recognition. [10]

It works by assigning each letter to a node with a probability of predicting the following letter in the word that represents the edge. As you can see in the example below, the term 'potato' can be pronounced in various ways, such as 'p oh t ah t oh', 'p ah t ay t oh', and others.

Speech Recognition and Statistical Modeling. [11]

The downside of these algorithms is that they only recognize discrete speech, so you cannot speak naturally; you need to pause between words which is unfortunate.

In the 1990s, the first commercial product became available for the masses when Dragon launched its product called Dragon Dictate, which is capable of recognizing approximately 60k words.

Entering the 2000s, Google released the voice search app for iPhone [12]. The app processes voice requests based on Google's cloud data center, matching them with a large pool of human-speech recordings and learning from queries collected from the users(230 billion words) trained by neural networks that got introduced in 2006, as mentioned at the beginning of the article.

I think that is enough history for today, which presumably will be continued in future posts about speech recognition. Now let's move on to the next section exploring Deepgram transcription services.

What is a Deepgram?

Go To TOC.

Deepgram is a new promising AI-powered transcription tool that utilizes deep learning and machine learning algorithms to transcribe audio recordings by detecting words and phrases that occur within the recording. In simple terms, it is a voice recognition service that takes recordings and converts them into text. But, it is much more than that.

Apparently, Deepgram has many use cases. For example, it can be used as a transcription service for meetings, and phone calls, as a speech-to-text service for videos, or as an automated transcript for audio files. Detailed information is available on their website [13].

Deepgram's Unique Features

Go To TOC.

High Accuracy for Better Speech Analysis. [14]

Deepgram has been shown [15] to offer significantly higher accuracy rates(90%+ accuracy) than other translation systems out there. In addition, it also provides a much higher transcription speed than other systems(3 seconds to transcribe hour-long recordings) and lower costs(0.78$/hour), which makes it an attractive option for businesses that need to transcribe large quantities of content regularly.

When writing this article, this service supports most languages with a large variety of accents and dialects that can identify and transcribe audio across 16 languages [16].

The cool part about Deepgram is that it offers a free trial which anyone can use. Moreover, Deepgram provides open-source SDKs and free speech recognition tools that can be integrated into any application or system.

With the help of Deepgram, we don't have to reinvent the wheel and build a machine learning model from the bottom up(that would be a fantastic project to work on in the future.). Instead, we will use the Python SDK, which allows us to interact with various deepgram API endpoints that utilize the state-of-the-art machine learning model to perform speech transcription.

In essence, Deepgram transcription services are easy to use, accurate and fast. It can help you save time, money, and resources while still providing high-quality content.

Now, let's jump into the technical stuff.

Speech Recognition from a Live Microphone

Go To TOC.

In this section, we will learn how to convert real-time speech into human-readable text. To accomplish this, we will use the deepgram-SDK along with the PyAudio package.

Install deepgram-sdk, pyaudio

Go To TOC.

Python has a handy built-in module called wave, but it does not support recording, just processing audio files on the fly. To record audio data, we can consult a third-party package called PyAudio. The official website is a good starting point on how to install and use this library on various platforms.

However, PyAudio depends on another library called portaudio, which is not part of the default Linux dependencies. To install it on your machine, you need to issue the following command on your terminal:

$ sudo apt-get install portaudio19-dev

If the above command runs successfully, you can download and install pyaudio on your system. However, Because we previously used poetry instead of the pip for dependency management, we can run the following command to import PyAudio into our project:

$ poetry add pyaudio

If the installation part was successful, you could look up the portaudio version by running:

$ python3 -c 'import pyaudio as p; print(p.get_portaudio_version())'1246720

To install deepgram on your machine, you can follow along with their GitHub repo(https://github.com/deepgram/python-sdk). Likewise, to import deepgram into our project with poetry, simply run:

$ poetry add deepgram-sdk

If the installation part was successful, you could look up the deepgram version by running:

$ python3 -c 'import deepgram; print(deepgram._version.__version__)'0.2.5

Now, it is time to play with these modules. To do so, make sure your microphone is on by default and not muted.

Input and Output Devices

Go To TOC.

Now, let's open up a REPL and test things out.

We will begin by importing the pyaudio module and then instantiating the PyAudio class.

>>> import pyaudio>>> py_audio = pyaudio.PyAudio()

If you are on linux, you may run into the following warnings:

ALSA lib pcm_dmix.c:1089:(snd_pcm_dmix_open) unable to open slaveALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.rearALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.center_lfeALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.sideALSA lib pcm_route.c:869:(find_matching_chmap) Found no matching channel mapALSA lib pcm_oss.c:377:(_snd_pcm_oss_open) Unknown field portALSA lib pcm_oss.c:377:(_snd_pcm_oss_open) Unknown field portALSA lib pcm_usb_stream.c:486:(_snd_pcm_usb_stream_open) Invalid type for cardALSA lib pcm_usb_stream.c:486:(_snd_pcm_usb_stream_open) Invalid type for cardALSA lib pcm_dmix.c:1089:(snd_pcm_dmix_open) unable to open slave

Let's ignore these warnings for now.

py_audio has a lot of valuable attributes that you can use to get information about your input and output devices.

>>> for attr in dir(py_audio):...   if not attr.startswith("_"):...     print(attr)... closeget_default_host_api_infoget_default_input_device_infoget_default_output_device_infoget_device_countget_device_info_by_host_api_device_indexget_device_info_by_indexget_format_from_widthget_host_api_countget_host_api_info_by_indexget_host_api_info_by_typeget_sample_sizeis_format_supportedopenterminate

For instance, to look up details about the default input device, you can call the following method:

>>> py_audio.get_default_input_device_info(){    'index': 9,    'structVersion': 2,    'name': 'default',    'hostApi': 0,    'maxInputChannels': 32,    'maxOutputChannels': 32,    'defaultLowInputLatency': 0.008684807256235827,    'defaultLowOutputLatency': 0.008684807256235827,    'defaultHighInputLatency': 0.034807256235827665,    'defaultHighOutputLatency': 0.034807256235827665,    'defaultSampleRate': 44100.0}

Keep in mind the value of the defaultSampleRate key. We are going to use it when recording audio from the microphone.

Similarly, to get information about your default input device, you can call the following method:

>>> py_audio.get_default_output_device_info(){    'index': 9,    'structVersion': 2,    'name': 'default',    'hostApi': 0,    'maxInputChannels': 32,    'maxOutputChannels': 32,    'defaultLowInputLatency': 0.008684807256235827,    'defaultLowOutputLatency': 0.008684807256235827,    'defaultHighInputLatency': 0.034807256235827665,    'defaultHighOutputLatency': 0.034807256235827665,    'defaultSampleRate': 44100.0}

If you want to check the details of every I/O device on your machine, you can execute the following code:

>>> for index in range(py_audio.get_device_count()):...   device_info = py_audio.get_device_info_by_index(index)...   for key, value in device_info.items():...     print(key, value, sep=": ")

Audio Recording & Wave Files

Go To TOC.

Experimentations

To record audio data from the microphone, you need to call the open method:

>>> from rich import inspect>>> inspect(py_audio.open) <bound method PyAudio.open of <pyaudio.PyAudio object at 0x7f6c8bed5180>>  def PyAudio.open(*args, **kwargs):                                                                                                                        Open a new stream. See constructor for                                       :py:func:`Stream.__init__` for parameter details.                                                                                                         27 attribute(s) not shown. Run inspect(inspect) for options.

We are going to use rich for proper message display. Now, let's create a stream object for recording purposes:

>>> # open stream object as input & output>>> audio_stream = py_audio.open(    rate=44100,             # frames per second,     channels=1,             # mono, change to 2 if you want stereo    format=pyaudio.paInt16, # sample format, 8 bytes. see inspect    input=True,             # input device flag    output=False,            # output device flag, if True, you can play back the audio.     frames_per_buffer=1024  # 1024 samples per frame)

Now, You can take a look at the available attributes for this stream object.

>>> for attr in dir(audio_stream):...   if not attr.startswith("_"):...     print(attr)... closeget_cpu_loadget_input_latencyget_output_latencyget_read_availableget_timeget_write_availableis_activeis_stoppedreadstart_streamstop_streamwrite

The read and write functions are the most useful functions for this tutorial. We can call the read function to record audio samples in terms of frames.

>>> inspect(audio_stream.read) <bound method Stream.read of <pyaudio.Stream object at 0x7f8310a41180>>  def Stream.read(num_frames, exception_on_overflow=True):                                                                                              Read samples from the stream.  Do not call when using                      *non-blocking* mode.                                                                                                                                  27 attribute(s) not shown. Run inspect(inspect) for options.

Apparently, the read method accepts frames number instead of duration. Therefore, we need to convert a duration, a given period of time to record data, to a frames number. To do so, we need to find how many frames are there in a given duration. The following formula will do the trick:

num_frames = int(rate / samples_per_frame * duration)

We can make sure that the above formula is correct using dimensional analysis:

the unit of mesurement for:

rate: samples/second
samples_per_frame: samples/frames
duration: second

The value on the left-hand side of the equation num_frames should have a unit in frames which is the case of our formula if you do the math. Now we can iterate through all the frames and read 1024 samples per frame. The int function was used to round down the result towards the nearest integer.

>>> frames = []>>> for _ in range(int(44100 / 1024 * 3)):...   data = audio_stream.read(1024)...   frames.append(data)... >>> len(frames)129

Each frame being added is a stream of bytes:

>>> type(frames[0])<class 'bytes'>

Now, let's store this object into a wav file to confirm that it is indeed a 3-second worth of recordings. To do so, let's import the built-in wave module:

>>> import wave

Let's see what the available attributes for this object are:

>>> for attr in dir(wave):...   if not attr.startswith("_"):...     print(attr)... ChunkErrorWAVE_FORMAT_PCMWave_readWave_writeaudioopbuiltinsnamedtupleopenstructsys

As you may guess, we are going to use the open function to open a file in write mode.

>>> wave_file = wave.open("sound.wav", "wb")

Similarly, let's see all the attributes for this object:

>>> for attr in dir(wave_file):...   if not attr.startswith("_"):...     print(attr)... closegetcompnamegetcomptypegetframerategetmarkgetmarkersgetnchannelsgetnframesgetparamsgetsampwidthinitfpsetcomptypesetframeratesetmarksetnchannelssetnframessetparamssetsampwidthtellwriteframeswriteframesraw

Since we are going to write into a file, then we have to use either writeframes or writefranmesraw. Go to the official documentation. You will realize that the writeframes function has more logic involved than the writeframesraw because it checks for several writing frames in the file. Thus, we will use this function for this tutorial.

But first, we need to set some parameters for the wave_file object:

>>> wave_file.setnchannels(2)>>> wave_file.setsampwidth(py_audio.get_sample_size(pyaudio.paInt16))>>> wave_file.setframerate(44100)

Now, everything is set up; you can write the stream of data into the file:

>>> wave_file.writeframes(b"".join(frames))>>> wave_file.close()

Having experimented with the wave and pyaudio modules, let's put it all together.

Putting it All Together

Go To TOC.

There are two approaches you can bundle together the previous code, either using a functional programming approach or object-oriented programming.

Functional Programming

Go To TOC.

import wavefrom typing import List, Optional, TypeVar, Union, IOimport pyaudio #  type: ignoreWaveWrite = TypeVar("WaveWrite", bound=wave.Wave_write)def init_recording(    file_name: Union[str, IO[bytes]] = "sound.wav", mode: Optional[str] = "wb") -> WaveWrite:    wave_file = wave.open(file_name, mode)    wave_file.setnchannels(2)    wave_file.setsampwidth(2)    wave_file.setframerate(44100)    return wave_filedef record(wave_file: WaveWrite, duration: Optional[int] = 3) -> None:    py_audio = pyaudio.PyAudio()    audio_stream = py_audio.open(        rate=44100,  # frames per second,        channels=2,  # stereo, change to 1 if you want mono        format=8,  # sample format, 8 bytes. see inspect        input=True,  # input device flag        frames_per_buffer=1024,  # 1024 samples per frame    )    frames = []    for _ in range(int(44100 / 1024 * 3)):        data = audio_stream.read(1024)        frames.append(data)    wave_file.writeframes(b"".join(frames))    audio_stream.close()if __name__ == "__main__":    wave_file = init_recording() #  type: ignore    record(wave_file)    wave_file.close()

Object Oriented Approach

Go To TOC.

As described in the docstrings below, I assumed that each field of the AudioRecorder class is private by default and only accessible through getters and setters. In python, it is not mandatory to use getters and setters, but I like to use this approach because I used to code in statical typed languages, mainly c# and Java.

Notice the use of the magic __attrs_post_init__ method that would set the wave_file attribute at the moment of instantiation after calling the __init__. I also used type hinting, as you can tell. In python, you are not required to do all of this, yet still an option. The __init__ is automatically generated using the atts module(notice each attribute has a define method).

This snippet of code was adapted from the audio_record module of the deepwordle project.

import osimport wavefrom os import PathLikefrom typing import IO, List, Optional, TypeVar, Unionimport pyaudio  # type: ignorefrom attrs import define, fieldWaveWrite = TypeVar("WaveWrite", bound=wave.Wave_write)BASE_DIR = os.path.dirname(os.path.abspath(__file__))@defineclass AudioRecorder:    """    A brief encapsulation of an audio recorder object attributes and methods.    All fields are assumed to be private by default, and only accessible through    getters/setters, but someone still could hack his/her way around it!    Attrs:        frames_per_buffer: An integer indicating the number of frames per buffer;            1024 frames/buffer by default.        audio_format: An integer that represents the number of bits per sample            stored as 16-bit signed int.        channels: An integer indicating how many channels a microphone has.        rate: An integer indicating how many samples per second: frequency.        py_audio: pyaudio instance.        data_stream: stream object to get data from microphone.        wave_file: wave class instance.        mode: file object mode.        file_name: file name to store audio data in it.    """    _frames_per_buffer: int = field(init=True, default=1024)    _audio_format: int = field(init=True, default=pyaudio.paInt16)    _channels: int = field(init=True, default=1)    _rate: int = field(init=True, default=44100)    _py_audio: pyaudio.PyAudio = field(init=False, default=pyaudio.PyAudio())    _data_stream: IO[bytes] = field(init=False, default=None)    _wave_file: wave.Wave_write = field(init=False, default=None)    _mode: str = field(init=True, default="wb")    _file_name: Union[str, PathLike[str]] = field(init=True, default="sound.wav")    @property    def frames_per_buffer(self) -> int:        """        A getter method that returns the value of the `frames_per_buffer` attribute.        :param self: Instance of the class.        :return: An integer that represents the value of the `frames_per_buffer` attribute.        """        if not hasattr(self, "_frames_per_buffer"):            raise AttributeError(                f"Your {self.__class__.__name__!r} instance has no attribute named frames_per_buffer."            )        return self._frames_per_buffer    @frames_per_buffer.setter    def frames_per_buffer(self, value: int) -> None:        """        A setter method that changes the value of the `frames_per_buffer` attribute.        :param value: An integer that represents the value of the `frames_per_buffer` attribute.        :return: NoReturn.        """        setattr(self, "_frames_per_buffer", value)    @property    def audio_format(self) -> int:        """        A getter method that returns the value of the `audio_format` attribute.        :param self: Instance of the class.        :return: A string that represents the value of the `audio_format` attribute.        """        if not hasattr(self, "_audio_format"):            raise AttributeError(                f"Your {self.__class__.__name__!r} instance has no attribute named audio_format."            )        return self._audio_format    @audio_format.setter    def audio_format(self, value: int) -> None:        """        A setter method that changes the value of the `audio_format` attribute.        :param value: An integer that represents the value of the `audio_format` attribute.        :return: NoReturn.        """        setattr(self, "_frames_per_buffer", value)    @property    def channels(self) -> int:        """        A getter method that returns the value of the `channels` attribute.        :param self: Instance of the class.        :return: An integer that represents the value of the `channels` attribute.        """        if not hasattr(self, "_channels"):            raise AttributeError(                f"Your {self.__class__.__name__!r} instance has no attribute named channels."            )        return self._channels    @channels.setter    def channels(self, value: int) -> None:        """        A setter method that changes the value of the `channels` attribute.        :param value: An integer that represents the value of the `channels` attribute.        :return: NoReturn.        """        setattr(self, "_channels", value)    @property    def rate(self) -> int:        """        A getter method that returns the value of the `rate`attribute.        :param self: Instance of the class.        :return: A string that represents the value of the `rate` attribute.        """        if not hasattr(self, "_rate"):            raise AttributeError(                f"Your {self.__class__.__name__!r} instance has no attribute named rate."            )        return self._rate    @rate.setter    def rate(self, value: int) -> None:        """        A setter method that changes the value of the `rate` attribute.        :param value: An integer that represents the value of the `rate` attribute.        :return: NoReturn.        """        setattr(self, "_rate", value)    @property    def py_audio(self) -> pyaudio.PyAudio:        """        A getter method that returns the value of the `py_audio`attribute.        :param self: Instance of the class.        :return: A PyAudio object that represents the value of the `py_audio` attribute.        """        if not hasattr(self, "_py_audio"):            raise AttributeError(                f"Your {self.__class__.__name__!r} instance has no attribute named py_audio."            )        return self._py_audio    @py_audio.setter    def py_audio(self, value: int) -> None:        """        A setter method that changes the value of the `py_audio` attribute.        :param value: A PyAudio object that represents the value of the `py_audio` attribute.        :return: NoReturn.        """        setattr(self, "_py_audio", value)    @property    def data_stream(self) -> IO[bytes]:        """        A getter method that returns the value of the `data_stream`attribute.        :param self: Instance of the class.        :return: A string that represents the value of the `data_stream` attribute.        """        if not hasattr(self, "_data_stream"):            raise AttributeError(                f"Your {self.__class__.__name__!r} instance has no attribute named data_stream."            )        return self._data_stream    @data_stream.setter    def data_stream(self, value: IO[bytes]) -> None:        """        A setter method that changes the value of the `data_stream` attribute.        :param value: A string that represents the value of the `data_stream` attribute.        :return: NoReturn.        """        setattr(self, "_data_stream", value)    @property    def wave_file(self) -> wave.Wave_write:        """        A getter method that returns the value of the `wave_file`attribute.        :param self: Instance of the class.        :return: A string that represents the value of the `wave_file` attribute.        """        if not hasattr(self, "_wave_file"):            raise AttributeError(                f"Your {self.__class__.__name__!r} instance has no attribute named wave_file."            )        return self._wave_file    @wave_file.setter    def wave_file(self, value: wave.Wave_write) -> None:        """        A setter method that changes the value of the `wave_file` attribute.        :param value: A string that represents the value of the `wave_file` attribute.        :return: NoReturn.        """        setattr(self, "_wave_file", value)    @property    def file_name(self) -> Union[str, PathLike[str]]:        """        A getter method that returns the value of the `file_name`attribute.        :param self: Instance of the class.        :return: A string that represents the value of the `file_name` attribute.        """        if not hasattr(self, "_mode"):            raise AttributeError(                f"Your {self.__class__.__name__!r} instance has no attribute named file_name."            )        return self._file_name    @file_name.setter    def file_name(self, value: Union[str, PathLike[str]]) -> None:        """        A setter method that changes the value of the `file_name` attribute.        :param value: A string that represents the value of the `file_name` attribute.        :return: NoReturn.        """        setattr(self, "_file_name", value)    @property    def mode(self) -> str:        """        A getter method that returns the value of the `mode`attribute.        :param self: Instance of the class.        :return: A string that represents the value of the `mode` attribute.        """        if not hasattr(self, "_mode"):            raise AttributeError(                f"Your {self.__class__.__name__!r} instance has no attribute named mode."            )        return self._mode    @mode.setter    def mode(self, value: str) -> None:        """        A setter method that changes the value of the `mode` attribute.        :param value: A string that represents the value of the `mode` attribute.        :return: NoReturn.        """        setattr(self, "_mode", value)    def __repr__(self) -> str:        attrs: dict = {            "frames_per_buffer": self.frames_per_buffer,            "audio_format": self.audio_format,            "channels": self.channels,            "rate": self.rate,            "py_audio": repr(self.py_audio),            "data_stream": self.data_stream,            "wave_file": repr(self.wave_file),            "mode": self.mode,            "file_name": self.file_name,        }        return f"{self.__class__.__name__}({attrs})"    def __attrs_post_init__(self) -> None:        wave_file = wave.open(os.path.join(BASE_DIR, self.file_name), self.mode)        wave_file.setnchannels(self.channels)        wave_file.setsampwidth(self.py_audio.get_sample_size(self.audio_format))        wave_file.setframerate(self.rate)        self.wave_file = wave_file        del wave_file    def record(self, duration: int = 3) -> None:        self.data_stream = self.py_audio.open(            format=self.audio_format,            channels=self.channels,            rate=self.rate,            input=True,            output=True,            frames_per_buffer=self.frames_per_buffer,        )        frames: List[bytes] = []        num_frames: int = int(self.rate / self.frames_per_buffer * duration)        for _ in range(num_frames):            data = self.data_stream.read(self.frames_per_buffer)            frames.append(data)        self.wave_file.writeframes(b"".join(frames))    def stop_recording(self) -> None:        if self.data_stream:            self.data_stream.close()            self.py_audio.terminate()            self.wave_file.close()if __name__ == "__main__":    rec = AudioRecorder()    print(rec)    rec.record()    rec.stop_recording()

Deepgram python sdk.

Go To TOC.

Let's go back to our REPL and start playing with the deepgram SDK.

We will start by importing the deepgram module and then instantiating a Deepgram instance.

>>> from deepgram import Deepgram>>> for attr in dir(Deepgram):...   if not attr.startswith("_"):...     print(attr)... keysprojectstranscriptionusage

As you can see, there are four main attributes in the Deepgram class. Using deepgram, you can transcribe pre-recorded audio or live audio streams like the bbc radio. You can follow along the Readme file to get information on setting up a deepgram account and to get things started. Having a secret key, you can interact with the API to do the transcription. Once you get the API key, you need to store it in an environment variable to get the following code running successfully:

$ export DEEPGRAM_API_KEY="XXXXXXXXX"

from deepgram import Deepgram # type:  ignoreimport asyncioimport osfrom os import PathLikefrom typing import Union, IOasync def transcribe(file_name: Union[Union[str, bytes, PathLike[str], PathLike[bytes]], int]):    with open(file_name, "rb") as audio:        source = {"buffer": audio, "mimetype": "audio/wav"}        response = await deepgram.transcription.prerecorded(source)        return response["results"]["channels"][0]["alternatives"][0]["words"]if __name__ == "__main__":    try:        deepgram = Deepgram(os.environ.get("DEEPGRAM_API_KEY"))        loop = asyncio.new_event_loop()        asyncio.set_event_loop(loop)        words = loop.run_until_complete(transcribe("sound.wav"))        string_words = " ".join(word_dict.get("word") for word_dict in words if "word" in word_dict)        print(f"You said: {string_words}!")        loop.close()    except AttributeError:        print("Please provide a valid `DEEPGRAM_API_KEY`.")

The above script will generate the following if the audio file contains only the words "hello" and "world":

You said: hello world!

Connecting Pyaudio and Deepgram

Go To TOC.

import wavefrom typing import List, Optional, TypeVar, Union, IOimport pyaudio #  type: ignorefrom deepgram import Deepgram # type:  ignoreimport asyncioimport osfrom os import PathLikeWaveWrite = TypeVar("WaveWrite", bound=wave.Wave_write)def init_recording(    file_name: Union[str, IO[bytes]] = "sound.wav", mode: Optional[str] = "wb") -> WaveWrite:    wave_file = wave.open(file_name, mode)    wave_file.setnchannels(2)    wave_file.setsampwidth(2)    wave_file.setframerate(44100)    return wave_filedef record(wave_file: WaveWrite, duration: Optional[int] = 3) -> None:    py_audio = pyaudio.PyAudio()    audio_stream = py_audio.open(        rate=44100,  # frames per second,        channels=2,  # stereo, change to 1 if you want mono        format=8,  # sample format, 8 bytes. see inspect        input=True,  # input device flag        frames_per_buffer=1024,  # 1024 samples per frame    )    frames = []    for _ in range(int(44100 / 1024 * 3)):        data = audio_stream.read(1024)        frames.append(data)    wave_file.writeframes(b"".join(frames))    audio_stream.close()async def transcribe(file_name: Union[Union[str, bytes, PathLike[str], PathLike[bytes]], int]):    with open(file_name, "rb") as audio:        source = {"buffer": audio, "mimetype": "audio/wav"}        response = await deepgram.transcription.prerecorded(source)        return response["results"]["channels"][0]["alternatives"][0]["words"]if __name__ == "__main__":    # start recording    print("Python is listening...")    wave_file = init_recording() #  type: ignore    record(wave_file)    wave_file.close()    # start transcribing    deepgram = Deepgram(os.environ.get("DEEPGRAM_API_KEY"))    loop = asyncio.new_event_loop()    asyncio.set_event_loop(loop)    words = loop.run_until_complete(transcribe("sound.wav"))    string_words = " ".join(word_dict.get("word") for word_dict in words if "word" in word_dict)    print(f"You said: {string_words}!")    loop.close()

Handle Excetpions

Go To TOC.

Now, we need to handle errors to make our app more user-friendly by using the try-catch block to handle expected exceptions instead of causing our program to crash. The first error happens when your DEEPGRAM_API_KEY is not correct, and this will cause the program to throw an Unauthorized exception.

    try:        # start recording        print("Python is listening...")        wave_file = init_recording() #  type: ignore        record(wave_file)        wave_file.close()        # start transcribing        deepgram = Deepgram(os.environ.get("DEEPGRAM_API_KEY"))        loop = asyncio.new_event_loop()        asyncio.set_event_loop(loop)        words = loop.run_until_complete(transcribe("sound.wav"))        string_words = " ".join(word_dict.get("word") for word_dict in words if "word" in word_dict)        print(f"You said: {string_words}!")        loop.close()    except Exception:        print("Unauthorized user. Please provide a valid `DEEPGRAM_API_KEY` value")

We can build a loop to record speech indefinitely until a condition is satisfied.

    try:        loop = asyncio.new_event_loop()        asyncio.set_event_loop(loop)        while True:            wave_file = init_recording() #  type: ignore            print("Python is listening...")            record(wave_file)            wave_file.close()            # start transcribing            deepgram = Deepgram(os.environ.get("DEEPGRAM_API_KEY"))            words = loop.run_until_complete(transcribe("sound.wav"))            string_words = " ".join(word_dict.get("word") for word_dict in words if "word" in word_dict)            print(f"You said: {string_words}!")            if string_words == "stop":              print('Goodbye!')              break        loop.close()    except Exception:        print("Unauthorized user. Please provide a valid `DEEPGRAM_API_KEY` value")

I/O operations bound the performance of this program. We will improve this version in the upcoming articles related to speech recognition.

Wrapping Up

Go To TOC.

In this article, We have explored the history of speech recognition, and we learned how to use deepgram python SDK for speech recognition and pyaudio for audio recording. There is a lot more you can do with these libraries, which is beyond the scope of this article. Keep in mind that we can improve our project to directly send audio recordings from the microphone without writing into a wave file with the help of web sockets which is the work of future articles. We can also build a voice-controlled search engine based on this. I want to suggest playing around with the webbrowser module to find even more exciting implementation ideas. We will be working on these kinds of projects throughout the upcoming articles on this series.

As always, this article is a gift to you, and you can share it with whomever you like or use it in any way that would be beneficial to your personal and professional development. Thank you in advance for your ultimate support!

Happy Coding, folks; see you in the next one.

[0] wikipedia. Deep learning.

[1] wikipedia. Geoffrey Hinton.

[2] William F. Katz, 2016. What Produces Speech: Your Speech Anatomy, Phonetics For Dummies.

[3] Jacquelyn Cafasso, 2019. What Part of the Brain Controls Speech?, healthline.

[4] Steven W. Smith, in Digital Signal Processing: A Practical Guide for Engineers and Scientists, 2003.

[5] Sam Lawson, 2018, Bell-Laboratories-invented-Audrey, ClickZ.

[6] Pioneering Speech Recognition, IBM.

[7] IBM Cloud Education, 2020, What is Speech Recognition.

[8] Project Veritas, 2022, Military Documents about Gain of Function contradict Fauci testimony under oath, Youtube.

[9] Sebastian Macke, Software Automatic Mouth - Tiny Speech Synthesizer, Github.

[10] Dimitrakakis, Christos & Bengio, Samy. (2011). Phoneme and Sentence-Level Ensembles for Speech Recognition EURASIP J. Audio, Speech and Music Processing. 2011. 10.1155/2011/426792.

[11] Ed Grabianowski, How Speech Recognition Works.

[12] News from Google, 2008, New Version of Google Mobile App for iPhone, now with Voice Search.

[13] Deepgram, Different Environments Call for Different Speech Recognition Models.

[14] Deepgram, High Accuracy for Better Speech Analysis.

[15] Deepgram, WHY DEEPGRAM: Enterprise audio is complex Your ASR doesnt have to be.

[16] Deepgram, Every customer. Heard and understood.)

Original Link: https://dev.to/wiseai/lets-make-python-listen-with-pyaudio-and-deepgram-sdk-part-1-5d6l

Share this article:

View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To