Playing around with PolyglotDB


Overview

The past few blog posts have focused on some of the updates I've done recently to the Montreal Forced Aligner, a tool I made while doing a postdoc at McGill with Morgan Sonderegger and Michael Wagner. The other primary tool is PolyglotDB (and the web interface ISCAN), a tool for analyzing forced-aligned corpora and extracting automated measures, particularly of large corpora. I've recently gotten it working with the latest versions of its backend databases, Neo4j and InfluxDB, and thought it might be neat to do some super basic demonstrations. I also have a collection of headsets and microphones now, and so what I've done this time is record a couple of different passages with the different microphones to see if their quality affects some acoustic measurements.

Dataset

The text materials come from two sources. The first is the North Wind and the Sun Aesop fable that has been used in some phonetic research (I remember Molly Babel and Phoebe Wong doing a project that used it), and the second is a reading passage from Fridland and Macrae (2008).

The microphone setups that I have are:

With each of those I record both passages each in two recording conditions, one where there's no background noise (though I live on the second floor, so there's likely some traffic noise), and one with my AC on in the background. For the recording set up, I just used the Windows voice recorder app, or the stock Android voice recorder on my phone. So that's 24 recordings all told (6 microphones, 2 stories, 2 noise conditions). The North Wind recordings are around 35 seconds, and the Fridland passage ones are around a minute and a half. If there was a major speech error, I would start over (mostly because I would accidentally add additional swear words), but minor speech errors or restarts were kept.

The recordings were forced aligned with MFA using the english_ipa acoustic model and corresponding dictionary. I didn't touch the transcript or add words for any errors or restarts. I did some very basic spot-checking, and the TextGrids looked good enough to me. Anecdotally, the microphone in the Bose headset is the worst, as its primary purpose is as noise-cancelling headphones and not as a quality headset, and a number of friends have reacted negatively to its quality compared to the HyperX headsets. HyperX headsets tend to be on the quieter side for the microphone, so I have some preamp software set up to boost the gain.

You can download the sound files and transcriptions here if you want to play around with them.

Analysis

The scripts I'm using are based on the SPADE scripts for analyzing dialects of English. I'm going to talk about vowel measures, a bit of sibilant measures, and pitch tracking measures.

Vowel space

The vowel formant measurement script uses the refinement algorithm based on FAVE-extract and modified by Jeff Mielke during the SPADE project. The basic idea is that you get some distributional information about the vowel categories overall, and use these to make more informed selection of formant track candidates on a per-token basis, minimizing distance to the vowel prototypes. These prototypes don't have to be generated on the fly, and there are a number available in the SPADE repository. For the initial analysis, I'm working with 4 versions of the algorithm:

  • base: generate prototypes on a first pass and use them to generate the final measures
  • spade-Buckeye: Use prototypes from the Buckeye Corpus (relabelling the Buckeye phoneset into IPA)
  • spade-SantaBarbara: Use prototypes from the Santa Barbara corpus (relabelling the phoneset into IPA)
  • spade-SOTC: Use prototypes from the Sounds of the City corpus (relabelling the phoneset into IPA)

The reason behind these selections is the range of dialectal similarity to my own. My dialect is Western American, early years split between Boulder, Colorado and Seattle, Washington, grad school in Vancouver, BC, and postdoc in Montreal, QC. So my dialect is West Coast, with some Canadian exemplars sprinkled in. Of the three external sources above, the Santa Barbara corpus should be the closest to my dialect (though not exactly, since California has its own peculiarities), with Buckeye as a little less similar, but still American English dialect. Sounds of the City corpus is made of Glaswegian speakers, so their dialect should be quite different from mine.

My initial pass for generating the formants had each microphone treated as an individual speaker, as shown below. This disadvantages the base condition the most, because the number of examples to form mean and standard deviation estimates is much lower. Particularly, it struggles with /u/ tokens in the bose headset, primarily due to consistent issues in first formant tracking. Note as well, that spade-SOTC struggles with the /ʊ/ tokens, often having them very fronted. The two American dialect sources show pretty good clustering.

Vowel spaces generated using per-microphone speakers in refinement
Vowel spaces generated using per-microphone speakers in refinement

As a more fair comparison, I grouped all the microphones together and re-ran the analysis. They are all recorded by me, after all! The plot below shows much better performance for the base condition. Across the board, the phones cluster more tightly in these cases, since all the conditions benefit from having more data to create better per-speaker distributions following the initial fitting based on the prototypes.

Vowel spaces generated using a single speaker in refinement
Vowel spaces generated using a single speaker in refinement

Astute readers (looking at you, Kaylynn!) might have noticed that there are /ɔ/ vowels distinct from /ɑ/, which doesn't make sense since my dialect is a merged one. In general, these /ɔ/ vowels are from NORTH and FORCE words, where the distinction remains. MFA did select the /ɔ/ pronunciation variant for some words that would be merged, and you can see those words in the plot below mixed in with the /ɑ/ words.

Distribution of individual tokens with their word label from the no noise recording condition for both passages on the HyperX Cloud 2 wired headset
Distribution of individual tokens with their word label from the "no noise" recording condition for both passages on the HyperX Cloud 2 wired headset

So generally speaking, formant tracking is not too bad here, and this is with a relatively low count of tokens (about 400-450 total vowel tokens per microphone setup). Even using completely different dialect prototypes to seed the initial formant estimation lets it recover somewhat from very sparse data.

Sibilants

In addition to the vowel scripts from SPADE, I also ran the sibilant analysis. Somewhat surprisingly, the various microphones did pretty well on center of gravity, except for the Predator laptop that has the measure for /ʃ/ approaching that of /s/

Sibilant center of gravity estimation across recording setups
Sibilant center of gravity estimation across recording setups

I was tempted to look into more sophisticated measures, like multitaper spectral analysis, for analyzing the sibilants, but it's a bit outside the scope for the current post.

Pitch

The pitch measures here aren't in SPADE, but they come from some work on intonational contours with Michael Wagner. The set up for this section uses the speaker adjusted pitch algorithm with two sources, Praat and Reaper. The speaker adjustment algorithm is similar to the refined formants above, where we do an initial pass to estimate a good pitch range for a speaker using a baseline pitch range that will likely have some errors, and then generate the actual pitch tracks with the better pitch range. See the appendix for a more detailed look at the performance gains that speaker adaptation brings. As with the formant refinement algorithm above, the speaker adaptation benefits from larger amounts of data on the same speaker, so all the pitch analyses use the single speaker set up.

Pitch tracks across the recordings of the North Wind and the Sun
Pitch tracks across the recordings of the North Wind and the Sun passage

In general, pitch measurements from Reaper are more consistent than from Praat. Praat gives random high frequency pitch estimations, but the HyperX microphones perform pretty well for both pitch programs. This is consistent with anecdotal reports where friends have commented on poor quality from the Bose microphone compared to the HyperX ones. A similar story is present below for the Fridland passages.

Pitch tracks across the recordings of the Fridland passage
Pitch tracks across the recordings of the Fridland passage

Taking a closer look at one of the recordings below, we can see generally pretty good results. The Reaper algorithm tends to be a bit more conservative in its voiced/unvoiced decision, so there are several places where Praat estimates pitch without a corresponding Reaper estimate. Conversely, there are also places where Reaper reports pitch and praat does not, particularly in creaky voice situations. For example, around second 23, where the glottal stop for "at" causes creaky voice. Reaper reports a dramatic dip in pitch, but Praat just gives up.

Pitch track detail with word alignments from a single North Wind and the Sun recording (hyperx_wired_noise)
Pitch track detail with word alignments from a single North Wind and the Sun recording (hyperx_wired_noise)

In general, Reaper has better performance across the full dataset and recording setups. It's a bit harder to set up and compile than Praat, but I don't think the choice between the two will have a meaningful difference at larger scales of data.

Summing up

I wanted to take a look at PolyglotDB for a couple of reasons. First was to motivate the update in database versions and make sure it's still a functional tool and not abandonware. The second is to see what it brings to the table for smaller datasets. The recordings here are such a small fraction of the size of the data that Polyglot was built for in the SPADE project, and I did worry that the benefits that Polyglot brings wouldn't be apparent. And using each microphone as a different speaker does reveal that to a certain extent. When there's less than 10 minutes of speech per speaker and only 25 minutes of speech, we do run into some issues. The formant refinement and speaker adapted pitch algorithms do help combat this and give some very reasonable results for our tiny speech corpus.

Appendix

The plot below shows one particularly bad recording (using the bose headset) where Praat has a lot of errors. The below plot doesn't take advantage of the speaker adaptation algorithm as a way to showcase the performance gains from it. The default settings in PolyglotDB (similar to Praat itself) are pretty forgiving, with a minimum pitch of 50 Hz and maximum pitch of 500 Hz.

Pitch track for a single recording (bose_no_noise) using the base algorithm rather than the speaker adaptation
Pitch track for a single recording (bose_no_noise) using the base algorithm rather than the speaker adaptation

We can see some pretty horrendous pitch tracking errors (no way my voice is making it up to 500 Hz, let alone 300 Hz). Most likely there's a fair bit of creaky voice that's messing up the tracking. Using the speaker adaptation algorithm generates a first pass using the base settings above, calculates the mean pitch for the speaker, and then uses a two octave range around it to generate a more likely range for that speaker. The results for this run are below.

Pitch track for a single recording (bose_no_noise) using the speaker adaptation algorithm
Pitch track for a single recording (bose_no_noise) using the speaker adaptation algorithm

You can see that we've cleaned up a lot of the big errors, and the ones that remain are smaller (I'm much more likely to get close to 250 Hz than to 500 Hz). In general, Reaper performs better than Praat just out of the box, but the speaker adaptation brings Praat pretty close.