Update on Montreal Forced Aligner performance

Tue 03 August 2021 in Blog

[TOC]

Overview ¶

The Montreal Forced Aligner version 1.0 released on May 18, 2017, with the Interspeech paper appearing that year as well. In the four years since, it has changed considerably and with the eventual release of 2.0 once it feels a bit more stable, I thought I’d take a moment to analyze some performance experiments that I’ve been playing around with.

In addition to features internal to MFA, I also want to take a look at how the performance compares to FAVE and MAUS , as the other two standards in the field. At some point I’d like to get LaBB-CAT ’s aligner set up, but that’s a bit more involved than the other two and leverages the same pretrained acoustic model as FAVE. If there are other aligner systems out there that I should benchmark, let me know.

Methodology ¶

Dataset ¶

The benchmark dataset is derived from the Buckeye corpus , the same as the 2017 Interspeech paper , where utterances are extracted with text transcriptions into small chunks of audio that are fed into the respective aligners.

33,370 extracted chunks
40 speakers
16.5 hours total duration

I’m expanding the benchmark of the Buckeye corpus . The word distance and CVC datasets are the same, but I’m also interested in the larger phone accuracies outside the more controlled CVC dataset. I’ll cover the measures pulled from the corpus more in depth as they come up in the results section.

Procedure ¶

Aligner ID	Phone set	Alignment procedure
FAVE	ARPABET	align
MAUS	X-Sampa	align
MFA English	ARPABET	align
MFA English adapted	ARPABET	adapt
MFA default train	ARPABET	train_and_align
MFA English IPA	IPA	align
MFA English IPA adapted	IPA	adapt
MFA English IPA train	IPA	train_and_align

The MFA runs represent a cross between training procedure and pronunciation dictionary. Training procedures are as follows:

align : Performs an initial alignment and then computes speaker-adapted features for a second pass alignment
adapt : Performs an initial alignment with an acoustic model followed by 3 rounds of speaker adaptation, alignment and re-estimation of the model
train_and_align : Performs the default training regime in MFA 2.0, which starts with a subset of utterances to train a monophone model, expands the subset to train a triphone model, performs LDA calculations and two rounds of speaker adaptation to refine the input features over increasing sizes of subsets.

For pronunciation dictionaries, the ARPABET dictionary is the standard one provided via MFA’s download command. The IPA dictionary was created recently (see the blog post about it for more details about its creation). Additionally, I used a newly introduced multilingual_ipa mode that does some normalizations to the IPA pronunciations to remove some unnecessary (for the aligner) variation (blog post coming soon! In the meantime refer to the docs ). In all cases of pretraining, the audio data is from LibriSpeech, just the pronunciation dictionaries differ.

Accuracy Results ¶

Let’s begin with the replication of Interspeech paper. First step is looking at the word boundary errors. So for every word boundary find the absolute distance between the manual annotation in the Buckeye corpus and the aligner generated annotations. The plot below shows these by the aligner and split into whether the boundary is next to a silence or not, as that heavily influences accuracy.

Word boundary errors in the Buckeye corpus

MFA generally looks to be performing well! One interesting thing to highlight is that aligners trained on the Buckeye corpus itself handle boundaries next to pauses better (though still not the same degree as within continuous speech). This benefit is likely due to the silence models being trained on the actual audio, whereas the silence models for the other MFA models are trained on LibriSpeech, which is overall less noisy.

Next up we have the CVC phone boundaries calculated in the Interspeech paper. To generate these, we took a list of 534 CVC word types in the Buckeye corpus (I believe originally from Yao Yao’s dissertation ), words like base , choose , gone , young , etc. For every token that matched, first we check whether the surface form from the manual annotation and the aligner annotation have three phones, and skip if they don’t. This avoids instances where a consonant is deleted. Following the check, we extract 4 boundaries:

beginning of the initial consonant
between the initial consonant and the vowel
between the vowel and the final consonant
at the end of the final consonant

Phone boundary error rate in selected CVC words

We can see that the word boundaries have the highest error (also note that we’re not controlling for pause boundaries here, so that’s going to make these a bit messier). General trends are also that CV transitions are more accurate than VC transitions, even across aligners. Somewhat surprising is that the pretrained IPA English model is doing as well/better than the adapted and trained versions.

For completeness, here are the same summary statistics from the above plots.

Aligner	Word boundaries		Phone boundaries
Aligner	mean (ms)	median (ms)	mean (ms)	median (ms)
FAVE	24.8	16.6	19.2	12
MAUS	29.6	15	20.5	10.4
MFA English	23.6	15.4	16.3	11
MFA English adapted	23.4	15	16.6	11.1
MFA default train	21.1	14.1	16.9	11.3
MFA English IPA	21.4	14.1	15.8	9.9
MFA English IPA adapted	21.6	14.3	16.4	10.3
MFA English IPA train	20.2	14	16.5	10.7

So that replicates the basic findings of the Interspeech paper. Note that the numbers for MFA english (in the Interspeech paper labelled as MFA-LS ) and MFA default train ( MFA-Retrained in the Interspeech paper) have actually improved very slightly from MFA 1.0 to 2.0. MFA english went from 24.1 ms mean word boundary distance to 23.6 ms, 17 ms mean phone boundary distance to 16.3 ms. MFA default train went from 22.6 ms mean distance for words to 21.1 and from 17.3 ms to 16.9 ms for phone distance.

So that’s all cool, but the main reason I really wanted to re-explore this space is to come up with some better measures for actually measuring alignment accuracy. The phone boundary measures after all cover just CVC words, are a very small subset of total words in the Buckeye corpus, and pointedly ignore pronunciation variation in order to make things as controlled as possible. I think we can do better.

So the algorithm that I’ve done here is still a work in progress, but I think it’s pretty promising. The basic idea is that we want to align intervals in different transcripts (haha, aligning the alignments) based on how close those intervals begin and end (i.e., if they occupy the same space in the audio file, they’re likely the same segment). We also want to incorporate some very basic phone-to-phone mapping between the different phone sets to make sure it’s not entirely based off the timing, because the identity of phones is important too for evaluating whether an alignment is good.

The Python code is something like this (I’ll describe it more in words after the code if you want to skip that):

:::python
import re
from Bio import pairwise2
import functools

MATCH_SCORE = 0
MISMATCH_SCORE = 2
GAP_START_SCORE = -5
GAP_CONTINUE_SCORE = -5


def silence_check(phone):
    return phone in {'sp', '<p:>', '', None}


def compare_labels(ref, test, aligner_name):
    if 'ipa' in aligner_name:
        mapping = ipa_mapping
    elif 'maus' in aligner_name:
        mapping = maus_mapping
    else:
        mapping = arpa_mapping
        if re.match(r'[a-z]+[0-9]', test):
            if ref == test[:-1].lower():
                return MATCH_SCORE
            else:
                return MISMATCH_SCORE
    if ref == test:
        return MATCH_SCORE
    if test in mapping and mapping[test] == ref:
        return MATCH_SCORE
    ref = ref.lower()
    test = test.lower()
    if ref == test:
        return MATCH_SCORE
    return MISMATCH_SCORE


def overlap_scoring(firstElement, secondElement, aligner_name):
    begin_diff = abs(firstElement.minTime - secondElement.minTime)
    end_diff = abs(firstElement.maxTime - secondElement.maxTime)
    label_diff = compare_labels(firstElement.mark, secondElement.mark, aligner_name)
    return -1 * (begin_diff + end_diff + label_diff)


def align_phones(ref, test, aligner_name):
    ref = [x for x in ref]
    test = [x for x in test]
    score_func = functools.partial(overlap_scoring, aligner_name=aligner_name)
    alignments = pairwise2.align.globalcs(ref, test,
                                          score_func, GAP_START_SCORE, GAP_CONTINUE_SCORE,
                                          gap_char=['-'], one_alignment_only=True)
    overlap_count = 0
    overlap_sum = 0
    num_insertions = 0
    num_deletions = 0
    for a in alignments:
        for i, sa in enumerate(a.seqA):
            sb = a.seqB[i]
            if sa == '-':
                if not silence_check(sb.mark):
                    num_insertions += 1
                else:
                    continue
            elif sb == '-':
                if not silence_check(sa.mark):
                    num_deletions += 1
                else:
                    continue
            else:
                overlap_sum += abs(sa.minTime - sb.minTime) + abs(sa.maxTime - sb.maxTime)
                overlap_count += 1
    return overlap_sum / overlap_count, num_insertions, num_deletions

So what it does is use biopython ’s pairwise sequence alignment algorithm with a custom match/mismatch function. The score that we give to match/mismatch is not fully based on the label of the phone (though if it does not match, it gets a base score of -2 vs 0 for if it does match). There are three mappings listed above (but not defined just because they’re so long, I’ll provide tables in the appendix if you’re curious). These basically transform “known” or “ unambiguous” phones from X-Sampa, IPA and ARPABET into Buckeye’s specific phone set. In addition to the label match, we also calculate the distance from the reference boundaries to the corresponding aligner-generated boundaries, in order to locally constrain the alignment. The last parameter is the gap scores, which specify the penalty for inserting skips in one of the sequences, either for insertions or deletions. The value of -5 for both seems to work pretty well for this particular case. To illustrate the output from running with a FAVE alignment:

One final thing to note is that we’re basically ignoring any mismatches due to silences, so not inserting silence is not penalized per se.

Onto the actual results! Well, ok, before that, a note about some removals. In analyzing the data, I found that MAUS for some reason phonemetized yknow as /i:keIkeIkeIkeI/ . I’m not really sure where this is coming from, but it’s fairly common in the Buckeye transcriptions and resulted in an inflated number of insertions. So for the following analysis, any utterance with yknow in it has been removed (2,575 utterances of 33,166 total utterances).

To begin with, let’s take a look at the phone boundary errors.

Phone error rates across the whole Buckeye corpus

Even removing the yknow utterances, MAUS is still having some issues here, which is a little weird because we weren’t seeing these disparities show up this much in the word boundary or CVC datasets. To dig a little deeper, if we look at this same data faceted by speaker, we can see a lot of these issues are driven by a few speakers.

Particularly s01 , s13 , s15 , s20 , s23 , s24 , s27 , and s36 are contributing more so than others to the increased error rate. I don’t know for sure what’s causing the issues, it might be something about vocal quality, recording quality, or pronunciation generation that’s affecting those more so than others.

Looking at the other measures of quality, we can see that MAUS is performing the best when it comes to insertions ( when yknow isn’t counted…). Additionally, you can see that IPA versions of MFA are inserting less, which makes sense because the ARPABET versions have multiple characters for syllabic consonants (i.e. /l̩/ like in bottle is transcribed as AH0 L ).

Finally, in terms of deletions, these are less common than insertions overall. I think the primary reason for this is that our reference is the surface form annotations in the Buckeye corpus. These surface forms are generally pretty reduced and close to the spoken form, so our the ground truth is not citation forms. Of note, FAVE tends to insert much more than it deletes, which is due to its reliance on CMU dictionary’s citation forms. MAUS on the flip side tends to delete more than it inserts, likely due to the post-lexical pronunciation modelling that it does. IPA versions of MFA tend to delete more than the ARPABET versions, I haven’t fully analyzed this, but my guess here is that some of these phones that are still in the Buckeye surface form transcriptions get deleted in the IPA version because they’re shorter than the minimum duration for a phone (30ms), so the cost of deleting is less for it.

Summing up ¶

MFA’s not looking too shabby! It’s nice to see the improvements made in the transition to version 2.0 having concrete improvements. Not surprisingly, the IPA dictionary is showing some good performance specifically with the Buckeye dataset due to similarities in the syllabic consonants. It’s not as phonetic as the Buckeye phone set, but with the limitation of 30ms minimum on phones, it can’t really quite get there.

Now, of course, the huge caveat to all this is that it’s English. I’d be super curious to see how performance differs across languages, but I’m not aware of too many manually corrected datasets out there that could serve as additional benchmarks, but I’d be curious to see how the methodology here stacks up.

Appendix ¶

Benchmark performance metrics ¶

The accuracy metrics above are the primary point of this point, but I do want to touch on the timing benchmarks done as part of this, as speed can be a consideration as well. The below graph shows how long each of the MFA runs took. Note that I didn’t benchmark the FAVE or MAUS cases, since they were not really optimized for this use case, but they each took several hours to generate, FAVE about a day’s worth of run time, MAUS about 6 hours of manual upload and download via WebMAUS.

For the MFA runs, the following system was used:

12 cores
i9-10900KF at 3.7 GHz
32 GB of RAM
Windows installation of MFA

The align runs are the quickest, coming in at under 15 minutes to generate all the TextGrids. The adapt runs are the next fastest, at a little over half an hour to do a couple of adaptation iterations and then align. Training unsurprisingly takes the longest at just around an hour and a half.

The next metrics are the final average per-frame log-likelihood on the final alignment. While I haven’t had success using this kind of metric on a per-utterance basis to figure out what files have issues like missing words, it does seem generally useful for gauging how well an alignment system is doing. Compare the below figure with the above phone boundary error rates across the full corpus. The log-likelihood for MFA English and MFA English adapted are much lower than the other runs, and these are exactly the runs that have higher phone boundary error rate above. So at least a sanity check that this can be used as a proxy measure for when human annotations aren’t available to compare against.

Phone mappings ¶

maus_mapping = {
    'dZ': 'jh',
    'Z': 'zh',
    '6': 'er',
    'tS': 'ch',
    'S': 'sh',
    'Q': 'aa',
    'A': 'aa',
    'A:': 'aa',
    'Q:': 'aa',
    '{': 'ae',
    '{:': 'ae',
    'E': 'eh',
    '@': 'ah',
    'V': 'ah',
    'V:': 'ah',
    'D': 'dh',
    'T': 'th',
    'e': 'ey',
    'e:': 'ey',
    'aU': 'aw',
    '@U': 'ow',
    'oU': 'ow',
    'eI': 'ey',
    'i:': 'iy',
    'u:': 'uw',
    'u': 'uw',
    'U': 'uh',
    'O:': 'ao',
    'O': 'ao',
    'o': 'ow',
    'o:': 'ow',
    'aI': 'ay',
    'OI': 'oy',
    'oI': 'oy',
    'n': 'en',
    'm': 'em',
    'l': 'el',
    'i': 'iy',
    'j': 'y',
    'I': 'ih',
    '?': 'tq',
    'N': 'ng',
    'h': 'hh',
}

ipa_mapping = {
    'ʔ': 'tq',
    'i': 'iy',
    'h': 'hh',
    'iː': 'iy',
    'ɡ': 'g',
    'ɚ': 'er',
    'ɝ': 'er',
    'ɝː': 'er',
    '3`': 'er',
    'dʒ': 'jh',
    'tʃ': 'ch',
    'ʒ': 'zh',
    'ɑ': 'aa',
    'ɑː': 'aa',
    'ʊ': 'uh',
    'ɛ': 'eh',
    'oʊ': 'ow',
    'aʊ': 'aw',
    'aɪ': 'ay',
    'ɔ': 'ao',
    'ɔː': 'ao',
    'ɔɪ': 'oy',
    'u': 'uw',
    'uː': 'uw',
    'æ': 'ae',
    'eɪ': 'ey',
    'ð': 'dh',
    'ʃ': 'sh',
    'ɹ': 'r',
    'j': 'y',
    'θ': 'th',
    'ə': 'ah',
    'ŋ': 'ng',
    'ʌ': 'ah',
    'n̩': 'en',
    'm̩': 'em',
    'l̩': 'el',
}

arpa_mapping = {
    'N': 'en',
    'M': 'em',
    'L': 'el',

}