One of the most frequent questions I get is what is the threshold for getting a "good" alignment. My usual answer is just to throw the Montreal Forced Aligner at the corpus and see how well it does on spot check. My other glib answer is "more is better". The more nuanced answer is that when there's a pretrained model on 20+ hours of speech similar to the variety you're trying to align, that will likely generate pretty good alignments. You can also try to use the recently released command for adapting pretrained models to new datasets. These solutions work for overresourced languages like English, but for fieldwork situations, they're not going to be the best.
So say you've collected some audio data of some language, transcribed it with orthography, have a pronunciation dictionary, and want to generate alignments. At what point does it make more sense to automatically align it with MFA versus a more manual process? Let's take a look.
The dataset that I'm using to evaluate quality of alignments based on the amount of training data is the Buckeye Corpus (as in my previous post about MFA performance). So there is a large caveat here that this is North American English, collected in a sociolinguistic interview context, with generally good recording quality. Aspects like noise in the background, recording quality of the mic, etc are all going to factor into the necessary quantity of training data to generate good models. Additionally, the number of hours reported below represents the duration of speech, and not necessarily the total duration, as interview utterances and silences are removed. The total size of the Buckeye corpus is generally reported as about 40 hours, but when including just the speakers' utterances, that drops to about 16.5 hours of speech.
The procedure I used created 195 "corpora". Between 1 and 39 speakers were included in a given subset, with 5 subsets per number of speaker. The 5 subsets is to try to control for some of the variability introduced from particular speakers, since especially for the lower numbers of speakers, which speaker is included is going to affect the quality a lot. For each of these 195 corpora, the inverse subset was generated as held-out data to test how well the model could generalize to new speakers. The durations for each of these are below.
Some speakers certainly have more data than others and including one over the other is going to cause some variation, but it's not so much that is obscures the very clear relationship of increasing the number of speakers increasing the duration of the data. This might not be fully representative of your data, since the set up is based off of speakers, so if you have the same duration of data, you might have more or less speakers that could affect the quality of the alignments.
Some miscellaneous details about the training regime:
- Used the same procedure of
MFA English IPA trainin the post about MFA performance
- Used Linux installation of MFA to allow for easier use of symbolic links in creating the subset corpora
- The timings below happened over a long period of time and often I was doing other stuff while it ran in the background, so take them with a grain of salt (shout out to Roguebook which provided me with many hours of entertainment during the training and alignment runs over the past few weeks)
In general, and unsurprisingly, the more speakers in the training set, the more time MFA takes to train. There are some deviations during points when I was using my computer for other purposes in parallel, but in general each training took under an hour (with 12 parallel cores). The parallelization in MFA is based on speakers, so for the smaller subsets, it can't take as much advantage of multiprocessing.
In addition to looking at how the alignments perform on the training data, I also used the held out data for each subset to give a sense of overall how good the model generated is, and to give a sense of how generalizable the acoustic models are to other speakers of the same variety.
In general most alignments took less than 10 minutes, but for the runs with lower number of speakers in the training data, there was a pretty substantial explosion of time required to generate alignments for held out data. The reason for this can be seen below in the log-likelihoods. The log-likelihoods for training are not super useful for comparing quality because the dataset is completely different in each run, but they do give a sense of how overfitted a given model is. You can see below that log-likelihood of training peaks around 5-6 speakers, which correlates with the peak in training time above.
When looking at the log-likelihood over the held out data, we don't really see a trough in log-likelihood corresponding to the peak in overfitting. The alignments in the 5-6 speaker range aren't suffering as much as we might first think. The alignment beam in MFA is a pretty forgiving 100 with 400 as a retry beam, whereas for training this beam is much more restrictive at 10 with 40 for retry, so alignment has the ability to keep a larger range of suboptimal candidates at the cost of larger lattices. The overfitted models take advantage of the larger beam and generate larger lattices, leading to the super long processing times. The Viterbi algorithm has a time complexity of \(O(K^2T)\) where K is the number states (constrained by the beam width) and T the number of frames (if you're curious about the beam search algorithm take a look here).
The overfitting is largely an issue of temporal performance, and it looks like the log likelihood stabilizes around 10 ish speakers. Overfitting isn't really an evil here that it normally is in model training, since the goal is usually just to generate the best alignments as possible on the training data. If the goal is to generate a model for ASR transcription or more widespread use, then this overfitting becomes a larger concern. To take a more in depth look, we'll look at actual alignment errors in the training data and in held out data. These are largely distinct cases, as when you have a large (enough) set of target utterances to align, you care more about performance on training data, and when you distribute models you care more about the generalizability of the model.
Results on training data alignment
First we'll take a look at how the alignments stack up on the training data. This will be the usual use case for fieldwork data, where you just want to generate alignments for phonetic analysis. To begin with, let's take a look at the word boundaries.
Word boundaries show some variability towards the lower end of training speakers, but stabilizes around 20 speakers, but in general it oscillates close to the values for the full training set across all the subsets. Honestly though, the scale of errors is very comparable between reduced training speakers and the full corpus.
There's always quite a bit of variation in selected CVC boundaries. The boundary errors stabilize fairly quickly, with only the smallest numbers of speakers showing increased errors. The VC transition does stabilize much later, around the 27 speaker mark.
For phone boundaries across the training data, we can see that errors are generally above the full run until about 27 speakers again. However, one thing to note is that the scale of this plot is much smaller than the above, so the increase in error rate isn't too bad.
In general, the performance across all training runs is not too bad. For reference, the overall mean phone error for FAVE and MAUS on the Buckeye corpus is 22-23 ms (see the previous metrics post for more comparison to other aligners, I've only included the
MFA English IPA train results in the above plots for visual clarity). So performance is generally better and pretty consistent once you hit the 5-10 speaker range, I would say, which corresponds to about 3-5 hours of speech. One outstanding question from this is whether increased speakers (and less data per speaker) might hurt performance, but I think 3-5 hours is probably a pretty good lower bound on good performance on the training data itself.
Results on held out data alignment
Another use case is training a model (or having a pretrained model like the ones distributed on MFA's pretrained model page) and applying this model to new data. So the question here is basically how well do the models we trained above generalize to new speakers within the same dialect as the training data?
Starting with word boundary distances, we can see some pretty stark errors for the lower numbers of speakers. Around about 20 speakers we hit around the target of the full training data, or at least they're at a similar level to what those models performed on the training data.
For the phone boundary errors in selected CVC words, it's not super good results, particularly at the lower end of speakers, to be expected. Impressionistically, it again looks like 20 speakers is the point where it gets in the same ballpark as the full training run.
For phone errors across all words, performance gets close to the best performance around that 20 speaker mark again. At that point we're in the 19-21 ms range that we saw above for where the models generally fell on their training data.
So in general, I think the recommendation here is that around 20 speakers (corresponding to about 8-10 hours of training audio) generates a model that's pretty good at generalizing to new speakers in the same variety. That caveat's pretty important, as the farther away from the specific variety, the worse the generalization is going to be. Another note is that I didn't do any balancing or control of speaker gender in the training data, so that would definitely affect generalization.
So hopefully this was a useful look at training data and how it impacts MFA alignments. Based on the above, I would recommend the following rules of thumb:
- If you care only about alignments of the training data, 3-5 hours should be enough.
- Caveat: increasing the number of speakers/varieties in the training data will likely need more training data
- If you care about generating models for more widespread use, 8-10 should be enough for generalizing to the same variety
- The more speakers the better, but also more speakers should need more data
- I usually recommend about 20 hours for a decently performant model