my Deus Ex Lipsyncing Fix Mod: making of

Back in 2021 I made a mod for Deus Ex 1 that fixes the lipsyncing and blinking, which, I betcha didn't know, was broken since ship. Everything I wrote about it is on Twitter, and it oughta be somewhere else, so here's a post about it.

I guess I was playing DX1 and thinking, geez, was this lipsync always this bad? In a weird way? It's insta-snapping mouth shapes, but they're not always the same mouth shapes. Is this broken? I couldn't find anything online about it, but I did find this article: an interview with Chris Norden, a coder on DX, where he goes into the lipsyncing and how it was, at one point, super elaborate and amazing, and they had to pare it back for performance reasons. I thought I'd check how much of this was done in Unrealscript (since the C++ source for DX is nowhere) and whether I could un-pare it. It turns out it was an extremely simple fix to get it as good as I got it, and I think that's as good as you can get it until someone leaks the source code.

I'd messed around with lipsyncing stuff before and was familiar with the broad strokes of how it tends to work via my intense familiarity with Half-Life 2: you figure out, hopefully automatically, the sounds (phonemes) present in a sound file ("oo", "ah", whatever) and map those to mouth shapes (visemes), then when the audio plays, move the mouth into the right shape for the phoneme we're in at this moment. The figuring-out process is called "phoneme extraction", at least by Valve, and Valve do this offline, because it takes a sec. In Valve's case they append this phoneme information to the end of the .wav file, and it looks like this:

PLAINTEXT
{
Okay, I don't blame you for hesitating, but if we're gonna do this thing, then let's just get through it. 
}
WORDS
{
WORD Okay 0.064 0.224
{
111 ow 0.014 0.096 1.000
107 k 0.096 0.142 1.000
101 ey 0.142 0.220 1.000
}
WORD I 0.224 0.352
{
593 ay 0.220 0.310 1.000
105 iy 0.310 0.364 1.000
}
WORD don't 0.352 0.496
{
100 d 0.364 0.396 1.000
111 ow 0.396 0.456 1.000
110 n 0.456 0.496 1.000
}

, etc. Phonemes, start times, end times. Easy!

My assumption is that the reason Deus Ex's super cool lipsyncing was too expensive to ship was, they don't seem to save this information anywhere, so I guess they were figuring out the phonemes in realtime. If correct, this is sort of a bummer - doing what Valve did would have scooped the whole cost out. Maybe there was more to it.

Anyway, the Unrealscript. Deus Ex is pre-Unreal having skeletal animation, it's all vertex animation. The character heads have a few: relevant here, 7 visemes and a blink. nextphoneme is set from somewhere outside this code (probably a cpp audio system I can't access) to A, E, F, M, O, T or U, which it doesn't matter which is which and I don't remember, or X, which is nothing (close mouth). Then this Unrealscript on the character sets the head's anim sequence to the appropriate pose. This all happens on tick, but only if IsSpeaking. We have a tweentime we're using to blend between these poses, so we should be seeing nice smooth blending, the lack of which is why I'm here in the first place! So what's the problem?

The main thing is a dodgy frame rate check:

// update the animation timers that we are using
	animTimer[0] += deltaTime;
	animTimer[1] += deltaTime;
	animTimer[2] += deltaTime;

	if (bIsSpeaking)
	{
		// if our framerate is high enough (>20fps), tween the lips smoothly
		if (Level.TimeSeconds - animTimer[3]  < 0.05)
			tweentime = 0;
		else
			tweentime = 0.1;

"tweentime" is how long it takes to blend to the next viseme in seconds; if 0, it's an instant snap. The intent here is to skip blending entirely if our framerate is so low that it looks better snapping the lips around than showing any in-between poses, only it doesn't work. The code is keeping Level.TimeSeconds from the previous frame and subtracting that from the current Level.TimeSeconds to get deltatime, which if it's less than 0.05, we're assumed to be getting less than 20fps. So it's flipped.

Also, 0.1 is just way too fast a value, which I suspect a reason for that I'll come back to*. I increased it to 0.35 to make the blends take long enough to really see.

With that fixed, the lipsync is smooth! Hooray! But it's not perfect: at the end of a line, when the audio finishes, we don't smoothly close the mouth; we snap the mouth shut instantly. This is because we're only doing any blending if bIsSpeaking=true, which it suddenly isn't. The perf hit of this function no longer matters at all, so I just skip that check too: every character always gets to run lipsync. Tweentime is also local to this function and initialises at 0, so I had to set it to 0.3 to get blending even when we have no phoneme.

Blinking was also way too fast, so fast as to be invisible, so I slowed it down a ton. Now you can see 'em blinkin'.

So now we have nice blinking and smooth mouth movement, but there's one thing that still sucks: presumably as part of the optimisation that made this ship at all, nextphoneme does not update every tick, or anywhere near every tick. It doesn't even update at a fixed rate - sometimes you'll get a good amount of updates in a sentence, sometimes one or two. This means that all the smooth blending in the world won't get you a correct result unless you happen to get lucky: JC can be speaking the M in "a bomb" and you're still back on the "a". As far as I can tell there's no way to fix this right now - the code that updates the phonemes just needs to do it every tick, and it don't, and it's not Unrealscript so I can't touch it. If the time between phoneme updates was at least consistent, you could set tweentime to that duration and make your blend take as long as it takes for a new phoneme to show up, but it ain't. So close!

*In the interview where Norden alludes to this amazing lipsync demo they had going on before they optimised it down, I assume it was initially getting a new phoneme every tick, and that is probably when they set 0.1 seconds as a blend duration. If you're getting constant new phonemes, blending super fast to the next one makes sense; it's only when you're not that a slower blend time looks good.

There's a lot of jank to this code. The silliest thing about it might be that it lives in ScriptedPawn, Deus Ex's NPC class, which does not share an immediate parent with the player character, so this whole function is just duplicated between the two classes.

Anyway, here's the whole function after I futzed with it.

// lip synching support - DEUS_EX CNN
//
function LipSynch(float deltaTime)
{
	local name animseq;
	local float rnd;
	local float tweentime;

	// update the animation timers that we are using
	animTimer[0] += deltaTime;
	animTimer[1] += deltaTime;
	animTimer[2] += deltaTime;

	if (bIsSpeaking)
	{
		// if our framerate is high enough (>20fps), tween the lips smoothly
		
//JOE CHANGE: 
//This used to set tweentime to 0 (no blend) if it thought FPS was low, else 0.1. It was 
//backwards though, the result was the opposite. 
//Even 0.1 is too fast to look good though. Anyway, skip the check, we don't care
//
//		if (Level.TimeSeconds - animTimer[3]  < 0.05)
//			tweentime = 0.4;
//		else
			tweentime = 0.36;

//Also, ideally tweentime would be the duration until the next time we get a phoneme update?
//But I don't know where that update comes from at the moment

		// the last animTimer slot is used to check framerate
		animTimer[3] = Level.TimeSeconds;

		if (nextPhoneme == "A")
			animseq = 'MouthA';
		else if (nextPhoneme == "E")
			animseq = 'MouthE';
		else if (nextPhoneme == "F")
			animseq = 'MouthF';
		else if (nextPhoneme == "M")
			animseq = 'MouthM';
		else if (nextPhoneme == "O")
			animseq = 'MouthO';
		else if (nextPhoneme == "T")
			animseq = 'MouthT';
		else if (nextPhoneme == "U")
			animseq = 'MouthU';
		else if (nextPhoneme == "X")
			animseq = 'MouthClosed';

		if (animseq != '')
		{
					if (lastPhoneme != nextPhoneme)
			{
				lastPhoneme = nextPhoneme;
				TweenBlendAnim(animseq, tweentime);
				TimeLastPhoneme = Level.TimeSeconds;
			}
		
		}
		

//		if ((Level.TimeSeconds - TimeLastPhoneme) >= tweentime*0.8 && TimeLastPhoneme != 0)
//		{
//		TweenBlendAnim('MouthClosed', 0.2);
//		nextPhoneme = "X";
//		lastPhoneme = "A";
//		TimeLastPhoneme = Level.TimeSeconds;
//		}
	}
	else
	if (bWasSpeaking)
	{
		bWasSpeaking = false;
		
//JOE: I added this tweentime set. Without it it was 0 as initialised, so the jaw snapped shut

		tweentime = 0.3;
		TweenBlendAnim('MouthClosed', tweentime);
	}

	// blink randomly
	if (animTimer[0] > 0.5)
	{
		animTimer[0] = 0;
		if (FRand() < 0.4)
			PlayBlendAnim('Blink', 0.2, 0.1, 1);
	}

	LoopHeadConvoAnim();
	LoopBaseConvoAnim();
}