It is well known that the shape and motion geometry in SfM problems such as this are subject to arbitrary scaling and that this scale factor cannot be recovered. (The imaging geometry and the rotation are recoverable and not subject to this scaling.) In two-frame problems with no information about true lengths in the scene, scale factor is usually set by fixing the length of the ``baseline'' between the two cameras. This corresponds to the magnitude of the translational motion.
It is equally acceptable to fix any other single length associated with the motion or the structure. In many previous formulations, including [10,42] some component of the translational motion is fixed at a finite value. This is not a good practice for two reasons. First, if the fixed component, e.g. the magnitude of translation is actually zero (or small), the estimation becomes numerically ill-conditioned. Second, every component of motion is generally dynamic, which means the scale changes at every frame! This is disastrous for stability and also requires some post-process to rectify the scale.
A better approach to setting the scale is to fix a static parameter. Since we are dealing with rigid objects, all of the shape parameters are static. Thus, fixing any one of these establishes a uniform scale for all motion and structure parameters over the entire sequence. The result is a well-conditioned, stable representation. Setting scale is simple and elegant in the EKF; the initial variance on, say, is set to zero, which will fix that parameter at its initial value. All other parameters then automatically scale themselves to accommodate this constraint. This behavior can be observed in the experimental results.