Skip to content

Change UnitVector transform to use normalization #138

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 12 commits into from

Conversation

sethaxen
Copy link
Contributor

This fixes #66 and relates #86

Length 1 unit vectors are no longer supported. While they are technically possible, they will likely suffer from numerical issues, since the transform is undefined at x=0, and for a Markov chain to travel from y=[-1] to y[1], it would have to leap over the origin, which is only even possible due to discretization and likely will often not work.

Note the current implementation is bare-bones and has the limitation that it's not appropriate for optimization when the target distribution may include a unit-vector whose distribution is uniform on the sphere. I think it would be useful to accept a logpr function corresponding to log(r^(1-n) p(r)) where p(r) is an (unnormalized) prior on the discarded norm r. As noted in #86 (comment) and https://discourse.mc-stan.org/t/a-better-unit-vector/26989/30, this would support providing an alternative prior in these rare cases whose logpr is not maximized at r=0. If this is welcome, I'll add it to the PR.

Copy link
Owner

@tpapp tpapp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! The docs also needs to be updated but I can take care of that myself.

If you have the time I would appreciate an explanation of why we need to correct by the Chi distribution.

lj_transform = logdet(J' * J) / 2
# lp_prior
# un-normalized Chi distribution prior on r
lp_prior = (K - 1) * log(r) - r^2 / 2
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please explain why this correction is needed?

x = randn(dimension(tt))
y = transform(tt, x)
x′ = inverse(tt, y)
@test x ≈ x′
m = sum(2:N)
@test x[1:end-m] ≈ x′[1:end-m]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might as well remove these two lines, the one below is sufficient.

@tpapp
Copy link
Owner

tpapp commented Apr 29, 2025

Apologies is the question is stupid, but could this be a bijection by adding the norm? Ie x ∈ ℝⁿ would be mapped to the (Named)Tuple (y, r), where r = norm(x, 2) and y = x ./ r.

The user would need to put a prior on r explicitly.

@sethaxen
Copy link
Contributor Author

If you have the time I would appreciate an explanation of why we need to correct by the Chi distribution.

Apologies is the question is stupid, but could this be a bijection by adding the norm? Ie x ∈ ℝⁿ would be mapped to the (Named)Tuple (y, r), where r = norm(x, 2) and y = x ./ r.

Yes that's exactly what's implicitly going on here. There are 2 equivalent perspectives we can take, but I'll focus on the one that I find most intuitive. To make the transform bijective almost everywhere, we expand the parameter space so the transform is x ↦ (y=normalize(x), r=norm(x)). The only point where this is non-bijective is x=0. Its Jacobian is non-square, but we have some tricks to compute the Jacobian determinant (as noted in the test comments, we can use the generalized Jac det that appears in the area formula logJ = logdet(J'J)/2. The Jacobian here is J=[(I-y*y')/r; y'], so logdet(J'J)/2 == (1 - n) * log(r).

After applying the transform we have an extra parameter r, but if we don't specify a prior, then by default r follows the uniform measure over the reals, which has infinite mass. As a result, the distribution we would get on x would also have infinite mass, and MCMC will fail. So we need to place a prior on r. The Chi prior used here is chosen because if we have n IID standard normal parameters x, then y is uniform on the sphere and r has this Chi distribution; so if the user's target distribution is uniform on the sphere then the distribution on x is the nice standard normal.

That choice of prior on r is fine in probably 99% of cases. It only really becomes an issue 1) when optimizing a uniform spherical distribution (highly unlikely) 2) when the target distribution on y is highly concentrated. For n=2, in the worst case, the direction of x is almost completely fixed but the length is not, so the distribution of x is almost on a linear subspace, curvature is high, and HMC can diverge. In these cases choosing a prior on r that is very tight around r=1 can drastically improve sampling. In practice I've only seen this be an issue when n=2.

The user would need to put a prior on r explicitly.

Yes, that's right. Cases like this were my motivation for the syntax product_distribution((y=VonMisesFisher(...), r=Chi(n))) in Distributions. But the downside is it requires the user know 1) that they need to pick a prior 2) know why there's a floating parameter they don't care about and 3) know what a good choice of prior is. To me this seems burdensome when the default Chi prior is fine for almost every user.

A practical issue with the user picking a prior themselves is that logJ has a term (1-n)*log(r), which will be Inf at r=0. If you then add the log-density of Chi, it has a (n-1)*log(r) term, which will be -Inf. The two terms should cancel, but by computing the logJ and the prior separately, you'll get a NaN. So having the user explicitly provide a prior for r would not solve the problem of optimization failing for uniform distributions on the sphere. That's why in the OP I proposed logpr(r) = log(r^(1-n) p(r)). Again, this failure mode is probably rare.

@sethaxen
Copy link
Contributor Author

One more comment: If the approach in this PR is taken, then not every transform here will be bijective/invertible, and the log-density correction is not just a logdetjac. So maybe some documentation would need to be updated. It would still be the case that every transform should be right-invertible.

@tpapp
Copy link
Owner

tpapp commented Apr 29, 2025

Thanks for the intuitive explanation, this now makes sense.

so logdet(J'J)/2 == (1 - n) * log(r)

Isn't there an extra -r^2 term, like in the code?

But the downside is it requires the user know 1) that they need to pick a prior 2) know why there's a floating parameter they don't care about and 3) know what a good choice of prior is. To me this seems burdensome when the default Chi prior is fine for almost every user.

Yes, I see your point. Also, in the future, we might consider adding similar mappings where an unconstrained x vector is transformed to an (y, z) tuple, where z may require similar regularization. Instead of special-casing this, I would like to make it generic.

then not every transform here will be bijective/invertible

But returning r too would make it (almost always) invertible, which is why I am considering it. I am really reluctant to give up invertibility.

The two terms should cancel, but by computing the logJ and the prior separately, you'll get a NaN.

I am thinking about the following API:

  1. the UnitVector transformation maps x to (y, r) as described above, returning a NamedTuple. This is documented.

  2. it takes an optional argument for the prior on r, which defaults to Chi distribution. The user can modify this. We document that this only affects the logjac and explain the rationale.

  3. the log jacobian determinant is adjusted by this, we dispatch on Chi in an internal method that avoids the NaN issue; if the user has a preference for another prior they should handle it themselves, there will be a default fallback.

Comments welcome.

@sethaxen
Copy link
Contributor Author

sethaxen commented Apr 29, 2025

Isn't there an extra -r^2 term, like in the code?

No the -r^2/2 term comes from the Chi prior.

But returning r too would make it (almost always) invertible, which is why I am considering it. I am really reluctant to give up invertibility.

Yeah that makes sense. I also like it's explicitness.

  • it takes an optional argument for the prior on r, which defaults to Chi distribution. The user can modify this. We document that this only affects the logjac and explain the rationale.

  • the log jacobian determinant is adjusted by this, we dispatch on Chi in an internal method that avoids the NaN issue; if the user has a preference for another prior they should handle it themselves, there will be a default fallback.

A simpler alternative would be a boolean flag defaulting to true that indicates whether the Chi prior is applied or not. Then document the prior and that the user can set it to false if they want but then they must explicitly increment the log-density with their own prior (i.e. logJ is really just the logdetjac). It doesn't avoid the NaN issue though, but I do think this is a rare issue, and it's also avoidable by choosing a suitable prior on r.

@sethaxen
Copy link
Contributor Author

Suggestion: If also returning r, rename it to UnitVectorNorm. An alternative name could be PolarVector, since the polar decomposition of a vector interpreted as an nx1 matrix would return the unit vector and a 1x1 matrix containing the norm.

@tpapp
Copy link
Owner

tpapp commented May 2, 2025

@sethaxen, I decided to start over and copied your code over to #139

@tpapp tpapp closed this May 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UnitVector only transforms to positive hemisphere
3 participants