-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there some way to specify that certain interactions never occur? #19
Comments
I found a better description of what's going on: import numpy as np
import numpy.linalg as la
import pandas as pd
import formulaic
index_vals = tuple("abc")
level_names = list("AB")
n_samples = 2
def describe_design_matrix(design_matrix):
print("Shape:", design_matrix.shape)
print("Rank: ", la.matrix_rank(design_matrix))
print("Approximate condition number:",
np.divide(*la.svd(design_matrix)[1][[0, -1]]))
ds_simple = pd.DataFrame(
index=pd.MultiIndex.from_product([index_vals] * len(level_names) + [range(n_samples)], names=level_names + ["sample"]),
columns=["y"], data=np.random.randn(len(index_vals) ** len(level_names) * n_samples)
).reset_index()
print("All sampled")
simple_X = formulaic.Formula("y ~ (A + B) ** 2").get_model_matrix(ds_simple)[1]
describe_design_matrix(simple_X)
print("Only some sampled")
simple_X = formulaic.Formula("y ~ (A + B) ** 2").get_model_matrix(ds_simple.query("A != 'a' or B == 'a'"))[1]
describe_design_matrix(simple_X)
print("Reduced X")
reduced_X = simple_X.loc[:, [col for col in simple_X.columns if not col.startswith("A[T.b]:")]]
describe_design_matrix(reduced_X) produces
I think I can turn this into something I can use for the original problem. |
Just realized I should probably mention: |
A slightly more complicated case: import numpy as np
import numpy.linalg as la
import pandas as pd
import formulaic
index_vals = tuple("abc")
level_names = list("ABC")
n_samples = 2
def describe_design_matrix(design_matrix):
print("Shape:", design_matrix.shape)
print("Rank: ", la.matrix_rank(design_matrix))
print("Approximate condition number:",
np.divide(*la.svd(design_matrix)[1][[0, -1]]))
ds_simple = pd.DataFrame(
index=pd.MultiIndex.from_product([index_vals] * len(level_names) + [range(n_samples)], names=level_names + ["sample"]),
columns=["y"], data=np.random.randn(len(index_vals) ** len(level_names) * n_samples)
).reset_index()
print("Only some sampled")
simple_X = formulaic.Formula("y ~ (A + B + C) ** 3").get_model_matrix(ds_simple.query("A != 'a' or B == 'a'"))[1]
describe_design_matrix(simple_X)
print("Reduced X")
reduced_X = simple_X.loc[:, [col for col in simple_X.columns if not col.startswith("A[T.b]:B")]]
describe_design_matrix(reduced_X) Since formulaic appears to sort the column names before forming the interactions, a general solution would need the test to be similar to: reduced_X = simple_X.loc[
:,
[
col for col in simple_X.columns
if (
# Check for interaction terms involving A == 'b' ...
(not col.startswith("A[T.b]:") and ":A[T.b]" not in col)
# ... and any value of B
or "B" not in col
)
# Equivalently,
# if not ((col.startswith("A[T.b]:") or ":A[T.b]" in col) and "B" in col
# I may need to expand the test for B to avoid problems if, for example, A.endswith(B)
# (":B[" in col or col.startswith("B[T."))
# should work
]
] I'm not entirely sure how to tell formulaic that This looks like most of an implementation and a start at a test suite for this functionality if an API gets decided. |
Hi @DWesl ! Sorry for the delay on this. I usually only work on Formulaic on the weekends, in and around family stuff. This is a very interesting problem. The I'm a little reluctant to add functionality along these lines to the formula grammar itself, since I think that will complicate things a lot. However I do see two alternative paths to supporting something like this downstream (or even integrating it directly into Formulaic). (1) As a post-model-matrix-generation transform that looks at the structure and patches it to be full rank. For (2), it should be "trivial" to add this as a transform, but making it a stateful transform (something that can be reused on a different dataset) would be more tricky. I'm imagining a new transform that look something like:
Obviously as the number of interactions grow, this looks more and more ugly. What are your thoughts? |
Oh... And you could explicitly write out the interactions:
This syntax should work in the latest master, but with some tweaks to the parser perhaps we could enable doing:
If we did that, would that be sufficient for your use-case? You'd have to expand your products manually. |
There's certainly the option of writing out the interactions explicitly, but I've worked around this for now by using: reduced_X = simple_X.loc[
:,
[
col for col in simple_X.columns
if (
# Check for interaction terms involving A == 'b' ...
if not ((col.startswith("A[T.b]:") or ":A[T.b]" in col) and
# I need to expand the test for B to avoid problems if, for example, A.endswith(B)
(":B[" in col or col.startswith("B[T."))
)
]
] (The problem that sparked this issue has the second column named essentially I would not add it to the grammar, I was thinking a keyword option to the Possibly a better (more consistent) way to do this would be to use the grammar to set ... that's not currently implemented in formulaic, but the idea should be sound. I have no idea which of these better conforms to the expectations of people looking at linear regression coefficients or ANOVA tables. |
Sorry for the delay here again @DWesl ! I've had a lot on my plate the last few weeks. I think I am persuaded that adding a keyword argument along the lines of "drop_columns_with_no_support" or "drop_all_zero_columns" could be useful, and that having be part of the model matrix generation process is valuable because that way new data will be forced to respect the limited interactions even if the new data has support for these interactions. Would that meet your use-case? |
That gets most of the way there; I'd need to reorder the variables in the Categorical to get a column with all zeros, but that's not that much effort. |
I asked about this in pydata/patsy#161, and puzzled out what formula would be needed to ensure the all-zero columns happen: |
Hi @DWesl ! As I prepare formulaic for the v1.0 release, I wanted to go back through all the old issues and make sure things were dealt with properly. I know it has been a really long time (sorry). Formulaic has come a long way during this time, but I don't think this specific issue has been "solved". Formulaic does now support customizing the reference level in contrasts/etc, as well as explicitly ordering the levels and respecting the order of factors in the formula, so you could use what you wrote above... but, there's still room for improvement. Has your thinking in this space evolved? |
I would expect the condition numbers to be somewhat closer to each other.
Is this just a bad expectation?
In the case this is derived from,
A == "a"
sets something to zero, and then multiplies that by the stuffB
controls.Is the standard practice for this situation to duplicate data expected to be identical so the tests don't crash?
The text was updated successfully, but these errors were encountered: