-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error if attributes have the same value in training set. #14
Comments
Well, in either case, that's not much of a decision tree.. a) same attributes resolving to different labels - current code just picks the last value, so all of the examples get resolved to one case. For (b), I guess it would make sense to handle this, even if its an edge case, but for (a), it seems like we should just throw an exception during training.. |
As mentioned in my first post, I get an error for the values posted. I would have imagined that a decision tree would pick the target feature that appears the most times, or if no such case exists then pick the last one. But if you think an exception during training is a better solution then that's fine I guess. In my training set I have decided to add a fake entry to bypass both cases discussed. |
For case (a) how about just ignoring those examples? For case (b), as goibon said, it would make sense to just resolve to the only target feature available. |
Although this is quite an old ticket, I must say that this issue causes me some headaches also. As I see it:
|
This is a very old thread but I think the issue is still here. I am a proponent of leaving it how it is. To address @DannyBen 's list:
I believe this makes the assumption that FIFO and LIFO data observation models are exactly the same. In other words, if I have a set of observations where the same occurrence of features are yielding different outputs, can I be certain that there weren't exigent circumstances or variables that were causing the variation in output? (As a matter of fact, I can probably assume that there was an exigent variable, since we would expect a model like Put another way, if we are training a model with data that has non-changing inputs, then what are we really doing? As @igrigorik said, a check could be made to throw an error when something like that happens, but it would be remiss of an objective modeling algorithm to choose a FIFO or LIFO method to select the "correct" observation, when we might not even understand what exigent variable is causing the discrepancies.
I think this means you already understand that the above limitations of assumptions are something we shouldn't control for. Is that correct?
I can see how this would be frustrating, but maybe a second look at the data is in order. If you have a multivariate system of continuous inputs that are constant, maybe one solution is to convert them to discrete variables instead? On other words, have you tried something like this: [
["cat5", "option1", "9.990941"],
["cat5", "option1", "9.990926"],
["cat5", "option1", "9.991411"],
["cat5", "option1", "9.991286"],
...
["cat5", "option1", "9.9907190386056"]
] I don't know your data, but if I saw a set of continuous inputs like this, I can safely assume that one of two things are true: either A) there are continuous variables that should be discretized, or B) the wrong data is being measured to yield an observation. Think about the nature of a decision tree, and then think about how it would play out on a model that had the exact same inputs but different outputs. For example, pretend we have a model trained on the data from OP. If we asked the model to predict the outcome of One could argue that it should return the mean value, but again, that assumption would also depend heavily on the environment and the type of data we're collecting. |
Psychologists call this "deflection" no? 😏 Let's ignore all the different ways to handle - or not handle - multiple values, I think there should be no disagreement that at least the library should raise a specialized error and not let ruby fail with an unrelated error. But, I understand if this is not on anybody's priority list. To be honest, I also moved on. |
2018 update All it needs is a simple random modifier for the default values. That might cause some false-correlation in the beginning, but fixes itself as data grows |
I've encountered a problem where if you have a training set like:
Where the attributes do not vary, only the target feature does, I get an error:
I encounter the same issue if the attributes vary but the target feature does not.
The text was updated successfully, but these errors were encountered: