Probability problem: chance somebody will pick a number in a range randomly

This is a problem I just made up off the top of my head. Just to eliminate any ambiguity, the point of the problem is dealing with issues that discrete vs continuous probability can present – specifically when you have a distribution that you want to deal with a continuous space where the actual selections are bound to act a lot like a discrete one.

So let’s say we’re making a model of the predicted response of a person when you ask them for a number between 1 and 10. Now, a lot of people might pick a culturally lucky number like 7, a few will just pick 5 or 2 just because, but a lot of math geeks with pick pi the square root of 2, or e, just to be cute – so we’re dealing with a continuous space. Yet, basically nobody will pick 1.2937542268907651423, or 8/7, even though both are valid selections.

So we’ve run into a bit of a snag. The chances of somebody picking 1, or 7, or pi are relatively high. Definitely non-zero if you were to give a poll. Not around 1 or 7 or pi, not in the range of 1 and 1.00000001. Exactly 1. the chance of somebody picking 1.00000001 is basically nothing (as you’d expect with a continuous space).

Yet, we’re dealing with a continues distribution, so the chance of anybody picking any single number, whether it be 1 or 2.379225677778923495821345 should be the same: zero. You can only talk about ranges. Yet, I can almost guarantee that if you polled a large enough number of people you will not see this reflected.

Now, you could say that this reflects the belief people have that the question is only asking about integers, or whatever, but the reasons for picking those numbers don’t matter – you’re just trying to make a prediction about what people will pick (let’s pretend we’re doing some good ol’ Bayesian inference, if it matters to you). Some arbitrary non-whole number is a valid response, and if you poll enough people I’m certain some asshole will pick one just to be funny (hell, I’ve done it before). So you have a problem: a continuous distribution that is almost certainly going to act a hell of a lot like a discrete distribution for some numbers, but like a continuous one for others.

So is there a way to deal with this? I assume there has to be, I’m sure I’m not the first person to notice this problem (if it’s even a problem). Yet, the only answer I’m coming up with is to treat it exactly like a discrete distribution and throw out statistically insignificant responses like 5.678955463, which I don’t like. Maybe project the responses into some feature space where the common responses fit some standard distribution? I’m not sure how you’ll come up with a function for that, though.

If you needed to write your distribution out in functional form, for some reason, you could use the Dirac delta function in your representation.

Dirac_delta_function#Applications_to_probability_theory

I wouldn’t be writing it out because, y’know, pure thought experiment. But that’s as good an answer as any. I figured there was something that dealt with that. Thanks. It looks pretty flexible too judging by that page – it could be used in a function to even reflect things like “when people don’t pick one of the standard numbers, it tends to be a number within this continuous distribution”.

You could, I suppose, limit your space of numbers to “Numbers which can be unambiguously expressed in the English language in a finite amount of time”, which is a countable set. Or set some specific upper bound on the amount of time, like “Numbers which can be unambiguously expressed in the English language in ten seconds or less”, in which case you’ve got a nice finite (and discreet) set. Or maybe don’t put a hard upper bound on the amount of time, but assume some sort of distribution on the times: The occasional nonconformist might pick a number that takes 11 seconds to express, but very, very few folks are going to pick a number that takes a million seconds to express.

If this is the crux of the query: it’s straightforward to have a probability distribution that is a mix of discrete and continuous. For example, you could say that a number is drawn randomly from the following distribution: “The number is either -1 (with probability 30%) or it is any number between 0 and 1 (uniformly distributed).” In this case, you just normalize the continuous part of the distribution to have integral 0.7 instead 1. In your scenario, you just have a longer list of discrete possibilities (and their corresponding probabilities) plus some continuous distribution describing the rest, normalized such that its integral makes up the balance of the unit probability.

To be sure, the practical aspects of actually assigning the probabilities are daunting, but the underlying math of a discrete-and-continuous distribution is well-defined.

I suppose my problem wasn’t just whether you can do it. I was pretty sure that a function existed for it and was trivial. It was partially the fact that I was trying to think of a way to represent such a thing on the fly. Like, you have a bunch of data and your domain – how do you generate a probability function for this? Even if you know it’s going to be a mix of discrete and continuous mix (hell, even if you know, say, the type of continuous function but not the exact parameters) before you start, how are you going to fit the function to the data?

You need a model, and you need at least as many data points as you have free parameters in your model. However, from how you phrased your last post, I suspect you already know all this, so I think I’m still not sure what you’re asking.
Are you asking about practical methods for optimizing the parameters of the model to describe the data? If so, then likelihood maximization is probably what you want.

In your actual scenario, you might be lookng for a model that can somehow dynamically determine which items are drawn from the discrete or the continuous parts of the underlying distribution. As Chronos implies, though, everything is actually discrete in the human-based scenario, so you just need a large enough data set to determine all the probabilities empirically. If you’re looking to approximate parts of the distribution as continuous (since they will behave an awful lot like it in practice), then you just need to have a rule for what you want to treat as discrete (any integer; any rational number with small numerators and denominators; things like pi and e; anything that appears in the data twice; … ) and then you need a model to fit out the rest of the numbers. Find the parameters that maximize the likelihood (including the discrete probabilities and the continuous model’s parameters), and you’re done – you’ve got a full distribution with which you can predict future numbers.

Alternatively, you can just fit the empirical distribution function.

I think that “dynamically determine” was more or less the crux of what was running through my head. Some way to dynamically determine what pieces of data are drawn from which parts. Some way to know that Very Important™ numbers (like 7 and pi) come from the more or less discrete distribution that will be selected from most of the time, and that other numbers like rational numbers like 6/5 and .012 will come with varying degrees of likelihood from various continuous distributions selected from less often (distributions such as Chronos or your mentioned about sized of denominators and length of the decimal etc).

That works too.

This is either a nitpick or a question to Chronos (or someone else if they can answer). I’d think that “Numbers which can be unambiguously expressed in the English language in a finite amount of time” would be a finite not a countable set. (I realize the former are also the latter.) Did you just mean if I don’t tell you how much finite time, I can’t tell you how big the finite is?

The OP needs to work harder to pin down a mathematics question that has an answer.

The tack that Chronos has taken leads to a very interesting place, the theory of Kolmogorov Complexity, or Chaiten’s discoveries in algorithmic information theory. Chronos’ concept of the time it takes to specify a number is essentially the same as Chaiten’s concept of the number of bits in the minimum algorithm that can be used to specify a number. It leads to very deep questions in mathematical logic, randomness, etc.

For any integer (or rational number, or a good many irrational numbers), I can specify that integer using English in a finite amount of time. Thus, all integers are elements of the set “numbers that can be unambiguously expressed in English in a finite amount of time”, and so, since there are an infinite number of integers, that set must be infinite.

Some of the integers will take longer to express than others, and for any given amount of time, I can come up with integers that will take longer than that amount of time. This is why it matters if you specify an upper bound, as I went into in my next sentence: If you do specify an upper bound, then the set becomes finite.