That’s certainly true for real-time data. But if you record the sensor data, you can play it back exactly offline, and potentially get deterministic results that way.
It’s probably not that important, though. As you say, the real data is never going to be the same twice. What you want is a high degree of consistency. If you add a tiny bit of noise to the images, the end result will be different, but it should be close. Something is probably wrong if it’s drastically different (though one can imagine edge cases where this isn’t true).
Agreed about the temperature (which is intended to introduce variability) but trying to drive an inference process to use exactly the same path every time would come with an enormous performance hit, and even then small computation errors can result in variation in response. Real LLMs are definitely not deterministic, and there is a lot of research focused on getting them to convert to a ‘correct’ solution with permissible variation (i.e. semantically identical even with variable wording) without trying to brute force the process.
But regardless, even if the inference process is forced to be strictly deterministic in inference, the complexity of a neural network of a useful size means that you can’t just look at the weighting parameters of each node and predict a response, nor predict what substantial variations in those parameters will produce. Which means that we can’t just introduce explicit directives into the model; anything you want the model to do within its own processes has to be trained via reinforcement, and even with vigorous reinforcement you don’t know what kind of conditions may result in the system ignoring that training and doing something completely unexpected.
Indeed, those edge cases must exist. If there exist two routes to the same destination that take exactly the same amount of time (and the Intermediate Value Theorem guarantees that such must exist, for some starting point), then Buridan’s AI must nonetheless somehow choose one of those two. Fine-tune the problem enough, and any arbitrarily-small difference in inputs could be the deciding factor in which one to take. And that’s even assuming that the AIs are behaving perfectly rationally, which of course they won’t be.
(and before you nitpick that the AI would break ties in travel time using distance or fuel usage or something as a tiebreaker, there still has to be some overall figure of merit that it’s optimizing for, and “in case of a tie do this” doesn’t actually resolve Buridan’s Ass decisions, because then you have the same edge-case problem in deciding whether something counts as a tie)
Pretty much the same example that came to mind. Still–these cases should be rare. If they aren’t, it means the AI isn’t being selective enough (like if it evaluates all routes to have the same cost regardless of length). Sure, you can always search for edge cases, but ideally, a randomly-chosen route will change a negligible amount of time when a bit of noise is added.
Time in a development environment allows changes to the hardware. That tilts the control surface allowing the mechanism of interest to jump out of it’s trap, but also freeing other trapped mechanisms, some of whom are demonic.
I’m not so sure… More relevantly, in such a situation taken in isolation, it doesn’t matter which option the AI chooses (as long as it does in fact choose one, rather than starving from indecision). And in fact, in the aggregate, some amount of random decision-making is probably good, and should be actively encouraged: If there are two major bridges connecting the east side and west side of town, for instance, and one is slightly better than the other (for whatever reason), then if every AI deterministically chooses the “best” choice, then that bridge will end up overcrowded, while the other one goes unused. But if every AI instead gives it a biased coin flip, with the “better” bridge having (say) a 55% chance and the “worse” one a 45% chance, then both bridges get well-utilized, and the overall traffic experience for everyone is optimized.
But of course, smart routing systems (may) use traffic load as an included factor so the load should theoretically level out - and also make equally balanced routes a greater rarity. Who knows - route evaluation may happen from east to west, or by “closest to straight line”, or any other algorithm; then consider alternate routes that are close to this, asnd select the next one as replacement only if it is better in some way according to its criteria, etc. I suppose it’s the same dilemma as a chess program faces, when it computes two equally scored options for next move.
Yes, but that’s reactive. It’s nice if the AI eventually notices the congestion and starts routing the other way, but it’d be better if it never caused the congestion to begin with. Though I suppose that there’s no reason why you couldn’t have all of the computerized cars networked together, so they could see that there are a bunch of folks all going to the same destination before they even pulled out of the driveway, and agree in advance who would take which route.
Yes, eventually, it could take milliseconds. But seriously, I assume the feedback is from other cars etc. that report they are going much less than the speed limit, waiting longer at lights, etc. Or from cellphone data. I have no idea where all that traffic congestion data comes from, but I assume a central source that provides it, since google maps, Wayz, Apple maps, GPS devices, Tesla etc. all seem to have data.
A more interesting problem is feedback speed. Everyone’s headed to the Taylor Swift concert (or leaving it) over one of two bridges. The more optimal one fills up, at which point the rest of the horde routes via the second bridge, which fills that one leaveing the first emptier, so further traffic now takes the first one. It becomes a feedback loop with a frequency determined by driving time to the bridges.
I see this in single GPU training. Using the same seed for multiple sessions gives almost the same results, but not bit-exact. Something in the GPU libs/driver is conditional. To be fair though in my case the variation is dwarfed by the precision decisions made for inference.
Network and bus performance would make this more pronounced with multiple GPU training, but as you said it’s not hard to accumulate the results in order. At worst you stall some GPUs with each update.
Can you elaborate? In my experience they are; given the same inputs they produce the same outputs. But there is a lot I don’t know especially on this topic.
A chat application built around an LLM may not appear deterministic but that is by design. It feeds additional state on subsequent inferences to improve the results. The fact that the LLM is ML/AI based is immaterial. The same would be true of a traditional LM.
This is a hot area of research now because everyone wants to release ML apps (LLM, diffusion, etc ), but they are afraid they might produce something inappropriate. The size of the datasets makes it unrealistic to retrain while transfer learning does not constrain the outputs enough and runs the risk of impacting the underlying performance. One technique is to add a parallel model that learns to constrain the unadulturated original model. ControlNet does this with StableDiffusion.
I think this will become the norm. There will be unconstrained, high performing foundational models that feed into app-specific models that adapt and constrain the final results.
The GPU itself is a parallel processor, and different cores (really groups of cores) can have non-deterministic order. Something like when the RAM hit a refresh cycle might delay a memory access, causing the GPU to select a different thread bundle to execute first. Or the fact that the GPU is probably rendering your desktop screen at the same time–this doesn’t take too many resources, but the assignment of some cores to handle desktop rendering rather than your training will alter things.
You’re right, though, in that these effects are generally dwarfed by other ones. After all, the only reason this is an issue at all is because FP math gets rounded differently depending on the order. But if the rounding behavior had a significant effect, it would mean you didn’t have enough precision in the first place.
And, interestingly, inference often does quite well when the weights are heavily quantized, down to 8, 4, or even fewer bits. The output gets slightly worse, but it doesn’t break completely the way we expect computer programs to behave. It degrades gracefully.
It isn’t just an issue of the precision of floating point math; the complexity of a large neural network with a feed forward transformer architecture is such that differences in handling floating point calculations will have some effect. You can posit an artificial neural network with weighing “…heavily quantized, down to 8, 4, or even fewer bits,” but no real world large language model would do this.
I’m not a user of ChatGPT or the publicly available LLMs but from while they should be essentially deterministic in processing the input given exactly the same text, early releases were definitely not reliable in interpreting prompts that are semantically equivalent but with textual difference. That is obviously a problem because you can’t rely upon casual users to understand how to formulate prompts just-so. More recent releases seem to be generally more consistent in interpretation but I’ve seen instances of the same prompt (of some reasonable degree of complexity) generating inconsistent or flatly contradictory responses.
Furthermore, you shouldn’t have to rely upon an LLM to be strictly deterministic because while you want the content of the response to be factual and semantically equivalent in addressing essentially the same question, the desire to use LLMs in casual human-interface roles means that you actually don’t want it to be robotically consistent; it should be ‘conversational’ with the normal degree of variation that a human representative would demonstrate but still capable of interpreting inputs and providing accurate and safe outputs. Human brains are certainly not deterministic state machines, and an LLM or other generative AI model complex enough to do useful ‘creative’ work wouldn’t be expected to be, either.
Infinite-precision math is a thing. Many LLM architectures don’t even use any transcendential functions: it’s just adds, multiplies, and max(0, x) (called ReLU). That can all be handled with infinite-precision rational numbers.
My point is just that any indeterminism from order dependencies arises from rounding to finite precision. The math would be associative if it weren’t for that. So the error generally lives in the least-significant bits (LSBs). And, based on observations of how well things behave under quantization, those LSBs really aren’t all that important. Or at the least, the system degrades gracefully with less precision (or more noise).
That’s certainly true. In fact, the encoding of the input is itself somewhat ambiguous. The first step is to change a sequence of text into a series of tokens. Short words often get their own token, while longer ones might be broken up into something like syllables. But since there are enough tokens to cover any arrangement of letters (including individual letters), there are multiple ways of expressing any word. A tweak to the token encoder could result in different output even given the same neural net weights.
Also true. Hence the “temperature” factor that enables it to generate tokens non-deterministically.
That said, there are different degrees of this. The random number could be purely random, so that the same prompt never gave the same results twice. But it could also be pseudorandom, with the same starting seed each time. It would have the same effect of seeming less robotic, but given the exact same prompt it would still output the same thing. Or perhaps it would be useful to control the starting seed: keep that random most of the time, but if desired it could be set to something specific so that you could get deterministic output.
Indeed. Some models can perform better as the quantization acts as a regularization that helps them generalize. Sort of forces them to ‘forget’ the minute details of the training data.