Timing probability question

This is based on real life, and I’m curious about the answer.

We have a tale of two programs A, and B, and a programmer who likes to live dangerously and doesn’t care about race conditions.

Program A is a generation program, over time it will generate a set of N files, sequentially. It will not generate file n until it generates file n-1, but you can assume that the moment it starts generating file n, it will take t_n [sampled from] T_1 seconds, where T_1 is some distribution. (We can assume uniformly distributed between x seconds minimum and y seconds maximum to make it simpler).

Program B is an evaluation program. It randomly samples a file from the set N with replacement. Overall, it will evaluate M (not necessarily unique) files over the course of its run. Each evaluation will end in t_m [sampled from] T_2 seconds, where T_2 is a distinct distribution from T_1, but is distributed similarly (i.e. distributed uniformly between some min and max time).

The trick, of course, is that Program A is generating while Program B is evaluating. So a new file may appear while B is evaluating a file. No files will ever be deleted, but it IS a problem if B selects file n – the one A is currently generating.

Assuming program A gets a head start of a few files (we’ll call that k), what’s the probability that program B will finish all M evaluations without picking the file A is currently working on?

There are two obvious “win situations” here:

Program A finishes first, in which case the probability from then on of picking the file A is working on is 0.
Program B finishes first, in which case even if A is working, we win.

You can assume that there’s no discontinuity where A and B finish an iteration simultaneously, just assume A has created n+1 by that point.

Obviously at the moment B is selecting a file (which we’ll assume takes no time for simplicity), it has a probability (n-1)/n of selecting a “good” file if A is still running (and n/n if A has terminated), but I really can’t think of how to solve this. It seems like it should have a closed-form a-priori solution, even if it’s kinda gnarly.

Your pseudo-math makes it too hard for me to decide whether I understand the problem statement or not.

But who cares what the probabilities are, anyway? The bad situation WILL happen at some point. Period.

I wonder what happens when the second program opens the file the first program is still writing. I always learned that when a program has a file open for writing, other programs can’t read the same file at the same time. But that doesn’t seem to be entirely true these days.

Back in the DOS days, files had an archive bit for exactly this purpose. Once a file was backed up, the archive bit was set and the backup program knew it didn’t have to revisit those files. When a file was created or modified, the archive bit was cleared and the backup program would back it up during its next run.

A simple solution would be for the first program to create files in a different place or with a different name and then rename/move them when the files are complete so the second program will only see files that are ready for processing.

**iljitsch **nailed it. You’re not going to generate a useful probability distro T_1 or T_2. Neither are you going to be able to characterize your ns & ks very well.

What you’re going to end up with is a probability distro of achieving a probability distro of having a collision.

Seems easier to have A gen a file and as a final atomic step rename it to a name B recognizes than it does to do all this math to determine how likely a completely predictable and totally avoidable failure is.

You didn’t explain how B discovers n. If B determines n at run start the scenario is different than if B (re-)determines n before each file selection. You also didn’t explain the relationship between n and B’s file selection process. e.g. if B knows n=10 at some moment, does it only select files named/numbered 0-9? Or does B not know or use n at all? These implementation details would matter hugely if you were actually trying to determine the collision probability distro.

A bit of a hijack:

  • Modern operating systems do offer the (optional) ability for B to read a file while A is writing it. There are also (optional) mechanisms to prevent it from happening.

  • Why on Earth doesn’t A create each file with name XXYYFHEOW.txt, then rename it to an expected name (for B’s use) only after it’s completely ready? (ETA: Oops, LSLGuy already said this.)

Is it possible for B to choose file #10, while A is still creating file #5 (i.e., long before it’ll even start file #10)? If so, then you’re very likely to get a lose condition on the very first sample. And if not, then that must mean that B somehow has access to a list of what files are available, in which case you shouldn’t put files on that list until they actually are available.

Guys, I know how to handle interprocess/filesystem race conditions. I didn’t in this case because I was doing a one-time conversion of a bunch of replays into CSV files and got impatient due to deadlines, and I serialized the model I was training at each step so if it panicked I could just restart the process (I control+C it all the time anyway). Even if it could read a partial file, since it was a variable-length sequence training problem the data was still good even if it ended early so it didn’t matter.

I just thought it was an interesting, if complex, probability problem I wasn’t sure how to tackle.

No, it only selects from the set of files that have already been generated. (And the list of files was literally just os.listdir on the directory my csv writer was spitting out files into).

And yes, it rereads the list at each selection.

OK, just for fun …

Let’s make the massively simplifying assumption that A generates files at a constant rate. And B consumes them at the exact same constant rate. We let A get a head start of k files and then fire up B. The odds on a collision the first time are 1/(k+1). For the second it will be 1/(k+2).

Process A eventually ends after generating n unique files, and B eventually ends after consuming m (possibly non-unique) files. Depending on the relation between k+n & m either A or B will finish first.

ISTM SUM(1/(k+i)) over range i=1 to n will give the odds on an overlap failure on at least one file read by B. Intuitively this says that after i gets big, the incremental risk from A continuing to run is negligible. And the bigger k is, the safer overall.

And finally 1-(MIN(m,n) * SUM(1/(k+i) over range i=1 to n)) gives an upper bound on the likelihood of B completing with zero read overlaps with A.

I’m not seeing much hope of a closed-form solution to this stuff. And that’s for the huge simplifying assumption that both A & B process all files in the same constant time. Clearly if B is faster, it has relatively more file open events.