What’s the appropriate way to use these statistics to create a prediction model/tool?

Rather than take the time of trying to describe the problem here in the abstract, I’ll go straight to the real-world situation I’m trying to apply this to. There is a process called the “match,” for graduating medical students going into internship and residency training. Basically, after applying and interviewing for different slots, programs and applicants both electronically submit a rank-ordered list of where they’d most like to go or who they’d most like to have at their program. Blinkenlights ensue, and at the other end about 90% of graduating US medical students find out where they’re going to train for the next 2-7 years when the computer mass e-mails the results of the match on same day in March.

This year, the people that carry out this process released a broad set of tables that describes a single main outcome: match vs. didn’t match in the first-ranked specialty with the assumption that the first specialty you rank is the one you really want. For instance, you could rank a program at Harvard in Family Medicine first, a program in Neurology at Podunk University second, and if you matched to Neurology at Podunk-U you would not have matched to your first ranked specialty. These tables break out a ton of different factors such as, how many programs you ranked in your first specialty, how good your board scores were, etc. and describe the number of applicants that matched vs. didn’t match to their preferred specialty.

http://www.nrmp.org/data/chartingoutcomes2009v3.pdf

I want to create a mathematical model that will take as many factors as possible (but will also be tolerant of factors being left out (for instance, you don’t know the answer to how many programs you’ll rank yet)) and will spit out a single number as your probability of matching in your preferred specialty.

Once I have the basic model, is there some appropriate way to weigh each factor? For instance, maybe the tables break it out by board scores and hair color. It’s a competitive specialty: Diagnostic RadiationDermatoPlastic Surgery. 100/220 applicants with board scores around the median national board score matched, but 3/3 red-haired applicants matched. I don’t want to tell a red-haired applicant with median board scores they have a 75% chance of matching. Presumably, with a larger sample size, 10/20 red haired applicants would have matched. Can I use the number n in each category to somehow weigh my model?

My first guess is risk-ratios, but I’m most familiar to applying these to situations where there’s a small or very-small a priori probability, but here the a priori probability is in the 60-95% range. Is RR still a valid measure? I know the difference between odds-ratios and risk-ratios, but is sticking to RR really all I have to worry about? Also, I’m not certain about the validity of multiplying a large number of risk-ratios (I know there are ways to do it with a moderate number, but 15 factors?), and that also doesn’t fully address the issue of a weighted model that favors the risk-ratios derived from larger sample sizes. Is my best approach indeed to work my way through the Wikipedia article on Relative Risk (Relative risk - Wikipedia) and just do it that way?

Linear modeling?

As a tack on question, let’s say you were going to make a website that let people run this simulation with several hypothetical scenarios? For instance, “what are my chances if I… get a certain number of grades, get a certain number of publications before graduation, etc.?” What would be the hands-down easiest, most robust way to go about programming this website for someone with moderate technological aptitude? (I’ve made websites before with Flash, etc. (see my profile), but never really with much advanced programming)

It’s going to be a bit difficult to come up with a good model for this. The matching is performed using some variant of a stable matching algorithm, so where you end up depends on the nature of the candidate pool, the preferences of individual candidates, and the preferences of the institutions. Most of that, if not all of it, is information that you just don’t have, and even if you do know it for a given year, it’s not guaranteed that it will be at all similar in any other year.

mumble chemometrics…mumble, mumble Discriminant Analysis…mumble Mahalanobis Distance…Principal Component Analysis mumble mumble.
maybe those terms would help; but as you can tell I could be mumbling out my ass.

Thank you, I appreciate the detailed information on the algorithm used to make the matches in the first place; I was always curious about how they did that. Also, you’re the first poster whose name popped into my head hoping they’d take a stab at this. However, I don’t really feel that I need the level of precision suggested by your post. The match characteristics for a given specialty are fairly stable for a given year, and I don’t need to develop a comprehensive view of every candidate and every program. I just need a prediction model that says, okay, 75% of the people that want to match into neurology are able to do so.

So, our a priori probability of a “successful match” is 0.75. For subjects that had a board-score 1 standard deviation above the national mean, the “successful match” probability is 0.90. For subjects without a completed research project prior to graduation, the probability of a “successful match” is 0.60.

We have a given applicant with a board-score 1 SD above the national mean and no completed research projects at graduation. What is the probability that this individual will match as a probability based solely upon the publicly available information? Certainly these added probabilities have something to contribute to our estimate.

There is something you can do, but it requires some strong and probably not justifiable assumptions. Let M denote the event that an applicant is matched, and let F[sub]1[/sub], F[sub]2[/sub], …, F[sub]n[/sub] respectively denote the events that an applicant has the first, second, …, nth factor relevant to matching. Then P(M|F[sub]1[/sub] & F[sub]2[/sub] & … & F[sub]n[/sub]) = P(M|F[sub]1[/sub])P(M|F[sub]2[/sub])…P(M|F[sub]n[/sub])P(F[sub]1[/sub])P(F[sub]2[/sub])…P(F[sub]n[/sub]) ÷ P(M)[sup]n - 1[/sup]P(F[sub]1[/sub] & F[sub]2[/sub] & … & F[sub]n[/sub]).

I’ll walk through the derivation for the two factor case in the hopes that this becomes clearer. You will need to be familiar with Bayes’ theorem and conditional probabilities in order to completely understand what’s going on.

Let F[sub]1[/sub] be the event that an applicant has a board score 1 SD above the national mean, and let F[sub]2[/sub] denote the event that an applicant has no completed research projects by graduation. We know P(M|F[sub]1[/sub]), and by Bayes’ theorem, we can calculate P(F[sub]1[/sub]|M) as P(M|F[sub]1[/sub])P(F[sub]1[/sub]) ÷ P(M). Likewise, P(F[sub]2[/sub]|M) = P(M|F[sub]2[/sub])P(F[sub]2[/sub]) ÷ P(M).

Now we make the assumption that F[sub]1[/sub] and F[sub]2[/sub] are condtionally independent given M, which implies that P(F[sub]1[/sub] & F[sub]2[/sub]|M) = P(F[sub]1[/sub]|M)P(F[sub]2[/sub]|M). This is the strong assumption that I mentioned earlier. Unfortunately, there’s not a whole lot you can do without making it, so we’re stuck with it.

Simple substitution then lets us conclude that P(F[sub]1[/sub] & F[sub]2[/sub]|M) = P(M|F[sub]1[/sub])P(M|F[sub]2[/sub])P(F[sub]1[/sub])P(F[sub]2[/sub]) ÷ P(M)[sup]2[/sup]. We then apply Bayes’ theorem one more time to compute P(M|F[sub]1[/sub] & F[sub]2[/sub]) = P(F[sub]1[/sub] & F[sub]2[/sub]|M)P(M) ÷ P(F[sub]1[/sub] & F[sub]2[/sub]). This gives as the final answer that P(M|F[sub]1[/sub] & F[sub]2[/sub]) = P(M|F[sub]1[/sub])P(M|F[sub]2[/sub])P(F[sub]1[/sub])P(F[sub]2[/sub]) ÷ P(M)P(F[sub]1[/sub] & F[sub]2[/sub]).

If you don’t have the data necessary to compute P(F[sub]1[/sub]), P(F[sub]2[/sub]), …, P(F[sub]n[/sub]) and P(F[sub]1[/sub] & F[sub]2[/sub] & … F[sub]n[/sub]), you can assume that these events are independent, in which case P(F[sub]1[/sub] & F[sub]2[/sub] & … F[sub]n[/sub]) = P(F[sub]1[/sub])P(F[sub]2[/sub])P(F[sub]n[/sub]). This is a second very strong and probably not justifiable assumption, but if you have to make it, you have to make it. In this case, the final formula becomes P(M|F[sub]1[/sub] & F[sub]2[/sub] & … & F[sub]n[/sub]) = P(M|F[sub]1[/sub])P(M|F[sub]2[/sub])…P(M|F[sub]n[/sub]) ÷ P(M)[sup]n - 1[/sup].

If you’re going to present this model, you absolutely must include in the presentation the assumptions that went into deriving it.

Excellent, thank you for those equations, pointers, links, and warnings on assumptions. I’m still trying to think about the weighting, but it seems fairly obvious to me that there won’t be any completely straightforward way of going about this in an entirely valid manner.

As always, you’re a virtually unequaled resource of statistical knowledge on this board. I appreciate your help.