Rather than take the time of trying to describe the problem here in the abstract, I’ll go straight to the real-world situation I’m trying to apply this to. There is a process called the “match,” for graduating medical students going into internship and residency training. Basically, after applying and interviewing for different slots, programs and applicants both electronically submit a rank-ordered list of where they’d most like to go or who they’d most like to have at their program. Blinkenlights ensue, and at the other end about 90% of graduating US medical students find out where they’re going to train for the next 2-7 years when the computer mass e-mails the results of the match on same day in March.
This year, the people that carry out this process released a broad set of tables that describes a single main outcome: match vs. didn’t match in the first-ranked specialty with the assumption that the first specialty you rank is the one you really want. For instance, you could rank a program at Harvard in Family Medicine first, a program in Neurology at Podunk University second, and if you matched to Neurology at Podunk-U you would not have matched to your first ranked specialty. These tables break out a ton of different factors such as, how many programs you ranked in your first specialty, how good your board scores were, etc. and describe the number of applicants that matched vs. didn’t match to their preferred specialty.
http://www.nrmp.org/data/chartingoutcomes2009v3.pdf
I want to create a mathematical model that will take as many factors as possible (but will also be tolerant of factors being left out (for instance, you don’t know the answer to how many programs you’ll rank yet)) and will spit out a single number as your probability of matching in your preferred specialty.
Once I have the basic model, is there some appropriate way to weigh each factor? For instance, maybe the tables break it out by board scores and hair color. It’s a competitive specialty: Diagnostic RadiationDermatoPlastic Surgery. 100/220 applicants with board scores around the median national board score matched, but 3/3 red-haired applicants matched. I don’t want to tell a red-haired applicant with median board scores they have a 75% chance of matching. Presumably, with a larger sample size, 10/20 red haired applicants would have matched. Can I use the number n in each category to somehow weigh my model?
My first guess is risk-ratios, but I’m most familiar to applying these to situations where there’s a small or very-small a priori probability, but here the a priori probability is in the 60-95% range. Is RR still a valid measure? I know the difference between odds-ratios and risk-ratios, but is sticking to RR really all I have to worry about? Also, I’m not certain about the validity of multiplying a large number of risk-ratios (I know there are ways to do it with a moderate number, but 15 factors?), and that also doesn’t fully address the issue of a weighted model that favors the risk-ratios derived from larger sample sizes. Is my best approach indeed to work my way through the Wikipedia article on Relative Risk (Relative risk - Wikipedia) and just do it that way?
Linear modeling?
As a tack on question, let’s say you were going to make a website that let people run this simulation with several hypothetical scenarios? For instance, “what are my chances if I… get a certain number of grades, get a certain number of publications before graduation, etc.?” What would be the hands-down easiest, most robust way to go about programming this website for someone with moderate technological aptitude? (I’ve made websites before with Flash, etc. (see my profile), but never really with much advanced programming)