Filling in missing crossword data

September 12, 2011

I have a question that will help inform a Web page about crossword puzzles, which in case you’re interested starts at Crossword Monograph (and also the answer will satisfy my curiosity).

This is not an emergency. For example, if Ben Bernanke also has a question for you, do him first.

My grade of C in Business Stats at the University of Kansas back in 1975 hasn’t helped. I’ve tried to noodle out this problem several times over the last few years, and I’ve asked people who thought they knew, but I still don’t have a defensible answer so now I’m turning to experts for help. If you don’t have an answer either, can you recommend where I should turn next?

========

Consider a list of 100 data points (they happen to be my finishing times for playing crossword puzzles) sorted from earliest to latest. Assume the trend is obviously if not steeply upward, i.e., on average I’m getting a little slower over time.

Now consider that a contiguous block of 20 of the data points is in fact missing. For example, I have the values for 1 - 35 and I have the values for 56 through 100, but I do not have the values for 36 through 55.

What is the most mathematically sensible way to fill in those 20 missing data?

========

It occurs to me that there are two possibilities. In the simpler one, a single value is calculated and used for all 20 of the missing data points. If that’s the better possibility, how do I calculate that value?

The other possibility, unlike the one above, would somehow account for the obvious upward trend by filling in the missing 20 values with a series of different values based on the known values before and after the gap. If this is the better possibility, how do I calculate those missing values?

If it’s neither of these, what is it? To repeat, what is the most defensible way of somehow filling in those 20 missing data, and what is the rationale?

(The real-world example of this problem, in case you’re interested, can be seen in Excel 2000 format at http://barelybad.com/xwd_times_194.xls. The gap starts at Row 173 and continues to Row 204. You’ll see it’s not one list of times but seven, one column for each day of the week. You’ll also find a second gap starting at Row 360, but I assume whatever answer you give for the first gap will apply to the second one.)

Thanks for any ideas you can offer.

–Johnny
barelybad.com (Laugh Think)

First you would need to decide how to model the data. you say the trend is obviously upward but is it better represented as a line, or as some type of quadratic curve, or some other curve? Once you know that, creating the model interpolating additional values is relatively easy. Excel can be used for a variety of such models. If I can find the time I’ll take a look at the file you referenced.

Absent any other reasoning, I’d assume the trend is linear across time (obviously that falls apart at some point, because it implies at some point in the past you were able to do crossword puzzles in negative time, but whatever, the whole exercise is more to satisfy your eccentricities than anything else), and plot it as a least-squares linear regression.

Then interpolate the points.

It might be easier to do if you reformatted your data into just two columns:
(Day #, Time in seconds)
so that your independent variable has values like 1, 2, 3, etc. and your dependent variable has values (in seconds) like 637, 632, 628, etc.

Then I’d ask Excel to do a scatter plot, and ask it to try various trendlines (linear, polynomial, logarithmic, etc.) until I found the one that looked like it did the best job of fitting your data points.

Note that you’re never going to be able to recover the data points themselves. The best you can do is assume that all of the missing points fall right on the curve predicted by your model. They almost certainly don’t: Some were actually higher, and some lower, by varying amounts. But you have no way of knowing which ones were higher and which were lower, and which ones were extreme and which ones were close to the average. So if you make a graph of all of the data points, real and estimated, the estimated portion of the graph is going to look much smoother than the rest.

The most intellectually honest way of handling this is just to graph the data points you have, and leave the gap in there, with an explanation in the caption that you lost some data points.