Statistical experimental design question

Grey · September 16, 2019, 5:42pm

Question about an experimental design which I feel is a GQ but it might wind up in IMHO. Let me give a brief high level view of the environment I’m trying to examine.

I have S systems and each system has a label (l, m, n) and might have an object (O)
Each system is associated to a single Region ® but Rs do not necessarily hold the same number of Systems S
There are ~7000 systems, ~60 odd Regions and ~1000 Objects
There’s no expectation that a system’s object count (1 or 0) should ever change.

I think I’ve noticed instances where System X should have an Object O, however they don’t. So there are 2 possible things going on – either the original data list is flawed, or there’s an underlying chance that an Object may not be present in a System that previously held it.

I’m assuming this is a data integrity issue but I’d like to test it. However, at best 30 systems take over an hour to visually inspect. So SDMB what’s the best way to disprove the Null Hypothesis here?

Model 1 – Assumes no difference between Systems regardless of label (l, m, n) or Region ®
Model 2 – Assumes a difference between System based on label (l, m, n)

am77494 · September 16, 2019, 7:27pm

Not an expert on Data Integrity, but we have similar situation in a project where many (100k plus) valves were suspect and needed a statistical method.

Here’s the general outline of the method used : Acceptance sampling - Wikipedia

And we settled on using ISO-2859-1 https://www.iso.org/obp/ui/#iso:std:iso:2859:-1:ed-2:v1:en

EdelweissPirate · September 16, 2019, 8:27pm

Stating this in general terms is all well and good, but the description is so general that I can’t tell whether we’re talking about physical objects or representations of those objects on a computer (like a bill of materials). Unless I’m missing something:

References to visual inspections would apply to physical objects. But “data integrity” makes me think we’re talking about computer-based representations of physical objects.

If it’s the latter and you have two lists to compare, this is trivial—the UNIX “diff”command will show you only the changed systems.

From a data integrity perspective, checksums are one option. A checksum for each system will show changes compared to previous sets of checksums, and a checksum for the collection of systems would make it easy to tell at a glance whether any system object counts have changed.

It would help a lot to get a little specificity about the domain you’re operating in.

EdelweissPirate · September 16, 2019, 8:53pm

This is an ambiguous statement (to me, anyway). Do you mean that there’s an expectation that a given system’s object count is assumed to be immutable?

Or do you mean that a changing object count wasn’t part of the original set of assumptions?

Quercus · September 16, 2019, 9:18pm

Do I understand this right?

You suspect (but do not yet have any proof) that some Systems that once had an object now do not (and this is not expected behavior).

You have a method for examining a System to see (with no possibility of error) both whether it currently has an object and whether it previously had an object.

Examining Systems is resource-intensive and you do not have the resources to examine them all to conclusively prove whether any objects have been lost.

You want to examine a sample set of Systems, which will enable you to either a)conclusively prove that an object has been lost; or b) state with X% certainly that no objects have been lost (X being less than 100%, since you won’t have examined every System).

You are asking whether the sample design should account for attributes of the Systems such as Region and label.

Is that correct?
If so, then, unless you have some reason to suspect (based on your knowledge of the actual things) that the label or Region affects whether an object could be lost, you should ignore the Region and label. You will get better statistical power that way.
(this helps explain, maybe)

Grey · September 17, 2019, 3:05pm

Quercus you’ve got it exactly right.

EdelweissPirate I can see that my generalization actually gets in the way. So thanks, and let me restate the problem.

I play EveOnline. You have star Systems, inside Regions. I have a master list put together by players that denotes if a system (one of the thousands) holds a particular object. I recently stumbled across 2 systems missing these objects in the data table.

Obviously data entry is my default assumption BUT there have been anecdotal reports of these objects being missing and returning. Surveying each system is not a viable method so the question is "What number of systems do I need to survey to reject the null hypothesis that these objects do not move (data entry error in master list).

EdelweissPirate · September 17, 2019, 9:06pm

Thank you for clarifying.

I understand that you’re looking for a statistical approach, but I’d propose a brute-force method; as the cliché says, quantity has a quality all its own.

EveOnline seems to have a robust API available:

I haven’t dug into the hooks provided in the API, but skimming the article above, it seems that you can query the game’s database directly.

Even if the API won’t give you the object status of every system directly (avoiding the player-typo hazard you mentioned) it seems to me that you could likely script inspections of all the systems. If a player truly has to be present in a system to inspect it, one imagines that you could use the API to inspect the system and report back to a central server. With a few hundred players using this script, you’d inspect every system pretty quickly.

By combining both methods, you could validate the global information from the database query (if the API allows such a query).

By inspecting the same systems repeatedly over time, you could pick up on the object disappearance thing in a systematic way (or at least a non-anecdotal way).

Maybe the API is unworkable for the ideas I mentioned above, but I’d be surprised if it was completely useless.
P.S. The linked article says that the database that you can query with the API is distributed and, under some circumstances, may be stale in some areas. You say that object status should be immutable, but maybe they’re doing something goofy with object statuses in stale portions of the database. That might explain why objects seem to occasionally appear or disappear.

Aspidistra · September 17, 2019, 9:14pm

When you say “missing these objects in the data table” do you mean that the master list says that there should be an Object in that star system, but when you look at the real EveOnline instance you see that there is not an Object in that star system?

I think I’m missing something here - what’s your method of telling the difference between “there’s not an object here right now even though the master list says there should be because the list-maker made a mistake” and “there’s not an object here right now even though the master list says there should be because there used to be one and it disappeared”?

Quercus · September 17, 2019, 9:42pm

That is my question, too. After all, if I understand right, you’ve already found one System without an Object where the list says there should be one, right? So you’ve proven that at least one System is incorrect in the list. What’s the point of statistically sampling more Systems if you can’t tell whether it’s a mistake in the list or a truly disappearing object?

Grey · September 19, 2019, 1:27pm

Thanks folks. I hadn’t considered trying to use the API but I suppose I can use it as a good excuse to learn how.

As for the persistence of these objects in a systems. There have been times where players have reported the object is gone, only to have another report it present later on. Again a data validation issue most likely but nothing says the designers couldn’t have added a time triggering a disappearance/reappearance.

At this point I’ve done a 7 day survey in a single region with 40 systems and 10 reported objects with zero issues. Feeling the problem is between the keyboard and chair.

Buck_Godot · September 20, 2019, 3:49am

Its a hard problem because there are so many unknowns.

How likely is a data error?
How likely is an object to be capable of movement?
if you search a system and an object was gone but then you come back after time t, how likely is it to have come back?
If an object has the ability to move but you see it here, how likely is it to mover after time t?

With large data sets it might be possible to start nailing down these probablilities, but when you can only scan a few dozen systems, and add in the possibility that these vary with qualities of the system the problem becomes neigh impossible.

Your best bet is to search systems until you find that an object isn’t where it should be and then try checking again to see if it shows up, which is pretty much what anybody not attempting to solve the problem analytically would naturally do.

Grey · September 20, 2019, 12:31pm

True. If there is a mean time to “disappear” then it’s possible everything could be seen as fine and then, unknown to us, if flips. A broad continual survey would be the way to go I suppose. Or a periodic query into the database, assuming the object is available to players to check.

Still fun to think about, and a good excuse to figure out how to use the OpenAPI infrastructure available.

Reply · September 23, 2019, 8:22am

Is this a fair restatement of the question? "Given a list of 1000 objects that could either be red or blue (or on/off, 1/0, whatever), how many must be sampled to know the accuracy of the list to a confidence level of 95%?

But yeah, that assumes the objects never change, that they are randomly distributed (and not affected by regions, etc.), that sampling doesn’t affect their status (maybe EVE only puts the objects there if no player has visited in X weeks?), etc.

Is this just a rare loot drop from some sort of space pirate?

Topic		Replies	Views
Help with Statistical Analysis - DOE or Sample size Factual Questions	3	1066	June 19, 2017
Tough Math and Statistics question Factual Questions	6	950	February 24, 2004
Algorithms for Data Classification Factual Questions	5	739	January 9, 2009
Statistics Question Factual Questions	3	1652	February 27, 2012
Taguchi Methods Factual Questions	6	714	February 14, 2003

Statistical experimental design question

Related topics