# Algorithms for Data Classification

I have an algorithm that identifies an object by checking one parameter. Occasionally this value is out a little, but that’s OK there is a tolerance of a couple of percent.

Most of the time the algorithm is actually performed on a batch of objects. The batch size is unknown, and it is what I want to calculate.

So everything works OK, but sometimes two groups form. Objects in group one return a certain value and objects in group two can be returning a value up to 6% out. Critically a batch containing a mixture of objects must recognize that this has occurred, so that size of batch calculation is still accurate.

I’m looking for some smart ideas so this last condition is met.

Ive tried a Monte Carlo based simulation, that calculates the probability of a batch containing mostly objects from group one or group two, based on the running average. However this only works if there is a smooth transition from group one to group two. In some circumstances a batch may suddenly shift from containing mostly objects from group one to having a high percentage of objects from group two.

If you only have one parameter, you’re looking at a threshold of some kind (i.e., if the parameter is below some critical value, guess group 1, and group 2 otherwise). There are any number of ways you can estimate the cutoff, but it sounds like you already have a pretty good idea of what it should be.

I’m not entirely clear on what you mean by calculating the batch size. Can you give a little more detail?

I would suggest some chemometric model; but I would need to know a bit more about it. I work with Near Infrared analysis on a regular basis that relies on chemometrics to classify or quantify materials.

A few months ago I was asking around here for a chemical reaction I could monitor; but I didn’t get very far (its in the archives somewhere). Maybe your question is something I would take a particular interest in and could solve both of our problems. Send me a personal message if you want.

I’m guessing you’ll probably need a FIFO “memory cache” to keep track of the trailing samples. The cache may need to hold the detail data (raw parameters – the attribute you checked on). The FIFO queue/cache needs to be whatever size makes sense; it could be 100 items, 1000 items, 1 million items, etc. This lets you “peek back into the past” – this is something you can’t do with a summary or running average.

Lots of software that monitor/filter dangerous network traffic use algorithms that require FIFO queues. For certain types of TCPIP attacks, you can’t just look at a single packet in isolation; you have to look at a bunch of them (all the thousands of packets in the past few seconds or minutes.)

Your OP is somewhat vague so my answer may be totally off the mark.

I agree that with Ruminator that your OP is pretty vague, but with the information presented, I would most likely use a log likelihood ratio test to determine which of the two batches each individual sample belongs in. These tests are especially easy to derive and are effective if you expect the sample variance of a batch to be normal or near normal. If you had more parameters (or depending on your data), a Bayesian test might be more appropriate…