Querying temporal data

I’ve got some data that I’d like to gather summary statistics on, but there’s a temporal component to it that makes it a little bit interesting. Let’s says that I’ve got a bunch of TV’s that are either getting clear reception, some interference, or complete static. I have some reason to believe that the amount of time that a TV spends in any of these states is influenced by the previous state. Obviously, this isn’t the real problem, but it gets the idea across.

So here’s my table structure:



TV.ID: integer
Event.Number: integer
State: {clear, static, mixed}
Duration: integer


TV.ID is the TV identifier, Event.Number tells me order in which things happen, State tells me what was happening, and Duration tells me how many seconds it lasted. This is maybe a little complicated, so let me give some sample data:



1    1    clear    52
1    2    mixed    36
1    3    clear    58
1    4    static   70


What this tells me is that TV 1 started off with clear reception, and that lasted for 52 seconds. Then there was some interference for the next 36 seconds, but it went back to clear for another 58 seconds. The last 70 seconds that I have records for were static.

The sort of thing that I want to do here is to select all the records where the state was clear and where the previous state was mixed. Once I have that, I can run whatever statistics I need to get on the event durations. Right now I’m iterating over everything to get the data I need, but I’m interested in writing a query to do it instead. What’s the proper way to do this in SQL?

I think what you want can be achieved by the equivalent to Oracle’s LAG analytical function.

It’s not ANSI sql, and you’d have to google up some advanced sql trick to do the same in ANSI. I imagine it would require a self-join somewhere, which is often problematic for performance. So if your database engine doesn’t have analytical functions like that, iterating might still be your best bet.

I don’t know how to do anything in SQL. But, isn’t this a question of autocorrelation? Do you have any autocorrelation metrics available?

It’s not straightforward to do this in standard SQL. If you were truly restricted to a pure SQL approach it would involve writing correlated subqueries or self-joins. Both are ugly and non-performant for large datasets. You can get the right answer with subqueries or self-joins and while it’s algebraically correct, the database engine has to make redundant passes through the data to deliver the rows. (When I say “pure SQL”, I’m talking about a single SQL query treating the data as a set. )

The generalization of your problem is basically the the analysis of a “sliding window of history.” It comes up in analyzing a history (a stream) of network packets for denial attacks or spam, or consumer financial transactions to predict fraud. It basically involves storing “state” somewhere in your program. (The “state” is the messy part that brings in the complex subqueries or self-joins if using pure SQL.) If you use procedural SQL (e.g. Transact-SQL), it could done with temporary tables or cursors and looping. If you iterate (loop) outside of SQL using a general purpose language such as C or Python, you could hold the sliding history in a queue data structure. (I assume from your OP that you’re doing some variation of this technique.) As someone already mentioned, some of the SQL dialects have special analytical functions such as LAG() or RANK() but that’s beyond the standard SQL.

To mention another approach…
If you copy the data into a “business intelligence” type of database and import the data into hypercubes, you’ll have more analytical functions to study sliding history data.

I agree that writing SQL that looks at row to row relationships can get messy, and therefore are best solved using procedural or data warehouse analytical tools.

But…

This one is a quite simple self-join. The following will work:


SELECT TVE1.*, TVE2.*
FROM TV_EVENTS TVE1, TV_EVENTS TVE2
WHERE (TVE1.TV_ID = TVE2.TV_ID) AND
      (TVE1.EVENT_NO = TVE2.EVENT_NO - 1) AND
      (TVE1.STATE = 'MIXED') AND
      (TVE2.STATE = 'CLEAR');

Actually, now that I think about it a little more, the example as stated is actually pretty simple even in straight SQL:



select * from events e1
where e1.eventNumber > 1 and 
           e1.state = 'clear' and 
           exists (
                select 1 from events e2
                where e2.tvID = e1.tvID and
                    e2.eventNumber = e1.eventNumber - 1 and
                    e2.state = 'mixed');


Furthermore, if there is an index on [tvID, evenNumber, state], I think this has a shot of being reasonably performant.

On preview, I see that K364 also gave the equally simple self-join version.

I may have misinterpreted the OP. I was thinking he’d want to analyze any arbitrary sliding time slice duration for any arbitrary pattern within that time slice. Therefore, his example of just comparing the current row to the previous row was just the simplest form of that generalized analysis.

No, just the previous state. I think the queries given will work fine. Thanks, everybody.