Big Data - Why does a jetliner produce so much data?

I was looking for examples of Big Data, and I found this web page which mentions that a jet engine produces 10+terabytes of data in 30 minutes of flight time.

I don’t know anything about aviation software systems, so I’m trying to imagine what is all that data. Whatever it is doing, it sounds like it is a very verbose software setting to produce that must data. I’d suspect it is only enabled to be used as a diagnostic tool?

I started to wonder about this after I read the AWS service/product Snowmobile. They literally send a container on a semi to your company’s data center. This is to transport your data and have it copied to AWS S3. It can handle up to 100PB per Snowmobile.

I’m not an aerospace engineer or mechanic, but I am familiar with diesel engines, on which a 20-minute test collecting data from dozens of sensors at 10 Hz produces about 30 MB of data.

10 terabytes in 30 minutes is a data rate that’s about 222,222 times higher. That’s a lot. I can only imagine a few dozen parameters that you might want to monitor on a big turbofan engine - pressures/temperatures/flows for air/oil/hydraulic fluid at multiple points throughout the engine, plus spindle RPMs (there are two or three spindles on a big turbofan), and maybe some accelerometers to watch for vibration. Would they be gathering data at KHz rates?

Hoping there’s another website out there that can confirm or refute the immense data rate being claimed by the website you linked to.

I’m also skeptical, as that would be (double check my math please) nearly 900 4K video cameras pointed at the engine for that half hour.

This random website quotes a GE engineer as saying that a GE engine produces 25MB per flight hour, and that the whole aircraft on a typical flight produces 1TB of data. I’m assuming there’s some audio/video recording in there to get to 1TB, as that still seems like a lot, especially if only a few hundred MB is from the engine data logs.

I have no experience with the electronics in a typical aircraft engine, but a number like 20T/hour would seem to only make sense as the total raw data generated by every analog-to-digital converter in the engine. The vast majority of that data would be used in embedded control loops and not logged.

25 MB per engine per hour is plausible based on my experience (see post #2), but 1 TB per flight - even for the entire aircraft, even for a multi-hour flight - strains credulity. An hour of HD video is about 1.5 GB, so even during a four-hour flight, a video feed will only generate 6 GB of video files. Add in 400 MB of engine sensor data, and you’re still under 7 GB for a four-hour flight. There are certainly a lot of other parameters that could conceivably be monitored on a plane - but 993 GB worth? :dubious:

This is more plausible. A/D converters in those engine systems could well be operating at KHz rates and producing terabytes of info that never gets recorded. This would mean the multi-terabyte figure mentioned at the OP’s link is not relevant to a discussion of big data.

I have worked on a few Digital projects involving Gas Turbines (both aeroderivative and power generation) and I’ll try to explain why this big data comes about.

The basic lesson her is that Mechanical / Controls / Electrical Engineers who have worked on these engines for their lifetime and the new Data Scientist or Digital experts don’t speak the same language. Many times they don’t get what the other is trying to do.

Let me try an example :

Say you have a sine wave, that has an amplitude of 1 unit and a frequency of 1 hz. So the conventional engineers have learnt to characterize by those two data points. But the digital analytics guy has a system that is setup to read time versus position values. And the sampling rate is very high let’s say 10 kHz - and so the Digital guy gets 10,000 data points over a second which the conventional guys get only 2 data points.

A key component for safety and diagnostics in any modern rotating equipment including turbines and engines is the vibration monitoring system. When the turbine rotates at 20,000 rpm, there are parts (due to gears etc) which rotate at multiples or fractions of the speed. Also - each of these rotating components have their vibration signatures which the vibration monitoring system keeps track of and gives warnings when something is failing or malfunctioning.

So for example : let’s see how this vibration system will work with your car. Say you are doing 70 miles on the highway, the engine is going at 3000 rpm and your wheels are going at 1000 rpm. Now the vibration monitoring system notices a vibration with a frequency of about 1000 rpm and it can also see that it scales with the speed of the wheels. It immediately alarms the operator that something is wrong with the wheels or depending upon severity keeps the information to provide the maintenance personnel. For such a vibration system to work, it needs to sample at least at 2 times the maximum engine rpm possible for any part of the system. So if the maximum possible rpm in a car is 7000 rpm, this system needs to sample 14000 times per minute or usually with a factor of safety at 28,000 samples per minute. And it will sample at each of the four wheels, the transmission, maybe the axle/joints, engine etc etc

A gas turbine does 10,000 to 25,000 rpm and there are many rotating components inside. So you can see how the data quickly adds up. There already is analytics built into the turbine/airplane that protects the system in real time and alarms the operator / pilot. For example a surge or stall is detected by the vibration system. But if you want to do analytics and you want “all” the available data, then you will end up with massive files.

Is all this data necessary or useful ? Well these are the early days and the traditional engineers and data scientists are working it out. My own opinion is that only a fraction of this data is useful.

I am (or was) an engineer in the broad aerospace field, but not a field directly relevant to the OP.

Nevertheless, there could be many reasons why the data is so huge. Unlike data we are most familiar with, such as ZIP files, DVDs, JPGs, and the like, raw data may not be compressed. This means it may contain copious quantities of irrelevant data, like spaces, or redundant and repetitive data that occupies unreasonable space.

It’s also possible that the raw data includes much that will eventually be considered unimportant. But prudence suggests it is better to record it first, then ask questions later. You can always discard it, you can’t recreate it; so store everything, just in case.

Exactly.

If you want another example of big data, every day semi-hysterical salespeople, journalists and PR people spew out over 813.2 Exabytes of total bullshit about Big Data and why you should give them lots and lots of money to do Big Data for you. You are reading this on the internet so you KNOW it’s true.

I lived through the web and web 2.0 bubbles, now Big Data/AI have been dragooned into playing the role of web 3.0 (Or is it 4.0? I may have missed a bubble somewhere), complete with obligatory wild-eyed exaggerations and only a tenuous link between what is claimed and what is correct or relevant. Treat articles about Big Data the way you would treat an article about an amazing new “eat whatever you want and lose weight” diet and assume they are fact-free unless you have good reason to trust the source.

Those numbers are way out of whack. Speaking from my experience (Airline Performance Engineering), With the various Engine Condition Monitors (ECM) and other data recording, you are looking at about 1TB of data for an entire flight, or about 30MB of Data per flight hour per engine.

They could plausibly be talking about debug data generated by the engine control computer.

I work on debuggers for hardware, and many modern chips generate trace data as they run. This is highly compressed data that, if you know the code that’s running on the chip, you can use to completely reconstruct what the chip did. This is invaluable for debugging hard-to-reproduce failures. Note that since this logging is generated by hardware, it doesn’t affect the timing of the system (which is important for things like control loops on jet engines)

A pretty good estimate for the compressed control flow data is 1 bit per instruction. A modern chip can execute (waves hands furiously) on the order of 1b instructions a second, so you’re looking at 1Gbps, which ends up being about 2TB of data in 30 minutes. But that’s just for control flow. If you want to store data transactions (sometimes necessary to figure out where things went wrong), it’s several times that. In fact data trace is usually 4-5x instruction trace. Bam. There’s your 10TB in 30min.

This is consistent with A/D converters operating at tens of KHz, which is consistent with very large amounts of raw data being generated. But on commercial aircraft engaged in a normal revenue-generating flight, is all of this raw data actually saved on an ongoing basis? Or does the system boil that raw data down to just the key parameters (e.g. vibration magnitude and frequency) so as to reduce storage requirements?

I was kind of thinking this, too (despite having zero aerospace experience or really much mechanical engineering at all). A single sensor will probably only actually have a few bytes worth of accuracy, so a single data point is only say 5 bytes of information. But what will probably be recorded for each data point is the actual data, the ID of the sensor, the precise time and date, some information about the status of the sensor, maybe even plane and flight data, etc. And if it’s stored as ASCII text with one-(or two!)-byte-per-decimal-digit, with lots of cautious padding, you could easily get a couple orders of magnitude more disk space than actual data.

I just made up some numbers: a 40-minute total trip (5 minutes warm up and cool down also recorded), 1000 sensors, sampling twice per rotation at 25,000 rpm (=833 Hz), and recording 500 bytes per sample, That happens to give exactly 1 terabyte. Looking at it, there’s probably not 1,000 physical sensors, but it’s also undoubtedly recording lots of secondary data – what the engine management software has calculated the state is and what it’s trying to do about it – so 1,000 entries seems reasonable (on preview, the last bit was put better by mr walrus)

There is a lot of hype for sure, but regarding AI: deep neural networks and the relatively new-ish ability to efficiently train them has significantly increased the set of problems we can solve or partially solve.

Here is an exaggerated example of something I’ve gone through in real life. Say an organization wants to run “analytics” “digital optimization” “AI analysis” on a fleet of vehicles and there’s this BIG Data that gets selected to do that.

Here’s how the conversation goes :

BIG Data to the Traditional Engineer :Give me the fuel gauge readings and the odometer readings for all the 100 trucks over the last 5 years. Oh and make sure the data is for every single minute.

Traditional Engineer : okay I will try that. But you know the fuel gauge …

BIG Data : Stop right there. We have Artificial Intelligence to figure out correlations and we don’t want your bias to enter the picture. Just give us the raw data.

A year passes :

BIG data : we have analyzed the data and believe your truck gives the best mileage when they are 3/4 empty on fuel. So our recommendation is to run them 3/4 empty all the time.

Traditional Engineer : you know the fuel gauge is not very accurate and you can’t trust the data except when it shows full or almost empty.

BIG Data : Oh let’s do another iteration…

The point is that BIG data relies too much on Computer “intelligence” and doesn’t see the need for rationalizing data before feeding the algorithms. The Tradional Engineer relies on heuristics and experience gained over time, which can benefit from AI but in a collaborative atmosphere.

Successful Digital or BIG data projects are where both the BIG data and traditional engineers come to the table and understand each other. Artificial Intelligence is not a replacement of Human Intelligence but merely an addition to it.

Read carefully, the original article doesn’t say that any setting produces that much date. Ratjer, that much data would be produced to provide a (complete) description of what is happening.

Perhaps they got the idea from a description of how much data is produced in a simulation of the engine. Which it is easy to imagine is far, far, far more data than that produced by any available test setup.

I wasn’t thinking each jet would be equipped with a bunch of AWS Snowball Edges to store all this. Perhaps it would be enabled at specific times for data analysis and perhaps troubleshooting. Since I don’t know aviation and engines, I was trying to understand why anything like that would generate a large volume of data, and what would they do with it.

The amount of growth from the hard drives and physical space AWS data centers must be experiencing sounds insane. The fact they promote to store 100PB for customers sounds like at some point they would run out and fairly soon. If my quick math is right, at 16 TB a hard drive wouldn’t that be over 600 hard drives to store 100PB?

I have a setup on my desk that produces that much data from a 4-core processor. This isn’t a crazy hypothetical data rate for hardware logging.

I don’t know about jet engines, but generally you store the data to a circular buffer that’s as big as is useful and no larger. It’s exceedingly rare to need 30 minutes of logs to solve a software bug. 99+% of the time you can solve it with like 1MB because it’s pretty hard to build a software system that starts to at time t and then only finally exhibits some symptom of being broken at t+several minutes. We’re usually talking milliseconds here.

It’s not uncommon to have a failure after extended amount of time, frequently due to:
1 - consuming some resource (e.g. disk, memory, etc.)
2 - frequency of the data condition causing the problem is low, so the system has to churn for a while until it randomly hits the condition
3 - a complex system that takes a long time to reach the state that has a problem

Those are all true, but it’s still quite rare that you actually need detailed hardware logging of the sort that I’m talking about that spans the full length of time.

Like, you really don’t need to know exactly the exact sequence of instructions executed 20 minutes ago to find a memory leak. And even if it takes a long time to get to the state where there’s a problem, there’s almost always enough context very close to arrival at the problem to solve the problem. A few seconds is a really long time for a chip.

Ya, agreed.