Can big data, everpresent sensors and AI replace the scientific method

Observe, form hypothesis, construct controlled experiments, draw conclusions.

That is the scientific method in a nutshell. But we are on the cusp of an age where there will be billions, probably trillions, of sensors all over the world. There will also be artificial intelligence constantly combing through all this data.

So my question is, would this possibly be a replacement for the scientific method? Could you use big data to determine causality and patterns without having to conduct isolated, controlled experiments?

Or is this mostly the same thing as the scientific method, just done more loosely?

Example, people have sensors all over their bodies and all over the world. When some people eat a certain food, they develop insomnia. the AI combing through all the data notices this, finds it is unlikely due to chance, and then compares data both within the patients and among controls to determine what the potential cause is. It then examines data that would take the place of an experiment since there is so much data. The end result is the AI determines whether food X causes insomnia, why, and why certain people are susceptible without having to engage in forming any controlled studies. Basically having sensors everywhere and artificial intelligence examining the results allows for the creation of new knowledge without having to conduct scientific tests or experiments, because real world events take the place of controlled experiments (I assume the AI could find a way to tamp down on the noise from an uncontrolled experiment).

Are you assuming a hypothetical AI or a current state-of-the-art?

If the latter, no probably not. It sounds like a generalized automated proof machine. Such an AI would be a holy grail of computer science. But if somebody creates a general problem solver, maybe. Because although there are applications of AI to big data it isn’t normally data of that size. Big data analytics is still an emerging research area with a lot of work to go.

Sent from my SM-G930W8 using Tapatalk

The main problem you’d run into is being able to understand the predictions the AI makes. The AI might be able to infer physical laws, but something like a deep neural net is an extremely complex nonlinear function. Distilling that into human understanding is a challenge and so extrapolating to situations the sensors haven’t seen is unlikely.

Of course, AI algorithms may be developed in the meantime to address that sort of thing.

IIRC one big point Jeff Hawkins (not Stephen the Physicist) makes is that it will be indeed science, not a replacement of it, but an accelerator. Understanding where and why anomalies take place will still have to be interpreted and acted upon, also investigated scientifically. It seems to me that no matter how advanced the AI can get we will still have to check if a particular new artificial AI geared for science is getting the real world right.

[QUOTE] Today, we're getting huge numbers of sources of data. It's growing exponentially. We stick them in data bases. ** The vast majority of the data in the world is never looked at ever.**

It just sits there. We have two ways of getting value out of it. One is the visualization tools. And the other is creating models. And then if we use those models, we can act on them.

There are challenges here. One of the biggest challenges is that this whole system is not very automated. And it takes data scientists-- people like you-- to do this stuff. And we want to get to a world where there’s not just hundreds or thousands or millions of models.

We want to get to one where there’s billions of models-- the Internet of things everything in the world’s going to be creating data. And we need to be able to model all this stuff. The other problem is-- and so today, it takes lots of people. It’s not automated. The other problem is the models can get obsolete. If you’re not doing on line learning, and most techniques today are not on line learning, you have to rebuild your models all the time because the patterns in the world change.

And people just aren’t really looking at temporal data very much. Much of the patterns in high velocity data, especially high velocity data, is temporal patterns. And that’s almost very rarely do people take advantage of that. They actually try to get rid of it.

So my view of the world tomorrow, it’s not like the current world’s going to go away. But this is where I think the growth is going to be, is we’re going to go to a world where there’s literally–I’m not joking-- billions of machine learning models out there.

The data’s going to stream right into the models. There’s no storage required. You’re not to save this stuff. The models are going to build and continually update themselves.

And you’re going to immediately take action. And so if you look at it, it looks just like what brains do-- go over back to what I said earlier. I said, whoa, look at that. So let’s try to apply our techniques to this. And so that’s what we’ve been doing. The key criteria here is you need to have automated model creation for billions of models. You need to have continuous learning. And you need to be able to find the temporal, as well as the spatial patterns in the data.
[/QUOTE]

Although it is true that neural networks (deep or otherwise) are somewhat of a black box (it is always possible to unravel the function but they are very complex), there are other AI algorithms that are can reveal information in a human usable way so that isn’t necessarily an absolute blocker.
I was in a comedy club waiting for the show to start when I wrote the first reply. I will try to clarify a bit more now that I’m back home.

As I mentioned earlier, there’s two main issues. One, is to really make this works requires either an automated theorem proving algorithm or general problem solver. As I mentioned earlier, for this to work in a general way would be a holy grail of computer science. There do exist AI algorithms that can help with proving mathematical theorems and there’s AI algorithms that can help solve problems (my current research falls into this category) but nobody is really certain how to do so in a general way. Most likely, such an algorithm would require a lot of expert knowledge and codifying expert knowledge is one of the biggest limitations in such systems. It can be very difficult to extract and represent such knowledge in a usable way.

On the data side, algorithms, like neural networks or other classifiers, can be effective at finding relationships. Most of these algorithms need to be trained to be effective (this isn’t strictly true, but I don’t want to get too deep here). How to optimally train such algorithms in the Big Data world are open research questions. As above, classification algorithms are specialized once trained. They’re very good at doing what they’ve been trained to do, and usually pretty bad at everything else.

Ultimately, the best way to think of AI, as it exists right now, is as a computational aide to humans. So, is it possible? Sure. Everything I’ve mentioned is an open research problem and somebody somewhere is working on it. So all the pieces might come together someday, but as AI algorithms exists right now, they’re really more about doing the types of computations that are hard or time consuming for humans. The heavy thinking is still a human activity.

beep For now. beep

A couple of years back, there was a bit of hoopla around an AI inferring Newton’s laws from observations; but I’m not qualified to comment on how much this was guided by expectations, and how much it was genuine automated discovery. In the end, I don’t think it was terribly much more than a sophisticated fitting algorithm you could point at a set of data. I also haven’t really noticed any followup work, but I may just not have paid close attention.

And of course, AI researcher Jürgen Schmidhuber claims his main motivation is basically to build a better scientist:

[QUOTE=Jürgen Schmidhuber]
First Schmidhuber will build a scientist better than himself (his colleagues claim that should be easy) who will then do the remaining work.
[/QUOTE]

So the idea is out there, at least. And I could see that computation, and AI more narrowly, may come to be of similar importance to science as its mathematization was. But in the end, science is somewhat by definition a human project; it’s not entirely clear to me that AIs probing the world, and reacting to it, should properly considered to be ‘doing science’.

There’s also the point of via which channels we probe the world. To us, the world is defined by the way we’re hooked up to it: via sight, sound, smell and so on. But for AIs, this hooking up becomes a design variable, and what they might discover, how they prioritize it, in some sense dependent thereon. For instance, you’ll probably not discover the standard model using wearable sensors; that requires large, special-purpose machinery.

FWIW there was just an article in Science that addressed this issue that may be of interest.

My take in the past was that a breakthrough would occur when we understood how creativity works, which to me was about analogy making - which I imaged as taking known patterns of connections and applying them through something akin to geometric translations and transformations to novel domains and finding that it predicted unsuspected new fits. Yet the black boxes of deep neural learning seem to figuring out their own paths to creative hypothesis making. And we don’t yet understand how they do it.

More in the same issue on how the systems are already actually doing some science, of the sort the op refers to … in, for now, limited ways. The psychology application seems most pertinent to the op:

Wesley, what you are describing is not a replacement of the scientific method. The core idea is still the same. Gain knowledge about the effect of something by either artificially changing a single variable, or finding a ‘natural experiment’ where this has happened independent of other variables.

The posters above me talk about ‘proof’. Science doesn’t prove anything. A quality experiment that produces clean data and a clear outcome showing causation can be thought of as performing a measurement that shifts a number in a gigantic table of probabilities about the world.

Well, some of the more advanced AI techniques do calculate similar probability tables. So you can make an AI agent that learns from experimental data and then uses that learned information to model what it expects to happen.

Anyways, nothing about the scientific method prohibits changing more than one variable at once, or finding natural experiments where multiple variables were changed. The problem is that when you do that, it creates doubt as to the cause of an effect you are seeing. Sometimes so much doubt that other scientists will consider the results of an experiment garbage.

However, this does not mean you can’t extract information from multivariable experiments, natural or otherwise. You absolutely can change several variables, then change several variables, and create a systematic data set that you can then mathematically isolate the effect of any one variable. It’s just a far more complex method, and it makes it harder for other human scientists to determine by reading the paper if they should pay any attention to your conclusions.

Obviously, some day (probably in the next 10 years) we will have some kind of “probability extractor.py” that can automatically evaluate the results of experiments and update a model that can simulate the outcomes of physical systems. It won’t matter whether the experiment changes 1 variable or 10 or isn’t even a controlled experiment - the math will check out, it will extract the information present in the data. Gradually human scientists will begin to rely on and trust tools like these, or they will stand aside and let the tools do most of the thinking.

What you’re really asking is that if you can build an AI doctor/scientist that can eventually be given a massive array of robotic systems able to autonomously craft drug molecules and gene therapy probes, and a building full of terminal patients. The AI’s utility function would be to advance medicine by keeping these terminally ill people alive as long as possible. It wouldn’t wait for papers to be published when it intervenes in a patient and gets outcomes - it would learn what happens within hours and try something else.

I could see such a system becoming advanced enough that the process for luckier patients is :

  1. Patient has a disease currently untreatable by modern medicine
  2. AI system, in conjunction with human helpers and overseers, samples the patient with robotic needles, determines based on an internal model of physiology what the problem is, models a genetic edit that should stop the patient’s death, and crafts a new genetic patch within hours.

There’s no waiting for the FDA, or 5 stages of clinical trial, or years of debate. A small room full of robots just make the treatment, and the AI’s internal model of human physiology is more complex than the model any human expert in the world knows.

Well, the treatment partially works. The patient’s velocity of dying slows, but now they are dying from a side effect of the treatment.

In current clinical practice, doctors would basically just stand there and watch the patient die, recording the death in the data for that clinical trial. They might attempt some known to work intervention, maybe give a drug or try a little CPR, but they won’t invent something new and try it.

Obviously, an AI system could sample the patient again, and go back to step 1, and maybe intervene successfully before the patient is dead.

People would still die, but medicine would be advanced faster than at any previous point in human history. Eventually the number of deaths would slow as the AI converges towards solutions.

This is sort of what medical science has become today: we can’t discover simple laws and rules – e=mc^2 or p^2=a^3 – but, instead, we have to contend with a vast collection of anecdotal data and try to find correlations which are sometimes very weak.

It’s still science, just “dirty” science with lots and lots of uncontrolled variables.

Humans are very good at perceiving patterns (sometimes when the pattern isn’t even really there!) If we can devise an AI system with that gift, then there aren’t any limits to the discoveries that will be made.

I disagree. Controlled experiments are one tool in the scientific method toolbox, but one can still employ the scientific method without using controlled experiments.

I think the middle steps should be something like make testable predictions, then gather data to test predictions. Controlled experiments are one way to gather that data but not the only way.

In your AI insomnia example, the AI is employing the scientific method.

It’s a fascinating idea, but I think there are currently two main issues with this:

  1. As the folk in this thread who deal with big data and machine learning could tell you, it’s very much a case of GIGO (garbage in, garbage out), and data input, validation, exploration, cleaning, and formatting is the vast majority of the work you do. This is no joke, if you write several pages of code, the vast majority is all that stuff, with the actual algorithm giving you your predictions / insights one line out of hundreds or thousands.

There have been a lot of smart folk and even collections of smart folk (companies) who strive to automate that stuff, and none of them have really gotten there - there’s a combination of creativity, domain-specific expertise, and past experience that goes into a whole series of decisions you make around that stuff for your model to make sense that can’t be automated right now. In most cases, there are also additional decisions you’ll take around interpretabality, legality, and governance that will influence some of your decisions and variables.

This is probably amenable to partial or total automation at some point as the algorithms and processes get better, probably with some degree of training on an existing body of models you’ve produced to get that domain-specific expertise lens.

But that won’t help at all with the second point.

  1. The real problem with big data is your confidence / statistical validity. Say you go with a p value < .05, as many scientists do - up earlier there was a quote mentioning moving to a world with billions of models. Well, if you’ve got 2 billion models with that threshold, there’s 100 million models that actually aren’t valid and were just picking up statistical noise or had a skewed sample in their training data.

The problem with big data is that it’s BIG. When you have billions or trillions of records / variables, it is a certainty that good chunk of them will appear correlated to a high degree of confidence when they’re not. Sure, you can up your confidence or move to stricter AUC / ROC requirements, but it just makes the false correlations a smaller, but still large, number. And generally, these big data streams are continuous, in that they’re constantly generating huge amounts of data, and this is both generating new false correlations in the recent data, as well as making it more difficult to audit your past models (because much like documentation vs code production, the temptation to do new smart/sexy things with the new incoming data is generally much greater to a business or individual than the desire to do exhaustive audits of past models).

So expand that to quadrillions of sensors and variables and models, and you’ll need some pretty robust processes around continuous existing model auditing and validation to weed out the trillions of false models, which is typically much harder and more time-consuming than model creation.

Then we have the additional problem of human interpretability and belief - for folk to use these models intelligently or in high-value decisions, there needs to be a degree of smart human buy-in, auditing, and belief, and that’s your REAL bottleneck.

All over the world there is already fewer data scientists than businesses would like to hire, and it’s a technical enough skillset that the absolute number of people who can do it is probably capped at a relatively small % of the population. Let’s call it 15%, and lets assume that this field pays so well literally everyone who could do it is doing it. So now we have quadrillions of models, with more being generated daily, and a billion folk worldwide who can audit and evaluate them. If we assume your typical model takes a month to audit and evaluate, you can see that we’ll only be able to do 12 billion a year, when there are quadrillions presently and more being generated at a rate that far outstrips your human evaluation capacity.

We would need some very robust and smart prioritization methods and queues to do any good with this, as well as a massive amount of humans capable of evaluating them to actually get a good amount of use out of these quadrillions of big-data computer-generated models.

I’m not saying it’s impossible, but I think this would actually be a significantly harder to solve problem than the first one.

You are confusing observation and correlation with science, which tries to find mechanisms. Sensors all over the world noted that the sun came up every day in a regular pattern, but that did not tell them why. That the food is correlated with insomnia does not tell us what chemical in the food causes it, and what mechanism in our bodies is affected by it.
I collected big data on semiconductor manufacturing. I often got indicators that something was wrong - but it never told me what was wrong, unfortunately.

if such a thing happens it would eliminate the "we know this pill alleviates ailment x but we don’t know why or how just that it does " thing that goes on now …

The example presented by the OP isn’t a replacement for the scientific method, it’s an AI sophisticated enough to use the scientific method.

The AI looks at the giant pile of data, figures out a pattern, tests the pattern, and does something with the resulting information. That’s the scientific method. If you can invent an AI system that can do that, congratulations because you’ve solved a very hard problem.

The point is, this isn’t just “throw more processors at the problem” and watch the problem solve itself. You have to figure out a way, somehow, for the expert system to decide what patterns to look for, and find them, and communicate those patterns back to the humans. That’s like, you know, not trivial. That’s strong AI. That’s Hal 9000. Which seemed pretty simple back in 1968, computers were advancing so quickly back then that a general problems solving computer like Hal 9000 seemed like it was just a straightforward engineering slog ahead.

Except that never happened, we aren’t anywhere close to anything like a general problem solving computer. It’s not a straightforward “throw more processors at it” type of problem. And once we invent Hal 9000 somehow, that’s not a replacement for the scientific method, that’s inventing a computer able to use the scientific method. It will be great day for humanity. Skynet begins to learn at a geometric rate. It becomes self-aware at 2:14 a.m. Eastern time, August 29th. In a panic, they try to pull the plug. Guess how well that’s going to work out for them?

Expert systems are so 1980s. I think they are kind of dead in the sense of what expert system meant back then.

Here is what machine learning can do today. You can give it a bunch of pictures, say some are of cats and some are of dogs, and then let it train itself to recognize them, as shown by giving it a whole bunch more pictures than the training set. The big difference between machine learning today and AI of the past is that you do not give it explicit instructions on how to distinguish cats from dogs. (Expert systems had to be programmed with rules for this.) So it does not figure out what patterns to look for (that is in the training set) but it does come up with a way of recognizing patterns.
Throwing more processors - or more GPUs - at this definitely helps. And none of it has anything to do with HAL9000.
Scientific method? No, not really, but a lot of scientists work on problems that have been defined by other people, so I’m not sure the fact that the machine learning system does not come up with its own problems counts against it.

I forgot that “expert system” is a term of art. I was just talking more generally.

The point is, if you want a system that can tell the difference between cats and dogs, you have to show it 10,000,000 pictures labeled “cat” and 10,000,000 pictures labeled “dog”. You can’t just set up a system and have it casually inform you a couple days later, “Oh, by the way, I just figured out the difference between cats and dogs”. You have to define the problem for it, and define the data set, and define what success looks like.

Yes, a current system that can tell the difference between cats and dogs doesn’t do so because the programmer gave it a list of rules about what a cat is and what a dog is, and the system follows the rules. The method the system uses is a black box, and the system might easily use methods that no human being would think to use, and might even use methods that human brains can’t understand.

That still doesn’t mean a general purpose AI that can make human scientists obsolete. Such a system is no closer to reality than it was in 1968. Yeah, it’s easy to imagine, and it has been imagined since the first computers. But it’s not happening any time soon, and all our progress on artificial intelligence since 1968 hasn’t brought us any closer to Hal 9000. Yeah, we’ve got Alexa. But Alexa doesn’t work the way they imagined Hal 9000 would work. Hal 9000 works by understanding things. Alexa works in a completely different way. And in many ways, an AI that works nothing like a human being is actually better than a Hal 9000. Who needs a system that works like a human brain when we’ve got 7 billion spare human brains running around the planet, barely being used?

IMO a few robots and a well-trained computer with a credit card probably could have churned through the entirety of my PhD work in less than a year and done a better job.

Actually, you’ve gotten it exactly backwards.

As the expert posts just above yours explain, the one thing Big Data lacks completely is the “why.” Big Data is entirely about the idea that we don’t care about “why.” All BD can do is discover what *is *happening. Plus/minus the GIGO problems so eloquently explained.

I think there are a few steps leading up to the reality that the OP is describing.

Firstly, doing big data analysis on scientific research will be very beneficial and I’m sure many people are working on this now. It might help us spot when research is confirming or refuting other research, which experiments may not have been rigorous and, most excitingly, combine observations: Right now, papers published in journals in different fields might never get associated, even if they’re describing facets of the same phenomenon. Big data might help us chain together human knowledge.
(that’s my understanding anyway, I’m not a scientist)

Then the vast growth in the amount of data available and the automatic analysis being performed might help draw our attention to what we don’t know. Even if AI is not smart enough to do the next step of drawing hypotheses, just that first step of expanding our known-unknowns will be very useful.

Then finally, yeah, you might have AI doing science.

Isn’t distilling science into human understanding already something of a problem?
ISTM that large chunks of physics and maths moved away from human intuition at least, a long time ago. Calculations are performed, and we try not to worry how to put it into visualizations or analogies, which are frequently misleading.
I think it’s entirely feasible to imagine a reality where we have enough information to verify the *correctness *of a program / machine, but are unable to distill what it’s doing into a neat set of equations. And being as such a machine would still be useful to us, we’d still want to use it. Will we just shrug and try not to worry about being one more level abstracted away?

I think we are going to find a lot of things that work, but can’t be explained. The bigger factors that it works will override the need to understand how and why. Finding stuff that works should come at a faster pace, perhaps exponentially surpassing the ability to know why thought the S.M.