Will someone please explain differentially expressed genes to me?

I am trying to understand, at an informed layperson level, what the study of differentially expressed genes entails. In statistics, analysis of microarray experiments is very hot now, so I need to bone up on this. I am not a biologist, and so trying to read the scholarly work on the subject has me jumping to Wikipedia every few sentences to see what word means. But what I’ve gathered so far is this:

The goal of gene expression research is to determine whether gene i, i = 1, 2, …, 10,000 (or more) being “on” has some effect on some response. Just exactly what the response is can vary ,but it can be, e.g, survival, developing a tumor, having Marfan syndrome, etc. Where statistics comes in is that the effect is quantified by a hypothesis test (my research area is multiple hypothesis testing).

If the above is true, what is a gene expression level? Is it not just “on” or “off”?

Help me fight my ignorance :slight_smile:

Gene expression quantitation can be either absolute or relative. Often, expression of a particular gene is normalized to that of a houskeeping gene (like actin) that is thought to be expressed at uniform levels (although this is often not tested for!).

Usually for your question, you would be comparing, say in a microarray experiment, expression profiles for 2 or more treatments - i.e., lets say a control group and a treatment group treated with a drug. You would have at minimum 3 independent biological replicates per group. You would compare the expression level of each gene (as assessed by fluorescence from the array) between control and treatment.

The kicker is that there are thousands of spots on each array, and your multiple comparisons becomes paramount here - obviously, with so many tests some are going to come up significant on chance.

There are multiple methods currently in use for correcting for this. None of them are ideal in my opinion. Hopefully you can come up with something better.

Also, I realized I never answered your main question - “what is gene expression?”

At its most basic, the level of gene expression would be the amount of messenger RNA produced for the open reading frame. This can be quantified by quantitative PCR and other methods. The microarray won’t give you an absolute quantitation, but the level of fluorescence for the spot is proportional to the amount of message.

That makes things somewhat clearer, thank you.

A followup question: Is an array is a collection of genes? I thought that chromosomes were collections of genes? Or is any array a subset of a chromosome? Is its length defined by the “open reading frame” you mention?

Not to be glib, but I think you need to do more than “bone up” if this is your level of understanding.

An expression microarray is a tool to measure levels of expression of hundreds or thousands of genes simultaneously, often all known genes in a genome. It is a glass or silicon chip with oligonucleotide (small specific sequences of nucleic acid) spots bound to it. Messenger RNA (mRNA; the initial product of gene expression) is extracted from samples. Reverse transcriptase PCR is used to transcribe mRNA to cDNA (a double-stranded DNA copy of the mRNA). The cDNA is then transcribed to cRNA (double-stranded RNA copy), which is labeled with fluorescent tags and bound to the microarray. Specific cRNA’s that match the sequence of the oligonucleotide spots on the microarray bind in a sequence- specific manner and fluoresce when the array is scanned with a laser. The amount of fluorescence is proportional to the amount of mRNA for each specific gene, which in turn is proportional to the level of expression for that gene. Thus, levels of gene expression can be quantified for every gene in the organism simultaneously. By comparing expression profiles between treatment and control samples, specific genes affected by the treatment can be identified.

As usual, Wikipedia has a pretty good description

The genes are the sections of cellular DNA that code for proteins. A gene is “active” or expressed when the cell makes the protein coded for by a specific gene. This is done by creating an RNA copy of the gene, which is “read” by ribosomes to assemble amino acids in the sequence specified in the RNA. There are many types of RNA but the ones of interest here are called messenger RNA (mRNA). Note that there are a lot of biochemical shenanigans going on under the hood that I am leaving out for clarity.

Now, the reason your expression microarrays deal with the intracellular mRNA copies of the gene rather than the genes directly is that the mRNA present in the cell at any given time represent the genes being actively expressed. Processed (mature) mRNAs that are ready to be transcribed by the ribosome also have features (the poly-A tail is the main one of interest) that can be efficiently detected. A gene’s expression level increases when more of its mRNA version is present, because more copies of the mRNA floating around (generally) means the protein will be produced faster and in greater quantities. mRNAs must be constantly synthesized to keep a gene active. There are many factors that influence the expression level of a gene, but they all boil down to controlling the speed and duration of mRNA synthesis.

Fundamentally, expression experiments try to determine what genomic information is in use. I recommend you read up on the “central dogma” and how that is implemented in the cell. Basically, the central dogma of biology describes the flow of information in cells, from DNA through RNA into protein products.

Just to fill in a gap, you take your microarray chip, which is covered with thousands of probes, each probe corresponding to a specific region of the genome, and mix it with the RNA of interest. Skipping a few dozen steps here, you then get a readout for each probe. The intensity of the readout corresponds to how much RNA matching that probe was present in your sample.

Then you have to dig through all that data, and figure out which genes are turned on or off significantly due to your treatment or whatever.

I guess your definition of “layperson” is very different from mine.

At last a GQ I know a lot about!!!

The reason this is important is that since mRNA is this stuff which goes out to create protein and proteins are what actually does the stuff in the cell, and makes cells do different things (such as be cancerous). By looking at the differentially expressed genes you can distinguish different types of cells; such as cancerous from non-cancerous, different types of tissue, or cells that became cancerous in different ways.

Statistically the big difference between this type of data and other data is that you generally have tens of thousands of data points (gene expression levels) for each sample, but have many fewer samples (100 samples is a very large experiment). So you have to be careful in dealing with multiple comparisons (for 30,000 tests for differential expression 1500 will be significant p<0.05 even for random data), and over-fitting predictive models (with so many variables to choose from you can fit any classification even with random data).

Other things to be aware of is that the gene expression data are generally treated on a log-scale so that signal difference represent ratios of gene expression. Also the gene expressions generally have a correlation structure that is too complicated to model, but should be kept in mind when analyzing data. This is often ignored but will likely be very important to be aware of in your multiple testing work.

I look forward to reading you future paper.

I see mozchron covered more or less the same thing already, so the gap I was really filling was the one in how carefully I read the other answers.

So, when you mix mRNA with probes on the array, when a gene is “active,” the probes corresponding to that gene will fluoresce with an intensity directly proportional to the degree of gene activity. The florescence data (for each probe?) is then used for a single hypothesis test. The multiple testing issue arises then because the chip can have 40,000 or so probes on it (according to the Wikipedia article).

I really am trying to understand all this better, but my biological background is highly lacking. My last formal biology course was in high school 13 years ago. The Wikipedia articles still have a lot of jargon that is difficult (I guess only for me) to plow through. Thanks, all, for your patience.

Basically correct.

Your biology background might be lacking, but you should do a pubmed search for these issues. Microarray (and next-generation sequencing) analysis, including the multiple comparison and false-discovery problem is a VERY hot topic in Biostatistics research right now (there are about 5 people down the hall from me this second who specialize in this) and you don’t want to waste time trying to re-invent the wheel or compete with people who have been doing this for a while.

I probably should have been clearer in the OP, but multiple testing in microarray experiments is not the focus of my work. I’m doing a literature review and just want to be able to talk semi-intelligently about some of the work that’s been done in the area, since it’s one of the main motivations for the newer methods.