A computer science grad student friend of mine working on a project has asked me for some help with a large project he’s working on for a biology department, but since I’m also not a biologist, I don’t really understand what he’s talking about, so I figured the biologist folks at the SDMB would be my best bet.
Apparently he’s working on a program to do something with structures in proteins, and his data sources are coming from PDB files from the Protein Databank. For instance, one structure he’s dealing with is 1K6Y, the PDB file for which can be found here. Now, this structure is apparently part of a viral protein called integrase, which helps viral DNA integrate into the infected cell’s DNA. It’s got 4 chains, called A, B, C, and D. Apparently each of these chains run from amino acid 1 through amino acid 288.
So here are his questions:
What exactly is a “structure” and a “chain”? How do they relate to each other and to the overall protein they’re a part of?
Do the four chains of structure 1K6Y really all go from amino acids 1 to 288? The list of atoms in each chain (the lines that start with ATOM in the PDB file) only goes from 1 to 210 for chain A, and it’s also missing amino acids 47-55 and 140-148. It ends in similar places for chains B, C, and D, and is mising similar chunks of amino acids in the middle. These amino acids are listed as “missing residues” in the remarks section of the file; does that mean they just weren’t found in the visualization technique they used to create this 1K6Y.PDB file, or are they really not a part of the structure?
He … had another question, but I can’t find his e-mail at the moment. I’ll try to remember it. But in the meantime, any help with these questions would be very much appreciated.
This entry provides the results from one experiment (here). The complete list of amino acids for each chain is given in the lines labeled “SEQRES”; there are, indeed, 288 in each. However, during this experiment, they only elucidated part of the structure, so many of the known residues are missing from the “map.”
“Structure” usually refers to certain characteristics that appear in many proteins. The primary structure of a protein is its sequence; the secondary structure is the local arrangement into helices, sheets, and maybe some other local structures; the tertiary structure is the folding of the entire protein, with links between disparate areas, into a functional unit.
There’s a primer for looking at structures in the PDB here.
I did see that page at proteopedia, Nametag, but I wasn’t sure what it meant. Thank you very much for clarifying. And thank you for the link to the primer on looking at structures. I’ll read it myself and pass the information along.
Here’s another question: from what I can tell, the integrase protein that this structure is from, is itself only 288 amino acids long. How can structure 1K6Y list four “chains” that are each 288 amino acids long, then? Clearly, either I’m not understanding integrase, or I’m not understanding what a chain is.
The formulas of proteins are long and linear. They are chains of aminoacids. The only “branching” that happens in a basic protein is the aminoacids’ own tails. But we are talking about molecules which, if they were straight, could in some cases be kilometers long: have you seen any living being that long? Protein chains need to be folded into a very specific 3D structure, both in order to be able to fit within their surroundings and in order to work.
Sometimes you get protein chains which can’t work by themselves. The protein you describe includes several chains; many others are single-chain. Others, for example insulin, have bridges between two chains (there is an aminoacid whose tail can create a link with another of the same aminoacid). And others, for example haemogoblin and chlorophyl, have what’s called “prosthetic groups”: pieces which are not an aminoacid but without which the protein can’t work. These prosthetic groups can be organic or inorganic: some of the proteins involved in coagulation need to have metals bound to them in order to work; the central groups of haemogoblin and chlorophyl are organic but have a core metal atom (iron for haemogoblin, magnesium IIRC for chlorophyl).
When a protein includes several chains, these chains can be chemically identical but structurally different. They can also be chemically different (in which case they will definitely be structurally different), or completely identical. I’m guessing this integrase’s four chains are chemically identical: that doesn’t mean that structural identity can be assumed, the 3D structure needs to be studied for the whole 4-chain complex. Studying isolated chains wouldn’t work for two reasons: one, the chains will tend to form groups of four; two, an isolated chain will not have the same surroundings and therefore the same structure as when there’s three other chains it’s interacting with. Careless studies, performed without taking this possible structural differences into account, will lead to reporting structures which are the same for all chains involved, but where the reported structure is actually a fuzzy average of the real structures and not any of the real structures: it would be the same as taking a look at a room where half the people are sitting on their ankles, half are standing, and deciding that everybody is kneeling.
Protein structures are not perfectly rigid, some parts, particularly loops and chain ends can bee quite flexible. Since protein crystals contain about 50% water (this percentage may vary), this flexibility may persist in the crystal. Such flexible regions are invisible in the experimentally determined structure, therefore, no coordinates can be determined for these regions, and they are omitted in the structure.
Thanks for responding, Nava. Does this mean that the HIV integrase structure being studied here actually consists of 288*4 = 1152 amino acids, four chemically identical chains (where the chains are all the same strings of amino acids, I mean) that may differ structurally?
Or is it just 288 amino acids that are folded and bound together in such a way that you can trace out four different “chains”?
I’m viewing the PDB file in question in Jmol, but the figure is so big and complicated that I can’t even tell if are 288 amino acids or 1152, much less which atoms on the screen correspond to which chain and which amino acid and whatnot.
In the pdb record, the different chains are indicated by a different letter: A,B,C,D:
ATOM 41 CA ASP A 6 120.208 56.652 53.741 1.00 40.32 C
This record in the PDB file for example tells you that Atom Nr.41 is the Calpha atom of amino acid aspartate, which is residue number 6 in chain A.
If you right-click in the jmol molecule display and select “console” from the menu that opens upon right-click, you get a script console.
if you enter:
color chain
Sorry -terminated to fast and took to long to edit the previous post:
If you right-click in the jmol molecule display and select “console” from the menu that opens upon right-click, you get a script console.
if you enter:
color chain
the displayed structure will be colored according to the chain label. The pdb jmol display is by default colored in this manner.
if you enter:
restrict *:A
only chain A remains displayed.
if you type
select *:A
future display commands will affect only chain A,
e.g.
spacefill on
to show atoms as spheres
spacefill off;wireframe 0.5
to show the bonds
The first option, 288*4. Each of the four strings of 288 aminoacids (aA for short) is chemically identical to the other three, but they can have different 3D shapes.
In Anaglyph’s post: Calpha is the carbon that has the amino and acid groups. The rest of the aA is called beta, gamma… and the last one is omega. This is standard for linear organic molecules and aAs (which are both protein bulding blocks and molecules in their own right), you’ll see it in fatty acids too.
I didn’t have time to read this thread in detail (sorrry!), so I’m not sure if I’m rehashing old ground for you. I just wanted to mention that in case you haven’t already done so, you can use the full journal article referenced (as an abstract only) in the Protopedia entry for more information about the structure. The full article is available free (in this case) through PubMed. My apologies if you already have this.