Would it be possible to take a persons genome and convert it to a unique ID number? Assuming a 32 character numbering scheme, what would be the smallest number of digits possible to identify everyone now? and for the next millennia?
Are you OK with identical twins getting the same number? In principle, there are slight genetic differences even between identical twins, but then, if you’re going to that level of detail, there are slight genetic differences between the cells in your own body, too.
Would it be possible to take a persons genome and convert it to a unique ID number?
- Yes. Except you’d run into problems with identical twins. And if you want to make it like a hash that you can use to check someones genetic identity the other variations Chronos mentioned might trip you up.
Assuming a 32 character numbering scheme, what would be the smallest number of digits possible to identify everyone now?
- Why on Earth do you want to use a 32 character system? Anyway, it’s a simple math question. You get 32^n unique numbers for n digits. 7 digits is the absolute minimum, and should also cover the next couple of hundred years, but if you want a unique number to be created by a hash function of sorts, and/or if you want no reuse of numbers, you probably want to add a lot of digits.
7 characters is the minimum, but there’s probably no easy way to get from the person’s genome to such a small number and have it remain unique. The worst case is you simply transcribe the whole genome. 3 billion base pairs is 6 billion bits, so with a 32 character alphabet (5 bits per character), you’d need an ID number that is 1.2 billion characters long. It’s undoubtedly possible to do better than this (somewhere between 7 and 1,200,000,000), but that would require a lot of detailed knowledge of where genes absolutely never differ between individuals.
There are approximately three billion base pairs in the human genome. If you wanted a strictly unique identifier (save for monozygotic siblings) then you’d have to replicate all of in the code. A cryptographic filter with recursive compression could be used to make a hash function of arbitrarily small size depending the length of your cyphertext alphabet; using the standard Roman alphabet with 26 characters you could probably create a probabilistically unique signature with about twenty characters, which gives over a billion billion to one odds of one person sharing an identifier with another. Further refinements in the algorithm could probably reduce that number of characters or increase the odds.
However, most of those base pairs are not a part of the 19k to 20k protein-coding genes in the human genome. These are what really make you unique; if you had the same set of these genes as another individual with no significant defects in the other genes that regulate gene expression or transcription, you would be an essentially identical person even if not closely related. Fortunately, the likelihood of having the exact combination of specific variations of a gene as some non-monozygotic person, even one of close relation, is so incredibly improbable that it can be dismissed out of hand. So that would reduce the character count to about 16 alphabetic characters with the same order of magnitude of probability of two unique identifiers. You could go further by isolating specific genes that are known to vary widely and reject genes that are generically common to reduce the dataset even further while minimizing the chance of replicating the same hash.
You could, of course, just have a master list based upon specific sets of base pairs or genes known to vary widely enough that two individuals are unlikely to have the same sequence, and then your list would just be as large as the population in the list. That’s a fairly brute force method but it would reduce the number of necessary unique alphabetic identifiers to 7 for the current world population; make it 8 to account for future growth, which should clear us through at least the end of this century and probably the next as long as there isn’t a renewed massive population growth.
Also: Even aside from identical twins, it’s always possible that you’ll end up with two people whose numbers match. But how high of a probability of this are you willing to tolerate? If there’s a 1% chance of one matching pair somewhere in the world, is that good enough? What if there are expected to be about 10 accidentally-matching pairs in the world, out of 10 billion people: Is that good enough?
Why would you use a 32 character alphabet when the real genetic alphabet has 4 characters? You just need two bits for A, G, C, or T. Since the human genome is somewhere around 3 billion base pairs, that means you need a storage space of 1.5 Megabytes. Round it up to 2 Mb for a person’s unique genome.
Now multiply 2 Mb by 8 billion for the amount of storage you need for genome of every person on Earth, then give yourself a factor of 1000 for the next 1000 years. This should be very conservative, since you’re not replacing the population every year, but it would be kind of embarrassing to run out of storage space too early, so play it safe.
If you just want a unique identifier for every person on earth for the next 1000 years, that’s a lot smaller.
For twins, simply give the evil one an * next to their number. Problem solved.
I arbitrarily selected a 10[SUP]18[/SUP] (“billion billion”) as the acceptable threshold for probability that any one hash could match another; with on the order of 10 billion people (allowing for some growth) that should give odds of more than a single non-monozygotic pairing at <<1% (assuming an alphabetic identifier). Using an alphanumeric identifier, or increasing the number of digits in the identifier will reduce that further.
On review, I see a missaprehension in my previous post; although there are 19k to 20k protein coding genomes, each will have multiple variations (differences or defects) which could be globally identified, increasing individual uniqueness. An actual genomicist could probably identify more complex combinations in tandom repeats and other signatures which could serve to reduce the necessary dataset for a unique identification.
It is also worth noting that some people can have combinations of genomes in their body; these “chimeras” are a result of mergers between fertilized zygotes, and can produce a viable offspring that nonetheless has different genomes in different organs. Such conditions are often unsuspected unless there is a medical pathology which makes it apparent. How you would identify such people under this scheme is unclear, but you’d obviously have to have some scheme to combine the genomes into a single dataset that is reproduceable for comparison.
Given k individuals, you could calculate the number of needed digits in N to prevent collisions by the following approximation,
-k(k-1) __________ 1 - e^ 2N
With 160 digits and a reliable hash of the DNA (ignoring twins) you could have 10^17 individuals and the chance of a single collision would be 1 in 100 trillion
1:100,000,000,000,000 are pretty good odds.
a 64 bit hash would only have a single collision in 1 out of ~600 times. (Times not being births, but the mapping of 10^17 individuals)
Aside from the problem of identical twins, there’s the problem of genetic chimeras. These are people that have multiple sets of DNA. This could be a simple matter of a fetus absorbing their would-be fraternal twin. Recipients of a bone marrow transplant would also be chimeras. Pregnant women have also been known to carry cells from their fetus, often for years after giving birth.
Yet another problem is that genotyping techniques are not perfect. If you’re doing full sequence or millions of SNPs (like 23andme, ancestry, etc.) you’re going to get slight variations each time you run the same person’s DNA. They won’t be big differences, hopefully much, much less than 1%, but for something like a hashing algorithm, those differences will be enough to make the hash different for the same person.
From a genetics perspective, I recommend going the other direction—don’t have millions or billions of markers, but a few hundred or thousand. Pick some markers with a high minor allele frequency*. If the MAF is under about 5%, then you won’t see the minor allele very often, so most people have the same marker at that location. Look for MAFs of 15-50%. Once you have those markers, filter them so you have a set that are not in linkage disequilibrium**. Also throw any away that cannot be genotyped reliably.
Once that is done you should have a few hundred to a few thousand markers (probably SNPs, but if you want to be old school go RFLP). Design a chip that specifically, very reliably, and cheaply genotypes just these specially chosen markers (there are companies that will help with this). Now use that chip to genotype your population, and use a hash of those markers to generate each person’s ID code.
Even with all that, you should probably pick a hash that will produce the same results even if a few of the markers are missing or different. Otherwise you’re going to have to genotype everyone several times. You’ll still have to regenotype occasionally for poor results.
- MAF: the frequency of the second most common allele for a two allele marker, like a SNP.
** Linkage disequilibrium: regions of DNA are inherited together, so SNPs near each other tend to be inherited together. Also, see haplotype.
So fraternal twins and chimeras mess things up. The idea was to generate a number unique, unchangeable and unquestionably affiliated with any given person.
If that is all you are trying to achieve, just come up with a unique naming convention for all people in the world (we wouldn’t dare use a number-that is too dangerous!)-say a government assigned middle name- and then set up clinics where everyone can stop by and have whatever tests are necessary to uniquely identify a given water sack. With a one to one association in the global database, your problem is solved. Given sophisticated enough testing in the clinic (doesn’t have to be real-time) one could handle identical twins, chimeras, etc. Note the testing doesn’t have to be done real-time. Once the proper measurements/samples are taken an ID can be assigned to that person and the tests can be run later.
I have often thought that this is an excellent mechanism for many ID related things. Extend the concept to a one to many association (one water sack to many unique names) and allow everyone to go down to their friendly neighborhood Social Security DNA jewelry store where they can buy anything from a cheap ID card to a diamond encrusted ID ring or bracelet. As many as they want. Each unique ID device is permanently associated with one person and every person in the population is held responsible (at least civil fines perhaps criminal sanctions) for only using an ID associated with that person. You could build near-field readers or contact readers into POS devices, computers, phones, etc and people would be held responsible for using only IDs associated with themselves. Yes, people could steal and ID card and use it to impersonate someone, that impersonation itself could be a serious crime. Steal a $100 on the internet and the credit card company can’t be bothered. Use a stolen ID and the Government takes the crime very seriously. Of course this would be the death of anonymity, but that is exactly the goal of any ID service. Perhaps couple this with a right to refuse to provide any ID and still receive service (without the extension of any credit). That way both competing goals could be achieved. Any one that has a legitimate need to remember people’s ID can have read-only access to the global database. That way the Government can enforce privacy and fair conduct rules. If facebook fails to protect individuals privacy then cut them off from the global database. They wouldn’t be able to sell anything that requires personalization.