Recently, I wanted to see if I could impute the presence of the HLA-B27 antigen using my raw 23andMe DNA data. This is a pretty important antigen, because it is associated with several diseases. Although I used a very small reference data set, I managed to get a rough result that happened to coincide with a blood test.
NOTE: I'm just a hobbyist, so tread carefully.
Why is this cool? Because you can't use a service 23andMe to find out if you have HLA-B27: There is no known SNP algorithm that can accurately tell you if you have it (although rs4349859 comes close, but 23andMe doesn't sequence it). So the idea is to statistically compare your genome with others whose HLA-B27 presence or absence is known, and then infer the probability that you have it or not.
To convert your them to plink output files (
.fam), we use
FORENAME is the individual's name and must contain no
spaces, but doesn't have to be in uppercase. They're just used as a marker to
distinguish between people, if you want to perform bulk operations.
male (it can also be deduced, obviously, but may be incorrect in some cases).
F for female. The
--out foo means plink will create output files
Next, we need to perform the actual imputation for HLA. To do this, we need a reference data set, and we'll use the HapMap CEU reference data set bundled with SNP2HLA.
This is a very small example set, consisting of some 124 individuals. Meaning, your results will be very inaccurate, and if you're not of European descent, may be completely useless.
The SNP2HLA authors have a larger data set with over 5000 individuals that used to be bundled with the software. It was redacted as of version 1.0.3, because of privacy and security. But if you're a serious researcher, you can ask for a copy of the full set. No, I don't have it.
To perform the actual imputation, simply do
The first argument,
foo is the name of the output files in the previous step.
HM_CEU_REF is the set to base the imputation on,
foo2hla is the output
base name for this operation (I like to discern between plink and SNP2HLA
output), then there's a path to plink (I use
plink2). The last two arguments
are memory limits. It's really only needed when processing large groups of, but
I kept them anyway.
So, the output files you get now:
foo2hla.bgl.phased, you can see the inferred presence or absence for each
of the two chromosomes.
This also says that HLA-B27 is
Absent in both chromosomes for HLA-B27 and the
As for the probabilities,
The first line is for HLA-B27 in general, ignoring any sub types. The last three
numbers are the probabilities for presence in both chromosomes (
present and one absent (
PA) and absent in both (
AA). So the probability is
a whooping 88.8% for the complete absence of HLA-B27 in this case (but 11.2%
for the presence of at least one copy). In comparison, about eight percent of Caucasians
posses this gene. The second line is for the
HLA-B*2705 sub type, and it has
the same probabilities.
So, even if there is such a big uncertainty in the results, it does give a very crude indication, even for the small reference data set. And, it did match up with the result of the blood test, which was fun, but could just as well have been pure luck.
Bottom line: It's pretty awesome that ordinary people can do stuff like this. I can easily imagine a professional service built around imputation: Patient gets genotyped, it's stored on a secure server. The doctor can then, based on a permission scheme, run imputation for things like HLA-B27 as one of many tools when diagnosing.
A blood test will always be more accurate than imputation, but is time consuming and costs money. If a doctor could quickly see that the probability of HLA-B27 is low, and depending on the context, it may not be necessary to order a blood test at all.