Imputing the presence of the HLA-B27 antigen using your 23andMe genome

By Christian Stigen Larsen
23 Feb 2016

Recently, I wanted to see if I could impute the presence of the HLA-B27 antigen using my raw 23andMe DNA data. This is a pretty important antigen, because it is associated with several diseases. Although I used a very small reference data set, I managed to get a rough result that happened to coincide with a blood test.

NOTE: I'm just a hobbyist, so tread carefully.

Why is this cool? Because you can't use a service 23andMe to find out if you have HLA-B27: There is no known SNP algorithm that can accurately tell you if you have it (although rs4349859 comes close, but 23andMe doesn't sequence it). So the idea is to statistically compare your genome with others whose HLA-B27 presence or absence is known, and then infer the probability that you have it or not.

Prerequisites

Converting your 23andMe files

To convert your them to plink output files (.bed, .bim and .fam), we use plink2:

$ plink2 --23file genome.txt SURNAME FORENAME M --out foo

Here SURNAME and FORENAME is the individual's name and must contain no spaces, but doesn't have to be in uppercase. They're just used as a marker to distinguish between people, if you want to perform bulk operations. M means male (it can also be deduced, obviously, but may be incorrect in some cases). Use F for female. The --out foo means plink will create output files foo.bed, foo.bim and foo.fam.

Imputing the presence of HLA-B27

Next, we need to perform the actual imputation for HLA. To do this, we need a reference data set, and we'll use the HapMap CEU reference data set bundled with SNP2HLA.

This is a very small example set, consisting of some 124 individuals. Meaning, your results will be very inaccurate, and if you're not of European descent, may be completely useless.

The SNP2HLA authors have a larger data set with over 5000 individuals that used to be bundled with the software. It was redacted as of version 1.0.3, because of privacy and security. But if you're a serious researcher, you can ask for a copy of the full set. No, I don't have it.

To perform the actual imputation, simply do

./SNP2HLA.csh foo HM_CEU_REF foo2hla `which plink2` 2000 1000

The first argument, foo is the name of the output files in the previous step. HM_CEU_REF is the set to base the imputation on, foo2hla is the output base name for this operation (I like to discern between plink and SNP2HLA output), then there's a path to plink (I use plink2). The last two arguments are memory limits. It's really only needed when processing large groups of, but I kept them anyway.

So, the output files you get now:

foo2hla.bed
foo2hla.bgl.gprobs
foo2hla.bgl.log
foo2hla.bgl.phased
foo2hla.bgl.r2
foo2hla.bim
foo2hla.dosage
foo2hla.fam

In foo2hla.bgl.phased, you can see the inferred presence or absence for each of the two chromosomes.

$ grep HLA_B_27 foo2hla.bgl.phased
M HLA_B_27 A A
M HLA_B_2705 A A

This also says that HLA-B27 is Absent in both chromosomes for HLA-B27 and the HLA-B2705 allele.

As for the probabilities,

$ grep HLA_B_27 foo2hla.bgl.gprobs
foo2hla.bgl.gprobs:HLA_B_27 P A 0,002 0,110 0,888
foo2hla.bgl.gprobs:HLA_B_2705 P A 0,002 0,110 0,888

The first line is for HLA-B27 in general, ignoring any sub types. The last three numbers are the probabilities for presence in both chromosomes (PP), one present and one absent (PA) and absent in both (AA). So the probability is a whooping 88.8% for the complete absence of HLA-B27 in this case (but 11.2% for the presence of at least one copy). In comparison, about eight percent of Caucasians posses this gene. The second line is for the HLA-B*2705 sub type, and it has the same probabilities.

So, even if there is such a big uncertainty in the results, it does give a very crude indication, even for the small reference data set. And, it did match up with the result of the blood test, which was fun, but could just as well have been pure luck.

Bottom line: It's pretty awesome that ordinary people can do stuff like this. I can easily imagine a professional service built around imputation: Patient gets genotyped, it's stored on a secure server. The doctor can then, based on a permission scheme, run imputation for things like HLA-B27 as one of many tools when diagnosing.

A blood test will always be more accurate than imputation, but is time consuming and costs money. If a doctor could quickly see that the probability of HLA-B27 is low, and depending on the context, it may not be necessary to order a blood test at all.