MaCH-Admix is an extension to MaCH for faster and more flexible imputaiton, especially in admixed populations.
All existing MaCH documentation applies to MaCH-Admix. For general input file preparation and haplotyping
options, please see: MaCH tutorial
Basically, in order to run MaCH-Admix, a user needs to go through the following steps:
We highly recommend the default settings. For example the following command line generates the best guess genotype (--geno), allele dosage
(--dosage), and two independent probabilities (--probs) for every genotype imputed.
   
./mach-admix -d target.dat -p target.ped -h ref.hap -s ref.snps --geno --probs --dosage -o my.out
Other issues included in this page:
Preparing Input Files (with references in haps+snps format)
A typical imputation scenario invovles:
- a ped file containing all observed genotypes of target individuals (typically in MERLIN/QTDT format)
- a dat file containing all markers present in the above ped file (typically in MERLIN/QTDT format)
- a haps file containing all reference phased haplotypes (e.g., Hapmap2 or 3 phased haplotype files, or phased 1000-Genome hap files)
- a snps file containing all markers present in the above hap file (e.g., the snps files come with Hapmap2 or 3 phased haplotype files)
The four input files are specified using the following command line options:
- ped file: -p or --pedfile <filename>
- dat file: -d or --datfile <filename>
- haps file: -h or --haps <filename>
- snps file: -s --snps <filename>
Example:
-d target.dat -p target.ped -h ref.hap -s ref.snps
For more information on input genotype and dat(marker list) file, please see
here and
here.
For more information on HapMap2/3 haps and snps files, please
HapMapII r21,
HapMapII r22,
HapMapIII r2.
For more information on 1000-Genome haps and snps files, please see
1000G
Reference Download.
A reminder: Before genotype imputation, the user should carry out basic data quality checks on available genotypes. Typically, we exclude
from analysis markers that have low genotyping success rates (perhaps with <95% of genotypes called successfully), unexpected evidence
for deviations from Hardy-Weinberg equilibrium (e.g., those with HWE p-value < 1e-6), large numbers of discrepancies among
duplicate samples or with several mendelian inconsistensies in available parent-offspring trios, or that are rare (e.g., those with MAF < 0.5%).
All these checks are platform and study specific, and the user will have to figure out what is appropriate for his/her data.
Preparing Input Files (with references in VCF format)
Since 1000G Phase I Version 3, we provide 1000G references for download in VCF (Variant Call Format). The VCF integrates many types information in a single file. To accomodate the new format, we have added a set of features in MaCH-Admix to allow processing of VCF files. A typical imputation scenario invovles:
- a ped file containing all observed genotypes of target individuals (typically in MERLIN/QTDT format)
- a dat file containing all markers present in the above ped file (typically in MERLIN/QTDT format)
- a VCF file containing all reference phased haplotypes information
The three input files are specified using the following command line options:
- ped file: -p or --pedfile <filename>
- dat file: -d or --datfile <filename>
- VCF file: -h or --haps <filename>
The input VCF file replaces the orginal haps file and the
"--vcfReference" command line option must be turned on.
Example:
-d target.dat -p target.ped -h ref.vcf --vcfReference
With VCF references input, MaCH-Admix accepts the following additional command line options to facilitate splitting of big imputation tasks:
- --startposition <#start> / --endposition <#end>: Specify the imputation range (in bp). Markers in reference VCF that are out-of-range will not be loaded for imputation.
- --outputstart <#start> / --outputend <#end>: Specify the range of markers (in bp) in final output. This can be used to exclude flanking regions when splitting a big region into multiple overlapping chunks.
Example:
./mach-admix --dosage -d target.dat -p target.ped -h ref.vcf --vcfReference
--startposition 10000000 --endposition 15000000 --outputstart 11000000 --outputend 140000000
The command above imputes markers from 10Mbp to 15Mbp and only output results from 11Mbp to 14Mbp.
For more information on 1000-Genome VCF reference files, please see
1000G
Reference Download. For more information on VCF, the format, please see
http://www.1000genomes.org/node/101. Also please note that MaCH-Admix can output results in VCF by turning on
"--outvcf" command line option.
Pick a run-mode
There are two Run Mode choices to conduct imputation:
- Integrated (Default): using this mode, the user doesn't need to specify any model parameters.
Model parameters include recombination rates and error rates that will be stored in .rec and .erate files respectively.
MaCH-Admix will integrate both parameter estimation and imputation in MCMC iterations. This mode is suitable
when model parameters are either not available or calibrated using a population poorly matched to
the target/to-be-imputed population. To use the integrated mode:
- specify all four input files (ped/dat/haps/snps)
- use command line option "--runMode Integrated"
Example:
./mach-admix --geno --runMode Integrated -d target.dat -p target.ped -h ref.hap -s ref.snps
Noe that
- The command line would have the exact same effect without "--runMode Integrated" because "Integrated" is the default runMode.
- the "--geno" option here requests for the best-guessed genotypes (after imputation).
- Pre-Calibrated: using this mode, the user first calibrate all model parameters (recombination rate / error rate).
Then MaCH-Admix can use the calibrated parameters to conduct imputation without parameter learning. This pre-calibrated mode
is suitable if the user repeatedly use the same set of data (e.g. the same reference haps/snps files), or has obtained
parameter files beforehand. With pre-calibrated parameters, this mode runs faster than the Integrated mode.
There are two types of pre-calibration:
- Pre-Calibration based on reference haplotypes only. This is suitable if the user repeatedly use the same set of references to
impute targets that have are similar to (all or part of) the references.
In this case, use the command line option "--runMode EstimateParameterFromRef"
- Pre-Calibration based on both reference haplotypes and target genotypes. This is suitable if the user repeatedly use the same set
of references and suspect that the parameters in the targets differ considerably from that in the references. (e.g., when target individuals are of unknown ethnicity or from an isolated population).
In this case, use the command line option "--runMode EstimateParameterOnly"
If the references are dense, of large size and from diverse population sources (e.g., 1000G), the first type of pre-calibration usually works well for most situations. Having the pre-calibration type decided, the user can conduct the 2-step imputation as follows:
- Step-1: calibrate parameters using one of the two command line options discussed above
Example:
./mach-admix --runMode EstimateParameterFromRef --prefix parameters -d target.dat -p target.ped -h ref.hap -s ref.snps
The "--prefix" option specifies the output filename.
Note:
You can skip this step by downloading a set of parameters pre-calibrated using the 1000 Genomes phaseI.v3
data at phaseI.v3.pre-calibrated-parameters.Nov2012.tgz
- Step-2: conduct imputation using calibrated parameter file and command line option "--runMode ImputeOnly"
Example:
./mach-admix --geno --runMode ImputeOnly --crossoverMap parameters.rec --errorMap parameters.erate
-d target.dat -p target.ped -h ref.hap -s ref.snps
Note the two parameter files specified ("--crossoverMap parameters.rec --errorMap parameters.erate").
Interpret output files
Most output files are self-explanatory. Details will be added to this section.
- .geno / .geno.gz: Best guessed genotypes generated using --geno option
- .dose / .dose.gz: Genotypes dosage file generated using --dosage option
- .prob / .prob.gz: Genotypes probability file generated using --probs option
- .qc / .qc.gz: Quality control information file generated using --quality option
- .rec and .erate: measured recombination and error rate parameter file. always generated.
If
"--outvcf" is turned on, the corresponding genotype / haplotype / dosage / probability / quality information will be stored in a single VCF file. The VCF output gives a one-marker-per-line view and is
helpful
when a user is dealing with not only SNPs but also Indels. For more information on VCF, the format, please see
http://www.1000genomes.org/node/101.
(Optional) Other reference selection modes
MaCH-Admix uses hidden Markov model (HMM) to infer unobserved genotypes. In such a
model, the computational cost grows quadratically with the number of reference haplotypes used. To accelerate
computation, MaCH-Admix utilizes approximation by sampling a small proportion from the available reference haplotypes
in each round as the effective reference panel. By default, MaCH-Admix uses piecewise IBS matching which
provides
the best trade off between running time and quality in most situations. In addition, MaCH-Admix has the following two approaches for evaluation purpose:
- IBS-based Sampling
The IBS-based approaches will construct a tailored reference panel for each individual depending on
the person's genetic background. Specifically, the IBS-based approaches achieve this by comparing the
genetic similarity (measured by IBS) between the target individual and every haplotype in the reference pool.
IBS-based sampling is generally good for imputation in short region (<10Mb), for both admixed and non-admixed populations.
In MaCH-Admix, we provide the following IBS options:
- IBS Double-Queue
This is the default recommended strategy in most situations. In this approach, we sort all reference
haplotypes according to their similarity to the two guessed haplotypes of each target individual. We
maintain two separate priority queues, one for each of the guessed haplotypes.
To switch to this Double-Queue selection approach, use "--selectMode IBSDQ" command line option.
Example:
./mach-admix --geno --selectMode IBSDQ -d target.dat -p target.ped -h ref.hap -s ref.snps
- Single-Queue
This is the basic IBS-based approach that maintains only one priority queue for each target
individual. This approach is generally worse than the double-queue approach.
To switch to this approach, use use "--selectMode IBSSQ" command line option.
- Ancestry-Weighted Sampling
This approach concerns the scenario in which the reference panel consists of haplotypes from several
reference populations, for instance CEU and YRI for African Americans. The user can specify beforehand
the desired weight (expected ancestral contribution) of each reference population. This is done by providing
an ancestry-weight file to MaCH-admix. Each line in the weight file corresponds to one reference
population and has two numbers, the population size (number of individuals, or 2*haplotypes) and the expected
weight of each reference population. An example of the ancestry-weight file is shown below:
[File: AncestryWeights.cfg]
60 0.2
60 0.8
The example specifies two reference populations, each with 60 individuals (120 haplotypes). The reference
haplotype file must have consistent content (i.e. 60*2 haplotypes from the first population followed
by 60*2 haplotypes from the second population ) After specifying the ancestry-weight file and ancestry-
weighted sampling mode, each individual in the first(second) population will carry a weight of 0.2(0.8)
in MCMC iterations. Note that the weights can be in arbitrary scale (MaCH-Admix does internal normalization).
Example Usage:
./mach-admix --geno --ancweightfile AncestryWeight.cfg --selectMode AncestryWeighted
-d target.dat -p target.ped -h ref.hap -s ref.snps
Here, "--selectMode AncestryWeighted" specifies the Ancestry-Weighted selection option
"--ancweightfile AncestryWeight.cfg" specifies the name of the ancestry-weight file. Note that the
total number of individuals in this file must agree with that in the reference haplotype file.
Determining the weights
A key question here is how to determine the weights. There are several natural ways to estimate the
weights. One could pre-specify the weights according to estimates for ancestry proportion, for example,
African Americans have about 20% Caucasian and 80% African ancestries, therefore, it is reasonable to
use ~2:8 ratio of CEU:YRI in weighting.
Alternatively, the user can use the built-in feature provided by MaCH-Admix to infer ancestry proportions (please see here).
User can also use other software such as STRUCTURE and HAPMIX.
(Optional) Explore Other Available Program Options
There are many other useful options:
- --geno: Output best-guess genotypes.
- --dosage: Output allele dosages (estimated number of copies of Al1 at each SNP for each individual, ranging continuously from 0 to 2).
- --autoflip: Automatically flip strand if the alleles in target(ped) and reference(haps) files are inconsistent
- --errorRate: Specify the base error rate. The default value is 0.001.
Normally the user does not need to set this because MaCH-Admix will adjust this value in each iteration.
Set this to a higher/lower value if the user's data are very noisy/clean to help accelerate the adjustment.
- --probs: Output posterior probabilities (two probabilities, for Al1/Al1 and Al1/Al2 respectively, for each genotype guessed).
- --qc: Output quality score for each genotype guessed.
- --phase: Perform phasing and output best-guess haplotypes.
If the user plans to perform phasing, we recommend a larger --states value (>200).
The default value is good for imputation but may be insufficient for phasing.
- --rounds <#rounds>: Specify the number of MCMC rounds. Default value is 30 which is good enough
for standard imputation tasks. Computing time grows linearly #rounds.
More rounds is better at increased computational costs.
- --states <#states>: Specify the number of reference haplotypes sampled in interal phasing step. The default value is 100.
Computing time grows quadratically with #states.
Larger states is better at increased computational costs.
- --referencesInFitting <#referencesInFitting>: Specify the number of reference haplotypes used in parameter learning.
The default value is 500. Computing time is affected only minimally by #referencesInFitting.
- --imputeStates <#imputeStates>: Specify the number of reference haplotypes used in internal imputing step. The default value is 500.
Computing time is affected only minimally by #imputeStates.
- --seed <#seed>: Specify random seed.
- --interimInterval <#interimRounds>: Output intermediate results every #interimRounds rounds.
The option is useful when computing time is very long for running all #rounds.
- --prefix <fileprefix>: Specify the prefix for all output files
- --uncompressed: Choose not to compress output files. By default all
output files are written out in .gz format (if compiled with gzip support).
- --mask <proportion>: Randomly mask a small proportion of input genotypes to evaluate imputation quality.
- --scoreMAF Allele frequency weighted IBS calculation: instead
of computing a Hamming distance where each site carries the same weight, this option gives
higher weight to sites with lower minor allele frequency.
Chunk Chromosomes
With large reference panels, it is a lot of time more efficient splitting the whole chromosomes into multiple chunks and running imputation on each chunk separately. We provide such
functionality by first splitting reference (correct, no need to split target) and then doing imputation.
If your reference is in vcf format, please download
splitVCFref.jar first.
Then follow the sample C-shell scripts
splitVCFandImpute.csh.
Estimate Ancestry-Weights of Reference Populations
MaCH-admix has a built-in feature to estimate the contributions of reference populations. To use this feature,
the user has to provide ancestry-weight file with arbitrarily defined weights. MaCH-admix reads the reference
population structure from this file. An example is shown below:
[File: AncestryWeights.cfg]
60 1.0
60 1.0
The example specifies two reference populations, each with 60 individuals (120 haplotypes). The reference
haplotype file must have consistent structure (i.e. 60*2 haplotypes from the first population followed
by 60*2 haplotypes from the second population ) We put arbitrary weights (1.0) here. The user can combine
any non-ancestry-weighted reference selection with "--ancweightfile" option to obtain an estimate of population
contributions. However, for short regions, we recommend the default IBS based approach
for it is more reliable with small number of states/rounds.
Example:
./mach-admix --ancweightfile AncestryWeight.cfg
-d target.dat -p target.ped -h ref.hap -s ref.snps
Example Output:
============== BEGIN: Contribution of Reference Populations ==============
Reference Population 1: 0.221873
Reference Population 2: 0.778127
============== END: Contribution of Reference Populations ==============
MaCH-Admix can also be used to estimate local ancestry content at each marker.
We will release the function and documentation here after more careful evaluations.