Skip Navigation
Text:
Increase font size
Decrease font size

MaCH-Admix beta Tutorial

MaCH-Admix is an extension to MaCH for faster and more flexible imputaiton, especially in admixed populations. All existing MaCH documentation applies to MaCH-Admix. For general input file preparation and haplotyping options, please see: MaCH tutorial

Basically, in order to run MaCH-Admix, a user needs to go through the following steps:

We highly recommend the default settings. For example the following command line generates the best guess genotype (--geno), allele dosage (--dosage), and two independent probabilities (--probs) for every genotype imputed.
    ./mach-admix -d target.dat -p target.ped -h ref.hap -s ref.snps --geno --probs --dosage -o my.out

Other issues included in this page: Preparing Input Files (with references in haps+snps format)
A typical imputation scenario invovles:
  • a ped file containing all observed genotypes of target individuals (typically in MERLIN/QTDT format)
  • a dat file containing all markers present in the above ped file (typically in MERLIN/QTDT format)
  • a haps file containing all reference phased haplotypes (e.g., Hapmap2 or 3 phased haplotype files, or phased 1000-Genome hap files)
  • a snps file containing all markers present in the above hap file (e.g., the snps files come with Hapmap2 or 3 phased haplotype files)
The four input files are specified using the following command line options:
  • ped file: -p or --pedfile <filename>
  • dat file: -d or --datfile <filename>
  • haps file: -h or --haps <filename>
  • snps file: -s --snps <filename>
Example:
-d target.dat -p target.ped -h ref.hap -s ref.snps
For more information on input genotype and dat(marker list) file, please see here and here. For more information on HapMap2/3 haps and snps files, please HapMapII r21, HapMapII r22, HapMapIII r2. For more information on 1000-Genome haps and snps files, please see 1000G Reference Download.

A reminder: Before genotype imputation, the user should carry out basic data quality checks on available genotypes. Typically, we exclude from analysis markers that have low genotyping success rates (perhaps with <95% of genotypes called successfully), unexpected evidence for deviations from Hardy-Weinberg equilibrium (e.g., those with HWE p-value < 1e-6), large numbers of discrepancies among duplicate samples or with several mendelian inconsistensies in available parent-offspring trios, or that are rare (e.g., those with MAF < 0.5%). All these checks are platform and study specific, and the user will have to figure out what is appropriate for his/her data.

Preparing Input Files (with references in VCF format)
Since 1000G Phase I Version 3, we provide 1000G references for download in VCF (Variant Call Format). The VCF integrates many types information in a single file. To accomodate the new format, we have added a set of features in MaCH-Admix to allow processing of VCF files. A typical imputation scenario invovles:
  • a ped file containing all observed genotypes of target individuals (typically in MERLIN/QTDT format)
  • a dat file containing all markers present in the above ped file (typically in MERLIN/QTDT format)
  • a VCF file containing all reference phased haplotypes information
The three input files are specified using the following command line options:
  • ped file: -p or --pedfile <filename>
  • dat file: -d or --datfile <filename>
  • VCF file: -h or --haps <filename>
The input VCF file replaces the orginal haps file and the "--vcfReference" command line option must be turned on. Example:
-d target.dat -p target.ped -h ref.vcf --vcfReference
With VCF references input, MaCH-Admix accepts the following additional command line options to facilitate splitting of big imputation tasks:
  • --startposition <#start> / --endposition <#end>: Specify the imputation range (in bp). Markers in reference VCF that are out-of-range will not be loaded for imputation.
  • --outputstart <#start> / --outputend <#end>: Specify the range of markers (in bp) in final output. This can be used to exclude flanking regions when splitting a big region into multiple overlapping chunks.
Example:
./mach-admix --dosage -d target.dat -p target.ped -h ref.vcf --vcfReference --startposition 10000000 --endposition 15000000 --outputstart 11000000 --outputend 140000000
The command above imputes markers from 10Mbp to 15Mbp and only output results from 11Mbp to 14Mbp.

For more information on 1000-Genome VCF reference files, please see 1000G Reference Download. For more information on VCF, the format, please see http://www.1000genomes.org/node/101. Also please note that MaCH-Admix can output results in VCF by turning on "--outvcf" command line option.

Pick a run-mode
There are two Run Mode choices to conduct imputation:
  • Integrated (Default): using this mode, the user doesn't need to specify any model parameters. Model parameters include recombination rates and error rates that will be stored in .rec and .erate files respectively. MaCH-Admix will integrate both parameter estimation and imputation in MCMC iterations. This mode is suitable when model parameters are either not available or calibrated using a population poorly matched to the target/to-be-imputed population. To use the integrated mode:
    • specify all four input files (ped/dat/haps/snps)
    • use command line option "--runMode Integrated"
      Example:
      ./mach-admix --geno --runMode Integrated -d target.dat -p target.ped -h ref.hap -s ref.snps
      Noe that
      1. The command line would have the exact same effect without "--runMode Integrated" because "Integrated" is the default runMode.
      2. the "--geno" option here requests for the best-guessed genotypes (after imputation).
  • Pre-Calibrated: using this mode, the user first calibrate all model parameters (recombination rate / error rate). Then MaCH-Admix can use the calibrated parameters to conduct imputation without parameter learning. This pre-calibrated mode is suitable if the user repeatedly use the same set of data (e.g. the same reference haps/snps files), or has obtained parameter files beforehand. With pre-calibrated parameters, this mode runs faster than the Integrated mode.

    There are two types of pre-calibration:
    • Pre-Calibration based on reference haplotypes only. This is suitable if the user repeatedly use the same set of references to impute targets that have are similar to (all or part of) the references. In this case, use the command line option "--runMode EstimateParameterFromRef"
    • Pre-Calibration based on both reference haplotypes and target genotypes. This is suitable if the user repeatedly use the same set of references and suspect that the parameters in the targets differ considerably from that in the references. (e.g., when target individuals are of unknown ethnicity or from an isolated population). In this case, use the command line option "--runMode EstimateParameterOnly"
    If the references are dense, of large size and from diverse population sources (e.g., 1000G), the first type of pre-calibration usually works well for most situations. Having the pre-calibration type decided, the user can conduct the 2-step imputation as follows:
    • Step-1: calibrate parameters using one of the two command line options discussed above
      Example:
      ./mach-admix --runMode EstimateParameterFromRef --prefix parameters -d target.dat -p target.ped -h ref.hap -s ref.snps
      The "--prefix" option specifies the output filename.
      Note: You can skip this step by downloading a set of parameters pre-calibrated using the 1000 Genomes phaseI.v3 data at phaseI.v3.pre-calibrated-parameters.Nov2012.tgz
    • Step-2: conduct imputation using calibrated parameter file and command line option "--runMode ImputeOnly"
      Example:
      ./mach-admix --geno --runMode ImputeOnly --crossoverMap parameters.rec --errorMap parameters.erate -d target.dat -p target.ped -h ref.hap -s ref.snps
      Note the two parameter files specified ("--crossoverMap parameters.rec --errorMap parameters.erate").


Interpret output files
Most output files are self-explanatory. Details will be added to this section.
  • .geno / .geno.gz: Best guessed genotypes generated using --geno option
  • .dose / .dose.gz: Genotypes dosage file generated using --dosage option
  • .prob / .prob.gz: Genotypes probability file generated using --probs option
  • .qc / .qc.gz: Quality control information file generated using --quality option
  • .rec and .erate: measured recombination and error rate parameter file. always generated.
If "--outvcf" is turned on, the corresponding genotype / haplotype / dosage / probability / quality information will be stored in a single VCF file. The VCF output gives a one-marker-per-line view and is helpful when a user is dealing with not only SNPs but also Indels. For more information on VCF, the format, please see http://www.1000genomes.org/node/101.

(Optional) Other reference selection modes
MaCH-Admix uses hidden Markov model (HMM) to infer unobserved genotypes. In such a model, the computational cost grows quadratically with the number of reference haplotypes used. To accelerate computation, MaCH-Admix utilizes approximation by sampling a small proportion from the available reference haplotypes in each round as the effective reference panel. By default, MaCH-Admix uses piecewise IBS matching which provides the best trade off between running time and quality in most situations. In addition, MaCH-Admix has the following two approaches for evaluation purpose:

  • IBS-based Sampling
    The IBS-based approaches will construct a tailored reference panel for each individual depending on the person's genetic background. Specifically, the IBS-based approaches achieve this by comparing the genetic similarity (measured by IBS) between the target individual and every haplotype in the reference pool. IBS-based sampling is generally good for imputation in short region (<10Mb), for both admixed and non-admixed populations.

    In MaCH-Admix, we provide the following IBS options:
    • IBS Double-Queue
      This is the default recommended strategy in most situations. In this approach, we sort all reference haplotypes according to their similarity to the two guessed haplotypes of each target individual. We maintain two separate priority queues, one for each of the guessed haplotypes. To switch to this Double-Queue selection approach, use "--selectMode IBSDQ" command line option.

      Example:
      ./mach-admix --geno --selectMode IBSDQ -d target.dat -p target.ped -h ref.hap -s ref.snps

    • Single-Queue
      This is the basic IBS-based approach that maintains only one priority queue for each target individual. This approach is generally worse than the double-queue approach. To switch to this approach, use use "--selectMode IBSSQ" command line option.


  • Ancestry-Weighted Sampling
    This approach concerns the scenario in which the reference panel consists of haplotypes from several reference populations, for instance CEU and YRI for African Americans. The user can specify beforehand the desired weight (expected ancestral contribution) of each reference population. This is done by providing an ancestry-weight file to MaCH-admix. Each line in the weight file corresponds to one reference population and has two numbers, the population size (number of individuals, or 2*haplotypes) and the expected weight of each reference population. An example of the ancestry-weight file is shown below:

    [File: AncestryWeights.cfg]
        60 0.2
        60 0.8

    The example specifies two reference populations, each with 60 individuals (120 haplotypes). The reference haplotype file must have consistent content (i.e. 60*2 haplotypes from the first population followed by 60*2 haplotypes from the second population ) After specifying the ancestry-weight file and ancestry- weighted sampling mode, each individual in the first(second) population will carry a weight of 0.2(0.8) in MCMC iterations. Note that the weights can be in arbitrary scale (MaCH-Admix does internal normalization).

    Example Usage:
    ./mach-admix --geno --ancweightfile AncestryWeight.cfg --selectMode AncestryWeighted -d target.dat -p target.ped -h ref.hap -s ref.snps

    Here, "--selectMode AncestryWeighted" specifies the Ancestry-Weighted selection option "--ancweightfile AncestryWeight.cfg" specifies the name of the ancestry-weight file. Note that the total number of individuals in this file must agree with that in the reference haplotype file.

    Determining the weights
    A key question here is how to determine the weights. There are several natural ways to estimate the weights. One could pre-specify the weights according to estimates for ancestry proportion, for example, African Americans have about 20% Caucasian and 80% African ancestries, therefore, it is reasonable to use ~2:8 ratio of CEU:YRI in weighting. Alternatively, the user can use the built-in feature provided by MaCH-Admix to infer ancestry proportions (please see here). User can also use other software such as STRUCTURE and HAPMIX.


(Optional) Explore Other Available Program Options
There are many other useful options:
  • --geno: Output best-guess genotypes.
  • --dosage: Output allele dosages (estimated number of copies of Al1 at each SNP for each individual, ranging continuously from 0 to 2).
  • --autoflip: Automatically flip strand if the alleles in target(ped) and reference(haps) files are inconsistent
  • --errorRate: Specify the base error rate. The default value is 0.001. Normally the user does not need to set this because MaCH-Admix will adjust this value in each iteration. Set this to a higher/lower value if the user's data are very noisy/clean to help accelerate the adjustment.
  • --probs: Output posterior probabilities (two probabilities, for Al1/Al1 and Al1/Al2 respectively, for each genotype guessed).
  • --qc: Output quality score for each genotype guessed.
  • --phase: Perform phasing and output best-guess haplotypes. If the user plans to perform phasing, we recommend a larger --states value (>200). The default value is good for imputation but may be insufficient for phasing.
  • --rounds <#rounds>: Specify the number of MCMC rounds. Default value is 30 which is good enough for standard imputation tasks. Computing time grows linearly #rounds. More rounds is better at increased computational costs.
  • --states <#states>: Specify the number of reference haplotypes sampled in interal phasing step. The default value is 100. Computing time grows quadratically with #states. Larger states is better at increased computational costs.
  • --referencesInFitting <#referencesInFitting>: Specify the number of reference haplotypes used in parameter learning. The default value is 500. Computing time is affected only minimally by #referencesInFitting.
  • --imputeStates <#imputeStates>: Specify the number of reference haplotypes used in internal imputing step. The default value is 500. Computing time is affected only minimally by #imputeStates.
  • --seed <#seed>: Specify random seed.
  • --interimInterval <#interimRounds>: Output intermediate results every #interimRounds rounds. The option is useful when computing time is very long for running all #rounds.
  • --prefix <fileprefix>: Specify the prefix for all output files
  • --uncompressed: Choose not to compress output files. By default all output files are written out in .gz format (if compiled with gzip support).
  • --mask <proportion>: Randomly mask a small proportion of input genotypes to evaluate imputation quality.
  • --scoreMAF Allele frequency weighted IBS calculation: instead of computing a Hamming distance where each site carries the same weight, this option gives higher weight to sites with lower minor allele frequency.


Chunk Chromosomes
With large reference panels, it is a lot of time more efficient splitting the whole chromosomes into multiple chunks and running imputation on each chunk separately. We provide such functionality by first splitting reference (correct, no need to split target) and then doing imputation.

If your reference is in vcf format, please download splitVCFref.jar first. Then follow the sample C-shell scriptssplitVCFandImpute.csh.



Estimate Ancestry-Weights of Reference Populations
MaCH-admix has a built-in feature to estimate the contributions of reference populations. To use this feature, the user has to provide ancestry-weight file with arbitrarily defined weights. MaCH-admix reads the reference population structure from this file. An example is shown below:

[File: AncestryWeights.cfg]
    60 1.0
    60 1.0

The example specifies two reference populations, each with 60 individuals (120 haplotypes). The reference haplotype file must have consistent structure (i.e. 60*2 haplotypes from the first population followed by 60*2 haplotypes from the second population ) We put arbitrary weights (1.0) here. The user can combine any non-ancestry-weighted reference selection with "--ancweightfile" option to obtain an estimate of population contributions. However, for short regions, we recommend the default IBS based approach for it is more reliable with small number of states/rounds.

Example:
./mach-admix --ancweightfile AncestryWeight.cfg -d target.dat -p target.ped -h ref.hap -s ref.snps

Example Output:
    ============== BEGIN: Contribution of Reference Populations ==============
    Reference Population 1: 0.221873
    Reference Population 2: 0.778127
    ============== END: Contribution of Reference Populations ==============



MaCH-Admix can also be used to estimate local ancestry content at each marker. We will release the function and documentation here after more careful evaluations.

Comments and suggestions are welcome, please e-mail Yun Li at yunli@med.unc.edu or Eric Yi Liu at liuyi@cs.unc.edu.