Skip Navigation
Text:
Increase font size
Decrease font size

TOPMed5b Based Imputation: Post Imputation Filtering and Quality Evaluation

Background Information

The example below assumes the imputation output file is in vcf format. You can fine an example output file ("imp_out.vcf.gz") inside the tarball file from our download page.

Post Imputation Filtering

Post imputation quality filtering is performed based usually according to imputation quality metric, estimated r2. Sometimes, people also filter based on minor allele frequency (MAF). In vcf format files, estimated r2 and MAF are usually both presented in the 8th INFO column (for two examples, "MAF=0.13402;R2=0.98195", "MAF=0.00005;R2=0.00193"). Please check out the script 01.a.post-imputation-filter.sh inside the tarball file from our download page.

Imputation Quality Evaluation

There are multiple metrics to assess imputation quality. We group them into two categories: estimated and true quality metrics. Estimated quality metrics can be obtained from running imputation software alone, without the need to have true genotypes. In contrast, true quality metrics can only be obtained in the presence of true genotypes.

Among the estimated quality metrics, we and others have been recommending the estimated r2, which is used in the above section for post-imputation quality filtering. Estimated r2 ranges from 0 to 1. The larger estimated r2 is, the better imputation quality is belived to be. For common variants, estimated r2 should skew heavily towards the well-imputed end (i.e., close to 1). Run the script 02.estimated-r2.dist.sh inside the tarball file from our download page to see results from a real data example.

Among the true imputation quality metrics, we have been recommending true r2, the squared Pearson correlation between imputed dosages and true genotypes. One can use our doseR2 program for this purpose.

For more information, please refer to MaCH wiki page.