About
Genomewide association has been a powerful tool for detecting common disease variants. However, this approach has been
underpowered in identifying variation that is poorly represented on commercial SNP arrays, being too rare or population-specific.
Recent multipoint methods including SNP tagging and imputation boost the power of detecting and localizing the true causal
variant, leveraging common haplotypes in a densely typed panel of reference samples. However, they are limited by the need to
obtain a robust population-specific reference panel with sampling deep enough to observe a rare variant of interest. We set out
to overcome these challenges by using long stretches of genomic sharing that are identical by descent (IBD). We use such evident
sharing between pairs and small subsets of individuals to recover the underlying shared haplotypes that have been co-inherited by
these individuals.
We have created a software tool, DASH (DASH Associates Shared Haplotypes), that builds upon pairwise IBD shared segments to infer clusters of IBD individuals.
Briefly, for each locus, DASH constructs a graph with links based on IBD at that locus, and uses an iterative min-cut approach to
identify clusters. These are densely connected components, each sharing a haplotype. As DASH slides the local window along the
genome, links representing new shared segments are added and old ones expire; these changes cause the resultant connected
components to grow and shrink. We code the corresponding haplotypes as genetic markers and use them for association testing.
The program has been developed in Itsik Pe'er's Lab of Computational Genetics at Columbia
University. It is built in C++ and tested in the Red Hat Linux environment; the source is distributed here in a tar.gz package under the GPL license.
If you plan to use DASH in a published analysis, please reference the following manuscript:
DASH: A Method for Identical-by-Descent Haplotype Mapping Uncovers Association with Recent Variation,
Alexander Gusev, Eimear E. Kenny, Jennifer K. Lowe, Jaqueline Salit, Richa Saxena, Sekar Kathiresan, David M. Altshuler, Jeffrey M. Friedman, Jan L. Breslow, Itsik Pe'er. The American Journal of Human Genetics 2011
Usage
The DASH package consists of 32-bit binaries and C++ source for the efficient connect-component-based clustering (src/dash_cc), the more advanced/slower dense subgraph clustering (src/dash_adv) and additional tools (src/tools). From the command line, extract DASH with tar xzvf dash-X-X-X.tar.gz. Pre-compilined binaries are in the 'bin' directory, but can be regenerated by entering each of the subdirectories in 'src' and calling make. For dash_adv a simple test-case using inputs from the test subdirectory can be run by calling make test.
DASH-adv uses a modified version of the Boost Graph Library subgraph.hpp class, with all of the neccessary files provided in this distribution. If you are having Boost related issues compiling, please make sure that a native copy of Boost is not superceding the one referenced.
Input
DASH accepts IBD segments through the standard input, one segment per line, with each line whitespace delimited with the following columns:
- Family ID 1
- Individual ID 1
- Family ID 2
- Individual ID 2
- Segment start (bp)
- Segment end (bp)
Simple Execution
DASH makes several assumptions about the structure of the shared segments. First, all segments are expected to be on the same chromosome - we recommend splitting genomic data into separate chromosomes which can be easily parallelized. More importantly, DASH assumes that each individual in the pair represents a haploid sample. While DASH allows for some degree of error and attempts to exclude individuals from a haplotype to which they are loosely connected, when a single input individual is sharing both of it's haplotypes to many other samples, DASH will place that individual into the single most likely haplotype cluster rather than both.
A vanilla analysis, first generating IBD segments using our GERMLINE algorithm would be the following:
germline -haploid
cut -f 1,2,4,10,11 germline.match | dash_cc my_samples.fam my_clusters
cut -f 1-3 my_clusters.clst | awk '{ print 1,"cs"$1,0,int(($2+$3)/2) }' > my_clusters.map
plink --ped my_clusters.ped --map my_clusters.map --pheno my_trait --assoc
From experimentation, we have found the "-haploid -bin_out -min_m 1 -bits 32 -err_hom 1 -err_het 1" flags for GERMLINE to be most effective.
Full Experiment
We have written a command-line execution pipeline that starts with unphased PLINK-format data, generates phased haplotypes with BEAGLE, identifies IBD segments with GERMLINE, and process them with DASH-cc. Please make sure that the 'src/tools' binaries have been compiled; and that links/copies to the PLINK, BEAGLE (if phasing), and GERMLINE (if detecting IBD) excutables are in the 'bin' directory under the names 'plink', 'beagle.jar', and 'germline' respectively.
A full analysis, starting with my_input.ped, my_input.map, and my_input.fam unphased, PLINK-format data would be run as follows:
- bash phase.sh my_input.ped my_input.map my_output Accepts unphased plink ped/map files and generates "my_output.phased.ped" and "my_output.phased.map" haplotypes phased with BEAGLE.
- bash gline.for_dash.sh my_output.phased.ped my_output.phased.map my_output Accepts phased PLINK ped/map files and generates my_output binary GERMLINE IBD-match output with parameters optimized for DASH. Additional parameters for GERMLINE (such as a recombination map, highly recommended) can be added as the fourth parameter with quotes, e.g: "-map genetic_distance.map".
- bash dash_cc.sh my_output my_input.fam my_output Accepts binary-format GERMLINE IBD output and generates my_output.dash_cc.clst cluster list and membership, my_output.dash_cc.cmap cluster map file, my_output.dash_cc.{bed,bim,fam} binary PLINK format files coding the haplotypes as bi-allelic markers.
Output
As it runs, DASH generates a .clst Haplotype cluster file where each line represents a cluster/haplotype with the following tab separated fields:
- Cluster identifier
- Cluster start position
- Cluster end position
- Maximum IBD match start position
- Minimum IBD match end position
- Family and Sample ID for cluster carriers ...
Fields 2 & 3 represent the shortest region containing the clustered individuals with no change in IBD-status; fields 4 & 5 represent the minimum region where all cluster members share and IBD segment. When called using the pipeline scripts, a *.cmap file will be generated corresponding fields 1-5 and a binary PLINK-format file will be generated corresponding to the haplotypes.
Advanced Options
The DASH-adv program has several command line options to direct the clustering process:
Flag | Default | Description |
---|
-help | - | Print this list of commands |
-fam | - | PLINK format .fam file listing sample ids. Used to generate ped/map files (see above). |
-win | 500000 | Sliding window size. |
-density | 0.6 | Minimum cluster density. |
-r2 | 0.95 | Maximum r^2 for which two haplotypes are considered different and printed, set to 1 to print all. |
-min | 4 | Minimum haplotype/cluster size. |
Contact
For any questions or comments, please contact the developers directly at: {gusev,itsik}@cs.columbia.edu.
Change Log
1.1.0 (05.27.11)
DASH paper published at AJHG
DASH-cc and DASH-adv versions released
Phasing & IBD pipeline added
1.0.0 (09.17.10)
First stable release