GermLineUsage
From the command line, extract germline with tar xzvf germline-X-X-X.zip, enter the extracted directory, and compile germline with make all. A simple test-case using shortened HapMap samples can be run using make test. The executable is run as germline <options> which prompts the user for input/output file information and runs the algorithm.
Input
GERMLINE accepts as input the following formats:
NOTE: Although the PLINK format is not intended for haplotypes, GERMLINE expects the respective alleles to appear in
order; i.e. the first allele always corresponds to one haplotype and the second allele to the other. Also, PLINK arbitrarily re-orders the
alleles in processing the files, so we do not recommend handling phased data with PLINK prior to GERMLINE analysis because the haplotypes
may not be intact (use the -from_snp and -to_snp flags to target specific regions).
Upon completion, GERMLINE generates a .match and .log file in the specified location. Each line in the .match file corresponds to a pairwise shared segment, with the following fields:
- Family ID 1
- Individual ID 1
- Family ID 2
- Individual ID 2
- Chromosome
- Segment start (bp)
- Segment end (bp)
- Segment start (SNP)
- Segment end (SNP)
- Total SNPs in segment
- Genetic length of segment
- Units for genetic length (cM or MB)
- Mismatching SNPs in segment
- 1 if Individual 1 is homozygous in match; 0 otherwise
- 1 if Individual 2 is homozygous in match; 0 otherwise
To spave space GERMLINE can also generate binary output using the -bin_out flag. This flag will generate three files:
- *.bsid Two columns per line for each sample: FAM ID,SAMPLE ID.
- *.bmid Four columns per line for each marker: CHROMOSOME,RSID,GENETIC DISTANCE,PHYSICAL DISTANCE.
- *.bmatch Binary match file containing integer pointers to samples (from bsid file), markers (from bmid file) and boolean meta-data.
The binary files can be converted back to the standard flat format described above by using the parse_bmatch utility provided with the code. Load the three generated files using parse_bmatch [BMATCH FILE] [BSID FILE] [BMID FILE] and the flat match output will be printed to standard out. See the parse_bmatch.cpp code for binary format details.
The program has several command line options to direct the segmental sharing process:
FlagDefaultDescription
-map-File location for genetic distance map. Uses the PLINK map format.
-min_m3Minimum length for match to be used for imputation (in cM or MB).
-err_hom2The maximum number of mismatching homozygous markers for a slice to still be considered part of a match.
-err_het0The maximum number of mismatching heterozygous markers for
a slice to
still be considered part of a match.
-from_snp-Indicate the ID of the first SNP to start processing from.
-to_snp-Indicate the ID of the last SNP to end processing with.
-h_extend-Extends from exact seeds using haplotypes rather than genotypes; useful when
data is well-phased (e.g. trios)
-homoz-Allow self matches (test for homozygosity)
-homoz-only-Analyze and report only auto/homo-zygous segments, no IBD reported but significantly faster analysis.
-haploid-Treat each input individual as two distinct and separate haplotypes. Output IDs will have .0/.1 suffix corresponding to each haplotype. The -err_het flag will have no effect in this analysis.
-bin_out-Generate output matches in binary format, creates a *.bmatch *.bsid and *.bmid files. These files can be converted to flat output using the parse_bmatch utility included and compiled in the package.
-bits128Size of each slice (in markers) used for exact matching seeds.