Last updated: 2019-05-14

Checks: 2 0

Knit directory: Note/

This reproducible R Markdown analysis was created with workflowr (version 1.3.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/.Rhistory
    Ignored:    docs/.DS_Store
    Ignored:    docs/figure/Test.Rmd/

Untracked files:
    Untracked:  analysis/MASH_est_Cor.Rmd
    Untracked:  analysis/MashEstCorProblem.Rmd
    Untracked:  analysis/MashMedian.Rmd
    Untracked:  analysis/mashMean.Rmd
    Untracked:  code/EstCorV.R
    Untracked:  docs/Estimate_Null_Cor_New.pdf
    Untracked:  docs/Estimate_Null_Cor_Old.pdf
    Untracked:  output/MASH.result.1.rds
    Untracked:  output/MASH.result.10.rds
    Untracked:  output/MASH.result.2.rds
    Untracked:  output/MASH.result.3.rds
    Untracked:  output/MASH.result.4.rds
    Untracked:  output/MASH.result.5.rds
    Untracked:  output/MASH.result.6.rds
    Untracked:  output/MASH.result.7.rds
    Untracked:  output/MASH.result.8.rds
    Untracked:  output/MASH.result.9.rds

Unstaged changes:
    Modified:   analysis/susieProblem2.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File Version Author Date Message
Rmd fdfd507 zouyuxin 2019-05-14 wflow_publish(“analysis/STRUCTURE.Rmd”)
html 79bf02c zouyuxin 2019-05-14 Build site.
Rmd c5a50b2 zouyuxin 2019-05-14 wflow_publish(“analysis/STRUCTURE.Rmd”)
html eb9acb9 zouyuxin 2019-05-14 Build site.
Rmd 4313c7f zouyuxin 2019-05-14 wflow_publish(“analysis/STRUCTURE.Rmd”)

This is a summary of STUCTURE. There are a series of paper introducing the model.

STUCTURE

Pritchard, Jonathan K., Matthew Stephens, and Peter Donnelly. “Inference of population structure using multilocus genotype data.” (2000) introduced the basic model for STUCTURE

This paper introduce a clustering method to infer population structure and cluster individuals based on the multilocus genetic data, using MCMC with some appropriate assumptions.

To do the clustering, there are two type of methods. One is distanced-base method, which calculates the pairwise distance matrix and represents the matrix using some graphical tool, such as tree, and identify the clusters by eye. This method is very easy to apply. But, it depends on the distance measure and graphical representation tool. It is also hard to measure the inference confidence, and hard to incorporate other information. This method is only suitable for exploratory data analysis, not for statistical inference. The other one is the model-based method, which assumes observations from each cluster are random draws from some parametric model. This method can obtain the inference for the parameters in the parametric model and the assignment simultaneously. The challenge of this method is that it requires suitable model and appropriate model assumptions. However, comparing with the distance-based method, it is easier to assess the model assumptions than comparing distance measures and graphical representations. Pritchard et al. (2000) used the later one and Bayesian approach to incorporate the inherent uncertainty of parameters into clustering inference procedure.

The basic method assumed a model with K populations (K could be unknown), each population is characterized by a set of allele frequencies at each locus. There are N diploid individuals at L loci. Moreover, it assumed the marker loci are unlinked and at linkage equilibrium with one another within populations, the HWE hold within populations.

Model without admixture

  • X: genotype of the sampled individuals. For diploid individual at L loci

    (\(x^{(i,1)}_l\),\(x^{(i,2)}_l\)) = genotype of the ith individual at the lth locus, \(i = 1,2,\cdots,N\), \(l = 1,2,\cdots,L\)

  • Z: unknown population origin of the individuals.

    \(z^{(i)}\) = population from which indiviual i originated

  • P: unknown allele frequencies in all populations

    \(p_{klj}\) = frequency of allel j at locus l in population k, \(k = 1,\cdots,K\), \(j = 1,\cdots,J_l\), where \(J_l\) is the number of distinct alleles observed at locus l.

Each allele at each locus in each genotype is an independent draw from the appropriate frequency distribution \(P(X|Z,P)\). The parameters Z and P are the main interests parameters. \[ Pr(x^{(i,a)}_l = j|Z,P) = p_{z^{(i)}lj} \] The inference of Z,P can be obtained using the posterior distribution \[ Pr(Z,P|X) \propto Pr(X|Z,P)Pr(Z,P) = Pr(X|Z,P)Pr(Z)Pr(P) \] The prior for Z is \[ Pr(z^{(i)} = k)= \frac{1}{K} \] The prior for P is Dirichlet \[ p_{kl·} ∼ D(\lambda_1,\cdots,\lambda_{J_l}) \]

The posterior information is from MCMC.

Model with admixture

  • Q: unknown admixture proportions for each individual.

    \(q^{(i)}_{k}\) = proportion of individual i’s genome that originated from population k.

  • Z: unknown population origin of the allele copy \(x^{(i,a)}_l\).

    \(z^{(i,a)}_l\) = population of origin of allele copy \(x^{(i,a)}_l\)

The primary interest is estimating Q.

\[ Pr(x^{(i,a)}_l = j|Z, P, Q) = p_{z^{(i,a)}_l l j} \] \[ Pr(z^{(i,a)}_l = k|P, Q) = q^{(i)}_{k} \] \[ Pr(Z,P,Q|X) \propto Pr(X|Z,P,Q)Pr(Z,P,Q) = Pr(X|Z,P)Pr(Z|Q)Pr(P)Pr(Q) \] The prior for Q is \[ q^{(i)} \sim D(\alpha, \cdots, \alpha) \] where \(\alpha\) is the concentration parameter. The mean of \(q_k^{(i)}\) is proportion to \(\alpha\), and the distribution becomes more centered around the mean as the \(\alpha\) increases. \(\alpha \ll 1\) means that each individual is originated mostly from a single population. \(\alpha \gg 1\) means that each individual has allele copies originated from all K populations with equal probability. They allowed \(\alpha\) varying from 0 to 10 and learnt the \(\alpha\) from the data.

Inference

The common problem of clustering algorithm is how to select the number of clusters. The clustering algorithm need to know the number of populations K in advance. Pritchard et al. introduced a method to infer K. Using Bayesian method, they placed a prior on K and used approximation

\[ Pr(K|X) \propto Pr(X|K)Pr(K) \approx \exp (−\frac{\hat{\mu}}{2} − \frac{\hat{\sigma}^2}{8} ) Pr(K) \] The approximation holds under some assumptions, but the authors mentioned that the assumptions may not hold in application. Therefore, the inference for K may not accurate. But, this method can get some consistent information about K from the data.