利用 Dirichlet-multinomial regression 计算不同条件下亚群丰度变化

Kevin_Hhui 2023-03-28 原文

image.png

方法来源于上面这篇文章，不得不说，这篇文章运用了非常多复杂的方法去阐述关注的科学问题，真不愧是出自Broad institute实验室的。我这里暂时只讲下文章中一种比较新颖的比较不同条件下亚群丰度变化的方法。

首先我们先了解下Dirichlet-multinomial regression。
让我们从数学层面开始：
假设从正常组织取了sample i，正常组织本身包含了p种cell type，假设各种cell type出现的概率为

Cell probability

, sample i中各种cell type出现的数目为

Number of cell type

sample i中共有N个cell，全部细胞的总和就是，

Total number of cells

那么汇总下，这个sample的概率分布就为

其中

是 sample i 对应的多项分布的参数, 满足

对于多项分布, 其均值和方差均可由该参数决定, 即设 Xij 是相应位置的随机变量, 则有

引入Dirichlet概率分布

对于同一个tissue不同的sample，不同sample之间的成分参数

可能会相差很大，具有‘超散布性’ （overdispersion），也就是说观测到的不同样本中成分参数的方差会显著大于多项分布模型下给出的方差。如何在建模中考虑这种超散布性，并且不给模型增加太多参数呢？我们假设不同样本的成分参数来自同一随机变量的不同实现，那么我们可以使用Dirichlet分布来model 成分参数分布。

其中

是Dirichlet分布参数，满足

结合前面的multinomial distribution，我们可以写出 DM 模型的概率分布为

DM 模型的另一种等价的参数化方法是令

则概率分布可写为

这种参数化方式的好处是参数的含义更易解释, 因为有

因此，对于包含多个cell type的tissue sample data，可以通过fit DM 模型的方式对两个sample中各个cell type abundance进行统计检验。
上面发表在Cell上的这篇文章就是用到的就是DirichletReg这个包，来检验正常人组织 VS 病人组织non-inflamed sample间哪些cell type abundance存在明显差异，以及检验正常人组织 VS 病人组织inflamed sample间哪些cell type abundance存在明显差异。

这里是示例代码

# source code downloaded from:
# https://github.com/cssmillie/ulcerative_colitis
# there is a 'ulcerative_colitis-master' folder which contains a number of r code scripts

## load required libraries (analysis.r code includes all the libraries for running all analyses,
## but as here only shows the Cellular Composition difference test, only libraries listed below need to be loaded.
library(Seurat)
library(RColorBrewer) #for brewer.pal
library(Matrix) #for Matrix
library(DirichletReg)
library(data.table)
library(tidyverse)
library(cowplot)

## this function is extracted from analysis.r 
dirichlet_regression = function(counts, covariates, formula){  
  # Dirichlet multinomial regression to detect changes in cell frequencies
  # formula is not quoted, example: counts ~ condition
  # counts is a [samples x cell types] matrix
  # covariates holds additional data to use in the regression
  #
  # Example:
  # counts = do.call(cbind, tapply(seur@data.info$orig.ident, seur@ident, table))
  # covariates = data.frame(condition=gsub('[12].*', '', rownames(counts)))
  # res = dirichlet_regression(counts, covariates, counts ~ condition)
  
  #ep.pvals = dirichlet_regression(counts=ep.freq, covariates=ep.cov, formula=counts ~ condition)$pvals

  # Calculate regression
  counts = as.data.frame(counts)
  counts$counts = DR_data(counts)
  data = cbind(counts, covariates)
  fit = DirichReg(counts ~ condition, data)
  
  # Get p-values
  u = summary(fit)
  #compared with healthy condition, 15 vars. noninflame and inflame, 30pvalues
  pvals = u$coef.mat[grep('Intercept', rownames(u$coef.mat), invert=T), 4] 
  v = names(pvals)
  pvals = matrix(pvals, ncol=length(u$varnames))
  rownames(pvals) = gsub('condition', '', v[1:nrow(pvals)])
  colnames(pvals) = u$varnames
  fit$pvals = pvals
  
  fit
}



## not all **.r in analysis.r need to be 'source'.
## only these three below are required for cellular composition difference test.
source('mtx.r') #for sparse_cbind
source('plot.r') #for matrix_barplot
source('colors.r') #for set.colors

## Load metadata for discovery and validation cohorts
## data downloaded from: 
## https://portals.broadinstitute.org/single_cell/study/SCP259
meta = read.table('all.meta2.txt', sep='\t', header=T, row.names=1, stringsAsFactors=F)

# Read a list of cell subsets, including the group that each one belongs to
# Groups include: Epithelial, Endothelial, Fibroblasts, Glia, Myeloid, B, and T cells
cell_subsets = read.table('cell_subsets.txt', sep='\t', header=F, stringsAsFactors=F)

# load seurat objects
# data downloaded from:
# https://www.dropbox.com/sh/dn4gwdww8pmfebf/AACXYu8rda5LoLwuCZ8aZXfma?dl=0
   
epi.seur = readRDS('train.Epi.seur.rds')
fib.seur = readRDS('train.Fib.seur.rds')
imm.seur = readRDS('train.Imm.seur.rds')
  
# set counts matrices
epi.counts = epi.seur@assays[['RNA']]@counts
fib.counts = fib.seur@assays[['RNA']]@counts
imm.counts = imm.seur@assays[['RNA']]@counts
  

# Use dirichlet-multinomial regression to find significant changes in cell frequencies during disease
# ---------------------------------------------------------------------------------------------------

# Count each cell subset in every sample
epi.freq = as.matrix(as.data.frame.matrix(table(epi.seur@meta.data$Sample, epi.seur@meta.data$Cluster)))
fib.freq = as.matrix(as.data.frame.matrix(table(fib.seur@meta.data$Sample, fib.seur@meta.data$Cluster)))
imm.freq = as.matrix(as.data.frame.matrix(table(imm.seur@meta.data$Sample, imm.seur@meta.data$Cluster)))

# Combine counts into a single matrix
all.freq = sparse_cbind(list(epi.freq, fib.freq, imm.freq))

# For the validation cohort, we need to combine the replicate samples (because they are not independent)
# To construct a list of replicates, we remove "1", "2", "a", and "b" from the sample IDs
reps = gsub('[12]*[ab]*$', '', rownames(all.freq))
temp = as.matrix(data.frame(aggregate(as.matrix(all.freq), list(reps), sum), row.names=1))
colnames(temp) = colnames(all.freq)
all.freq = temp[,colSums(temp) > 0]

# Split matrix into "epithelial" and "lamina propria" cell subsets and samples
ep.ident = levels(epi.seur@active.ident)
lp.ident = c(levels(fib.seur@active.ident), levels(imm.seur@active.ident))
ep.freq = all.freq[grep('Epi', rownames(all.freq)), ep.ident]
lp.freq = all.freq[grep('LP', rownames(all.freq)), lp.ident]

# For the dirichlet-multinomial regression, we need to know the disease state for each sample
# We can get this from the metadata table as follows:
sample2health = data.frame(unique(data.frame(sample=gsub('[12]*[ab]*$', '', meta[,'Sample']), health=meta[,'Health'])), row.names=1)
ep.cov = data.frame(condition=factor(sample2health[rownames(ep.freq),1], levels=c('Healthy', 'Non-inflamed', 'Inflamed')), row.names=rownames(ep.freq))
lp.cov = data.frame(condition=factor(sample2health[rownames(lp.freq),1], levels=c('Healthy', 'Non-inflamed', 'Inflamed')), row.names=rownames(lp.freq))

# Calculate significant changes using dirichlet multinomial regression
# This returns a matrix of p-values for each cell type / disease state
# 3 condition
ep.pvals = dirichlet_regression(counts=ep.freq, covariates=ep.cov, formula=counts ~ condition)$pvals
colnames(ep.pvals) = colnames(ep.freq)
lp.pvals = dirichlet_regression(counts=lp.freq, covariates=lp.cov, formula=counts ~ condition)$pvals
colnames(lp.pvals) = colnames(lp.freq)

# Plot epithelial cell proportions
ep.pct = 100*ep.freq/rowSums(ep.freq)
p1 = matrix_barplot(ep.pct, group_by=ep.cov$condition, pvals=ep.pvals, colors=set.colors)
save_plot(p1, file='train.Fig2A.epi_freqs.pdf', nrow=1, ncol=2.5)

# Plot lamina propria cell proportions
lp.pct = 100*lp.freq/rowSums(lp.freq)
p2 = matrix_barplot(lp.pct, group_by=lp.cov$condition, pvals=lp.pvals, colors=set.colors)
save_plot(p2, file='train.Fig2A.lp_freqs.pdf', nrow=1, ncol=2.5)

用自己的数据计算下

My scRNA-seq data

总结下，这篇文章运用了非常多计算方法，当然自己原创的部分也占据了很大部分，有必要好好学习下。

参考1：https://zhuanlan.zhihu.com/p/341941329
参考2：https://www.stat-center.pku.edu.cn/docs/20190226150810864721.pdf
原文：https://pubmed.ncbi.nlm.nih.gov/31348891/

有关利用 Dirichlet-multinomial regression 计算不同条件下亚群丰度变化的更多相关文章

ruby-on-rails - 使用一系列等级计算字母等级 - 2
这里是Ruby新手。完成一些练习后碰壁了。练习:计算一系列成绩的字母等级创建一个方法get_grade来接受测试分数数组。数组中的每个分数应介于0和100之间，其中100是最大分数。计算平均分并将字母等级作为字符串返回，即“A”、“B”、“C”、“D”、“E”或“F”。我一直返回错误:avg.rb:1:syntaxerror,unexpectedtLBRACK,expecting')'defget_grade([100,90,80])^avg.rb:1:syntaxerror,unexpected')',expecting$end这是我目前所拥有的。我想坚持使用下面的方法或.join，
ruby - 如何根据特征实现 FactoryGirl 的条件行为 - 2
我有一个用户工厂。我希望默认情况下确认用户。但是鉴于unconfirmed特征，我不希望它们被确认。虽然我有一个基于实现细节而不是抽象的工作实现，但我想知道如何正确地做到这一点。factory:userdoafter(:create)do|user,evaluator|#unwantedimplementationdetailshereunlessFactoryGirl.factories[:user].defined_traits.map(&:name).include?(:unconfirmed)user.confirm!endendtrait:unconfirmeddoenden
ruby - 在 Ruby 中有条件地定义函数 - 2
我有一些代码在几个不同的位置之一运行:作为具有调试输出的命令行工具，作为不接受任何输出的更大程序的一部分，以及在Rails环境中。有时我需要根据代码的位置对代码进行细微的更改，我意识到以下样式似乎可行:print"Testingnestedfunctionsdefined\n"CLI=trueifCLIdeftest_printprint"CommandLineVersion\n"endelsedeftest_printprint"ReleaseVersion\n"endendtest_print()这导致:TestingnestedfunctionsdefinedCommandLin
ruby - 定义方法参数的条件 - 2
我有一个只接受一个参数的方法:defmy_method(number)end如果使用number调用方法，我该如何引发错误？？通常，我如何定义方法参数的条件？比如我想在调用的时候报错:my_method(1) 最佳答案您可以添加guard在函数的开头，如果参数无效则引发异常。例如:defmy_method(number)failArgumentError,"Inputshouldbegreaterthanorequalto2"ifnumbereputse.messageend#=>Inputshouldbegreaterthano
ruby-on-rails - 启用 Rack::Deflater 时 ETag 发生变化 - 2
在启用Rack::Deflater来gzip我的响应主体时偶然发现了一些奇怪的东西。也许我遗漏了一些东西，但启用此功能后，响应被压缩，但是资源的ETag在每个请求上都会发生变化。这会强制应用程序每次都响应，而不是发送304。这在没有启用Rack::Deflater的情况下有效，我已经验证页面源没有改变。我正在运行一个使用thin作为Web服务器的Rails应用程序。Gemfile.lockhttps://gist.github.com/2510816有没有什么方法可以让我从Rack中间件获得更多的输出，这样我就可以看到发生了什么？提前致谢。最佳答案
计算机毕业设计ssm+vue基本微信小程序的小学生兴趣延时班预约小程序 - 2
项目介绍随着我国经济迅速发展,人们对手机的需求越来越大,各种手机软件也都在被广泛应用,但是对于手机进行数据信息管理,对于手机的各种软件也是备受用户的喜爱小学生兴趣延时班预约小程序的设计与开发被用户普遍使用,为方便用户能够可以随时进行小学生兴趣延时班预约小程序的设计与开发的数据信息管理,特开发了小程序的设计与开发的管理系统。小学生兴趣延时班预约小程序的设计与开发的开发利用现有的成熟技术参考,以源代码为模板,分析功能调整与小学生兴趣延时班预约小程序的设计与开发的实际需求相结合,讨论了小学生兴趣延时班预约小程序的设计与开发的使用。开发环境开发说明：前端使用微信微信小程序开发工具：后端使用ssm：VU
java - 为什么 ruby modulo 与 java/other lang 不同？ - 2
我基本上来自Java背景并且努力理解Ruby中的模运算。(5%3)(-5%3)(5%-3)(-5%-3)Java中的上述操作产生，2个-22个-2但在Ruby中，相同的表达式会产生21个-1-2.Ruby在逻辑上有多擅长这个？模块操作在Ruby中是如何实现的？如果将同一个操作定义为一个web服务，两个服务如何匹配逻辑。最佳答案在Java中，模运算的结果与被除数的符号相同。在Ruby中，它与除数的符号相同。remainder()在Ruby中与被除数的符号相同。您可能还想引用modulooperation.
ruby - 如何计算 Liquid 中的变量 +1 - 2
我对如何计算通过{%assignvar=0%}赋值的变量加一完全感到困惑。这应该是最简单的任务。到目前为止，这是我尝试过的:{%assignamount=0%}{%forvariantinproduct.variants%}{%assignamount=amount+1%}{%endfor%}Amount:{{amount}}结果总是0。也许我忽略了一些明显的东西。也许有更好的方法。我想要存档的只是获取运行的迭代次数。最佳答案因为{{incrementamount}}将输出您的变量值并且不会影响{%assign%}定义的变量，我
ruby-on-rails - 在 RSpec 中，如何以任意顺序期望具有不同参数的多条消息？ - 2
RSpec似乎按顺序匹配方法接收的消息。我不确定如何使以下代码工作:allow(a).toreceive(:f)expect(a).toreceive(:f).with(2)a.f(1)a.f(2)a.f(3)我问的原因是a.f的一些调用是由我的代码的上层控制的，所以我不能对这些方法调用添加期望。最佳答案 RSpecspy是测试这种情况的一种方式。要监视一个方法，用allowstub，除了方法名称之外没有任何约束，调用该方法，然后expect确切的方法调用。例如:allow(a).toreceive(:f)a.f(2)a.f(1)
ruby-on-rails - 使用包含多个关联和单独的条件 - 2
我的Gallery模型中有以下查询:media_items.includes(:photo,:video).rank(:position_in_gallery)我的图库模型有_许多媒体项，每个都有一个照片或视频关联。到目前为止，一切正常。它返回所有media_items包括它们的photo或video关联，由media_item的position_in_gallery属性排序。但是我现在需要将此查询返回的照片限制为仅具有is_processing属性的照片，即nil。是否可以进行相同的查询，但条件是返回的照片等同于:.where(photo:'photo.is_processingIS