草庐IT

转录组DEGs聚类热图和功能富集分析

kkkkkkang 2023-03-28 原文

写在前面:经常做转录组分析,就是把差异基因搞个火山图和Venn图看各组差异基因的共有和特有情况。看见有个比较好的选择,能直观比较各种处理带来的影响,如下:

image.png

来自Nature plants的一篇文章
Ref:
https://github.com/YulongNiu/MPIPZ_microbe-host_homeostasis
https://www.nature.com/articles/s41477-021-00920-2
这个图很牛逼啊,表示的信息量很全,值得学习。去扒作者的代码,复现出了大部分

所需文件:
总的基因丰度表,即各个基因在每个样品中的丰度


image.png

各个样品的基因差异表达情况
举一个例子,这是Deseq2算出来的:


image.png

1. 导入数据并做一些处理

setwd("C:/Users/yjk/Desktop/cluster_DEGs")
library("dplyr")
library("ComplexHeatmap")
library("tibble")
library("RColorBrewer")
library("ggplot2")
my_theme()
all_genes <- read.table("all_genes.txt", header = TRUE) # fpkm

# DESeq2获得的差异表达基因(DEGs), |log2foldchange| > 1, padj < 0.05
HF12N_Rs <- read.table("HF12N_Rs.txt", header = TRUE, sep = "\t")
HF12N_S <- read.table("HF12N_S.txt", header = TRUE, sep = "\t")
HF12Rs_N_S <- read.table("HF12Rs_N_S.txt", header = TRUE, sep = "\t")
HF12S_Rs <- read.table("HF12S_Rs.txt", header = TRUE, sep = "\t")
HG64N_Rs <- read.table("HG64N_Rs.txt", header = TRUE, sep = "\t")
HG64N_S <- read.table("HG64N_S.txt", header = TRUE, sep = "\t")
HG64Rs_N_S <- read.table("HG64Rs_N_S.txt", header = TRUE, sep = "\t")
HG64S_Rs <- read.table("HG64S_Rs.txt", header = TRUE, sep = "\t")
# 合并差异表达和基因丰度
all <- HF12N_Rs %>% 
    full_join(HF12N_S) %>% 
    full_join(HF12Rs_N_S) %>% 
    full_join(HF12S_Rs) %>% 
    full_join(HG64N_Rs) %>% 
    full_join(HG64N_S) %>% 
    full_join(HG64Rs_N_S) %>% 
    full_join(HG64S_Rs) %>% 
    left_join(all_genes)
all[is.na(all)] <- "No"
## mean func
meanGp <- function(v) {
    require('magrittr')
    res <- v %>%
        split(rep(1 : 8, each = 3)) %>%
        sapply(mean, na.rm = TRUE)
    return(res)
}

all_for_cluster <- select(all, -contains("vs")) # 只选择原始fpkm数据
rownames(all_for_cluster) <- all_for_cluster$id
all_for_cluster <- all_for_cluster[-1]

## sample name
sampleN <- c("HG64NRs","HG64SRs","HG64N","HG64S",
             "HF12NRs","HF12SRs","HF12N","HF12S")

meanCount <- all_for_cluster %>%
    apply(1, meanGp) %>%
    t

colnames(meanCount) <- sampleN
## scale
scaleCount <- meanCount %>%
    t %>%
    scale %>%
    t
scaleCount %<>% .[complete.cases(.), ]

2. 然后要先计算多少个cluster合适:

# set.seed(123) 聚类结果一致
set.seed(123)
## choose cluster num
## 1. sum of squared error
wss <- (nrow(scaleCount) - 1) * sum(apply(scaleCount, 2, var))

for (i in 2:20) {
    wss[i] <- sum(kmeans(scaleCount,
                         centers=i,
                         algorithm = 'MacQueen')$withinss)
}

ggplot(tibble(k = 1:20, wss = wss), aes(k, wss)) +
    geom_point(colour = '#D55E00', size = 3) +
    geom_line(linetype = 'dashed') +
    xlab('Number of clusters') +
    ylab('Sum of squared error')

# ggsave("Sum_of_squared_error.pdf", height = 3, width = 4)
## 2. Akaike information criterion
kmeansAIC = function(fit){
    m = ncol(fit$centers)
    n = length(fit$cluster)
    k = nrow(fit$centers)
    D = fit$tot.withinss
    return(D + 2*m*k)
}

aic <- numeric(20)
for (i in 1:20) {
    fit <- kmeans(x = scaleCount, centers = i, algorithm = 'MacQueen')
    aic[i] <- kmeansAIC(fit)
}

ggplot(tibble(k = 1:20, aic = aic), aes(k, wss)) +
    geom_point(colour = '#009E73', size = 3) +
    geom_line(linetype = 'dashed') +
    xlab('Number of clusters') +
    ylab('Akaike information criterion')

# ggsave("Akaike_information_criterion.pdf", height = 3, width = 4)
# choose cluster 10

withinss: Vector of within-cluster sum of squares, one component per cluster.

我们上面计算的是withinss,即cluster内的误差平方和,同一个cluster趋势越一致则越小。所以cluster越多则这个值越小,这和我们认知一致。但是,cluster太多了信息很杂,太少了就成了生拉硬扯成cluster了

image.png

可以看出选则10比较合适
利用另外一个参数:赤池信息量准则(Akaike information criterion,简称AIC)来衡量

AIC介绍,https://www.biaodianfu.com/aic-bic.html
理论上来讲,增加自由参数的数目可以提高拟合的优良性,但是过多,模型过于复杂容易造成过拟合现象。因此,AIC不仅要提高模型拟合度(极大似然),而且引入了惩罚项,使模型参数尽可能少,有助于降低过拟合的可能性。

image.png

可以看出,两种方法,同样结果。选择10

3. 然后聚类:

kclust10 <- kmeans(scaleCount, centers = 10, algorithm = "MacQueen", nstart = 1000, iter.max = 20)
cl <- as.data.frame(kclust10$cluster) %>% 
    rownames_to_column("id") %>% 
    set_colnames(c("id","cl"))

heat_all <- all %>% left_join(cl)


# 把所有DEGs的cluster信息保存,为后面富集分析所用
degs_cl <- heat_all %>%
    select(c("id","cl"))
write.table(degs_cl, "./enrichment/degs_cl.txt", sep = "\t", quote = FALSE)


scaleC <- heat_all %>% 
    select(-contains("vs")) %>% 
    select(-c("id","cl")) %>% 
    t %>% 
    scale %>%
    t %>%
    as_tibble %>%
    bind_cols(heat_all %>% select(id, cl))
# 排序
scaleC <- scaleC[,c("HG64S1", "HG64S2", "HG64S3",
                 "HG64N1", "HG64N2", "HG64N3",
                 "HG64SRs1", "HG64SRs2", "HG64SRs3",
                 "HG64NRs1", "HG64NRs2", "HG64NRs3",
                 "HF12S1", "HF12S2", "HF12S3",
                 "HF12N1", "HF12N2", "HF12N3",
                 "HF12SRs1", "HF12SRs2", "HF12SRs3",
                 "HF12NRs1", "HF12NRs2", "HF12NRs3",
                 "id", "cl")]


## col annotation
NatSoil <- HeatmapAnnotation(NatSoil = c(rep(rep(c("No", "Yes", "No", "Yes"), each = 3),2)),
                             col = list(NatSoil = c("Yes" = "grey", "No" = "white")),
                             gp = gpar(col = "black"))
Rs <- HeatmapAnnotation(Rs = c(rep(rep(c("No", "Yes"), each = 6),2)),
                               col = list(Rs = c("Yes" = "grey", "No" = "white")),
                        gp = gpar(col = "black"))

## DEG annotation
deg <- heat_all %>% select(matches("vs"))
# order
deg <- deg[,c("HG64N_vs_HG64S", "HF12N_vs_HF12S", "HG64NRs_vs_HG64SRs", "HF12RsN_vs_HF12RsS", 
              "HG64NRs_vs_HG64N", "HF12NRs_vs_HF12N", "HG64SRs_vs_HG64S", "HF12SRs_vs_HF12S")]
Heatmap(matrix = scaleC[1:24],
        name = 'Scaled Counts',
        row_split = scaleC$cl,
        row_gap = unit(2, "mm"),
        column_order = 1:24,
        # column_split = rep(c("HG64S","HG64N","HG64SRs","HG64NRs",
                             # "HF12S","HF12N","HF12SRs","HF12NRs"), each = 3),
        column_split = factor(rep(c("HG64","HF12"), each = 12), levels = c("HG64","HF12")),
        show_column_names = FALSE,
        col = colorRampPalette(rev(brewer.pal(n = 10, name = 'Spectral'))[c(-3,-8)])(100),
        top_annotation = c(NatSoil, Rs),
        row_title_gp = gpar(fontsize = 10),
        use_raster = FALSE) +
        Heatmap(deg,
                col = c('Down' = '#00bbf9', 'No' = 'white', 'Up' = '#f94144'),
                column_names_gp = gpar(fontsize = 6),
                heatmap_legend_param = list(title = 'DEGs'),
                cluster_columns = FALSE,
                column_names_rot = 45,
                use_raster = FALSE)

image.png

4. 然后每个cluster有什么功能呢?富集分析

#####################
setwd("C:/Users/yjk/Desktop/cluster_DEGs/enrichment")
library("clusterProfiler")
library("magrittr")
library("tidyverse")
library("RColorBrewer")
library("AnnotationHub")
my_theme()

# > packageVersion("AnnotationHub")
# [1] ‘3.0.2’
# packageVersion("clusterProfiler")
# [1] ‘4.0.5’

hub <- AnnotationHub()
# snapshotDate(): 2021-05-18
query(hub, "solanum")
# AnnotationHub with 9 records
# # snapshotDate(): 2021-05-18
# # $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/, Inparanoid8, WikiPathways
# # $species: Solanum lycopersicum, Solanum tuberosum, Solanum pennellii, Solanum pennelli, Sola...
# # $rdataclass: OrgDb, Inparanoid8Db, Tibble
# # additional mcols(): taxonomyid, genome, description, coordinate_1_based, maintainer,
# #   rdatadateadded, preparerclass, tags, rdatapath, sourceurl, sourcetype 
# # retrieve records with, e.g., 'object[["AH10593"]]' 
# 
# title                                             
# AH10593 | hom.Solanum_lycopersicum.inp8.sqlite              
# AH10606 | hom.Solanum_tuberosum.inp8.sqlite                 
# AH91815 | wikipathways_Solanum_lycopersicum_metabolites.rda 
# AH94101 | org.Solanum_pennellii.eg.sqlite                   
# AH94102 | org.Solanum_pennelli.eg.sqlite                    
# AH94160 | org.Solanum_tuberosum.eg.sqlite                   
# AH94246 | org.Solanum_esculentum.eg.sqlite                  
# AH94247 | org.Solanum_lycopersicum.eg.sqlite                
# AH94248 | org.Solanum_lycopersicum_var._humboldtii.eg.sqlite

sly.db <- hub[["AH94247"]]

这里如果遇到提示

hub <- AnnotationHub()
snapshotDate(): 2021-05-18
Warning message:DEPRECATION: As of AnnotationHub (>2.23.2), default caching location has changed.
Problematic cache: C:\Users\yjk\AppData\Local/AnnotationHub/AnnotationHub/Cache
See https://bioconductor.org/packages/devel/bioc/vignettes/AnnotationHub/inst/doc/TroubleshootingTheCache.html#default-caching-location-update

运行下面的命令即可

 moveFiles<-function(package){
      olddir <- path.expand(rappdirs::user_cache_dir(appname=package))
      newdir <- tools::R_user_dir(package, which="cache")
      dir.create(path=newdir, recursive=TRUE)
      files <- list.files(olddir, full.names =TRUE)
      moveres <- vapply(files,
                        FUN=function(fl){
                            filename = basename(fl)
                            newname = file.path(newdir, filename)
                            file.rename(fl, newname)
                        },
                        FUN.VALUE = logical(1))
      if(all(moveres)) unlink(olddir, recursive=TRUE)
  }

下面就可以进行GO和KEGG富集分析了

kmeansRes <- read.table("degs_cl.txt")

prefix <- 'kmeans10'
savepath <- "C:/Users/yjk/Desktop/cluster_DEGs/enrichment"

for (i in kmeansRes$cl %>% unique) {
    ## BP
    goBP <- enrichGO(gene = kmeansRes %>% filter(cl == i) %>% .$id,
                     OrgDb = sly.db,
                     keyType= 'SYMBOL',
                     ont = 'BP',
                     pAdjustMethod = 'BH',
                     pvalueCutoff = 0.05,
                     qvalueCutoff = 0.1)
    
    goBPSim <- clusterProfiler::simplify(goBP,
                                         cutoff = 0.5,
                                         by = 'p.adjust',
                                         select_fun = min)
    ## check and plot
    write.table(as.data.frame(goBPSim),
              paste0(prefix, '_cl', i, '_cp_BP.txt') %>% file.path(savepath, .),
              quote = FALSE,
              sep = "\t")
    
    ## KEGG
    kk2 <- enrichKEGG(gene = kmeansRes %>% filter(cl == i) %>% .$id %>%
                      bitr(.,"SYMBOL", "ENTREZID", sly.db) %>% dplyr::select("ENTREZID") %>%
                          unlist(),
                      organism = 'sly',
                      pvalueCutoff = 0.05)
    
    write.table(as.data.frame(kk2),
              paste0(prefix, '_cl', i, '_cp_KEGG.txt') %>% file.path(savepath, .),
              quote = FALSE,
              sep = "\t")
}

kall <- lapply(kmeansRes$cl %>% unique, function(x) {
    
    eachG <- kmeansRes %>% filter(cl == x) %>% .$id
    
    return(eachG)
    
}) %>%
    set_names(kmeansRes$cl %>% unique %>% paste0('cl', .))

kallGOBP <- compareCluster(geneCluster = kall,
                           fun = 'enrichGO',
                           OrgDb = sly.db,
                           keyType= 'SYMBOL',
                           ont = 'BP',
                           pAdjustMethod = 'BH',
                           pvalueCutoff=0.01,
                           qvalueCutoff=0.1)

kallGOBPSim <- clusterProfiler::simplify(kallGOBP,
                                         cutoff = 0.9,
                                         by = 'p.adjust',
                                         select_fun = min)
dotplot(kallGOBPSim, showCategory = 10) + 
    ggtitle("Biological process")
# ggsave("kallGOBPSim.pdf", width = 6.4, height = 5.4)

kallGOCC <- compareCluster(geneCluster = kall,
                           fun = 'enrichGO',
                           OrgDb = sly.db,
                           keyType= 'SYMBOL',
                           ont = "CC",
                           pAdjustMethod = 'BH',
                           pvalueCutoff=0.01,
                           qvalueCutoff=0.1)

kallGOCCSim <- clusterProfiler::simplify(kallGOCC,
                                         cutoff = 0.9,
                                         by = 'p.adjust',
                                         select_fun = min)
dotplot(kallGOCCSim, showCategory = 20) + 
    ggtitle("Cellular component")
# ggsave("kallGOCCSim.pdf", width = 6.4, height = 4)


kallGOMF <- compareCluster(geneCluster = kall,
                           fun = 'enrichGO',
                           OrgDb = sly.db,
                           keyType= 'SYMBOL',
                           ont = "MF",
                           pAdjustMethod = 'BH',
                           pvalueCutoff=0.01,
                           qvalueCutoff=0.1)

kallGOMFSim <- clusterProfiler::simplify(kallGOMF,
                                         cutoff = 0.9,
                                         by = 'p.adjust',
                                         select_fun = min)
dotplot(kallGOMFSim, showCategory = 10) + 
    ggtitle("Molecular function")
ggsave("kallGOMFSim.pdf", width = 6.9, height = 4)

kallGOBP %>%
    as.data.frame %>%
    write.table('kmeans10_GOBP.txt', quote = FALSE, sep = "\t")

emapplot(kallGOBP,
         showCategory = 5,
         pie='count',
         pie_scale=1,
         layout='kk')


kallKEGG_input <- lapply(kmeansRes$cl %>% unique, function(x) {
    
    eachG <- kmeansRes %>% filter(cl == x) %>% 
        .$id %>% 
        bitr(.,"SYMBOL", "ENTREZID", sly.db) %>% 
        dplyr::select("ENTREZID") %>% 
        unlist()
    
    return(eachG)
    
}) %>%
    set_names(kmeansRes$cl %>% unique %>% paste0('cl', .))
kallKEGG <- compareCluster(geneCluster = kallKEGG_input,   
                           fun = 'enrichKEGG',
                           organism = "sly",
                           pvalueCutoff = 0.05)

dotplot(kallKEGG)
# ggsave('kmeans10_KEGGALL.pdf', width = 8, height = 4)

kallKEGG %>% 
    as.data.frame %>%
    write.table('kmeans10_KEGG.txt', quote = FALSE, sep = "\t")

列一个例子,GO富集的生物学过程


image.png

到此结束

有关转录组DEGs聚类热图和功能富集分析的更多相关文章

  1. ruby-on-rails - Cucumber 是否只是 rspec 的包装器以帮助将测试组织成功能? - 2

    只是想确保我理解了事情。据我目前收集到的信息,Cucumber只是一个“包装器”,或者是一种通过将事物分类为功能和步骤来组织测试的好方法,其中实际的单元测试处于步骤阶段。它允许您根据事物的工作方式组织您的测试。对吗? 最佳答案 有点。它是一种组织测试的方式,但不仅如此。它的行为就像最初的Rails集成测试一样,但更易于使用。这里最大的好处是您的session在整个Scenario中保持透明。关于Cucumber的另一件事是您(应该)从使用您的代码的浏览器或客户端的角度进行测试。如果您愿意,您可以使用步骤来构建对象和设置状态,但通常您

  2. ruby-on-rails - rails 功能测试 - 2

    在Rails自动生成的功能测试(test/functional/products_controller_test.rb)中,我看到以下代码:classProductsControllerTest我的问题是:方法调用products()在哪里/如何定义?products(:one)到底是什么意思?看代码,大概意思是“创建一个产品”,但是它是如何工作的呢?注意我是Ruby/Rails的新手,如果这些是微不足道的问题,我深表歉意。 最佳答案 如果您查看test/fixtures文件夹,您会看到一个products.yml文件。这是在您创建

  3. ruby-on-rails - 功能测试 Authlogic? - 2

    在我的一些Controller中,我有一个before_filter检查用户是否登录?用于CRUD操作。application.rbdeflogged_in?unlesscurrent_userredirect_toroot_pathendendprivatedefcurrent_user_sessionreturn@current_user_sessionifdefined?(@current_user_session)@current_user_session=UserSession.findenddefcurrent_userreturn@current_userifdefine

  4. 建模分析 | 平面2R机器人(二连杆)运动学与动力学建模(附Matlab仿真) - 2

    目录0专栏介绍1平面2R机器人概述2运动学建模2.1正运动学模型2.2逆运动学模型2.3机器人运动学仿真3动力学建模3.1计算动能3.2势能计算与动力学方程3.3动力学仿真0专栏介绍?附C++/Python/Matlab全套代码?课程设计、毕业设计、创新竞赛必备!详细介绍全局规划(图搜索、采样法、智能算法等);局部规划(DWA、APF等);曲线优化(贝塞尔曲线、B样条曲线等)。?详情:图解自动驾驶中的运动规划(MotionPlanning),附几十种规划算法1平面2R机器人概述如图1所示为本文的研究本体——平面2R机器人。对参数进行如下定义:机器人广义坐标

  5. 网站日志分析软件--让网站日志分析工作变得更简单 - 2

    网站的日志分析,是seo优化不可忽视的一门功课,但网站越大,每天产生的日志就越大,大站一天都可以产生几个G的网站日志,如果光靠肉眼去分析,那可能看到猴年马月都看不完,因此借助网站日志分析工具去分析网站日志,那将会使网站日志分析工作变得更简单。下面推荐两款网站日志分析软件。第一款:逆火网站日志分析器逆火网站日志分析器是一款功能全面的网站服务器日志分析软件。通过分析网站的日志文件,不仅能够精准的知道网站的访问量、网站的访问来源,网站的广告点击,访客的地区统计,搜索引擎关键字查询等,还能够一次性分析多个网站的日志文件,让你轻松管理网站。逆火网站日志分析器下载地址:https://pan.baidu.

  6. ABB-IRB-1200运动学分析MATLAB RVC工具分析+Simulink-Adams联合仿真 - 2

    一、机器人介绍        此处是基于MATLABRVC工具箱,对ABB-IRB-1200型号的微型机械臂进行正逆向运动学分析,并利Simulink工具实现对机械臂进行具有动力学参数的末端轨迹规划仿真,最后根据机械模型设计Simulink-Adams联合仿真。 图1.ABBIRB 1200尺寸参数示意图ABBIRB 1200提供的两种型号广泛适用于各作业,且两者间零部件通用,两种型号的工作范围分别为700 mm 和 900 mm,大有效负载分别为 7 kg 和5 kg。 IRB 1200 能够在狭小空间内能发挥其工作范围与性能优势,具有全新的设计、小型化的体积、高效的性能、易于集成、便捷的接

  7. 关于Qt程序打包后运行库依赖的常见问题分析及解决方法 - 2

    目录一.大致如下常见问题:(1)找不到程序所依赖的Qt库version`Qt_5'notfound(requiredby(2)CouldnotLoadtheQtplatformplugin"xcb"in""eventhoughitwasfound(3)打包到在不同的linux系统下,或者打包到高版本的相同系统下,运行程序时,直接提示段错误即segmentationfault,或者Illegalinstruction(coredumped)非法指令(4)ldd应用程序或者库,查看运行所依赖的库时,直接报段错误二.问题逐个分析,得出解决方法:(1)找不到程序所依赖的Qt库version`Qt_5'

  8. ruby - Ruby 中允许 "p *1..10"打印出数字 1-10 的功能是什么? - 2

    require'pp'p*1..10这会打印出1-10。为什么这么简洁?您还可以用它做什么? 最佳答案 它是“splat”运算符。它可用于分解数组和范围并在赋值期间收集值。这里收集赋值中的值:a,*b=1,2,3,4=>a=1b=[2,3,4]在此示例中,内部数组([3,4])中的值被分解并收集到包含数组中:a=[1,2,*[3,4]]=>a=[1,2,3,4]您可以定义将参数收集到数组中的函数:deffoo(*args)pargsendfoo(1,2,"three",4)=>[1,2,"three",4]

  9. ruby - 现代计算机的功能是否不足以处理字符串而无需使用符号(在 Ruby 中) - 2

    我读过的关于Ruby符号的每一篇文章都在谈论符号相对于字符串的效率。但是,这不是1970年代。我的电脑可以处理一些额外的垃圾收集。我错了吗?我拥有最新最好的奔腾双核处理器和4GBRAM。我认为这应该足以处理一些字符串。 最佳答案 您的计算机可能能够处理“一点点额外的垃圾收集”,但是当“一点点”发生在运行数百万次的内部循环中时呢?如果它在内存有限的嵌入式系统上运行呢?有很多地方你可以随意使用字符串,但在某些地方你不能。这完全取决于上下文。 关于ruby-现代计算机的功能是否不足以处理字符串

  10. ruby - 如何在 Cucumber 的功能名称中使用空格 - 2

    我正在使用Windows并尝试运行一个现有的功能包,该功能包最初是在MacOS上构建的,这允许他们通过使用带空格的"\"来解决问题。我正在使用Ruby2.2.3和Cucumber。功能名称包含空格,我无法更改它。我尝试使用""和''来绕过空白,但每次都有同样的问题。这是问题的一个例子。如果我运行:cucumberfeatures/'Namecontainingwhitespaces.feature'它工作正常。但是当我运行时:cucumber-pmy_profile和cucumber.yml包含:my_profile:features/'Namecontainingwhitespace

随机推荐