分析流程-多基因风险分数 PRS( Polygenic risk score)

BohanL 2023-10-16 原文

sudo apt-get install zlib1g zlib1g.dev libblas3 libgfortran5 liblapack3 libquadmath0 plink1.9 unzip

sudo apt install dirmngr gnupg apt-transport-https ca-certificates software-properties-common

sudo apt install r-base

1.获取或者生成基础数据（base data）

Polygenic Risk Score (PRS) 分析第一步就是获得基础数据（即GWAS统计分析结果），应该包含了与性状相关的所有等位基因信息及对应效应贡献.
CHR: The chromosome in which the SNP resides---位于第几条染色体
BP: Chromosomal co-ordinate of the SNP---位于染色体物理位置，可能出现的形式：POSITION
SNP: SNP ID, usually in the form of rs-ID---SNP编号，通常是rsID
A1: The effect allele of the SNP---效应等位基因，可能出现的形式：REF
A2: The non-effect allele of the SNP---非效应等位基因，可能出现的形式：ALT
N: Number of samples used to obtain the effect size estimate---用于评估效量值的群体数量
SE: The standard error (SE) of the effect size esimate---所评估效应量值的标准误差
P: The P-value of association between the SNP genotypes and the base phenotype---所评估表型和基因型的相关性p值，可能出现的形式：p_value
OR: The effect size estimate of the SNP, if the outcome is binary/case-control. If the outcome is continuous or treated as continuous then this will usually be BETA---所评基因型的效应量，Odds Ratio或 Effect Size
INFO: The imputation information score---插补得分，可能出现的形式：INFO.plink
MAF: The minor allele frequency (MAF) of the SNP ---所评基因型的效应量,可能出现的形式：ALT_FREQ、FRQ

我是从网上下载了别人的GWAS结果，是个TXT，目的是要将数据从以下排序换成第二行所示：

Chromosome  Position    RSID    REF ALT ALT_FREQ    ALT_FREQ_1KGASN RSQ INFO    HWE_P   Pvalue  Qvalue  N   NullLogLike AltLogLike  SNPWeight   SNPWeightSE OddsRatio   WaldStat    NullLogDelta    NullGeneticVar  NullResidualVar NullBias

CHR BP  RSID    A1  A2  N   SE  P   OR  info    MAF

1.1 数据排序

首先把TXT转换成CSV，再用PYTHON提取排序后再把CSV转换成TXT

TXT转CSV

# -*- coding: utf-8 -*-
"""
Created on Wed Jul 13 11:25:52 2022

@author: Bohan
"""

import pandas as pd
df = pd.read_csv("MDD.10640samples.dosages.hwe6info9maf5.logreg.2017.txt",delimiter="\t", low_memory=False)
df.to_csv("MDD.10640samples.dosages.hwe6info9maf5.logreg.2017.csv", encoding='utf-8', index=False)

CSV提取排序

# -*- coding: utf-8 -*-
"""
Created on Wed Jul 13 10:49:59 2022

@author: Bohan
"""

import csv

with open("MD_GWAS_SNPresults-original.csv") as f, open("originalarranged.csv","w",newline='') as tmp:
    r = csv.reader(f)
    wr = csv.writer(tmp)
    wr.writerows([a,b,c,d,e,m,q,k,r,i,f] for [a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w] in r)
#[a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w] 是原来文件的列顺序
#[a,b,c,d,e,m,q,k,r,i,f]是按照原顺序提取重排

CSV转TXT

# -*- coding: utf-8 -*-
"""
Created on Wed Jul 13 12:07:58 2022

@author: Bohan
"""

import csv
 
 
a=open('MD_GWAS_SNPresults-original.arranged.csv','r')
reader = csv.reader(a)
 
with open('MD_GWAS_SNPresults-original.arranged','w') as f:
    
    for i in reader:
        for x in i:
            f.write(x)
            f.write('\t')
        f.write('\n')
a.close()

搞完base数据看起来这样

CHR BP  RSID    A1  A2  N   SE  P   OR  info    MAF 
10  90127   rs185642176 C   T   10640   0.004802742809305182    0.4704864679671288  1.0123502866538827  0.905519504189  0.046   
10  90164   rs141504207 C   G   10640   0.004802742809305182    0.4704864679671288  1.0123502866538827  0.905515996565  0.046   
10  94026   rs10904032  G   A   10640   0.004807929547222263    0.9907936601472477  1.000113527128114   0.946752744368  0.145

1.2 base数据质检

解压读取MD_GWAS_SNPresults-original.2015.arranged.txt.gz
输出文件头 (NR==1)
输出MAF大于0.01的数据行 (第11列是MAF)
输出INFO大于0.8 的数据行(第10列是INFO)
压缩结果到MD2015.gz

gunzip -c MD_GWAS_SNPresults-original.2015.arranged.txt.gz |\
awk 'NR==1 || ($11 > 0.01) && ($10 > 0.8) {print}' |\
gzip  > MD2015.gz

根据第3列的数据去重获得MD2015.nodup.gz（No-duplicate）

gunzip -c MD2015.gz |\
awk '{seen[$3]++; if(seen[$3]==1){ print}}' |\
gzip - > MD2015.nodup.gz

根据第四第五行数值去除Ambiguous SNPs

gunzip -c MD2015.nodup.gz |\
awk '!( ($4=="A" && $5=="T") || \
        ($4=="T" && $5=="A") || \
        ($4=="G" && $5=="C") || \
        ($4=="C" && $5=="G")) {print}' |\
    gzip > MD2015.QC.gz

2.目标数据（target data）的质检QC

本案例用的是illumina ASA v1 SNP芯片检出的原始文件（*.idat文件）
直接去下载illumina的genomestudio软件
https://files.softwaredownloads.illumina.com/5831d9df-95cb-4427-a7c0-499fe871e1d5/genomestudio-software-v2-0-5-0-installer.zip

2.1 创建分析项目

Workflow

1st Step

要用的还有Infinium Asian Screening Array v1.0 Manifest File (BPM Format - GRCh37)

Infinium Asian Screening Array v1.0 Cluster File

2.2 评估参照数据

要用的文件

创建完project和统计后，开始control评估

评估参考数据

内建有参考marker

参照数据可以用来判断实验体系是否正常运行

2.3 评估样品和SNP

image.png

具体理论可以参考这个GSA/ASA芯片质控 - 简书 (jianshu.com)
样品评估
Call Rate（检出率）样本检出率：是指对于某种样本而言，通过测序并成功判刑的snp与所有检出的snp的比值，通常标准在90%或以上。
LogR Dev用于评估是否有样品污染
SNP评估
Call Frequency（检出频率）用于SNP检出的样品覆盖程度
GenTrain Scoreillumina自己的算法

image.png

样品评估
Call Rate太低的不要

image.png

SNP评估

image.png

基本流程就是创建样品-添加manifest信息，参考基因组按照GWAS对应版本，利用自带插件plink-input-report-plugin-v2-1-4到处.ped和.map文件

plink1.9 --file new --out new --make-bed

plink1.9 --bfile new --maf 0.01 --hwe 1e-6 --geno 0.02 --mind 0.02 --write-snplist --make-just-fam --out new.QC

plink1.9 --bfile new --keep new.QC.fam --extract new.QC.snplist --indep-pairwise 200 50 0.25 --out new.QC

plink1.9 --bfile new --extract new.QC.prune.in --keep new.QC.fam --het --out new.QC

sudo R
install.packages("data.table")
library(data.table)
dat <- fread("new.QC.het")
valid <- dat[F<=mean(F)+3*sd(F) & F>=mean(F)-3*sd(F)] 
fwrite(valid[,c("FID","IID")], "new.valid.sample", sep="\t")

install.packages("magrittr")
install.packages("R.utils")
library(magrittr)
bim <- fread("new.bim") %>%
     setnames(., colnames(.), c("CHR", "SNP", "CM", "BP", "B.A1", "B.A2")) %>%
    .[,c("B.A1","B.A2"):=list(toupper(B.A1), toupper(B.A2))]
MD <- fread("MD2015.QC.gz") %>%
.[,c("A1","A2"):=list(toupper(A1), toupper(A2))]
qc <- fread("new.QC.snplist", header=F)

info <- merge(bim, MD, by=c("SNP", "CHR", "BP")) %>%
    .[SNP %in% qc[,V1]]

complement <- function(x){
    switch (x,
        "A" = "T",
        "C" = "G",
        "T" = "A",
        "G" = "C",
        return(NA)
    )

info.match <- info[A1 == B.A1 & A2 == B.A2, SNP]
com.snps <- info[sapply(B.A1, complement) == A1 &
                    sapply(B.A2, complement) == A2, SNP]

bim[SNP %in% com.snps, c("B.A1", "B.A2") :=
        list(sapply(B.A1, complement),
            sapply(B.A2, complement))]

recode.snps <- info[B.A1==A2 & B.A2==A1, SNP]

bim[SNP %in% recode.snps, c("B.A1", "B.A2") :=
        list(B.A2, B.A1)]

com.recode <- info[sapply(B.A1, complement) == A2 &
                    sapply(B.A2, complement) == A1, SNP]

bim[SNP %in% com.recode, c("B.A1", "B.A2") :=
        list(sapply(B.A2, complement),
            sapply(B.A1, complement))]
fwrite(bim[,c("SNP", "B.A1")], "EUR.a1", col.names=F, sep="\t")

mismatch <- bim[!(SNP %in% info.match |
                    SNP %in% com.snps |
                    SNP %in% recode.snps |
                    SNP %in% com.recode), SNP]
write.table(mismatch, "EUR.mismatch", quote=F, row.names=F, col.names=F)
q()


plink1.9 --bfile new --extract new.QC.prune.in --keep new.valid.sample --check-sex --out new.QC

valid <- fread("new.valid.sample")
dat <- fread("new.QC.sexcheck")[FID%in%valid$FID]
fwrite(dat[STATUS=="OK",c("FID","IID")], "new.QC.valid", sep="\t") 
q() # exit R

plink1.9 --bfile new  --extract new.QC.prune.in --keep new.QC.valid --rel-cutoff 0.125 --out new.QC

plink1.9 --bfile new --make-bed --keep new.QC.rel.id --out new.QC --extract new.QC.snplist


-----------------------------------------------------------------------------

library(data.table)
dat <- fread("MD2015.QC.gz")
fwrite(dat[,BETA:=log(OR)], "new.QC.Transformed", sep="\t")
q() # exit R


plink1.9  --bfile new.QC  --clump-p1 1 --clump-r2 0.1 --clump-kb 250 --clump new.QC.Transformed --clump-snp-field SNP --clump-field P --out new

awk 'NR!=1{print $3}' new.clumped >  new.valid.snp

awk '{print $3,$8}' new.QC.Transformed > new.pvalue

echo "0.001 0 0.001" > range_list 
echo "0.05 0 0.05" >> range_list
echo "0.1 0 0.1" >> range_list
echo "0.2 0 0.2" >> range_list
echo "0.3 0 0.3" >> range_list
echo "0.4 0 0.4" >> range_list
echo "0.5 0 0.5" >> range_list


plink1.9 --bfile new.QC --score new.QC.Transformed 3 4 12 header  --q-score-range range_list new.pvalue --extract new.valid.snp --out new


# First, we need to perform prunning
plink1.9 --bfile new.QC --indep-pairwise 200 50 0.25 --out new
# Then we calculate the first 6 PCs
plink1.9 --bfile new.QC --extract new.prune.in --pca 6 --out new

library(data.table)
library(magrittr)
p.threshold <- c(0.001,0.05,0.1,0.2,0.3,0.4,0.5)
phenotype <- fread("new.phenotype")
pcs <- fread("new.eigenvec", header=F) %>%
    setnames(., colnames(.), c("FID", "IID", paste0("PC",1:6)) )
covariate <- fread("new.cov")
pheno <- merge(phenotype, covariate) %>%
        merge(., pcs)
~~~~~
null.r2 <- summary(lm(Trait1~., data=pheno[,-c("FID", "IID")]))$r.squared
prs.result <- NULL
for(i in p.threshold){
    pheno.prs <- paste0("new.", i, ".profile") %>%
        fread(.) %>%
        .[,c("FID", "IID", "SCORE")] %>%
        merge(., pheno, by=c("FID", "IID"))

    model <- lm(Trait1~., data=pheno.prs[,-c("FID","IID")]) %>%
            summary
    model.r2 <- model$r.squared
    prs.r2 <- model.r2-null.r2
    prs.coef <- model$coeff["SCORE",]
    prs.result %<>% rbind(.,
        data.frame(Threshold=i, R2=prs.r2, 
                    P=as.numeric(prs.coef[4]), 
                    BETA=as.numeric(prs.coef[1]),
                    SE=as.numeric(prs.coef[2])))
}
print(prs.result[which.max(prs.result$R2),])
q() # exit R

p.threshold <- c(0.001,0.05,0.1,0.2,0.3,0.4,0.5)

Read in the phenotype file

phenotype <- read.table("new.phenotype", header=T)

Read in the PCs

pcs <- read.table("new.eigenvec", header=F)

The default output from plink does not include a header

To make things simple, we will add the appropriate headers

(1:6 because there are 6 PCs)

colnames(pcs) <- c("FID", "IID", paste0("PC",1:6))

Read in the covariates (here, it is sex)

covariate <- read.table("new.cov", header=T)

Now merge the files

pheno <- merge(merge(phenotype, covariate, by=c("FID", "IID")), pcs, by=c("FID","IID"))

We can then calculate the null model (model with PRS) using a linear regression

(as height is quantitative)

null.model <- glm(Trait1~., data=pheno[,!colnames(pheno)%in%c("FID","IID")])

And the R2 of the null model is

null.r2 <- summary(null.model)r.squared
# R2 of PRS is simply calculated as the model R2 minus the null R2
prs.r2 <- model.r2-null.r2
# We can also obtain the coeffcient and p-value of association of PRS as follow
prs.coef <- summary(model)$coeff["SCORE",]
prs.beta <- as.numeric(prs.coef[1])
prs.se <- as.numeric(prs.coef[2])
prs.p <- as.numeric(prs.coef[4])
# We can then store the results
prs.result <- rbind(prs.result, data.frame(Threshold=i, R2=prs.r2, P=prs.p, BETA=prs.beta,SE=prs.se))
}

Best result is:

prs.result[which.max(prs.result$R2),]
q() # exit R

有关分析流程-多基因风险分数 PRS( Polygenic risk score)的更多相关文章

建模分析 | 平面2R机器人(二连杆)运动学与动力学建模(附Matlab仿真) - 2
目录0专栏介绍1平面2R机器人概述2运动学建模2.1正运动学模型2.2逆运动学模型2.3机器人运动学仿真3动力学建模3.1计算动能3.2势能计算与动力学方程3.3动力学仿真0专栏介绍?附C++/Python/Matlab全套代码?课程设计、毕业设计、创新竞赛必备！详细介绍全局规划(图搜索、采样法、智能算法等)；局部规划(DWA、APF等)；曲线优化(贝塞尔曲线、B样条曲线等)。?详情：图解自动驾驶中的运动规划(MotionPlanning)，附几十种规划算法1平面2R机器人概述如图1所示为本文的研究本体——平面2R机器人。对参数进行如下定义：机器人广义坐标
网站日志分析软件--让网站日志分析工作变得更简单 - 2
网站的日志分析，是seo优化不可忽视的一门功课，但网站越大，每天产生的日志就越大，大站一天都可以产生几个G的网站日志，如果光靠肉眼去分析，那可能看到猴年马月都看不完，因此借助网站日志分析工具去分析网站日志，那将会使网站日志分析工作变得更简单。下面推荐两款网站日志分析软件。第一款：逆火网站日志分析器逆火网站日志分析器是一款功能全面的网站服务器日志分析软件。通过分析网站的日志文件，不仅能够精准的知道网站的访问量、网站的访问来源，网站的广告点击，访客的地区统计，搜索引擎关键字查询等，还能够一次性分析多个网站的日志文件，让你轻松管理网站。逆火网站日志分析器下载地址：https://pan.baidu.
ABB-IRB-1200运动学分析MATLAB RVC工具分析+Simulink-Adams联合仿真 - 2
一、机器人介绍此处是基于MATLABRVC工具箱，对ABB-IRB-1200型号的微型机械臂进行正逆向运动学分析，并利Simulink工具实现对机械臂进行具有动力学参数的末端轨迹规划仿真，最后根据机械模型设计Simulink-Adams联合仿真。图1.ABBIRB 1200尺寸参数示意图ABBIRB 1200提供的两种型号广泛适用于各作业，且两者间零部件通用，两种型号的工作范围分别为700 mm 和 900 mm，大有效负载分别为 7 kg 和5 kg。 IRB 1200 能够在狭小空间内能发挥其工作范围与性能优势，具有全新的设计、小型化的体积、高效的性能、易于集成、便捷的接
关于Qt程序打包后运行库依赖的常见问题分析及解决方法 - 2
目录一.大致如下常见问题：（1）找不到程序所依赖的Qt库version`Qt_5'notfound(requiredby（2）CouldnotLoadtheQtplatformplugin"xcb"in""eventhoughitwasfound（3）打包到在不同的linux系统下，或者打包到高版本的相同系统下，运行程序时，直接提示段错误即segmentationfault，或者Illegalinstruction(coredumped)非法指令（4）ldd应用程序或者库，查看运行所依赖的库时，直接报段错误二.问题逐个分析，得出解决方法：（1）找不到程序所依赖的Qt库version`Qt_5'
ruby-on-rails - 如何使用 ruby-prof 和 JMeter 分析 Rails - 2
我想使用ruby-prof和JMeter分析Rails应用程序。我对分析特定Controller/操作/或模型方法的建议方法不感兴趣，我想分析完整堆栈，从上到下。所以我运行这样的东西:RAILS_ENV=productionruby-prof-fprof.outscript/server>/dev/null然后我在上面运行我的JMeter测试计划。然而，问题是使用CTRL+C或SIGKILL中断它也会在ruby-prof可以写入任何输出之前杀死它。如何在不中断ruby-prof的情况下停止mongrel服务器？最佳答案
【Unity游戏破解】外挂原理分析 - 2
文章目录认识unity打包目录结构游戏逆向流程Unity游戏攻击面可被攻击原因mono的打包建议方案锁血飞天无限金币攻击力翻倍以上统称内存挂透视自瞄压枪瞬移内购破解Unity游戏防御开发时注意数据安全接入第三方反作弊系统外挂检测思路狠人自爆实战查看目录结构用il2cppdumper例子2-森林whoishe后记认识unity打包目录结构dll一般很大，因为里面是所有的游戏功能编译成的二进制码游戏逆向流程开发人员代码被编译打包到GameAssembly.dll中使用il2ppDumper工具，并借助游戏名_Data\il2cpp_data\Metadata\global-metadata.dat
驱动开发：内核无痕隐藏自身分析 - 2
在笔者前面有一篇文章《驱动开发：断链隐藏驱动程序自身》通过摘除驱动的链表实现了断链隐藏自身的目的，但此方法恢复时会触发PG会蓝屏，偶然间在网上找到了一个作者介绍的一种方法，觉得有必要详细分析一下他是如何实现的进程隐藏的，总体来说作者的思路是最终寻找到MiProcessLoaderEntry的入口地址，该函数的作用是将驱动信息加入链表和移除链表，运用这个函数即可动态处理驱动的添加和移除问题。MiProcessLoaderEntry(pDriverObject->DriverSection,1)添加MiProcessLoaderEntry(pDriverObject->DriverSection,
等保工作流程和明细 - 2
一、系统定级信息系统运营使用单位按照等级保护管理办法和定级指南，自主确定信息系统的安全保护等级。有上级主管部门的，应当经上级主管部门审批。跨省或全国统一联网运行的信息系统可以由其主管部门统一确定安全保护等级。定级需要根据信息系统的实际情况合理定级。二、系统备案第二级以上信息系统定级单位到所在地设区的市级以上公安机关办理备案手续。省级单位到省公安厅网安总队备案，各地市单位一般直接到市级网安支队备案，也有部分地市区县单位的定级备案资料是先交到区县公安网监大队的，具体根据各地市要求来。信息系统运营、使用单位或者其主管部门应当在信息系统安全保护等级确定后30日内，到公安机关办理备案手续。三、初次测评信
2023爱分析·流程中台市场厂商评估报告：微宏科技 - 2
目录1. 研究范围定义2. 流程中台市场分析3. 厂商评估：微宏科技4. 入选证书 1. 研究范围定义近年来，随着外部市场环境快速变化、客户需求愈发多样，企业逐渐意识到，自身业务需要更加敏捷、高效，具备根据市场需求快速迭代的能力。业务流程的自动化能够帮助企业实现业务的敏捷高效，因此受到越来越多企业的关注。企业的“自动化武器库”品类丰富，包括低/零代码平台、RPA、BPM、AI等。企业可以使用多项自动化工具，但结果往往是各项自动化工具处于各自的“自动化烟囱”之中，仅能实现碎片式自动化。例如，某企业的IT团队可能在使用低代码平台、财务团队可能在使用RPA、呼叫中心则可能在使用聊天机器人。自动
ruby - 在 rspec 中测试多步骤工作流程 - 2
我想了解使用rspec测试多步骤工作流的习惯用法或最佳实践。我们以“购物车”系统为例，其中的购买流程可能是当用户提交购物篮并且我们没有使用https时，重定向到https当用户提交购物篮并且我们使用https并且没有cookie时，创建并显示一个新的购物篮并发回cookie当用户提交到购物车并且我们使用https并且有一个有效的cookie并且新商品与第一个商品用于不同的产品时，向购物车添加一行并显示这两行当用户提交到购物篮并且我们使用https并且有一个有效的cookie并且新商品与之前的商品相同时，增加该购物篮行的数量并显示这两条线当用户点击购物车页面上的“结帐”并使用https并

分析流程-多基因风险分数 PRS( Polygenic risk score)

1.获取或者生成基础数据（base data）

1.1 数据排序

1.2 base数据质检

2.目标数据（target data）的质检QC

2.1 创建分析项目

2.2 评估参照数据

2.3 评估样品和SNP

Read in the phenotype file

Read in the PCs

The default output from plink does not include a header

To make things simple, we will add the appropriate headers

(1:6 because there are 6 PCs)

Read in the covariates (here, it is sex)

Now merge the files

We can then calculate the null model (model with PRS) using a linear regression

(as height is quantitative)

And the R2 of the null model is

Best result is:

有关分析流程-多基因风险分数 PRS( Polygenic risk score)的更多相关文章

随机推荐