当前位置: > 信息科学 > 利用超几何概率分布做富集分析

利用超几何概率分布做富集分析

1. 了解超几何概率分布

参考:The Hypergeometric Distribution
超几何分布的公式为:

1
2
3
p(x) = choose(m, x) choose(n, k-x) / choose(m+n, k)
for x = 0, …, k.
其中, m 是桶里面白球的个数, n 是黑球的个数, k 是从桶中随机取出的球数, x 是取出球中白球的个数.

Fisher‘s Exact Test就是依据超几何概率分布得到的。

2. 超几何分布运算方法

使用R的运算方法:

1
2
3
4
5
6
7
8
dhyper(x, m, n, k, log = FALSE)
    计算某一个点的p值
phyper(q, m, n, k, lower.tail = TRUE, log.p = FALSE)
    计算一个分布的p值,默认下计算P(X<=x)
qhyper(p, m, n, k, lower.tail = TRUE, log.p = FALSE)
    计算某一个p值对应的分位数
rhyper(nn, m, n, k)
    按超几何分布给出nn的可能的模拟结果值

3. 对一组p值进行校正

参考:Adjust P-values for Multiple Comparisons
使用R的运算方法:

1
2
3
4
5
p.adjust(p, method = p.adjust.methods, n = length(p))
 
p.adjust.methods
# c(“holm”, “hochberg”, “hommel”, “bonferroni”, “BH”, “BY”,
# “fdr”, “none”)

4. 利用上述原理来编写perl程序,来进行GO或PATHWAY富集分析

输入两个文件:第1个文件是各个基因对应的GO分类或KEGG分类;第2个文件是差异表达基因的名称。以下为KEGG富集分析的perl程序。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
#!/usr/bin/perl
 
=head1 Name
 
  kegg_enrich_analysis.pl
 
=head1 Description
 
  This program is designed to do enrichment analysis by kegg results.
 
=head1 Version
 
  Author: Chen Lianfu, chenllianfu@foxmail.com
  Version: 1.0,  Date: Sat Apr 27 02:06:02 2013
 
=head1 Usage
 
  --out         the output result name, default: enrichment_output
  --tmp         to keep the tmp files, default: false
  --fdr         set the false discovery rate, default: 0.05
  --kdn         When a kegg term was perpared for Fisher's exact 
test, the i(the num of this kegg terms have so many differential 
expression genes) should >= this value, default: 3
  --help        output help information to screen
 
=head1 Example
 
  perl ../kegg_enrich_analysis.pl annot_keggs genelist.txt
 
=cut
 
use strict;
use Getopt::Long;
 
my ($out,$tmp,$fdr_threshold,$kegg2degs_num_threshold,$Help);
 
GetOptions(
    "out:s"=>\$out,
    "fdr:s"=>\$fdr_threshold,
    "kdn:s"=>\$kegg2degs_num_threshold,
    "tmp"=>\$tmp,
    "help"=>\$Help
);
 
$out ||= 'enrichment_output';
$fdr_threshold ||= 0.05;
$kegg2degs_num_threshold ||= 3;
 
die `pod2text $0` if (@ARGV == 0 || $Help);
 
my $kegg2gene_file = shift @ARGV;
my $deglist = shift @ARGV;
 
open kegg2GENE, '<', $kegg2gene_file or die $!;
open DEGLIST, '<', $deglist or die $!;
 
my %kegg2genes;
my %gene2keggs;
while ( <kegg2GENE> ) {
    next unless /^(\S+).*?(K\d+)/;
    $gene2keggs{$1} .= "$2,";
    $kegg2genes{$2}{"genes"} .= "$1,";
    $kegg2genes{$2}{"num"} += 1;
}
 
my ( $total_deg_num, $deg_own_kegg_num );
my %deg_kegg2genes;
while ( <DEGLIST> ) {
    chomp;
    my $genename = $_;
    $total_deg_num ++;
    if ( $gene2keggs{$genename} ) {
        $deg_own_kegg_num ++;
        my @keggs = split /,/, $gene2keggs{$genename};
 
        foreach ( @keggs ) {
            $deg_kegg2genes{$_}{"genes"} .= "$genename,";
            $deg_kegg2genes{$_}{"num"} ++;
        }
    }
}
 
print "$total_deg_num differential expression genes in total, and 
$deg_own_kegg_num genes own kegg terms.\n";
 
my @for_enrich_keggs = keys %deg_kegg2genes;
open FOR_R, '>', "tmp_4value" or die $!;
 
my $kegg_num_for_fisher_test = @for_enrich_keggs;
print "$kegg_num_for_fisher_test kegg terms is perpared for Fisher
's exact test.\n";
 
foreach my $kegg ( @for_enrich_keggs ) {
    my $n_and_m = keys %gene2keggs;
    my $N = $deg_own_kegg_num;
    my $n = $kegg2genes{$kegg}{"num"};
    my $i = $deg_kegg2genes{$kegg}{"num"};
    my $m = $n_and_m -$n;
 
    print FOR_R "$kegg $i $n $m $N\n" if $i >= $kegg2degs_num_threshold;
}
 
open R_SCRIPT, '>', "phyper_adjust.R" or die $!;
 
print R_SCRIPT 
'phy <- read.table(file="tmp_4value")
pvalue <- phyper(phy$V2,phy$V3,phy$V4,phy$V5,lower.tail=FALSE)
qvalue <- p.adjust(pvalue,method=\'fdr\')
value <- cbind ( pvalue, qvalue )
rownames(value)=phy$V1
value
';
 
my @fdr = `cat phyper_adjust.R | R --vanilla --slave`;
 
#print @fdr;
 
`rm phyper_adjust.R tmp_4value` unless $tmp;
 
open OUTPUT, '>', $out or die $!;
 
foreach ( @fdr ) {
    next unless /(\S+)\s+(\S+)\s+(\S+)/;
    my $kegg = $1; my $pvalue = $2; my $fdr = $3;
 
    if ( $fdr <= $fdr_threshold ) {
        my $n_and_m = keys %gene2keggs;
        my $N = $deg_own_kegg_num;
        my $n = $kegg2genes{$kegg}{"num"};
        my $i = $deg_kegg2genes{$kegg}{"num"};
        my $m = $n_and_m -$n;
 
        my $genes = $deg_kegg2genes{$kegg}{"genes"};
 
        print OUTPUT "$kegg\t$i\t$N\t$n\t$n_and_m\t$pvalue\t$fdr\t$genes\n";
    }
}
支付宝赞助
微信赞助

利用超几何概率分布做富集分析:等您坐沙发呢!

发表评论

表情
还能输入210个字