简介
针对不同的任何,可能需要不同的语言模型,从而提高语音识别系统的识别率,而srilm是一个非常好用的开源语言模型训练工具;具体介绍可上srilm官网查询。
常用命令
1 | ngram-count -text train.txt -lm train |
srilm 安装
官网中 http://www.speech.sri.com/projects/srilm/download.html 填写相关信息后即可免费下载源码.
下载后解压
按照INSTALL文件说明 修改Makefile文件中SRILM变量
SRILM = /home/vrgroup/Desktop/srilm-1.7.1
执行如下命令进行编译1
/home/vrgroup/Desktop/srilm-1.7.1# make World
生成bin lib include lib 三个文件夹
其中bin中就包含了需要用到的工具
训练英文语言模型
直接对REDEME文件进行语言模型建模。生成test文件。该文件即为语言模型1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181/home/vrgroup/Desktop/srilm-1.7.1# bin/i686-m64/ngram-count -text README -lm test
warning: discount coeff 1 is out of range: 0
warning: count of count 8 is zero -- lowering maxcount
warning: count of count 7 is zero -- lowering maxcount
warning: count of count 6 is zero -- lowering maxcount
warning: count of count 5 is zero -- lowering maxcount
warning: count of count 4 is zero -- lowering maxcount
warning: discount coeff 1 is out of range: 0
warning: count of count 8 is zero -- lowering maxcount
warning: count of count 7 is zero -- lowering maxcount
warning: count of count 6 is zero -- lowering maxcount
warning: count of count 5 is zero -- lowering maxcount
warning: count of count 4 is zero -- lowering maxcount
warning: count of count 3 is zero -- lowering maxcount
warning: discount coeff 1 is out of range: 0
/home/vrgroup/Desktop/srilm-1.7.1# vim test
#test 文件内容如下所示
\data\
ngram 1=63
ngram 2=84
ngram 3=4
\1-grams:
-1.959041 $ -0.2228022
-1.959041 $Header: -0.2962311
-1.959041 (from -0.2913786
-1.959041 /home/srilm/CVS/srilm/README,v -0.2962311
-1.959041 1.9 -0.2962311
-1.959041 19:35:49 -0.2962311
-1.959041 2009/12/02 -0.2962311
-0.7829502 </s>
-99 <s> -1.125891
-1.959041 C -0.2815078
-1.658011 C++ -0.4674698
-1.959041 DECIPHER(TM) -0.2962311
-1.959041 Exp -0.2962311
-1.959041 INSTALL -0.2962311
-1.959041 LM -0.2864712
-1.959041 SRI -0.2962311
-1.959041 See -0.2962311
-1.959041 Subdirectories -0.2962311
-1.356981 and -0.6693475
-1.959041 bin/ -0.2864712
-1.959041 build -0.2962311
-1.959041 common/ -0.2962311
-1.959041 convenience -0.2815078
-1.959041 data -0.2962311
-1.959041 doc/ -0.2962311
-1.959041 documentation -0.2228022
-1.959041 dstruct/ -0.2913786
-1.959041 factored -0.2913786
-1.959041 files -0.2228022
-1.959041 flm/ -0.2962311
-1.959041 for -0.2962311
-1.959041 header -0.2962311
-1.959041 include/ -0.2864712
-1.959041 instructions. -0.2228022
-1.658011 language -0.4674699
-1.959041 lattice -0.2815078
-1.959041 lattice/ -0.2962311
-1.959041 lib/ -0.2864712
-1.959041 libraries -0.2228022
-1.356981 library -0.5972611
-1.959041 lm/ -0.2913786
-1.959041 makefiles -0.2962311
-1.959041 man -0.2962311
-1.959041 man/ -0.2962311
-1.959041 misc/ -0.2913786
-1.658011 miscellaneous -0.4674698
-1.658011 model -0.4575991
-1.959041 of -0.2962311
-1.959041 pages -0.2228022
-1.959041 programs -0.2228022
-1.48192 released -0.5875012
-1.959041 scripts -0.2962311
-1.959041 shared -0.2962311
-1.959041 srilm -0.2228022
-1.959041 stolcke -0.2962311
-1.959041 structures -0.2228022
-1.959041 system) -0.2228022
-1.658011 the -0.4674698
-1.959041 tool -0.2228022
-1.48192 tools -0.5238322
-1.959041 using -0.2913786
-1.959041 utility -0.2962311
-1.959041 utils/ -0.2913786
\2-grams:
-0.30103 $ </s>
-0.30103 $Header: /home/srilm/CVS/srilm/README,v
-0.30103 (from the
-0.30103 /home/srilm/CVS/srilm/README,v 1.9
-0.30103 1.9 2009/12/02
-0.30103 19:35:49 stolcke
-0.30103 2009/12/02 19:35:49
-1.20412 <s> $Header:
-1.20412 <s> See
-1.20412 <s> Subdirectories
-1.20412 <s> bin/
-1.20412 <s> common/
-1.20412 <s> doc/
-1.20412 <s> dstruct/
-1.20412 <s> flm/
-1.20412 <s> include/
-1.20412 <s> lattice/
-1.20412 <s> lib/
-1.20412 <s> lm/
-1.20412 <s> man/
-1.20412 <s> misc/
-1.20412 <s> utils/
-0.30103 C and
-0.4771213 C++ convenience
-0.4771213 C++ data
-0.30103 DECIPHER(TM) system)
-0.30103 Exp $
-0.30103 INSTALL for
-0.30103 LM tools
-0.30103 SRI DECIPHER(TM)
-0.30103 See INSTALL
-0.30103 Subdirectories of
-0.69897 and C++
-0.69897 and tool
-0.39794 and tools 0.1249388
-0.30103 bin/ released
-0.30103 build instructions.
-0.30103 common/ shared
-0.30103 convenience library
-0.30103 data structures
-0.30103 doc/ documentation
-0.30103 documentation </s>
-0.30103 dstruct/ C++
-0.30103 factored language
-0.30103 files </s>
-0.30103 flm/ factored
-0.30103 for build
-0.30103 header files
-0.30103 include/ released
-0.30103 instructions. </s>
-0.1760913 language model 0
-0.30103 lattice library
-0.30103 lattice/ lattice
-0.30103 lib/ released
-0.30103 libraries </s>
-0.69897 library </s>
-0.2218488 library and -0.2552725
-0.30103 lm/ language
-0.30103 makefiles (from
-0.30103 man pages
-0.30103 man/ man
-0.30103 misc/ miscellaneous
-0.4771213 miscellaneous C
-0.4771213 miscellaneous utility
-0.1760913 model library -0.07918125
-0.30103 of srilm
-0.30103 pages </s>
-0.30103 programs </s>
-0.60206 released header
-0.60206 released libraries
-0.60206 released programs
-0.30103 scripts using
-0.30103 shared makefiles
-0.30103 srilm </s>
-0.30103 stolcke Exp
-0.30103 structures </s>
-0.30103 system) </s>
-0.4771213 the LM
-0.4771213 the SRI
-0.30103 tool </s>
-0.1249387 tools </s>
-0.30103 using the
-0.30103 utility scripts
-0.30103 utils/ miscellaneous
\3-grams:
-0.1760913 library and tools
-0.1760913 model library and
-0.1760913 language model library
-0.1760913 and tools </s>
\end\
这是一个ARPA-MIT LM 格式的三元语言模型文件 第一个浮点数表示概率(log10),后面的浮点数表示back-off权重(log10).back-off权重主要用于平滑概率为0的数据.具体概率计算方法按照如下公式:
- P表示 (c b a)出现的概率
- P=p(a|b c);
- if p(a|b c)==0 : P=backoff(c b)*p(a|b)
- if p(a|b)==0 : P=backoff(b)*p(a)
中文语言模型训练
对于中文文本,需要先进行分词,然后才能进行词频统计来建立语言模型,HTokenize工具可以用来分词.分词后结果如下:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37/home/vrgroup/Desktop/srilm-1.7.1# iconv -f GBK -t utf-8 train.txt
请 显示 需要 优先 处理 的 事项
请 标识 优先 处理 的 位置
请 进入 乙烯 装置 模型 定位 报警 设备 有 多处 报警
显示 设备 当前 信息 参数
切 换 安全 管 控 界面
请 给 出 复合 报警 的 相关 点 点 位 趋势 图 息 界面
请 调 出 实时 监控 画面
请 查看 复合 报警 相应 模型 图
结合 知识 库 请 分析 故障 原因 并且 给 出 处理 措施
请 进行 相应 处理 生成 故障 处理 报告 归 档 到 知识 库
请 处理 裂解 炉 燃料 油 改 燃料 气 工艺 方案 切 换
请 先 切 换 到 方案 切 换 模拟 状态
请 显示 工艺 方案 指导 规程
显示 关键 点 位 及 指标 趋势 图
观察 指标 趋势
切 换 模拟 完成
紧急 事项 处理 完成
语音 助手 主动 汇报 事故 相关 信息
了解 外 操 情况
外 操 人员 携带 防护 设备 进入 现场 勘查
调用 北 美 应急 手册
通知 人员 撤离 去 往 紧急 集合 地点
计算 消防 路线
报告 消防 支队 气 防 炼 化 医院
通知 其他 岗位 人员 增援
对 清 污 分 流 阀门 进行 确认
打开 消防 炮 消防 栓
计算 模拟 气体 扩散 影响 结果
紧急 停车
要求 上游 装置 将 酸性 气体 改 至 运行 的 其他 硫磺 收集 装置 或 气 柜
计算 气体 扩散
结果 报告 政府 部门 报告 指挥 中心
根据 堵 漏 方案 对 分 液 灌 泄 压 置换
对 现场 气体 进行 检测
形成 事故 报告
/home/vrgroup/Desktop/srilm-1.7.1#
同样命令可得到中文语言模型1
root@vrgroup-Precision-M6800:/home/vrgroup/Desktop/srilm-1.7.1# bin/i686-m64/ngram-count -text train.txt -lm train
warning: discount coeff 1 is out of range: 0
warning: count of count 8 is zero -- lowering maxcount
warning: count of count 7 is zero -- lowering maxcount
warning: count of count 6 is zero -- lowering maxcount
warning: discount coeff 3 is out of range: -0.0228311
warning: count of count 4 is zero
warning: count of count 8 is zero -- lowering maxcount
warning: count of count 7 is zero -- lowering maxcount
warning: count of count 6 is zero -- lowering maxcount
warning: count of count 5 is zero -- lowering maxcount
warning: count of count 4 is zero -- lowering maxcount
warning: count of count 3 is zero -- lowering maxcount
warning: discount coeff 1 is out of range: 0
root@vrgroup-Precision-M6800:/home/vrgroup/Desktop/srilm-1.7.1#
在HDecode中使用语言模型
训练好的模型需要进行gzip压缩才可以用于HDecode.替换掉HDecode中语言模型参数-w为train即可完成对特定语言的识别,从而提高特定领域的识别率1
2/home/vrgroup/Desktop/srilm-1.7.1# gzip train
./RedisToHDecode -A -D -V -T 1 -C hdecode.hlda.cfg -H S2.hlda.MMF -y rec -t 250.0 250.0 -u 3500 -v 125.0 -s 12.0 -p -10.0 -w train -i out 64k.decode.dct xwrd.clustered.mlist test.plp
```