聊一聊双十一背后的技术 毫秒分词算啥, 试试正则和相似度 原作者:digoal/德哥 创作时间:2016-11-21 17:36:26+08 |
doudou586 发布于2016-11-21 18:36:26 评论: 1 浏览: 10100 顶: 3020 踩: 3413 |
看刑侦剧经常有看到人物拼图,然后到图库搜索的,以前可能靠的是人肉,今天,使用PG,可以靠数据库的图形近似度搜索功能。 《弱水三千,只取一瓢,当图像搜索遇见PostgreSQL (Haar wavelet)》
但是我今天不是写图形搜索,我要写的是文本搜索,文本搜索,大家一定会想到分词。千万不要以为分词可以搞定一切需求,比如这样的需求就搞不定。
hello world打成了hello word或者hello w0rld,你要让数据库匹配出来,怎么搞?
又或者你的业务需要写正则进行匹配,怎么搞?比如一些域名的查询,错误! 超链接引用无效。
甚至更变态的 fi[a-z]{1}e.*?.?? ,这样的查询。
数据量小,并发小时,这种查询是可以忍受全表扫描和CPU处理过滤的。
但是想想一下,你是一个日请求过亿的业务,或者数据量亿级别的,全表扫描和CPU的开销会让你疯掉的。
那么来看看PostgreSQL是如何做到的吧。PG就像机器猫的百宝箱,里面什么都能找到,一起来召唤。
在pg_trgm的代码注释中有详细的介绍,分为4个步骤。
1. 使用PostgreSQL regexp库,将正则转换为NFA样式(图形化词组)。
2. 将NFA样式再进行转换,转换为扩展的图形样式(trigrams),包括拆分后的查询词组与NOT词组。
3. 简化,过滤不必要的trigrams。
4. 打包为TrgmPackedGraph结构,支持GIN,GIST索引的检索。
新建一张测试表,其中包含一个字符串类型字段,插入5000万随机字符串,要来就来大数据量,数据库可不是玩具。
postgres=# create extension pg_trgm; postgres=# create table tbl(id int primary key, info text, crt_time timestamp); postgres=# insert into tbl select id,md5(random()::text), now() from generate_series(1,50000000) t(id);
创建支持正则的索引
postgres=# CREATE INDEX trgm_idx_tbl_gin ON tbl USING GIN (info gin_trgm_ops);
正则查询例子,刚给一位搞其他数据库产品的同学演示过,被查询速度惊吓到,直接给跪了。
postgres=# select * from tbl where info ~ '53?6b.*8823a' limit 10; id | info | crt_time ----------+----------------------------------+---------------------------- 4587797 | 0b1e56bd258823a9f9338919617651e5 | 2016-11-18 14:43:25.726919 7269097 | 9d5a7aace4bb56b29fe54cd98823a8ec | 2016-11-18 14:43:25.726919 11589980 | 3536b69b29b607348823a675cf4842c3 | 2016-11-18 14:43:25.726919 (3 rows) Time: 142.087 ms postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where info ~ '53?6b.*8823a' limit 10; QUERY PLAN ------------------------------------------------------------------------------------------- Limit (cost=527.75..537.82 rows=10 width=45) (actual time=140.867..140.902 rows=3 loops=1) Output: id, info, crt_time Buffers: shared hit=911 -> Bitmap Heap Scan on public.tbl (cost=527.75..5564.25 rows=5000 width=45) (actual time=140.847..140.881 rows=3 loops=1) Output: id, info, crt_time Recheck Cond: (tbl.info ~ '53?6b.*8823a'::text) Rows Removed by Index Recheck: 5 Heap Blocks: exact=8 Buffers: shared hit=911 -> Bitmap Index Scan on trgm_idx_tbl (cost=0.00..526.50 rows=5000 width=0) (actual time=140.832..140.832 rows=8 loops=1) Index Cond: (tbl.info ~ '53?6b.*8823a'::text) Buffers: shared hit=903 Planning time: 0.389 ms Execution time: 140.928 ms (14 rows) postgres=# select * from tbl where info ~ 'hello.*[a-f]{1}abc' limit 10; id | info | crt_time ----+------+---------- (0 rows) Time: 0.974 ms postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where info ~ 'hello.*[a-f]{1}abc' limit 10; QUERY PLAN ------------------------------------------------------------------------------------------- Limit (cost=852.75..862.82 rows=10 width=45) (actual time=0.451..0.451 rows=0 loops=1) Output: id, info, crt_time Buffers: shared hit=35 -> Bitmap Heap Scan on public.tbl (cost=852.75..5889.25 rows=5000 width=45) (actual time=0.449..0.449 rows=0 loops=1) Output: id, info, crt_time Recheck Cond: (tbl.info ~ 'hello.*[a-f]{1}abc'::text) Buffers: shared hit=35 -> Bitmap Index Scan on trgm_idx_tbl (cost=0.00..851.50 rows=5000 width=0) (actual time=0.447..0.447 rows=0 loops=1) Index Cond: (tbl.info ~ 'hello.*[a-f]{1}abc'::text) Buffers: shared hit=35 Planning time: 0.366 ms Execution time: 0.479 ms (12 rows)
模糊查询例子(其实正则已经包含了模糊查询,这里只是再展示一下)
postgres=# select * from tbl where info ~ '821b8b92' limit 10; id | info | crt_time ----+----------------------------------+---------------------------- 4 | c5132821b8b92ba36487d06b438d8aed | 2016-11-18 14:43:25.726919 (1 row) Time: 141.353 ms postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where info ~ '821b8b92' limit 10; QUERY PLAN ------------------------------------------------------------------------------------------------------ Limit (cost=527.75..537.82 rows=10 width=45) (actual time=140.293..140.299 rows=1 loops=1) Output: id, info, crt_time Buffers: shared hit=904 -> Bitmap Heap Scan on public.tbl (cost=527.75..5564.25 rows=5000 width=45) (actual time=140.291..140.297 rows=1 loops=1) Output: id, info, crt_time Recheck Cond: (tbl.info ~ '821b8b92'::text) Rows Removed by Index Recheck: 1 Heap Blocks: exact=2 Buffers: shared hit=904 -> Bitmap Index Scan on trgm_idx_tbl (cost=0.00..526.50 rows=5000 width=0) (actual time=140.279..140.279 rows=2 loops=1) Index Cond: (tbl.info ~ '821b8b92'::text) Buffers: shared hit=902 Planning time: 0.331 ms Execution time: 140.323 ms (14 rows)
以上例子若全表扫描的耗时(本例未使用PG支持的CPU并行计算)
postgres=# select * from tbl where info ~ '53?6b.*8823a' limit 10; id | info | crt_time ----------+----------------------------------+---------------------------- 4587797 | 0b1e56bd258823a9f9338919617651e5 | 2016-11-18 14:43:25.726919 7269097 | 9d5a7aace4bb56b29fe54cd98823a8ec | 2016-11-18 14:43:25.726919 11589980 | 3536b69b29b607348823a675cf4842c3 | 2016-11-18 14:43:25.726919 (3 rows) Time: 93489.353 ms postgres=# select * from tbl where info ~ 'hello.*[a-f]{1}abc' limit 10; id | info | crt_time ----+------+---------- (0 rows) Time: 78246.122 ms postgres=# select * from tbl where info ~ '821b8b92' limit 10; id | info | crt_time ----+----------------------------------+---------------------------- 4 | c5132821b8b92ba36487d06b438d8aed | 2016-11-18 14:43:25.726919 (1 row) Time: 82218.218 ms