聊一聊双十一背后的技术 毫秒分词算啥, 试试正则和相似度
作者: digoal
日期: 2016-11-20
标签: PostgreSQL , 分词 , 图库搜索 , 近似度搜索, 搜索引擎 , 双十一 ,
背景
看刑侦剧经常有看到人物拼图,然后到图库搜索的,以前可能靠的是人肉,今天,使用PG,可以靠数据库的图形近似度搜索功能。 《弱水三千,只取一瓢,当图像搜索遇见PostgreSQL (Haar wavelet)》
但是我今天不是写图形搜索,我要写的是文本搜索,文本搜索,大家一定会想到分词。千万不要以为分词可以搞定一切需求,比如这样的需求就搞不定。
hello world打成了hello word或者hello w0rld,你要让数据库匹配出来,怎么搞?
又或者你的业务需要写正则进行匹配,怎么搞?比如一些域名的查询,错误! 超链接引用无效。
甚至更变态的 fi[a-z]{1}e.*?.?? ,这样的查询。
数据量小,并发小时,这种查询是可以忍受全表扫描和CPU处理过滤的。
但是想想一下,你是一个日请求过亿的业务,或者数据量亿级别的,全表扫描和CPU的开销会让你疯掉的。
那么来看看PostgreSQL是如何做到的吧。PG就像机器猫的百宝箱,里面什么都能找到,一起来召唤。
PostgreSQL 如何支持正则?
在pg_trgm的代码注释中有详细的介绍,分为4个步骤。
1. 使用PostgreSQL regexp库,将正则转换为NFA样式(图形化词组)。
2. 将NFA样式再进行转换,转换为扩展的图形样式(trigrams),包括拆分后的查询词组与NOT词组。
3. 简化,过滤不必要的trigrams。
4. 打包为TrgmPackedGraph结构,支持GIN,GIST索引的检索。
正则查询、模糊查询例子
新建一张测试表,其中包含一个字符串类型字段,插入5000万随机字符串,要来就来大数据量,数据库可不是玩具。
postgres=# create extension pg_trgm; postgres=# create table tbl(id int primary key, info text, crt_time timestamp); postgres=# insert into tbl select id,md5(random()::text), now() from generate_series(1,50000000) t(id);
创建支持正则的索引
postgres=# CREATE INDEX trgm_idx_tbl_gin ON tbl USING GIN (info gin_trgm_ops);
正则查询例子,刚给一位搞其他数据库产品的同学演示过,被查询速度惊吓到,直接给跪了。
postgres=# select * from tbl where info ~ '53?6b.*8823a' limit 10; id | info | crt_time ----------+----------------------------------+---------------------------- 4587797 | 0b1e56bd258823a9f9338919617651e5 | 2016-11-18 14:43:25.726919 7269097 | 9d5a7aace4bb56b29fe54cd98823a8ec | 2016-11-18 14:43:25.726919 11589980 | 3536b69b29b607348823a675cf4842c3 | 2016-11-18 14:43:25.726919 (3 rows) Time: 142.087 ms postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where info ~ '53?6b.*8823a' limit 10; QUERY PLAN ------------------------------------------------------------------------------------------- Limit (cost=527.75..537.82 rows=10 width=45) (actual time=140.867..140.902 rows=3 loops=1) Output: id, info, crt_time Buffers: shared hit=911 -> Bitmap Heap Scan on public.tbl (cost=527.75..5564.25 rows=5000 width=45) (actual time=140.847..140.881 rows=3 loops=1) Output: id, info, crt_time Recheck Cond: (tbl.info ~ '53?6b.*8823a'::text) Rows Removed by Index Recheck: 5 Heap Blocks: exact=8 Buffers: shared hit=911 -> Bitmap Index Scan on trgm_idx_tbl (cost=0.00..526.50 rows=5000 width=0) (actual time=140.832..140.832 rows=8 loops=1) Index Cond: (tbl.info ~ '53?6b.*8823a'::text) Buffers: shared hit=903 Planning time: 0.389 ms Execution time: 140.928 ms (14 rows) postgres=# select * from tbl where info ~ 'hello.*[a-f]{1}abc' limit 10; id | info | crt_time ----+------+---------- (0 rows) Time: 0.974 ms postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where info ~ 'hello.*[a-f]{1}abc' limit 10; QUERY PLAN ------------------------------------------------------------------------------------------- Limit (cost=852.75..862.82 rows=10 width=45) (actual time=0.451..0.451 rows=0 loops=1) Output: id, info, crt_time Buffers: shared hit=35 -> Bitmap Heap Scan on public.tbl (cost=852.75..5889.25 rows=5000 width=45) (actual time=0.449..0.449 rows=0 loops=1) Output: id, info, crt_time Recheck Cond: (tbl.info ~ 'hello.*[a-f]{1}abc'::text) Buffers: shared hit=35 -> Bitmap Index Scan on trgm_idx_tbl (cost=0.00..851.50 rows=5000 width=0) (actual time=0.447..0.447 rows=0 loops=1) Index Cond: (tbl.info ~ 'hello.*[a-f]{1}abc'::text) Buffers: shared hit=35 Planning time: 0.366 ms Execution time: 0.479 ms (12 rows)
模糊查询例子(其实正则已经包含了模糊查询,这里只是再展示一下)
postgres=# select * from tbl where info ~ '821b8b92' limit 10; id | info | crt_time ----+----------------------------------+---------------------------- 4 | c5132821b8b92ba36487d06b438d8aed | 2016-11-18 14:43:25.726919 (1 row) Time: 141.353 ms postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where info ~ '821b8b92' limit 10; QUERY PLAN ------------------------------------------------------------------------------------------------------ Limit (cost=527.75..537.82 rows=10 width=45) (actual time=140.293..140.299 rows=1 loops=1) Output: id, info, crt_time Buffers: shared hit=904 -> Bitmap Heap Scan on public.tbl (cost=527.75..5564.25 rows=5000 width=45) (actual time=140.291..140.297 rows=1 loops=1) Output: id, info, crt_time Recheck Cond: (tbl.info ~ '821b8b92'::text) Rows Removed by Index Recheck: 1 Heap Blocks: exact=2 Buffers: shared hit=904 -> Bitmap Index Scan on trgm_idx_tbl (cost=0.00..526.50 rows=5000 width=0) (actual time=140.279..140.279 rows=2 loops=1) Index Cond: (tbl.info ~ '821b8b92'::text) Buffers: shared hit=902 Planning time: 0.331 ms Execution time: 140.323 ms (14 rows)
以上例子若全表扫描的耗时(本例未使用PG支持的CPU并行计算)
postgres=# select * from tbl where info ~ '53?6b.*8823a' limit 10; id | info | crt_time ----------+----------------------------------+---------------------------- 4587797 | 0b1e56bd258823a9f9338919617651e5 | 2016-11-18 14:43:25.726919 7269097 | 9d5a7aace4bb56b29fe54cd98823a8ec | 2016-11-18 14:43:25.726919 11589980 | 3536b69b29b607348823a675cf4842c3 | 2016-11-18 14:43:25.726919 (3 rows) Time: 93489.353 ms postgres=# select * from tbl where info ~ 'hello.*[a-f]{1}abc' limit 10; id | info | crt_time ----+------+---------- (0 rows) Time: 78246.122 ms postgres=# select * from tbl where info ~ '821b8b92' limit 10; id | info | crt_time ----+----------------------------------+---------------------------- 4 | c5132821b8b92ba36487d06b438d8aed | 2016-11-18 14:43:25.726919 (1 row) Time: 82218.218 ms
450字作文:https://www.nanss.com/xuexi/2057.html 花带来的美好心情说说:https://www.nanss.com/wenan/2261.html 观察日记:https://www.nanss.com/xuexi/2483.html 认清朋友后心寒的句子:https://www.nanss.com/yulu/2275.html 清明节作文:https://www.nanss.com/xuexi/2476.html 感谢母亲的话:https://www.nanss.com/yulu/2360.html 读朝花夕拾有感:https://www.nanss.com/xuexi/2148.html 奖项名称:https://www.nanss.com/shenghuo/2052.html 出门溜达幽默说说:https://www.nanss.com/wenan/2032.html 分手文案短句干净治愈:https://www.nanss.com/wenan/2468.html 运动会入场口号:https://www.nanss.com/xuexi/2455.html 青年志愿者活动方案:https://www.nanss.com/gongzuo/2100.html 男人下厨的幽默句子:https://www.nanss.com/wenan/2395.html 让人惹不起的网名:https://www.nanss.com/mingcheng/2351.html 吊炸天的群名:https://www.nanss.com/mingcheng/2412.html 婚姻选错痛苦一生的说说:https://www.nanss.com/wenan/1883.html 收获满满的优美句子:https://www.nanss.com/yulu/1758.html 格列佛游记读后感:https://www.nanss.com/xuexi/2134.html 江湖再见霸气离别句子:https://www.nanss.com/wenan/1788.html 雪国读后感:https://www.nanss.com/xuexi/2149.html 讽刺男人爱喝酒的句子:https://www.nanss.com/yulu/1945.html 春天来了作文:https://www.nanss.com/xuexi/2179.html 爱女儿的霸气说说:https://www.nanss.com/wenan/2203.html 企业年度工作总结:https://www.nanss.com/gongzuo/2086.html 低碳生活的建议:https://www.nanss.com/shenghuo/2444.html 抖音唯美昵称:https://www.nanss.com/mingcheng/2005.html 关于大自然的文章:https://www.nanss.com/yuedu/2061.html 桥的作文:https://www.nanss.com/xuexi/2167.html 治愈系温柔短句:https://www.nanss.com/wenan/2258.html 不值得的人就该放弃说说:https://www.nanss.com/wenan/2080.html