聊一聊双十一背后的技术 毫秒分词算啥, 试试正则和相似度

作者: digoal

日期: 2016-11-20

标签: PostgreSQL , 分词 , 图库搜索 , 近似度搜索, 搜索引擎 , 双十一 ,


背景

看刑侦剧经常有看到人物拼图,然后到图库搜索的,以前可能靠的是人肉,今天,使用PG,可以靠数据库的图形近似度搜索功能。 《弱水三千,只取一瓢,当图像搜索遇见PostgreSQL (Haar wavelet)

但是我今天不是写图形搜索,我要写的是文本搜索,文本搜索,大家一定会想到分词。千万不要以为分词可以搞定一切需求,比如这样的需求就搞不定。

hello world打成了hello word或者hello w0rld,你要让数据库匹配出来,怎么搞?

又或者你的业务需要写正则进行匹配,怎么搞?比如一些域名的查询,错误! 超链接引用无效

甚至更变态的 fi[a-z]{1}e.*?.?? ,这样的查询。

数据量小,并发小时,这种查询是可以忍受全表扫描和CPU处理过滤的。

但是想想一下,你是一个日请求过亿的业务,或者数据量亿级别的,全表扫描和CPU的开销会让你疯掉的。

那么来看看PostgreSQL是如何做到的吧。PG就像机器猫的百宝箱,里面什么都能找到,一起来召唤。

PostgreSQL 如何支持正则?

在pg_trgm的代码注释中有详细的介绍,分为4个步骤。

1. 使用PostgreSQL regexp库,将正则转换为NFA样式(图形化词组)。

2. 将NFA样式再进行转换,转换为扩展的图形样式(trigrams),包括拆分后的查询词组与NOT词组。

3. 简化,过滤不必要的trigrams。

4. 打包为TrgmPackedGraph结构,支持GIN,GIST索引的检索。

正则查询、模糊查询例子

新建一张测试表,其中包含一个字符串类型字段,插入5000万随机字符串,要来就来大数据量,数据库可不是玩具。

postgres=# create extension pg_trgm; 
postgres=# create table tbl(id int primary key, info text, crt_time timestamp); 
postgres=# insert into tbl select id,md5(random()::text), now() from generate_series(1,50000000) t(id); 

创建支持正则的索引

postgres=# CREATE INDEX trgm_idx_tbl_gin ON tbl USING GIN (info gin_trgm_ops);  

正则查询例子,刚给一位搞其他数据库产品的同学演示过,被查询速度惊吓到,直接给跪了。

postgres=# select * from tbl where info ~ '53?6b.*8823a' limit 10;
    id    |               info               |          crt_time          
----------+----------------------------------+----------------------------
  4587797 | 0b1e56bd258823a9f9338919617651e5 | 2016-11-18 14:43:25.726919
  7269097 | 9d5a7aace4bb56b29fe54cd98823a8ec | 2016-11-18 14:43:25.726919
 11589980 | 3536b69b29b607348823a675cf4842c3 | 2016-11-18 14:43:25.726919
(3 rows)
Time: 142.087 ms

postgres=# explain (analyze,verbose,timing,costs,buffers) 
                   select * from tbl where info ~ '53?6b.*8823a' limit 10;
                             QUERY PLAN        
-------------------------------------------------------------------------------------------
 Limit  (cost=527.75..537.82 rows=10 width=45) (actual time=140.867..140.902 rows=3 loops=1)
   Output: id, info, crt_time
   Buffers: shared hit=911
   ->  Bitmap Heap Scan on public.tbl  (cost=527.75..5564.25 rows=5000 width=45) 
                                       (actual time=140.847..140.881 rows=3 loops=1)
         Output: id, info, crt_time
         Recheck Cond: (tbl.info ~ '53?6b.*8823a'::text)
         Rows Removed by Index Recheck: 5
         Heap Blocks: exact=8
         Buffers: shared hit=911
         ->  Bitmap Index Scan on trgm_idx_tbl  (cost=0.00..526.50 rows=5000 width=0) 
                                                (actual time=140.832..140.832 rows=8 loops=1)
               Index Cond: (tbl.info ~ '53?6b.*8823a'::text)
               Buffers: shared hit=903
 Planning time: 0.389 ms
 Execution time: 140.928 ms
(14 rows)


postgres=# select * from tbl where info ~ 'hello.*[a-f]{1}abc' limit 10;
 id | info | crt_time 
----+------+----------
(0 rows)
Time: 0.974 ms
postgres=# explain (analyze,verbose,timing,costs,buffers) 
                              select * from tbl where info ~ 'hello.*[a-f]{1}abc' limit 10;
                                         QUERY PLAN 
-------------------------------------------------------------------------------------------
 Limit  (cost=852.75..862.82 rows=10 width=45) (actual time=0.451..0.451 rows=0 loops=1)
   Output: id, info, crt_time
   Buffers: shared hit=35
   ->  Bitmap Heap Scan on public.tbl  (cost=852.75..5889.25 rows=5000 width=45) 
                                       (actual time=0.449..0.449 rows=0 loops=1)
         Output: id, info, crt_time
         Recheck Cond: (tbl.info ~ 'hello.*[a-f]{1}abc'::text)
         Buffers: shared hit=35
         ->  Bitmap Index Scan on trgm_idx_tbl  (cost=0.00..851.50 rows=5000 width=0) 
                                                (actual time=0.447..0.447 rows=0 loops=1)
               Index Cond: (tbl.info ~ 'hello.*[a-f]{1}abc'::text)
               Buffers: shared hit=35
 Planning time: 0.366 ms
 Execution time: 0.479 ms
(12 rows)

模糊查询例子(其实正则已经包含了模糊查询,这里只是再展示一下)

postgres=# select * from tbl where info ~ '821b8b92' limit 10;
 id |               info               |          crt_time          
----+----------------------------------+----------------------------
  4 | c5132821b8b92ba36487d06b438d8aed | 2016-11-18 14:43:25.726919
(1 row)
Time: 141.353 ms

postgres=# explain (analyze,verbose,timing,costs,buffers) 
                   select * from tbl where info ~ '821b8b92' limit 10;
                   QUERY PLAN        
------------------------------------------------------------------------------------------------------
 Limit  (cost=527.75..537.82 rows=10 width=45) (actual time=140.293..140.299 rows=1 loops=1)
   Output: id, info, crt_time
   Buffers: shared hit=904
   ->  Bitmap Heap Scan on public.tbl 
                  (cost=527.75..5564.25 rows=5000 width=45) (actual time=140.291..140.297 rows=1 loops=1)
         Output: id, info, crt_time
         Recheck Cond: (tbl.info ~ '821b8b92'::text)
         Rows Removed by Index Recheck: 1
         Heap Blocks: exact=2
         Buffers: shared hit=904
         ->  Bitmap Index Scan on trgm_idx_tbl  (cost=0.00..526.50 rows=5000 width=0) 
                                                (actual time=140.279..140.279 rows=2 loops=1)
               Index Cond: (tbl.info ~ '821b8b92'::text)
               Buffers: shared hit=902
 Planning time: 0.331 ms
 Execution time: 140.323 ms
(14 rows)

以上例子若全表扫描的耗时(本例未使用PG支持的CPU并行计算)

postgres=# select * from tbl where info ~ '53?6b.*8823a' limit 10;
    id    |               info               |          crt_time          
----------+----------------------------------+----------------------------
  4587797 | 0b1e56bd258823a9f9338919617651e5 | 2016-11-18 14:43:25.726919
  7269097 | 9d5a7aace4bb56b29fe54cd98823a8ec | 2016-11-18 14:43:25.726919
 11589980 | 3536b69b29b607348823a675cf4842c3 | 2016-11-18 14:43:25.726919
(3 rows)
Time: 93489.353 ms

postgres=# select * from tbl where info ~ 'hello.*[a-f]{1}abc' limit 10;
 id | info | crt_time 
----+------+----------
(0 rows)
Time: 78246.122 ms

postgres=# select * from tbl where info ~ '821b8b92' limit 10;
 id |               info               |          crt_time          
----+----------------------------------+----------------------------
  4 | c5132821b8b92ba36487d06b438d8aed | 2016-11-18 14:43:25.726919
(1 row)
Time: 82218.218 ms



/images/news/2016/pg_bot_banner.jpg
请在登录后发表评论,否则无法保存。
1楼 xiaowu
2024-04-22 09:28:30+08

450字作文:https://www.nanss.com/xuexi/2057.html 花带来的美好心情说说:https://www.nanss.com/wenan/2261.html 观察日记:https://www.nanss.com/xuexi/2483.html 认清朋友后心寒的句子:https://www.nanss.com/yulu/2275.html 清明节作文:https://www.nanss.com/xuexi/2476.html 感谢母亲的话:https://www.nanss.com/yulu/2360.html 读朝花夕拾有感:https://www.nanss.com/xuexi/2148.html 奖项名称:https://www.nanss.com/shenghuo/2052.html 出门溜达幽默说说:https://www.nanss.com/wenan/2032.html 分手文案短句干净治愈:https://www.nanss.com/wenan/2468.html 运动会入场口号:https://www.nanss.com/xuexi/2455.html 青年志愿者活动方案:https://www.nanss.com/gongzuo/2100.html 男人下厨的幽默句子:https://www.nanss.com/wenan/2395.html 让人惹不起的网名:https://www.nanss.com/mingcheng/2351.html 吊炸天的群名:https://www.nanss.com/mingcheng/2412.html 婚姻选错痛苦一生的说说:https://www.nanss.com/wenan/1883.html 收获满满的优美句子:https://www.nanss.com/yulu/1758.html 格列佛游记读后感:https://www.nanss.com/xuexi/2134.html 江湖再见霸气离别句子:https://www.nanss.com/wenan/1788.html 雪国读后感:https://www.nanss.com/xuexi/2149.html 讽刺男人爱喝酒的句子:https://www.nanss.com/yulu/1945.html 春天来了作文:https://www.nanss.com/xuexi/2179.html 爱女儿的霸气说说:https://www.nanss.com/wenan/2203.html 企业年度工作总结:https://www.nanss.com/gongzuo/2086.html 低碳生活的建议:https://www.nanss.com/shenghuo/2444.html 抖音唯美昵称:https://www.nanss.com/mingcheng/2005.html 关于大自然的文章:https://www.nanss.com/yuedu/2061.html 桥的作文:https://www.nanss.com/xuexi/2167.html 治愈系温柔短句:https://www.nanss.com/wenan/2258.html 不值得的人就该放弃说说:https://www.nanss.com/wenan/2080.html

© 2010 PostgreSQL中文社区