从PostgreSQL支持100万个连接聊起 原作者:digoal / 德哥 创作时间:2016-08-06 23:08:49+08 |
doudou586 发布于2016-08-06 23:08:49 评论: 2 浏览: 10498 顶: 809 踩: 833 |
100万个数据库连接,绝逼疯了,常人怎么会干这种事情。
没错,数据库支持100万个连接意味着什么呢?不能用连接池吗?
除了抱着玩一玩的心态,也能了解到操作系统层的一些知识,何乐不为?
根据我前几天写的《如何度量Kernel Resources for PostgreSQL》,我们可以评估得出,如果要支持100万个数据库客户端连接,操作系统应该如何配置。https://yq.aliyun.com/articles/58690,但是实际上能如愿吗?
以PostgreSQL 9.5为例,100万个连接,需要多少信号量?
SEMMNI >= (max_connections + max_worker_processes + autovacuum_max_workers + 5) / 16
SEMMNS >= ((max_connections + max_worker_processes + autovacuum_max_workers + 5) / 16) * 17 + 其他程序的需求
SEMMSL >= 17
CentOS 6.x x64, 512GB内存。
kernel.sem = 18 2147483647 2147483646 512000000 max number of arrays = 512000000 max semaphores per array = 18 max semaphores system wide = 2147483647 max ops per semop call = 2147483646
以上内核配置,信号量完全满足100万连接需求。
那么数据库能启动吗?
vi postgresql.conf max_connections = 1000000 pg_ctl start
启动失败。
报错来自如下代码:使用semget创建sem失败。src/backend/port/sysv_sema.c
/* 76 * InternalIpcSemaphoreCreate 77 * 78 * Attempt to create a new semaphore set with the specified key. 79 * Will fail (return -1) if such a set already exists. 80 * 81 * If we fail with a failure code other than collision-with-existing-set, 82 * print out an error and abort. Other types of errors suggest nonrecoverable 83 * problems. 84 */ 85 static IpcSemaphoreId 86 InternalIpcSemaphoreCreate(IpcSemaphoreKey semKey, int numSems) 87 { 88 int semId; 89 90 semId = semget(semKey, numSems, IPC_CREAT | IPC_EXCL | IPCProtection); 91 92 if (semId < 0) 93 { 94 int saved_errno = errno; 95 96 /* 97 * Fail quietly if error indicates a collision with existing set. One 98 * would expect EEXIST, given that we said IPC_EXCL, but perhaps we 99 * could get a permission violation instead? Also, EIDRM might occur 100 * if an old set is slated for destruction but not gone yet. 101 */ 102 if (saved_errno == EEXIST || saved_errno == EACCES 103 #ifdef EIDRM 104 || saved_errno == EIDRM 105 #endif 106 ) 107 return -1; 108 109 /* 110 * Else complain and abort 111 */ 112 ereport(FATAL, 113 (errmsg("could not create semaphores: %m"), 114 errdetail("Failed system call was semget(%lu, %d, 0%o).", 115 (unsigned long) semKey, numSems, 116 IPC_CREAT | IPC_EXCL | IPCProtection), 117 (saved_errno == ENOSPC) ? 118 errhint("This error does *not* mean that you have run out of disk space. " 119 "It occurs when either the system limit for the maximum number of " 120 "semaphore sets (SEMMNI), or the system wide maximum number of " 121 "semaphores (SEMMNS), would be exceeded. You need to raise the " 122 "respective kernel parameter. Alternatively, reduce PostgreSQL's " 123 "consumption of semaphores by reducing its max_connections parameter.\n" 124 "The PostgreSQL documentation contains more information about " 125 "configuring your system for PostgreSQL.") : 0)); 126 } 127 128 return semId; 129 }
semget之所以失败,并不是kernel.sem的配置问题,而是操作系统内核的宏限制。
sem的分组数量不能大于semvmx,也就是说,最多能开50多万个连接。
如下: kernels/xxx.x86_64/include/uapi/linux/sem.h
#define SEMMNI 128 /* <= IPCMNI max # of semaphore identifiers */ #define SEMMSL 250 /* <= 8 000 max num of semaphores per id */ #define SEMMNS (SEMMNI*SEMMSL) /* <= INT_MAX max # of semaphores in system */ #define SEMOPM 32 /* <= 1 000 max num of ops per semop call */ #define SEMVMX 32767 /* <= 32767 semaphore maximum value */ #define SEMAEM SEMVMX /* adjust on exit max value */ /* unused */ #define SEMUME SEMOPM /* max num of undo entries per process */ #define SEMMNU SEMMNS /* num of undo structures system wide */ #define SEMMAP SEMMNS /* # of entries in semaphore map */ #define SEMUSZ 20 /* sizeof struct sem_undo */
超过32767个array后会报错
$ ipcs -u ------ Semaphore Status -------- used arrays = 32768 allocated semaphores = 557056 $ pg_ctl start FATAL: could not create semaphores: No space left on device DETAIL: Failed system call was semget(1953769, 17, 03600). HINT: This error does *not* mean that you have run out of disk space. It occurs when either the system limit for the maximum number of semaphore sets (SEMMNI), or the system wide maximum number of semaphores (SEMMNS), would be exceeded. You need to raise the respective kernel parameter. Alternatively, reduce PostgreSQL's consumption of semaphores by reducing its max_connections parameter. The PostgreSQL documentation contains more information about configuring your system for PostgreSQL.
使用ipcs -l也能查看到当前值
semaphore max value = 32767
这个值只能重新编译内核来修改。
在semctl和semop的手册中也能看到,有此说法。
man semctl SEMVMX Maximum value for semval: implementation dependent (32767). man semop SEMVMX Maximum allowable value for semval: implementation dependent (32767). The implementation has no intrinsic limits for the adjust on exit maximum value (SEMAEM), the system wide maximum number of undo structures (SEMMNU) and the per-process maximum number of undo entries system parameters.
100万连接没戏了吗?
当然可以办到。
例如修改SEMVMX,并重新编译内核显然是一条路,但是还有其他路子吗?
我们前面看到,报错的代码是
InternalIpcSemaphoreCreate@src/backend/port/sysv_sema.c
查看对应头文件,发现PG支持几种创建信号量的方式,真的是柳暗花明又一春 :
sysv, posix(named , unamed), win32
对应的头文件源码如下 :src/include/storage/pg_sema.h
/* * PGSemaphoreData and pointer type PGSemaphore are the data structure * representing an individual semaphore. The contents of PGSemaphoreData * vary across implementations and must never be touched by platform- * independent code. PGSemaphoreData structures are always allocated * in shared memory (to support implementations where the data changes during * lock/unlock). * * pg_config.h must define exactly one of the USE_xxx_SEMAPHORES symbols. */ #ifdef USE_NAMED_POSIX_SEMAPHORES #includetypedef sem_t *PGSemaphoreData; #endif #ifdef USE_UNNAMED_POSIX_SEMAPHORES #include typedef sem_t PGSemaphoreData; #endif #ifdef USE_SYSV_SEMAPHORES typedef struct PGSemaphoreData { int semId; /* semaphore set identifier */ int semNum; /* semaphore number within set */ } PGSemaphoreData; #endif #ifdef USE_WIN32_SEMAPHORES typedef HANDLE PGSemaphoreData; #endif
其中posix的named和unamed分别使用如下系统调用posix named 方式创建信号 :
mySem = sem_open(semname, O_CREAT | O_EXCL, (mode_t) IPCProtection, (unsigned) 1);
posix unamed 方式创建信号 :
sem_init(sem, 1, 1)
posix源码如下,注意用到的宏: src/backend/port/posix_sema.c
#ifdef USE_NAMED_POSIX_SEMAPHORES /* * PosixSemaphoreCreate * * Attempt to create a new named semaphore. * * If we fail with a failure code other than collision-with-existing-sema, * print out an error and abort. Other types of errors suggest nonrecoverable * problems. */ static sem_t * PosixSemaphoreCreate(void) { int semKey; char semname[64]; sem_t *mySem; for (;;) { semKey = nextSemKey++; snprintf(semname, sizeof(semname), "/pgsql-%d", semKey); mySem = sem_open(semname, O_CREAT | O_EXCL, (mode_t) IPCProtection, (unsigned) 1); #ifdef SEM_FAILED if (mySem != (sem_t *) SEM_FAILED) break; #else if (mySem != (sem_t *) (-1)) break; #endif /* Loop if error indicates a collision */ if (errno == EEXIST || errno == EACCES || errno == EINTR) continue; /* * Else complain and abort */ elog(FATAL, "sem_open(\"%s\") failed: %m", semname); } /* * Unlink the semaphore immediately, so it can't be accessed externally. * This also ensures that it will go away if we crash. */ sem_unlink(semname); return mySem; } #else /* !USE_NAMED_POSIX_SEMAPHORES */ /* * PosixSemaphoreCreate * * Attempt to create a new unnamed semaphore. */ static void PosixSemaphoreCreate(sem_t * sem) { if (sem_init(sem, 1, 1) < 0) elog(FATAL, "sem_init failed: %m"); } #endif /* USE_NAMED_POSIX_SEMAPHORES */
从src/include/storage/pg_sema.h 可以看到,在pg_config.h中必须有一个指定的USE_xxx_SEMAPHORES symbols。
这个symbol不是直接设置pg_config.h来的,是在configure时设置的,会自动加到pg_config.h 。
Select semaphore implementation type. if test "$PORTNAME" != "win32"; then if test x"$USE_NAMED_POSIX_SEMAPHORES" = x"1" ; then $as_echo "#define USE_NAMED_POSIX_SEMAPHORES 1" >>confdefs.h SEMA_IMPLEMENTATION="src/backend/port/posix_sema.c" else if test x"$USE_UNNAMED_POSIX_SEMAPHORES" = x"1" ; then $as_echo "#define USE_UNNAMED_POSIX_SEMAPHORES 1" >>confdefs.h SEMA_IMPLEMENTATION="src/backend/port/posix_sema.c" else $as_echo "#define USE_SYSV_SEMAPHORES 1" >>confdefs.h SEMA_IMPLEMENTATION="src/backend/port/sysv_sema.c" fi fi else $as_echo "#define USE_WIN32_SEMAPHORES 1" >>confdefs.h SEMA_IMPLEMENTATION="src/backend/port/win32_sema.c" fi
默认使用SYSV,如果要使用其他的sem方法。可以这么做
export USE_UNNAMED_POSIX_SEMAPHORES=1 LIBS=-lpthread ./configure --prefix=/home/digoal/pgsql9.5
记得加-lpthread ,否则报错
/bin/ld: port/pg_sema.o: undefined reference to symbol 'sem_close@@GLIBC_2.2.5' /bin/ld: note: 'sem_close@@GLIBC_2.2.5' is defined in DSO /lib64/libpthread.so.0 so try adding it to the linker command line /lib64/libpthread.so.0: could not read symbols: Invalid operation collect2: error: ld returned 1 exit status make[2]: *** [postgres] Error 1 make[2]: Leaving directory `/home/digoal/postgresql-9.5.3/src/backend' make[1]: *** [all-backend-recurse] Error 2 make[1]: Leaving directory `/home/digoal/postgresql-9.5.3/src' make: *** [world-src-recurse] Error 2
通过这些系统调用的Linux编程帮助文档,了解一下posix的信号量管理
man sem_overview man sem_init , unnamed sem man sem_open , named sem EINVAL value was greater than SEM_VALUE_MAX.
可以得知sem_open 也受到semvmx的限制,因此为了在不修改内核的情况下,实现PostgreSQL支持100万个连接,甚至更多。
必须使用USE_UNNAMED_POSIX_SEMAPHORES
使用USE_UNNAMED_POSIX_SEMAPHORES编译
export USE_UNNAMED_POSIX_SEMAPHORES=1 LIBS=-lpthread ./configure --prefix=/home/digoal/pgsql9.5 make world -j 32 make install-world -j 32
修改参数,允许100万个连接
vi postgresql.conf max_connections = 1000000
重启数据库
pg_ctl restart -m fast
pgbench是很好的测试工具,只不过限制了1024个连接,为了支持100万个连接测试,需要修改一下。
代码
vi src/bin/pgbench/pgbench.c #ifdef WIN32 #define FD_SETSIZE 1024 /* set before winsock2.h is included */ #endif /* ! WIN32 */ /* max number of clients allowed */ #ifdef FD_SETSIZE #define MAXCLIENTS (FD_SETSIZE - 10) #else #define MAXCLIENTS 1024 #endif case 'c': benchmarking_option_set = true; nclients = atoi(optarg); if (nclients <= 0) // 改一下这里 || nclients > MAXCLIENTS) { fprintf(stderr, "invalid number of clients: \"%s\"\n", optarg); exit(1); }
测试表
postgres=# create unlogged table test(id int, info text); CREATE TABLE
测试脚本
vi test.sql \setrandom s 1 100 select pg_sleep(:s); insert into test values (1,'test');
开始压测,遇到第一个问题
pgbench -M prepared -n -r -f ./test.sql -c 999900 -j 1 -T 10000 need at least 999903 open files, but system limit is 655360 Reduce number of clients, or use limit/ulimit to increase the system limit.
这个问题还好,是打开文件数受限,改一些限制就可以解决
修改ulimit
vi /etc/security/limits.conf * soft nofile 1048576 * hard nofile 1048576 * soft noproc 10000000 * hard noproc 10000000 * soft memlock unlimited * hard memlock unlimited
修改内核参数
sysctl -w fs.file-max=419430400000
重测,再次遇到问题,原因是pgbench使用了ip地址连接PG,导致pgbench的动态端口耗尽。
pgbench -M prepared -n -r -f ./test.sql -c 999900 -j 1 -T 10000 connection to database "postgres" failed: could not connect to server: Cannot assign requested address Is the server running on host "127.0.0.1" and accepting TCP/IP connections on port 1921? transaction type: Custom query scaling factor: 1 query mode: prepared number of clients: 999900 number of threads: 1 duration: 10000 s number of transactions actually processed: 0
换成unix socket连接即可解决。
pgbench -M prepared -n -r -f ./test.sql -c 999900 -j 1 -T 10000 -h $PGDATA connection to database "postgres" failed: could not fork new process for connection: Cannot allocate memory could not fork new process for connection: Cannot allocate memory transaction type: Custom query scaling factor: 1 query mode: prepared number of clients: 999900 number of threads: 1 duration: 10000 s number of transactions actually processed: 0
不能fork new process,后面跟了个Cannot allocate memory这样的提示,我看了当前的配置
vm.swappiness = 0 vm.overcommit_memory = 0
于是我加了交换分区,同时改了几个参数
dd if=/dev/zero of=./swap1 bs=1024k count=102400 oflag=direct mkswap ./swap1 swapon ./swap1 sysctl -w vm.overcommit_memory=1 (always over commit) sysctl -w vm.swappiness=1
重新测试,发现还是有问题
pgbench -M prepared -n -f ./test.sql -c 999900 -j 1 -T 10000 -h $PGDATA could not fork new process for connection: Cannot allocate memory
使用以下手段观测,发现在约连接到 65535 时报错
sar -r 1 10000 psql select count(*) from pg_stat_activity; \watch 1
找到了根源,是内核限制了
kernel.pid_max=65535
修改一下这个内核参数
sysctl -w kernel.pid_max=4096000
重新测试
pgbench -M prepared -n -f ./test.sql -c 999900 -j 1 -T 10000 -h $PGDATA
继续观测
psql select count(*) from pg_stat_activity; \watch 1 sar -r 1 10000
连接到26万时,内存用了约330GB,每个连接1MB左右。
看起来应该没有问题了,只要内存足够是可以搞定100万连接的。
为了让PostgreSQL支持100万个并发连接,除了资源(主要是内存)要给足。 数据库本身编译也需要注意,还需要操作系统内核也需要一些调整。
export USE_UNNAMED_POSIX_SEMAPHORES=1 LIBS=-lpthread ./configure --prefix=/home/digoal/pgsql9.5 make world -j 32 make install-world -j 32
vi /etc/security/limits.conf * soft nofile 1048576 * hard nofile 1048576 * soft noproc 10000000 * hard noproc 10000000
sysctl -w fs.file-max=419430400000
dd if=/dev/zero of=./swap1 bs=1024k count=102400 oflag=direct mkswap ./swap1 swapon ./swap1 . . sysctl -w vm.overcommit_memory=0 sysctl -w vm.swappiness=1
[67504.841109] Memory cgroup out of memory: Kill process 385438 (pidof) score 721 or sacrifice child [67504.850566] Killed process 385438, UID 0, (pidof) total-vm:982240kB, anon-rss:978356kB, file-rss:544kB [67517.496404] pidof invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0, oom_score_adj=0 [67517.496407] pidof cpuset=/ mems_allowed=0 [67517.496410] Pid: 385469, comm: pidof Tainted: G --------------- H
sysctl -w kernel.pid_max=4096000
NOTES System V semaphores (semget(2), semop(2), etc.) are an older semaphore API. POSIX semaphores provide a simpler, and better designed interface than System V semaphores; on the other hand POSIX semaphores are less widely available (especially on older systems) than System V semaphores.