当前位置：首页 > news >正文

nacos热更新引起tcp激增导致服务不可用

news 2025/7/13 4:11:32

简介

本次遇到线上nacos热更新问题导致服务不可用，记录在次，也帮后续者排查原因。
这是一个偶发现象，非必现，但是也需要注意.

现象

环境如下：

centos8
nacos 2.2.1
spring-cloud-alibaba-dependencies:2022.0.0.0

特殊点：此现象必须使用阿里云的nacos服务才会出现。在我们开发测试环境中自行搭建的nacos多次尝试并未复现，所以导致应用上线后多次迭代未出现问题直至使用热更新。

在更新nacos配置热更新时，一段时间后服务不可用。
服务所在节点的ecs服务器监控可以看到，在发布热更新的时间后，tcp同时连接数量激增，导致资源占用。多节点的tcp的同时连接数增大导致nginx连接数到max，间接导致域名下应用不可用。
应用日志一直打印
[notify-currentSkip] **** listener is not finish yet,will try next time.

直接原因

1、应用tcp连接数量激增
2、nginx同时连接数设置过小
3、没有设置报警及后续处理

根本原因

通过频繁的热更新，我们复现了问题，通过dump分析我们拿到了线程堆栈。

Thread 2117539: (state = BLOCKED)- sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise)- java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=175 (Compiled frame)- java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt() @bci=1, line=836 (Compiled frame)- java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(java.util.concurrent.locks.AbstractQueuedSynchronizer$Node, int) @bci=67, line=870 (Interpreted frame)- java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(int) @bci=17, line=1199 (Compiled frame)- java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock() @bci=5, line=943 (Compiled frame)- org.springframework.cloud.context.scope.GenericScope.destroy() @bci=69, line=140 (Interpreted frame)- org.springframework.cloud.context.scope.refresh.RefreshScope.refreshAll() @bci=1, line=154 (Interpreted frame)- org.springframework.cloud.context.refresh.ContextRefresher.refresh() @bci=9, line=86 (Interpreted frame)- org.springframework.cloud.endpoint.event.RefreshEventListener.handle(org.springframework.cloud.endpoint.event.RefreshEvent) @bci=44, line=72 (Interpreted frame)- org.springframework.cloud.endpoint.event.RefreshEventListener.onApplicationEvent(org.springframework.context.ApplicationEvent) @bci=30, line=61 (Interpreted frame)- org.springframework.context.event.SimpleApplicationEventMulticaster.doInvokeListener(org.springframework.context.ApplicationListener, org.springframework.context.ApplicationEvent) @bci=2, line=172 (Compiled frame)- org.springframework.context.event.SimpleApplicationEventMulticaster.invokeListener(org.springframework.context.ApplicationListener, org.springframework.context.ApplicationEvent) @bci=34, line=165 (Compiled frame)- org.springframework.context.event.SimpleApplicationEventMulticaster.multicastEvent(org.springframework.context.ApplicationEvent, org.springframework.core.ResolvableType) @bci=83, line=139 (Compiled frame)- org.springframework.context.support.AbstractApplicationContext.publishEvent(java.lang.Object, org.springframework.core.ResolvableType) @bci=70, line=404 (Compiled frame)- org.springframework.context.support.AbstractApplicationContext.publishEvent(org.springframework.context.ApplicationEvent) @bci=3, line=361 (Compiled frame)- com.alibaba.cloud.nacos.refresh.NacosContextRefresher$1.innerReceive(java.lang.String, java.lang.String, java.lang.String) @bci=34, line=132 (Interpreted frame)- com.alibaba.nacos.api.config.listener.AbstractSharedListener.receiveConfigInfo(java.lang.String) @bci=10, line=40 (Interpreted frame)- com.alibaba.nacos.client.config.impl.CacheData.lambda$safeNotifyListener$1(com.alibaba.nacos.api.config.listener.Listener, java.lang.String, java.lang.String, java.lang.String, java.lang.String, java.lang.String, com.alibaba.nacos.client.config.impl.CacheData$ManagerListenerWrap, java.lang.String) @bci=146, line=317 (Interpreted frame)- com.alibaba.nacos.client.config.impl.CacheData$$Lambda$1538.run() @bci=36 (Interpreted frame)- java.util.concurrent.Executors$RunnableAdapter.call() @bci=4, line=511 (Compiled frame)- java.util.concurrent.FutureTask.run() @bci=42, line=266 (Compiled frame)- java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) @bci=95, line=1149 (Compiled frame)- java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=624 (Compiled frame)- java.lang.Thread.run() @bci=11, line=750 (Compiled frame)

通过此和服务器打印出的WARN日志的代码，就能找到对应的源码

    private void safeNotifyListener(final String dataId, final String group, final String content, final String type,final String md5, final String encryptedDataKey, final ManagerListenerWrap listenerWrap) {final Listener listener = listenerWrap.listener;if (listenerWrap.inNotifying) {LOGGER.warn("[{}] [notify-currentSkip] dataId={}, group={}, md5={}, listener={}, listener is not finish yet,will try next time.",name, dataId, group, md5, listener);return;}Runnable job = () -> {long start = System.currentTimeMillis();ClassLoader myClassLoader = Thread.currentThread().getContextClassLoader();ClassLoader appClassLoader = listener.getClass().getClassLoader();try {if (listener instanceof AbstractSharedListener) {AbstractSharedListener adapter = (AbstractSharedListener) listener;adapter.fillContext(dataId, group);LOGGER.info("[{}] [notify-context] dataId={}, group={}, md5={}", name, dataId, group, md5);}// Before executing the callback, set the thread classloader to the classloader of// the specific webapp to avoid exceptions or misuses when calling the spi interface in// the callback method (this problem occurs only in multi-application deployment).Thread.currentThread().setContextClassLoader(appClassLoader);ConfigResponse cr = new ConfigResponse();cr.setDataId(dataId);cr.setGroup(group);cr.setContent(content);cr.setEncryptedDataKey(encryptedDataKey);configFilterChainManager.doFilter(null, cr);String contentTmp = cr.getContent();listenerWrap.inNotifying = true;listener.receiveConfigInfo(contentTmp);// compare lastContent and contentif (listener instanceof AbstractConfigChangeListener) {Map<String, ConfigChangeItem> data = ConfigChangeHandler.getInstance().parseChangeData(listenerWrap.lastContent, contentTmp, type);ConfigChangeEvent event = new ConfigChangeEvent(data);((AbstractConfigChangeListener) listener).receiveConfigChange(event);listenerWrap.lastContent = contentTmp;}listenerWrap.lastCallMd5 = md5;LOGGER.info("[{}] [notify-ok] dataId={}, group={}, md5={}, listener={} ,cost={} millis.", name, dataId,group, md5, listener, (System.currentTimeMillis() - start));} catch (NacosException ex) {LOGGER.error("[{}] [notify-error] dataId={}, group={}, md5={}, listener={} errCode={} errMsg={}", name,dataId, group, md5, listener, ex.getErrCode(), ex.getErrMsg());} catch (Throwable t) {LOGGER.error("[{}] [notify-error] dataId={}, group={}, md5={}, listener={}", name, dataId, group, md5,listener, t);} finally {listenerWrap.inNotifying = false;Thread.currentThread().setContextClassLoader(myClassLoader);}};final long startNotify = System.currentTimeMillis();try {if (null != listener.getExecutor()) {listener.getExecutor().execute(job);} else {try {INTERNAL_NOTIFIER.submit(job);} catch (RejectedExecutionException rejectedExecutionException) {LOGGER.warn("[{}] [notify-blocked] dataId={}, group={}, md5={}, listener={}, no available internal notifier,will sync notifier ",name, dataId, group, md5, listener);job.run();} catch (Throwable throwable) {LOGGER.error("[{}] [notify-blocked] dataId={}, group={}, md5={}, listener={}, submit internal async task fail,throwable= ",name, dataId, group, md5, listener, throwable);job.run();}}} catch (Throwable t) {LOGGER.error("[{}] [notify-error] dataId={}, group={}, md5={}, listener={} throwable={}", name, dataId,group, md5, listener, t.getCause());}final long finishNotify = System.currentTimeMillis();LOGGER.info("[{}] [notify-listener] time cost={}ms in ClientWorker, dataId={}, group={}, md5={}, listener={} ",name, (finishNotify - startNotify), dataId, group, md5, listener);}

我们根据现象只打印了一条WARN日志可以推论出在Runnable job中listenerWrap.inNotifying=true后的几行代码中

		listener.receiveConfigInfo(contentTmp);// compare lastContent and contentif (listener instanceof AbstractConfigChangeListener) {Map<String, ConfigChangeItem> data = ConfigChangeHandler.getInstance().parseChangeData(listenerWrap.lastContent, contentTmp, type);ConfigChangeEvent event = new ConfigChangeEvent(data);((AbstractConfigChangeListener) listener).receiveConfigChange(event);listenerWrap.lastContent = contentTmp;}listenerWrap.lastCallMd5 = md5;