服务健康检查返回200 OK,但用户却反馈页面白屏、登录按钮无法点击。这种脱节的监控状态在真实项目中屡见不鲜。传统的存活探针(liveness probe)和就绪探针(readiness probe)只能覆盖服务端点,无法触及复杂的前端渲染逻辑和关键用户交互路径。我们需要的是一个能模拟真实用户行为、并能深入洞察全链路性能瓶颈的系统。这就是我们构建这套合成监控平台的起点。
初步构想与技术选型阵痛
最初的想法很简单:用一个定时任务(cron job)定期执行一些Puppeteer脚本。但这很快就暴露出一系列问题:
- 脚本管理混乱: 脚本散落在服务器各处,更新、回滚、版本控制都成了噩梦。
- 执行环境脆弱: 单机执行,资源消耗巨大,且没有任何隔离和伸缩性可言。一个脚本的崩溃可能影响其他任务。
- 监控的“黑盒”: 脚本执行失败了,我们只知道结果是
FAIL。是目标网站慢?是脚本本身有bug?还是执行环境出了问题?完全无法定位。耗时分析更是无从谈起。
为了解决这些痛点,我们确定了几个核心设计目标:
- 监控即代码 (Monitoring-as-Code): 监控脚本应像应用代码一样,通过Git进行版本管理和协作。
- 健壮的调度与编排: 需要一个稳定的后端服务来管理任务生命周期、调度执行和结果记录。
- 端到端的可观测性: 必须能追踪一次监控任务从“被调度”到“执行完毕”的完整链路,将后端调度、脚本执行、浏览器行为串联起来。
基于这些目标,技术选型变得清晰起来:
- 调度核心 (Orchestrator): Spring Boot。它的
@Scheduled注解提供了简单可靠的定时任务能力,强大的生态(JPA, Actuator)能快速构建起任务管理和持久化能力。在团队Java技术栈的背景下,这是最自然的选择。 - 执行引擎 (Executor): Puppeteer。作为Headless Chrome的事实标准,它能完美模拟用户行为。
- 脚本管理 (Script Source): Git。将所有Puppeteer脚本存放在一个专门的Git仓库中,实现“监控即代码”。调度中心在执行任务前,动态拉取最新的脚本。
- 可观测性 (Observability): Jaeger (配合OpenTelemetry)。这是整个方案的灵魂。我们需要将Spring Boot的调度Trace和Puppeteer脚本执行的Trace关联起来,形成一条完整的调用链。
架构与核心流程
我们将系统设计为一个中心的调度服务(Spring Boot)和多个分布式的执行节点(可以是任何能运行Node.js和Chrome的环境)。核心流程如下:
sequenceDiagram
participant Scheduler as Spring Boot Scheduler
participant GitRepo as Git Repository
participant ExecutorNode as Execution Node
participant TargetApp as Target Application
participant JaegerCollector as Jaeger Collector
loop Every Minute
Scheduler->>Scheduler: @Scheduled task fires
activate Scheduler
Scheduler->>JaegerCollector: 1. Start parent Span "schedule-task"
Scheduler->>GitRepo: 2. Pull latest scripts
GitRepo-->>Scheduler: Return scripts
Scheduler->>ExecutorNode: 3. Spawn Puppeteer process (with Trace Context)
deactivate Scheduler
activate ExecutorNode
ExecutorNode->>JaegerCollector: 4. Start child Span "execute-puppeteer" (using context)
ExecutorNode->>TargetApp: 5. Launch Chrome & navigate
activate TargetApp
TargetApp-->>ExecutorNode: Page loaded
deactivate TargetApp
ExecutorNode->>ExecutorNode: 6. Execute user actions (click, type)
ExecutorNode->>JaegerCollector: 7. Report sub-spans (e.g., "login-action")
ExecutorNode-->>Scheduler: 8. Return result (stdout/stderr)
ExecutorNode->>JaegerCollector: 9. End child Span
deactivate ExecutorNode
activate Scheduler
Scheduler->>JaegerCollector: 10. End parent Span
Scheduler->>Scheduler: 11. Persist task result
deactivate Scheduler
end
这个流程的关键在于第3步和第4步:Spring Boot在创建父Span后,必须将Trace上下文(Trace ID, Parent Span ID)注入到即将启动的Puppeteer子进程中。Puppeteer脚本启动后,解析这个上下文,并以此为基础创建自己的子Span,从而将两个独立的执行环境串联到同一个Trace中。
步骤化实现:从调度到追踪
1. Spring Boot调度中心
首先,我们定义任务实体和仓库。
// src/main/java/com/example/synthetic/task/MonitoringTask.java
package com.example.synthetic.task;
import jakarta.persistence.Entity;
import jakarta.persistence.GeneratedValue;
import jakarta.persistence.Id;
import jakarta.persistence.Column;
@Entity
public class MonitoringTask {
@Id
@GeneratedValue
private Long id;
private String name;
@Column(nullable = false)
private String gitRepoUrl; // 脚本所在的Git仓库
@Column(nullable = false)
private String scriptPath; // 脚本在仓库中的相对路径
@Column(nullable = false)
private String cronExpression; // Cron表达式
private boolean active = true;
// Getters and Setters...
}
// src/main/java/com/example/synthetic/task/TaskRepository.java
package com.example.synthetic.task;
import org.springframework.data.jpa.repository.JpaRepository;
import java.util.List;
public interface TaskRepository extends JpaRepository<MonitoringTask, Long> {
List<MonitoringTask> findByActiveTrue();
}
接下来是核心的调度和执行服务。这里我们使用JGit库来操作Git仓库。
<!-- pom.xml dependency -->
<dependency>
<groupId>org.eclipse.jgit</groupId>
<artifactId>org.eclipse.jgit</artifactId>
<version>6.7.0.202309050840-r</version>
</dependency>
// src/main/java/com/example/synthetic/core/TaskExecutorService.java
package com.example.synthetic.core;
import com.example.synthetic.task.MonitoringTask;
import io.opentelemetry.api.trace.Span;
import io.opentelemetry.api.trace.Tracer;
import io.opentelemetry.context.Context;
import io.opentelemetry.context.propagation.TextMapSetter;
import org.eclipse.jgit.api.Git;
import org.eclipse.jgit.api.errors.GitAPIException;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.stereotype.Service;
import java.io.BufferedReader;
import java.io.File;
import java.io.IOException;
import java.io.InputStreamReader;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.HashMap;
import java.util.Map;
import java.util.concurrent.TimeUnit;
@Service
public class TaskExecutorService {
private static final Logger logger = LoggerFactory.getLogger(TaskExecutorService.class);
@Value("${synthetic.workspace.path}")
private String workspacePath;
@Value("${synthetic.executor.node.path}")
private String nodeExecutablePath;
private final Tracer tracer;
// OpenTelemetry Tracer通过构造函数注入
public TaskExecutorService(Tracer tracer) {
this.tracer = tracer;
}
public void executeTask(MonitoringTask task) {
// 1. 创建父Span,这是链路追踪的起点
Span parentSpan = tracer.spanBuilder(task.getName())
.setAttribute("task.id", task.getId())
.setAttribute("task.script.path", task.getScriptPath())
.startSpan();
// 确保Span在方法结束时关闭
try (var scope = parentSpan.makeCurrent()) {
File repoDir = prepareGitRepo(task.getGitRepoUrl());
executePuppeteerScript(task, repoDir);
} catch (Exception e) {
logger.error("Task {} execution failed", task.getName(), e);
parentSpan.recordException(e);
parentSpan.setStatus(io.opentelemetry.api.trace.StatusCode.ERROR, e.getMessage());
} finally {
parentSpan.end();
}
}
private File prepareGitRepo(String repoUrl) throws GitAPIException, IOException {
String repoName = repoUrl.substring(repoUrl.lastIndexOf('/') + 1).replace(".git", "");
Path localPath = Paths.get(workspacePath, repoName);
File localRepoDir = localPath.toFile();
// 在真实项目中,这里的Git操作需要加锁,防止并发问题
if (localRepoDir.exists()) {
logger.info("Updating existing repository: {}", repoUrl);
try (Git git = Git.open(localRepoDir)) {
git.pull().call();
}
} else {
logger.info("Cloning new repository: {}", repoUrl);
Git.cloneRepository()
.setURI(repoUrl)
.setDirectory(localRepoDir)
.call();
}
return localRepoDir;
}
private void executePuppeteerScript(MonitoringTask task, File repoDir) {
Path scriptFile = Paths.get(repoDir.getAbsolutePath(), task.getScriptPath());
ProcessBuilder processBuilder = new ProcessBuilder(
nodeExecutablePath,
scriptFile.toString()
);
// 2. 关键步骤:注入Trace Context到子进程的环境变量
Map<String, String> environment = processBuilder.environment();
Map<String, String> traceContextMap = new HashMap<>();
// 使用W3C TextMap Propagator来获取注入所需的header
io.opentelemetry.context.propagation.ContextPropagators propagators =
io.opentelemetry.api.GlobalOpenTelemetry.getPropagators();
propagators.getTextMapPropagator().inject(Context.current(), traceContextMap, new TextMapSetter<Map<String, String>>() {
@Override
public void set(Map<String, String> carrier, String key, String value) {
if (carrier != null) {
carrier.put(key, value);
}
}
});
// 将traceparent和tracestate注入环境变量
environment.put("TRACEPARENT", traceContextMap.get("traceparent"));
environment.put("TRACESTATE", traceContextMap.get("tracestate"));
Span executionSpan = tracer.spanBuilder("process-execution")
.startSpan();
try (var scope = executionSpan.makeCurrent()) {
Process process = processBuilder.start();
// 实时读取子进程的输出,避免缓冲区阻塞
StringBuilder output = new StringBuilder();
try (BufferedReader reader = new BufferedReader(new InputStreamReader(process.getInputStream()))) {
String line;
while ((line = reader.readLine()) != null) {
logger.info("[Task {} stdout]: {}", task.getName(), line);
output.append(line).append("\n");
}
}
boolean finished = process.waitFor(2, TimeUnit.MINUTES); // 设置超时
if (!finished) {
process.destroyForcibly();
throw new IOException("Puppeteer script execution timed out.");
}
int exitCode = process.exitValue();
executionSpan.setAttribute("process.exit_code", exitCode);
if (exitCode != 0) {
// 读取错误流
StringBuilder errorOutput = new StringBuilder();
try (BufferedReader reader = new BufferedReader(new InputStreamReader(process.getErrorStream()))) {
String line;
while ((line = reader.readLine()) != null) {
logger.error("[Task {} stderr]: {}", task.getName(), line);
errorOutput.append(line).append("\n");
}
}
throw new RuntimeException("Script failed with exit code " + exitCode + ". Error: " + errorOutput);
}
// 可以在这里解析output,获取脚本返回的业务指标
} catch (IOException | InterruptedException e) {
Thread.currentThread().interrupt(); // Restore interrupted status
executionSpan.recordException(e);
executionSpan.setStatus(io.opentelemetry.api.trace.StatusCode.ERROR, e.getMessage());
throw new RuntimeException("Failed to execute puppeteer script", e);
} finally {
executionSpan.end();
}
}
}
这里的TaskExecutorService是核心。executeTask方法创建了顶层Span,prepareGitRepo处理Git操作,而executePuppeteerScript则通过ProcessBuilder启动Node.js进程。最关键的部分是使用OpenTelemetry的TextMapPropagator将当前的Trace上下文注入到子进程的环境变量中。这是一个常见的跨进程链路传播方案。
2. Puppeteer脚本的改造
现在,我们需要让Puppeteer脚本能够识别并使用从环境变量中传入的Trace上下文。
首先,安装Node.js的OpenTelemetry相关依赖:
npm install @opentelemetry/api @opentelemetry/sdk-node @opentelemetry/exporter-trace-otlp-http @opentelemetry/instrumentation @opentelemetry/resources @opentelemetry/semantic-conventions
然后,创建一个通用的tracer.js文件,用于初始化OpenTelemetry SDK。
// common/tracer.js
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');
const { W3CTraceContextPropagator } = require('@opentelemetry/core');
const { trace } = require('@opentelemetry/api');
const initTracer = (serviceName) => {
// 配置OTLP Exporter,指向Jaeger Collector
const exporter = new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4318/v1/traces',
});
const sdk = new NodeSDK({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: serviceName,
}),
traceExporter: exporter,
// 使用W3CTraceContextPropagator来解析traceparent
textMapPropagator: new W3CTraceContextPropagator(),
});
sdk.start();
console.log('OpenTelemetry SDK started for service:', serviceName);
// 优雅关闭
process.on('SIGTERM', () => {
sdk.shutdown()
.then(() => console.log('Tracing terminated'))
.catch((error) => console.log('Error terminating tracing', error))
.finally(() => process.exit(0));
});
return trace.getTracer(serviceName);
};
module.exports = { initTracer };
现在,这是我们的Puppeteer脚本,例如scripts/check-login.js。
// scripts/check-login.js
const puppeteer = require('puppeteer');
const { propagation, context } = require('@opentelemetry/api');
const { initTracer } = require('../common/tracer.js'); // 引入tracer初始化器
// 3. 脚本启动时,从环境变量中提取Trace上下文
const tracer = initTracer('puppeteer-synthetic-script');
const parentContext = propagation.extract(context.active(), process.env);
async function runLoginTest() {
// 4. 基于提取的上下文,创建子Span
const mainSpan = tracer.startSpan('execute-puppeteer-script', undefined, parentContext);
// 将新创建的span设为当前活动上下文,后续span将自动成为其子span
await context.with(trace.setSpan(context.active(), mainSpan), async () => {
let browser;
try {
const launchSpan = tracer.startSpan('puppeteer-launch');
browser = await puppeteer.launch({ args: ['--no-sandbox'] });
launchSpan.end();
const page = await browser.newPage();
const navigationSpan = tracer.startSpan('navigate-to-login-page');
await page.goto('https://example.com/login'); // 替换为你的目标URL
navigationSpan.end();
const interactionSpan = tracer.startSpan('perform-login-action');
await page.type('#username', 'testuser');
await page.type('#password', 'testpassword');
await page.click('#login-button');
// 等待登录成功后的某个元素出现
await page.waitForSelector('.user-dashboard', { timeout: 10000 });
interactionSpan.end();
mainSpan.setStatus({ code: trace.SpanStatusCode.OK });
console.log('Login test successful.');
} catch (error) {
console.error('Login test failed:', error.message);
mainSpan.recordException(error);
mainSpan.setStatus({ code: trace.SpanStatusCode.ERROR, message: error.message });
process.exit(1); // 以非0状态码退出,方便Java端捕获
} finally {
if (browser) {
await browser.close();
}
mainSpan.end();
}
});
}
runLoginTest();
这段脚本的核心改动在于:
- 引入了
tracer.js并初始化。 - 使用
propagation.extract(context.active(), process.env)从环境变量中恢复Trace上下文。这是一个标准操作,SDK会自动查找traceparent和tracestate。 - 使用
tracer.startSpan('...', undefined, parentContext)来创建与Java父Span关联的子Span。 - 使用
context.with(...)确保后续创建的Span(如puppeteer-launch)都成为execute-puppeteer-script的子Span,形成清晰的层级结构。
3. 运行与观察
现在,当Spring Boot的调度器触发任务时,它会拉取最新的check-login.js脚本,启动一个Node.js进程,并将Trace上下文传递过去。在Jaeger UI中,我们将看到一条完整的、跨越JVM和Node.js的分布式链路追踪。