Add doc for workers architecture

This commit is contained in:
2025-11-07 15:34:13 +07:00
parent 2dc34f07d8
commit 5578d272fa
9 changed files with 672 additions and 0 deletions

View File

@@ -0,0 +1,78 @@
# Overall System Architecture
This diagram shows the complete system architecture with API Server Cluster, Compute Worker Cluster, and their interactions with the database and external services.
```mermaid
graph TB
subgraph "Monorepo Structure"
subgraph "API Server Cluster"
API1[Managing.Api<br/>API-1<br/>Orleans]
API2[Managing.Api<br/>API-2<br/>Orleans]
API3[Managing.Api<br/>API-3<br/>Orleans]
end
subgraph "Compute Worker Cluster"
W1[Managing.Compute<br/>Worker-1<br/>8 cores, 6 jobs]
W2[Managing.Compute<br/>Worker-2<br/>8 cores, 6 jobs]
W3[Managing.Compute<br/>Worker-3<br/>8 cores, 6 jobs]
end
subgraph "Shared Projects"
APP[Managing.Application<br/>Business Logic]
DOM[Managing.Domain<br/>Domain Models]
INFRA[Managing.Infrastructure<br/>Database Access]
end
end
subgraph "External Services"
DB[(PostgreSQL<br/>Job Queue)]
INFLUX[(InfluxDB<br/>Candles)]
end
subgraph "Clients"
U1[User 1]
U2[User 2]
U1000[User 1000]
end
U1 --> API1
U2 --> API2
U1000 --> API3
API1 --> DB
API2 --> DB
API3 --> DB
W1 --> DB
W2 --> DB
W3 --> DB
W1 --> INFLUX
W2 --> INFLUX
W3 --> INFLUX
API1 -.uses.-> APP
API2 -.uses.-> APP
API3 -.uses.-> APP
W1 -.uses.-> APP
W2 -.uses.-> APP
W3 -.uses.-> APP
style API1 fill:#4A90E2
style API2 fill:#4A90E2
style API3 fill:#4A90E2
style W1 fill:#50C878
style W2 fill:#50C878
style W3 fill:#50C878
style DB fill:#FF6B6B
style INFLUX fill:#FFD93D
```
## Components
- **API Server Cluster**: Handles HTTP requests, creates jobs, returns immediately
- **Compute Worker Cluster**: Processes CPU-intensive backtest jobs
- **PostgreSQL**: Job queue and state management
- **InfluxDB**: Time-series data for candles
- **Shared Projects**: Common business logic used by both API and Compute services

View File

@@ -0,0 +1,52 @@
# Request Flow Sequence Diagram
This diagram shows the complete request flow from user submission to job completion and status polling.
```mermaid
sequenceDiagram
participant User
participant API as API Server<br/>(Orleans)
participant DB as PostgreSQL<br/>(Job Queue)
participant Worker as Compute Worker
participant Influx as InfluxDB
User->>API: POST /api/backtest/bundle
API->>API: Create BundleBacktestRequest
API->>API: Generate BacktestJobs from variants
API->>DB: INSERT BacktestJobs (Status: Pending)
API-->>User: 202 Accepted<br/>{bundleRequestId, status: "Queued"}
Note over Worker: Polling every 5 seconds
Worker->>DB: SELECT pending jobs<br/>(ORDER BY priority, createdAt)
DB-->>Worker: Return pending jobs
Worker->>DB: UPDATE job<br/>(Status: Running, AssignedWorkerId)
Worker->>Influx: Load candles for backtest
Influx-->>Worker: Return candles
loop Process each candle
Worker->>Worker: Run backtest logic
Worker->>DB: UPDATE job progress
end
Worker->>DB: UPDATE job<br/>(Status: Completed, ResultJson)
Worker->>DB: UPDATE BundleBacktestRequest<br/>(CompletedBacktests++)
User->>API: GET /api/backtest/bundle/{id}/status
API->>DB: SELECT BundleBacktestRequest + job stats
DB-->>API: Return status
API-->>User: {status, progress, completed/total}
```
## Flow Steps
1. **User Request**: User submits bundle backtest request
2. **API Processing**: API creates bundle request and generates individual backtest jobs
3. **Job Queue**: Jobs are inserted into database with `Pending` status
4. **Immediate Response**: API returns 202 Accepted with bundle request ID
5. **Worker Polling**: Compute workers poll database every 5 seconds
6. **Job Claiming**: Worker claims jobs using PostgreSQL advisory locks
7. **Candle Loading**: Worker loads candles from InfluxDB
8. **Backtest Processing**: Worker processes backtest with progress updates
9. **Result Storage**: Worker saves results and updates bundle progress
10. **Status Polling**: User polls API for status updates

View File

@@ -0,0 +1,54 @@
# Job Processing Flow
This diagram shows the detailed flow of how compute workers process backtest jobs from the queue.
```mermaid
flowchart TD
Start([User Creates<br/>BundleBacktestRequest]) --> CreateJobs[API: Generate<br/>BacktestJobs]
CreateJobs --> InsertDB[(Insert Jobs<br/>Status: Pending)]
InsertDB --> WorkerPoll{Worker Polls<br/>Database}
WorkerPoll -->|Every 5s| CheckJobs{Jobs<br/>Available?}
CheckJobs -->|No| Wait[Wait 5s]
Wait --> WorkerPoll
CheckJobs -->|Yes| ClaimJobs[Claim Jobs<br/>Advisory Lock]
ClaimJobs --> UpdateStatus[Update Status:<br/>Running]
UpdateStatus --> CheckSemaphore{Semaphore<br/>Available?}
CheckSemaphore -->|No| WaitSemaphore[Wait for<br/>slot]
WaitSemaphore --> CheckSemaphore
CheckSemaphore -->|Yes| AcquireSemaphore[Acquire<br/>Semaphore]
AcquireSemaphore --> LoadCandles[Load Candles<br/>from InfluxDB]
LoadCandles --> ProcessBacktest[Process Backtest<br/>CPU-intensive]
ProcessBacktest --> UpdateProgress{Every<br/>10%?}
UpdateProgress -->|Yes| SaveProgress[Update Progress<br/>in DB]
SaveProgress --> ProcessBacktest
UpdateProgress -->|No| ProcessBacktest
ProcessBacktest --> BacktestComplete{Backtest<br/>Complete?}
BacktestComplete -->|No| ProcessBacktest
BacktestComplete -->|Yes| SaveResult[Save Result<br/>Status: Completed]
SaveResult --> UpdateBundle[Update Bundle<br/>Progress]
UpdateBundle --> ReleaseSemaphore[Release<br/>Semaphore]
ReleaseSemaphore --> WorkerPoll
style Start fill:#4A90E2
style ProcessBacktest fill:#50C878
style SaveResult fill:#FF6B6B
style WorkerPoll fill:#FFD93D
```
## Key Components
- **Worker Polling**: Workers continuously poll database for pending jobs
- **Advisory Locks**: PostgreSQL advisory locks prevent multiple workers from claiming the same job
- **Semaphore Control**: Limits concurrent backtests per worker (default: CPU cores - 2)
- **Progress Updates**: Progress is saved to database every 10% completion
- **Bundle Updates**: Individual job completion updates the parent bundle request

View File

@@ -0,0 +1,69 @@
# Database Schema & Queue Structure
This diagram shows the entity relationships between BundleBacktestRequest, BacktestJob, and User entities.
```mermaid
erDiagram
BundleBacktestRequest ||--o{ BacktestJob : "has many"
BacktestJob }o--|| User : "belongs to"
BundleBacktestRequest {
UUID RequestId PK
INT UserId FK
STRING Status
INT TotalBacktests
INT CompletedBacktests
INT FailedBacktests
DATETIME CreatedAt
DATETIME CompletedAt
STRING UniversalConfigJson
STRING DateTimeRangesJson
STRING MoneyManagementVariantsJson
STRING TickerVariantsJson
}
BacktestJob {
UUID Id PK
UUID BundleRequestId FK
STRING JobType
STRING Status
INT Priority
TEXT ConfigJson
TEXT CandlesJson
INT ProgressPercentage
INT CurrentBacktestIndex
INT TotalBacktests
INT CompletedBacktests
DATETIME CreatedAt
DATETIME StartedAt
DATETIME CompletedAt
TEXT ResultJson
TEXT ErrorMessage
STRING AssignedWorkerId
DATETIME LastHeartbeat
}
User {
INT Id PK
STRING Name
}
```
## Table Descriptions
### BundleBacktestRequest
- Represents a bundle of multiple backtest jobs
- Contains variant configurations (date ranges, money management, tickers)
- Tracks overall progress across all jobs
### BacktestJob
- Individual backtest execution unit
- Contains serialized config and candles
- Tracks progress, worker assignment, and heartbeat
- Links to parent bundle request
### Key Indexes
- `idx_status_priority`: For efficient job claiming (Status, Priority DESC, CreatedAt)
- `idx_bundle_request`: For bundle progress queries
- `idx_assigned_worker`: For worker health monitoring

View File

@@ -0,0 +1,103 @@
# Deployment Architecture
This diagram shows the production deployment architecture with load balancing, clustering, and monitoring.
```mermaid
graph TB
subgraph "Load Balancer"
LB[NGINX/Cloudflare]
end
subgraph "API Server Cluster"
direction LR
API1[API-1<br/>Orleans Silo<br/>Port: 11111]
API2[API-2<br/>Orleans Silo<br/>Port: 11121]
API3[API-3<br/>Orleans Silo<br/>Port: 11131]
end
subgraph "Compute Worker Cluster"
direction LR
W1[Worker-1<br/>8 CPU Cores<br/>6 Concurrent Jobs]
W2[Worker-2<br/>8 CPU Cores<br/>6 Concurrent Jobs]
W3[Worker-3<br/>8 CPU Cores<br/>6 Concurrent Jobs]
end
subgraph "Database Cluster"
direction LR
DB_MASTER[(PostgreSQL<br/>Master<br/>Job Queue)]
DB_REPLICA[(PostgreSQL<br/>Replica<br/>Read Only)]
end
subgraph "Time Series DB"
INFLUX[(InfluxDB<br/>Candles Data)]
end
subgraph "Monitoring"
PROM[Prometheus]
GRAF[Grafana]
end
LB --> API1
LB --> API2
LB --> API3
API1 --> DB_MASTER
API2 --> DB_MASTER
API3 --> DB_MASTER
W1 --> DB_MASTER
W2 --> DB_MASTER
W3 --> DB_MASTER
W1 --> INFLUX
W2 --> INFLUX
W3 --> INFLUX
W1 --> PROM
W2 --> PROM
W3 --> PROM
API1 --> PROM
API2 --> PROM
API3 --> PROM
PROM --> GRAF
DB_MASTER --> DB_REPLICA
style LB fill:#9B59B6
style API1 fill:#4A90E2
style API2 fill:#4A90E2
style API3 fill:#4A90E2
style W1 fill:#50C878
style W2 fill:#50C878
style W3 fill:#50C878
style DB_MASTER fill:#FF6B6B
style INFLUX fill:#FFD93D
style PROM fill:#E67E22
style GRAF fill:#E67E22
```
## Deployment Components
### Load Balancer
- **NGINX/Cloudflare**: Distributes incoming requests across API servers
- Health checks and failover support
### API Server Cluster
- **3+ Instances**: Horizontally scalable Orleans silos
- Each instance handles HTTP requests and Orleans grain operations
- Ports: 11111, 11121, 11131 (for clustering)
### Compute Worker Cluster
- **3+ Instances**: Dedicated CPU workers
- Each worker: 8 CPU cores, 6 concurrent backtests
- Total capacity: 18 concurrent backtests across cluster
### Database Cluster
- **Master**: Handles all writes (job creation, updates)
- **Replica**: Read-only for status queries and reporting
### Monitoring
- **Prometheus**: Metrics collection
- **Grafana**: Visualization and dashboards

View File

@@ -0,0 +1,96 @@
# Concurrency Control Flow
This diagram shows how the semaphore-based concurrency control works across multiple workers.
```mermaid
graph LR
subgraph "Database Queue"
Q[Pending Jobs<br/>Priority Queue]
end
subgraph "Worker-1"
S1[Semaphore<br/>6 slots]
J1[Job 1]
J2[Job 2]
J3[Job 3]
J4[Job 4]
J5[Job 5]
J6[Job 6]
end
subgraph "Worker-2"
S2[Semaphore<br/>6 slots]
J7[Job 7]
J8[Job 8]
J9[Job 9]
J10[Job 10]
J11[Job 11]
J12[Job 12]
end
subgraph "Worker-3"
S3[Semaphore<br/>6 slots]
J13[Job 13]
J14[Job 14]
J15[Job 15]
J16[Job 16]
J17[Job 17]
J18[Job 18]
end
Q -->|Claim 6 jobs| S1
Q -->|Claim 6 jobs| S2
Q -->|Claim 6 jobs| S3
S1 --> J1
S1 --> J2
S1 --> J3
S1 --> J4
S1 --> J5
S1 --> J6
S2 --> J7
S2 --> J8
S2 --> J9
S2 --> J10
S2 --> J11
S2 --> J12
S3 --> J13
S3 --> J14
S3 --> J15
S3 --> J16
S3 --> J17
S3 --> J18
style Q fill:#FF6B6B
style S1 fill:#50C878
style S2 fill:#50C878
style S3 fill:#50C878
```
## Concurrency Control Mechanisms
### 1. Database-Level (Advisory Locks)
- **PostgreSQL Advisory Locks**: Prevent multiple workers from claiming the same job
- Atomic job claiming using `pg_try_advisory_lock()`
- Ensures exactly-once job processing
### 2. Worker-Level (Semaphore)
- **SemaphoreSlim**: Limits concurrent backtests per worker
- Default: `Environment.ProcessorCount - 2` (e.g., 6 on 8-core machine)
- Prevents CPU saturation while leaving resources for Orleans messaging
### 3. Cluster-Level (Queue Priority)
- **Priority Queue**: Jobs ordered by priority, then creation time
- VIP users get higher priority
- Fair distribution across workers
## Capacity Calculation
- **Per Worker**: 6 concurrent backtests
- **3 Workers**: 18 concurrent backtests
- **Average Duration**: ~47 minutes per backtest
- **Throughput**: ~1,080 backtests/hour
- **1000 Users × 10 backtests**: ~9 hours to process full queue

View File

@@ -0,0 +1,74 @@
# Monorepo Project Structure
This diagram shows the monorepo structure with shared projects used by both API and Compute services.
```mermaid
graph TD
ROOT[Managing.sln<br/>Monorepo Root]
ROOT --> API[Managing.Api<br/>API Server<br/>Orleans]
ROOT --> COMPUTE[Managing.Compute<br/>Worker App<br/>No Orleans]
ROOT --> SHARED[Shared Projects]
SHARED --> APP[Managing.Application<br/>Business Logic]
SHARED --> DOM[Managing.Domain<br/>Domain Models]
SHARED --> INFRA[Managing.Infrastructure<br/>Database/External]
SHARED --> COMMON[Managing.Common<br/>Utilities]
API --> APP
API --> DOM
API --> INFRA
API --> COMMON
COMPUTE --> APP
COMPUTE --> DOM
COMPUTE --> INFRA
COMPUTE --> COMMON
style ROOT fill:#9B59B6
style API fill:#4A90E2
style COMPUTE fill:#50C878
style SHARED fill:#FFD93D
```
## Project Organization
### Root Level
- **Managing.sln**: Solution file containing all projects
### Service Projects
- **Managing.Api**: API Server with Orleans
- Controllers, Orleans grains, HTTP endpoints
- Handles user requests, creates jobs
- **Managing.Compute**: Compute Worker App (NEW)
- Background workers, job processors
- No Orleans dependency
- Dedicated CPU processing
### Shared Projects
- **Managing.Application**: Business logic
- `Backtester.cs`, `TradingBotBase.cs`
- Used by both API and Compute
- **Managing.Domain**: Domain models
- `BundleBacktestRequest.cs`, `BacktestJob.cs`
- Shared entities
- **Managing.Infrastructure**: External integrations
- Database repositories, InfluxDB client
- Shared data access
- **Managing.Common**: Utilities
- Constants, enums, helpers
- Shared across all projects
## Benefits
1. **Code Reuse**: Shared business logic between API and Compute
2. **Consistency**: Same domain models and logic
3. **Maintainability**: Single source of truth
4. **Type Safety**: Shared types prevent serialization issues
5. **Testing**: Shared test projects

View File

@@ -0,0 +1,71 @@
# Implementation Plan
## Phase 1: Database & Domain Setup
- [ ] Create `BacktestJob` entity in `Managing.Domain/Backtests/`
- [ ] Create `BacktestJobStatus` enum (Pending, Running, Completed, Failed)
- [ ] Create database migration for `BacktestJobs` table
- [ ] Add indexes: `idx_status_priority`, `idx_bundle_request`, `idx_assigned_worker`
- [ ] Create `IBacktestJobRepository` interface
- [ ] Implement `BacktestJobRepository` with advisory lock support
## Phase 2: Compute Worker Project
- [ ] Create `Managing.Compute` project (console app/worker service)
- [ ] Add project reference to shared projects (Application, Domain, Infrastructure)
- [ ] Configure DI container (NO Orleans)
- [ ] Create `BacktestComputeWorker` background service
- [ ] Implement job polling logic (every 5 seconds)
- [ ] Implement job claiming with PostgreSQL advisory locks
- [ ] Implement semaphore-based concurrency control
- [ ] Implement progress callback mechanism
- [ ] Implement heartbeat mechanism (every 30 seconds)
- [ ] Add configuration: `MaxConcurrentBacktests`, `JobPollIntervalSeconds`
## Phase 3: API Server Updates
- [ ] Update `BacktestController` to create jobs instead of calling grains directly
- [ ] Implement `CreateBundleBacktest` endpoint (returns immediately)
- [ ] Implement `GetBundleStatus` endpoint (polls database)
- [ ] Update `Backtester.cs` to generate `BacktestJob` entities from bundle variants
- [ ] Remove direct Orleans grain calls for backtests (keep for other operations)
## Phase 4: Shared Logic
- [ ] Extract backtest execution logic from `BacktestTradingBotGrain` to `Backtester.cs`
- [ ] Make backtest logic Orleans-agnostic (can run in worker or grain)
- [ ] Add progress callback support to `RunBacktestAsync` method
- [ ] Ensure candle loading works in both contexts
## Phase 5: Monitoring & Health Checks
- [ ] Add health check endpoint to compute worker
- [ ] Add metrics: pending jobs, running jobs, completed/failed counts
- [ ] Add stale job detection (reclaim jobs from dead workers)
- [ ] Add logging for job lifecycle events
## Phase 6: Deployment
- [ ] Create Dockerfile for `Managing.Compute`
- [ ] Create deployment configuration for compute workers
- [ ] Configure environment variables for compute cluster
- [ ] Set up monitoring dashboards (Prometheus/Grafana)
- [ ] Configure auto-scaling rules for compute workers
## Phase 7: Testing & Validation
- [ ] Test single backtest job processing
- [ ] Test bundle backtest with multiple jobs
- [ ] Test concurrent job processing (multiple workers)
- [ ] Test job recovery after worker failure
- [ ] Test priority queue ordering
- [ ] Load test with 1000+ concurrent users
## Phase 8: Migration Strategy
- [ ] Keep Orleans grains as fallback during transition
- [ ] Feature flag to switch between Orleans and Compute workers
- [ ] Gradual migration: test with small percentage of traffic
- [ ] Monitor performance and error rates
- [ ] Full cutover once validated

View File

@@ -0,0 +1,75 @@
# Workers Processing Architecture
This folder contains documentation for the enterprise-grade backtest processing architecture using a database queue pattern with separate API and Compute worker clusters.
## Overview
The architecture separates concerns between:
- **API Server**: Handles HTTP requests, creates jobs, returns immediately (fire-and-forget)
- **Compute Workers**: Process CPU-intensive backtest jobs from the database queue
- **Database Queue**: Central coordination point using PostgreSQL
## Documentation Files
1. **[01-Overall-Architecture.md](./01-Overall-Architecture.md)**
- Complete system architecture diagram
- Component relationships
- External service integrations
2. **[02-Request-Flow.md](./02-Request-Flow.md)**
- Sequence diagram of request flow
- User request → Job creation → Processing → Status polling
3. **[03-Job-Processing-Flow.md](./03-Job-Processing-Flow.md)**
- Detailed job processing workflow
- Worker polling, job claiming, semaphore control
4. **[04-Database-Schema.md](./04-Database-Schema.md)**
- Entity relationship diagram
- Database schema for job queue
- Key indexes and relationships
5. **[05-Deployment-Architecture.md](./05-Deployment-Architecture.md)**
- Production deployment topology
- Load balancing, clustering, monitoring
6. **[06-Concurrency-Control.md](./06-Concurrency-Control.md)**
- Concurrency control mechanisms
- Semaphore-based limiting
- Capacity calculations
7. **[07-Monorepo-Structure.md](./07-Monorepo-Structure.md)**
- Monorepo project organization
- Shared projects and dependencies
## Key Features
-**No Timeouts**: Fire-and-forget pattern with polling
-**Scalable**: Horizontal scaling of both API and Compute clusters
-**Reliable**: Jobs persist in database, survive restarts
-**Efficient**: Dedicated CPU resources for compute work
-**Enterprise-Grade**: Handles 1000+ users, priority queue, health checks
## Architecture Principles
1. **Separation of Concerns**: API handles requests, Compute handles CPU work
2. **Database as Queue**: PostgreSQL serves as reliable job queue
3. **Shared Codebase**: Monorepo with shared business logic
4. **Resource Isolation**: Compute workers don't interfere with API responsiveness
5. **Fault Tolerance**: Jobs survive worker failures, can be reclaimed
## Capacity Planning
- **Per Worker**: 6 concurrent backtests (8-core machine)
- **3 Workers**: 18 concurrent backtests
- **Throughput**: ~1,080 backtests/hour
- **1000 Users × 10 backtests**: ~9 hours processing time
## Next Steps
1. Create `Managing.Compute` project
2. Implement `BacktestJob` entity and repository
3. Create `BacktestComputeWorker` background service
4. Update API controllers to use job queue pattern
5. Deploy compute workers to dedicated servers