Add doc for workers architecture
This commit is contained in:
@@ -0,0 +1,78 @@
|
||||
# Overall System Architecture
|
||||
|
||||
This diagram shows the complete system architecture with API Server Cluster, Compute Worker Cluster, and their interactions with the database and external services.
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Monorepo Structure"
|
||||
subgraph "API Server Cluster"
|
||||
API1[Managing.Api<br/>API-1<br/>Orleans]
|
||||
API2[Managing.Api<br/>API-2<br/>Orleans]
|
||||
API3[Managing.Api<br/>API-3<br/>Orleans]
|
||||
end
|
||||
|
||||
subgraph "Compute Worker Cluster"
|
||||
W1[Managing.Compute<br/>Worker-1<br/>8 cores, 6 jobs]
|
||||
W2[Managing.Compute<br/>Worker-2<br/>8 cores, 6 jobs]
|
||||
W3[Managing.Compute<br/>Worker-3<br/>8 cores, 6 jobs]
|
||||
end
|
||||
|
||||
subgraph "Shared Projects"
|
||||
APP[Managing.Application<br/>Business Logic]
|
||||
DOM[Managing.Domain<br/>Domain Models]
|
||||
INFRA[Managing.Infrastructure<br/>Database Access]
|
||||
end
|
||||
end
|
||||
|
||||
subgraph "External Services"
|
||||
DB[(PostgreSQL<br/>Job Queue)]
|
||||
INFLUX[(InfluxDB<br/>Candles)]
|
||||
end
|
||||
|
||||
subgraph "Clients"
|
||||
U1[User 1]
|
||||
U2[User 2]
|
||||
U1000[User 1000]
|
||||
end
|
||||
|
||||
U1 --> API1
|
||||
U2 --> API2
|
||||
U1000 --> API3
|
||||
|
||||
API1 --> DB
|
||||
API2 --> DB
|
||||
API3 --> DB
|
||||
|
||||
W1 --> DB
|
||||
W2 --> DB
|
||||
W3 --> DB
|
||||
|
||||
W1 --> INFLUX
|
||||
W2 --> INFLUX
|
||||
W3 --> INFLUX
|
||||
|
||||
API1 -.uses.-> APP
|
||||
API2 -.uses.-> APP
|
||||
API3 -.uses.-> APP
|
||||
W1 -.uses.-> APP
|
||||
W2 -.uses.-> APP
|
||||
W3 -.uses.-> APP
|
||||
|
||||
style API1 fill:#4A90E2
|
||||
style API2 fill:#4A90E2
|
||||
style API3 fill:#4A90E2
|
||||
style W1 fill:#50C878
|
||||
style W2 fill:#50C878
|
||||
style W3 fill:#50C878
|
||||
style DB fill:#FF6B6B
|
||||
style INFLUX fill:#FFD93D
|
||||
```
|
||||
|
||||
## Components
|
||||
|
||||
- **API Server Cluster**: Handles HTTP requests, creates jobs, returns immediately
|
||||
- **Compute Worker Cluster**: Processes CPU-intensive backtest jobs
|
||||
- **PostgreSQL**: Job queue and state management
|
||||
- **InfluxDB**: Time-series data for candles
|
||||
- **Shared Projects**: Common business logic used by both API and Compute services
|
||||
|
||||
52
assets/documentation/Workers processing/02-Request-Flow.md
Normal file
52
assets/documentation/Workers processing/02-Request-Flow.md
Normal file
@@ -0,0 +1,52 @@
|
||||
# Request Flow Sequence Diagram
|
||||
|
||||
This diagram shows the complete request flow from user submission to job completion and status polling.
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant User
|
||||
participant API as API Server<br/>(Orleans)
|
||||
participant DB as PostgreSQL<br/>(Job Queue)
|
||||
participant Worker as Compute Worker
|
||||
participant Influx as InfluxDB
|
||||
|
||||
User->>API: POST /api/backtest/bundle
|
||||
API->>API: Create BundleBacktestRequest
|
||||
API->>API: Generate BacktestJobs from variants
|
||||
API->>DB: INSERT BacktestJobs (Status: Pending)
|
||||
API-->>User: 202 Accepted<br/>{bundleRequestId, status: "Queued"}
|
||||
|
||||
Note over Worker: Polling every 5 seconds
|
||||
Worker->>DB: SELECT pending jobs<br/>(ORDER BY priority, createdAt)
|
||||
DB-->>Worker: Return pending jobs
|
||||
Worker->>DB: UPDATE job<br/>(Status: Running, AssignedWorkerId)
|
||||
Worker->>Influx: Load candles for backtest
|
||||
Influx-->>Worker: Return candles
|
||||
|
||||
loop Process each candle
|
||||
Worker->>Worker: Run backtest logic
|
||||
Worker->>DB: UPDATE job progress
|
||||
end
|
||||
|
||||
Worker->>DB: UPDATE job<br/>(Status: Completed, ResultJson)
|
||||
Worker->>DB: UPDATE BundleBacktestRequest<br/>(CompletedBacktests++)
|
||||
|
||||
User->>API: GET /api/backtest/bundle/{id}/status
|
||||
API->>DB: SELECT BundleBacktestRequest + job stats
|
||||
DB-->>API: Return status
|
||||
API-->>User: {status, progress, completed/total}
|
||||
```
|
||||
|
||||
## Flow Steps
|
||||
|
||||
1. **User Request**: User submits bundle backtest request
|
||||
2. **API Processing**: API creates bundle request and generates individual backtest jobs
|
||||
3. **Job Queue**: Jobs are inserted into database with `Pending` status
|
||||
4. **Immediate Response**: API returns 202 Accepted with bundle request ID
|
||||
5. **Worker Polling**: Compute workers poll database every 5 seconds
|
||||
6. **Job Claiming**: Worker claims jobs using PostgreSQL advisory locks
|
||||
7. **Candle Loading**: Worker loads candles from InfluxDB
|
||||
8. **Backtest Processing**: Worker processes backtest with progress updates
|
||||
9. **Result Storage**: Worker saves results and updates bundle progress
|
||||
10. **Status Polling**: User polls API for status updates
|
||||
|
||||
@@ -0,0 +1,54 @@
|
||||
# Job Processing Flow
|
||||
|
||||
This diagram shows the detailed flow of how compute workers process backtest jobs from the queue.
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
Start([User Creates<br/>BundleBacktestRequest]) --> CreateJobs[API: Generate<br/>BacktestJobs]
|
||||
CreateJobs --> InsertDB[(Insert Jobs<br/>Status: Pending)]
|
||||
|
||||
InsertDB --> WorkerPoll{Worker Polls<br/>Database}
|
||||
|
||||
WorkerPoll -->|Every 5s| CheckJobs{Jobs<br/>Available?}
|
||||
CheckJobs -->|No| Wait[Wait 5s]
|
||||
Wait --> WorkerPoll
|
||||
|
||||
CheckJobs -->|Yes| ClaimJobs[Claim Jobs<br/>Advisory Lock]
|
||||
ClaimJobs --> UpdateStatus[Update Status:<br/>Running]
|
||||
|
||||
UpdateStatus --> CheckSemaphore{Semaphore<br/>Available?}
|
||||
CheckSemaphore -->|No| WaitSemaphore[Wait for<br/>slot]
|
||||
WaitSemaphore --> CheckSemaphore
|
||||
|
||||
CheckSemaphore -->|Yes| AcquireSemaphore[Acquire<br/>Semaphore]
|
||||
AcquireSemaphore --> LoadCandles[Load Candles<br/>from InfluxDB]
|
||||
|
||||
LoadCandles --> ProcessBacktest[Process Backtest<br/>CPU-intensive]
|
||||
|
||||
ProcessBacktest --> UpdateProgress{Every<br/>10%?}
|
||||
UpdateProgress -->|Yes| SaveProgress[Update Progress<br/>in DB]
|
||||
SaveProgress --> ProcessBacktest
|
||||
UpdateProgress -->|No| ProcessBacktest
|
||||
|
||||
ProcessBacktest --> BacktestComplete{Backtest<br/>Complete?}
|
||||
BacktestComplete -->|No| ProcessBacktest
|
||||
BacktestComplete -->|Yes| SaveResult[Save Result<br/>Status: Completed]
|
||||
|
||||
SaveResult --> UpdateBundle[Update Bundle<br/>Progress]
|
||||
UpdateBundle --> ReleaseSemaphore[Release<br/>Semaphore]
|
||||
ReleaseSemaphore --> WorkerPoll
|
||||
|
||||
style Start fill:#4A90E2
|
||||
style ProcessBacktest fill:#50C878
|
||||
style SaveResult fill:#FF6B6B
|
||||
style WorkerPoll fill:#FFD93D
|
||||
```
|
||||
|
||||
## Key Components
|
||||
|
||||
- **Worker Polling**: Workers continuously poll database for pending jobs
|
||||
- **Advisory Locks**: PostgreSQL advisory locks prevent multiple workers from claiming the same job
|
||||
- **Semaphore Control**: Limits concurrent backtests per worker (default: CPU cores - 2)
|
||||
- **Progress Updates**: Progress is saved to database every 10% completion
|
||||
- **Bundle Updates**: Individual job completion updates the parent bundle request
|
||||
|
||||
@@ -0,0 +1,69 @@
|
||||
# Database Schema & Queue Structure
|
||||
|
||||
This diagram shows the entity relationships between BundleBacktestRequest, BacktestJob, and User entities.
|
||||
|
||||
```mermaid
|
||||
erDiagram
|
||||
BundleBacktestRequest ||--o{ BacktestJob : "has many"
|
||||
BacktestJob }o--|| User : "belongs to"
|
||||
|
||||
BundleBacktestRequest {
|
||||
UUID RequestId PK
|
||||
INT UserId FK
|
||||
STRING Status
|
||||
INT TotalBacktests
|
||||
INT CompletedBacktests
|
||||
INT FailedBacktests
|
||||
DATETIME CreatedAt
|
||||
DATETIME CompletedAt
|
||||
STRING UniversalConfigJson
|
||||
STRING DateTimeRangesJson
|
||||
STRING MoneyManagementVariantsJson
|
||||
STRING TickerVariantsJson
|
||||
}
|
||||
|
||||
BacktestJob {
|
||||
UUID Id PK
|
||||
UUID BundleRequestId FK
|
||||
STRING JobType
|
||||
STRING Status
|
||||
INT Priority
|
||||
TEXT ConfigJson
|
||||
TEXT CandlesJson
|
||||
INT ProgressPercentage
|
||||
INT CurrentBacktestIndex
|
||||
INT TotalBacktests
|
||||
INT CompletedBacktests
|
||||
DATETIME CreatedAt
|
||||
DATETIME StartedAt
|
||||
DATETIME CompletedAt
|
||||
TEXT ResultJson
|
||||
TEXT ErrorMessage
|
||||
STRING AssignedWorkerId
|
||||
DATETIME LastHeartbeat
|
||||
}
|
||||
|
||||
User {
|
||||
INT Id PK
|
||||
STRING Name
|
||||
}
|
||||
```
|
||||
|
||||
## Table Descriptions
|
||||
|
||||
### BundleBacktestRequest
|
||||
- Represents a bundle of multiple backtest jobs
|
||||
- Contains variant configurations (date ranges, money management, tickers)
|
||||
- Tracks overall progress across all jobs
|
||||
|
||||
### BacktestJob
|
||||
- Individual backtest execution unit
|
||||
- Contains serialized config and candles
|
||||
- Tracks progress, worker assignment, and heartbeat
|
||||
- Links to parent bundle request
|
||||
|
||||
### Key Indexes
|
||||
- `idx_status_priority`: For efficient job claiming (Status, Priority DESC, CreatedAt)
|
||||
- `idx_bundle_request`: For bundle progress queries
|
||||
- `idx_assigned_worker`: For worker health monitoring
|
||||
|
||||
@@ -0,0 +1,103 @@
|
||||
# Deployment Architecture
|
||||
|
||||
This diagram shows the production deployment architecture with load balancing, clustering, and monitoring.
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Load Balancer"
|
||||
LB[NGINX/Cloudflare]
|
||||
end
|
||||
|
||||
subgraph "API Server Cluster"
|
||||
direction LR
|
||||
API1[API-1<br/>Orleans Silo<br/>Port: 11111]
|
||||
API2[API-2<br/>Orleans Silo<br/>Port: 11121]
|
||||
API3[API-3<br/>Orleans Silo<br/>Port: 11131]
|
||||
end
|
||||
|
||||
subgraph "Compute Worker Cluster"
|
||||
direction LR
|
||||
W1[Worker-1<br/>8 CPU Cores<br/>6 Concurrent Jobs]
|
||||
W2[Worker-2<br/>8 CPU Cores<br/>6 Concurrent Jobs]
|
||||
W3[Worker-3<br/>8 CPU Cores<br/>6 Concurrent Jobs]
|
||||
end
|
||||
|
||||
subgraph "Database Cluster"
|
||||
direction LR
|
||||
DB_MASTER[(PostgreSQL<br/>Master<br/>Job Queue)]
|
||||
DB_REPLICA[(PostgreSQL<br/>Replica<br/>Read Only)]
|
||||
end
|
||||
|
||||
subgraph "Time Series DB"
|
||||
INFLUX[(InfluxDB<br/>Candles Data)]
|
||||
end
|
||||
|
||||
subgraph "Monitoring"
|
||||
PROM[Prometheus]
|
||||
GRAF[Grafana]
|
||||
end
|
||||
|
||||
LB --> API1
|
||||
LB --> API2
|
||||
LB --> API3
|
||||
|
||||
API1 --> DB_MASTER
|
||||
API2 --> DB_MASTER
|
||||
API3 --> DB_MASTER
|
||||
|
||||
W1 --> DB_MASTER
|
||||
W2 --> DB_MASTER
|
||||
W3 --> DB_MASTER
|
||||
|
||||
W1 --> INFLUX
|
||||
W2 --> INFLUX
|
||||
W3 --> INFLUX
|
||||
|
||||
W1 --> PROM
|
||||
W2 --> PROM
|
||||
W3 --> PROM
|
||||
API1 --> PROM
|
||||
API2 --> PROM
|
||||
API3 --> PROM
|
||||
|
||||
PROM --> GRAF
|
||||
|
||||
DB_MASTER --> DB_REPLICA
|
||||
|
||||
style LB fill:#9B59B6
|
||||
style API1 fill:#4A90E2
|
||||
style API2 fill:#4A90E2
|
||||
style API3 fill:#4A90E2
|
||||
style W1 fill:#50C878
|
||||
style W2 fill:#50C878
|
||||
style W3 fill:#50C878
|
||||
style DB_MASTER fill:#FF6B6B
|
||||
style INFLUX fill:#FFD93D
|
||||
style PROM fill:#E67E22
|
||||
style GRAF fill:#E67E22
|
||||
```
|
||||
|
||||
## Deployment Components
|
||||
|
||||
### Load Balancer
|
||||
- **NGINX/Cloudflare**: Distributes incoming requests across API servers
|
||||
- Health checks and failover support
|
||||
|
||||
### API Server Cluster
|
||||
- **3+ Instances**: Horizontally scalable Orleans silos
|
||||
- Each instance handles HTTP requests and Orleans grain operations
|
||||
- Ports: 11111, 11121, 11131 (for clustering)
|
||||
|
||||
### Compute Worker Cluster
|
||||
- **3+ Instances**: Dedicated CPU workers
|
||||
- Each worker: 8 CPU cores, 6 concurrent backtests
|
||||
- Total capacity: 18 concurrent backtests across cluster
|
||||
|
||||
### Database Cluster
|
||||
- **Master**: Handles all writes (job creation, updates)
|
||||
- **Replica**: Read-only for status queries and reporting
|
||||
|
||||
### Monitoring
|
||||
- **Prometheus**: Metrics collection
|
||||
- **Grafana**: Visualization and dashboards
|
||||
|
||||
@@ -0,0 +1,96 @@
|
||||
# Concurrency Control Flow
|
||||
|
||||
This diagram shows how the semaphore-based concurrency control works across multiple workers.
|
||||
|
||||
```mermaid
|
||||
graph LR
|
||||
subgraph "Database Queue"
|
||||
Q[Pending Jobs<br/>Priority Queue]
|
||||
end
|
||||
|
||||
subgraph "Worker-1"
|
||||
S1[Semaphore<br/>6 slots]
|
||||
J1[Job 1]
|
||||
J2[Job 2]
|
||||
J3[Job 3]
|
||||
J4[Job 4]
|
||||
J5[Job 5]
|
||||
J6[Job 6]
|
||||
end
|
||||
|
||||
subgraph "Worker-2"
|
||||
S2[Semaphore<br/>6 slots]
|
||||
J7[Job 7]
|
||||
J8[Job 8]
|
||||
J9[Job 9]
|
||||
J10[Job 10]
|
||||
J11[Job 11]
|
||||
J12[Job 12]
|
||||
end
|
||||
|
||||
subgraph "Worker-3"
|
||||
S3[Semaphore<br/>6 slots]
|
||||
J13[Job 13]
|
||||
J14[Job 14]
|
||||
J15[Job 15]
|
||||
J16[Job 16]
|
||||
J17[Job 17]
|
||||
J18[Job 18]
|
||||
end
|
||||
|
||||
Q -->|Claim 6 jobs| S1
|
||||
Q -->|Claim 6 jobs| S2
|
||||
Q -->|Claim 6 jobs| S3
|
||||
|
||||
S1 --> J1
|
||||
S1 --> J2
|
||||
S1 --> J3
|
||||
S1 --> J4
|
||||
S1 --> J5
|
||||
S1 --> J6
|
||||
|
||||
S2 --> J7
|
||||
S2 --> J8
|
||||
S2 --> J9
|
||||
S2 --> J10
|
||||
S2 --> J11
|
||||
S2 --> J12
|
||||
|
||||
S3 --> J13
|
||||
S3 --> J14
|
||||
S3 --> J15
|
||||
S3 --> J16
|
||||
S3 --> J17
|
||||
S3 --> J18
|
||||
|
||||
style Q fill:#FF6B6B
|
||||
style S1 fill:#50C878
|
||||
style S2 fill:#50C878
|
||||
style S3 fill:#50C878
|
||||
```
|
||||
|
||||
## Concurrency Control Mechanisms
|
||||
|
||||
### 1. Database-Level (Advisory Locks)
|
||||
- **PostgreSQL Advisory Locks**: Prevent multiple workers from claiming the same job
|
||||
- Atomic job claiming using `pg_try_advisory_lock()`
|
||||
- Ensures exactly-once job processing
|
||||
|
||||
### 2. Worker-Level (Semaphore)
|
||||
- **SemaphoreSlim**: Limits concurrent backtests per worker
|
||||
- Default: `Environment.ProcessorCount - 2` (e.g., 6 on 8-core machine)
|
||||
- Prevents CPU saturation while leaving resources for Orleans messaging
|
||||
|
||||
### 3. Cluster-Level (Queue Priority)
|
||||
- **Priority Queue**: Jobs ordered by priority, then creation time
|
||||
- VIP users get higher priority
|
||||
- Fair distribution across workers
|
||||
|
||||
## Capacity Calculation
|
||||
|
||||
- **Per Worker**: 6 concurrent backtests
|
||||
- **3 Workers**: 18 concurrent backtests
|
||||
- **Average Duration**: ~47 minutes per backtest
|
||||
- **Throughput**: ~1,080 backtests/hour
|
||||
- **1000 Users × 10 backtests**: ~9 hours to process full queue
|
||||
|
||||
@@ -0,0 +1,74 @@
|
||||
# Monorepo Project Structure
|
||||
|
||||
This diagram shows the monorepo structure with shared projects used by both API and Compute services.
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
ROOT[Managing.sln<br/>Monorepo Root]
|
||||
|
||||
ROOT --> API[Managing.Api<br/>API Server<br/>Orleans]
|
||||
ROOT --> COMPUTE[Managing.Compute<br/>Worker App<br/>No Orleans]
|
||||
|
||||
ROOT --> SHARED[Shared Projects]
|
||||
|
||||
SHARED --> APP[Managing.Application<br/>Business Logic]
|
||||
SHARED --> DOM[Managing.Domain<br/>Domain Models]
|
||||
SHARED --> INFRA[Managing.Infrastructure<br/>Database/External]
|
||||
SHARED --> COMMON[Managing.Common<br/>Utilities]
|
||||
|
||||
API --> APP
|
||||
API --> DOM
|
||||
API --> INFRA
|
||||
API --> COMMON
|
||||
|
||||
COMPUTE --> APP
|
||||
COMPUTE --> DOM
|
||||
COMPUTE --> INFRA
|
||||
COMPUTE --> COMMON
|
||||
|
||||
style ROOT fill:#9B59B6
|
||||
style API fill:#4A90E2
|
||||
style COMPUTE fill:#50C878
|
||||
style SHARED fill:#FFD93D
|
||||
```
|
||||
|
||||
## Project Organization
|
||||
|
||||
### Root Level
|
||||
- **Managing.sln**: Solution file containing all projects
|
||||
|
||||
### Service Projects
|
||||
- **Managing.Api**: API Server with Orleans
|
||||
- Controllers, Orleans grains, HTTP endpoints
|
||||
- Handles user requests, creates jobs
|
||||
|
||||
- **Managing.Compute**: Compute Worker App (NEW)
|
||||
- Background workers, job processors
|
||||
- No Orleans dependency
|
||||
- Dedicated CPU processing
|
||||
|
||||
### Shared Projects
|
||||
- **Managing.Application**: Business logic
|
||||
- `Backtester.cs`, `TradingBotBase.cs`
|
||||
- Used by both API and Compute
|
||||
|
||||
- **Managing.Domain**: Domain models
|
||||
- `BundleBacktestRequest.cs`, `BacktestJob.cs`
|
||||
- Shared entities
|
||||
|
||||
- **Managing.Infrastructure**: External integrations
|
||||
- Database repositories, InfluxDB client
|
||||
- Shared data access
|
||||
|
||||
- **Managing.Common**: Utilities
|
||||
- Constants, enums, helpers
|
||||
- Shared across all projects
|
||||
|
||||
## Benefits
|
||||
|
||||
1. **Code Reuse**: Shared business logic between API and Compute
|
||||
2. **Consistency**: Same domain models and logic
|
||||
3. **Maintainability**: Single source of truth
|
||||
4. **Type Safety**: Shared types prevent serialization issues
|
||||
5. **Testing**: Shared test projects
|
||||
|
||||
@@ -0,0 +1,71 @@
|
||||
# Implementation Plan
|
||||
|
||||
## Phase 1: Database & Domain Setup
|
||||
|
||||
- [ ] Create `BacktestJob` entity in `Managing.Domain/Backtests/`
|
||||
- [ ] Create `BacktestJobStatus` enum (Pending, Running, Completed, Failed)
|
||||
- [ ] Create database migration for `BacktestJobs` table
|
||||
- [ ] Add indexes: `idx_status_priority`, `idx_bundle_request`, `idx_assigned_worker`
|
||||
- [ ] Create `IBacktestJobRepository` interface
|
||||
- [ ] Implement `BacktestJobRepository` with advisory lock support
|
||||
|
||||
## Phase 2: Compute Worker Project
|
||||
|
||||
- [ ] Create `Managing.Compute` project (console app/worker service)
|
||||
- [ ] Add project reference to shared projects (Application, Domain, Infrastructure)
|
||||
- [ ] Configure DI container (NO Orleans)
|
||||
- [ ] Create `BacktestComputeWorker` background service
|
||||
- [ ] Implement job polling logic (every 5 seconds)
|
||||
- [ ] Implement job claiming with PostgreSQL advisory locks
|
||||
- [ ] Implement semaphore-based concurrency control
|
||||
- [ ] Implement progress callback mechanism
|
||||
- [ ] Implement heartbeat mechanism (every 30 seconds)
|
||||
- [ ] Add configuration: `MaxConcurrentBacktests`, `JobPollIntervalSeconds`
|
||||
|
||||
## Phase 3: API Server Updates
|
||||
|
||||
- [ ] Update `BacktestController` to create jobs instead of calling grains directly
|
||||
- [ ] Implement `CreateBundleBacktest` endpoint (returns immediately)
|
||||
- [ ] Implement `GetBundleStatus` endpoint (polls database)
|
||||
- [ ] Update `Backtester.cs` to generate `BacktestJob` entities from bundle variants
|
||||
- [ ] Remove direct Orleans grain calls for backtests (keep for other operations)
|
||||
|
||||
## Phase 4: Shared Logic
|
||||
|
||||
- [ ] Extract backtest execution logic from `BacktestTradingBotGrain` to `Backtester.cs`
|
||||
- [ ] Make backtest logic Orleans-agnostic (can run in worker or grain)
|
||||
- [ ] Add progress callback support to `RunBacktestAsync` method
|
||||
- [ ] Ensure candle loading works in both contexts
|
||||
|
||||
## Phase 5: Monitoring & Health Checks
|
||||
|
||||
- [ ] Add health check endpoint to compute worker
|
||||
- [ ] Add metrics: pending jobs, running jobs, completed/failed counts
|
||||
- [ ] Add stale job detection (reclaim jobs from dead workers)
|
||||
- [ ] Add logging for job lifecycle events
|
||||
|
||||
## Phase 6: Deployment
|
||||
|
||||
- [ ] Create Dockerfile for `Managing.Compute`
|
||||
- [ ] Create deployment configuration for compute workers
|
||||
- [ ] Configure environment variables for compute cluster
|
||||
- [ ] Set up monitoring dashboards (Prometheus/Grafana)
|
||||
- [ ] Configure auto-scaling rules for compute workers
|
||||
|
||||
## Phase 7: Testing & Validation
|
||||
|
||||
- [ ] Test single backtest job processing
|
||||
- [ ] Test bundle backtest with multiple jobs
|
||||
- [ ] Test concurrent job processing (multiple workers)
|
||||
- [ ] Test job recovery after worker failure
|
||||
- [ ] Test priority queue ordering
|
||||
- [ ] Load test with 1000+ concurrent users
|
||||
|
||||
## Phase 8: Migration Strategy
|
||||
|
||||
- [ ] Keep Orleans grains as fallback during transition
|
||||
- [ ] Feature flag to switch between Orleans and Compute workers
|
||||
- [ ] Gradual migration: test with small percentage of traffic
|
||||
- [ ] Monitor performance and error rates
|
||||
- [ ] Full cutover once validated
|
||||
|
||||
75
assets/documentation/Workers processing/README.md
Normal file
75
assets/documentation/Workers processing/README.md
Normal file
@@ -0,0 +1,75 @@
|
||||
# Workers Processing Architecture
|
||||
|
||||
This folder contains documentation for the enterprise-grade backtest processing architecture using a database queue pattern with separate API and Compute worker clusters.
|
||||
|
||||
## Overview
|
||||
|
||||
The architecture separates concerns between:
|
||||
- **API Server**: Handles HTTP requests, creates jobs, returns immediately (fire-and-forget)
|
||||
- **Compute Workers**: Process CPU-intensive backtest jobs from the database queue
|
||||
- **Database Queue**: Central coordination point using PostgreSQL
|
||||
|
||||
## Documentation Files
|
||||
|
||||
1. **[01-Overall-Architecture.md](./01-Overall-Architecture.md)**
|
||||
- Complete system architecture diagram
|
||||
- Component relationships
|
||||
- External service integrations
|
||||
|
||||
2. **[02-Request-Flow.md](./02-Request-Flow.md)**
|
||||
- Sequence diagram of request flow
|
||||
- User request → Job creation → Processing → Status polling
|
||||
|
||||
3. **[03-Job-Processing-Flow.md](./03-Job-Processing-Flow.md)**
|
||||
- Detailed job processing workflow
|
||||
- Worker polling, job claiming, semaphore control
|
||||
|
||||
4. **[04-Database-Schema.md](./04-Database-Schema.md)**
|
||||
- Entity relationship diagram
|
||||
- Database schema for job queue
|
||||
- Key indexes and relationships
|
||||
|
||||
5. **[05-Deployment-Architecture.md](./05-Deployment-Architecture.md)**
|
||||
- Production deployment topology
|
||||
- Load balancing, clustering, monitoring
|
||||
|
||||
6. **[06-Concurrency-Control.md](./06-Concurrency-Control.md)**
|
||||
- Concurrency control mechanisms
|
||||
- Semaphore-based limiting
|
||||
- Capacity calculations
|
||||
|
||||
7. **[07-Monorepo-Structure.md](./07-Monorepo-Structure.md)**
|
||||
- Monorepo project organization
|
||||
- Shared projects and dependencies
|
||||
|
||||
## Key Features
|
||||
|
||||
- ✅ **No Timeouts**: Fire-and-forget pattern with polling
|
||||
- ✅ **Scalable**: Horizontal scaling of both API and Compute clusters
|
||||
- ✅ **Reliable**: Jobs persist in database, survive restarts
|
||||
- ✅ **Efficient**: Dedicated CPU resources for compute work
|
||||
- ✅ **Enterprise-Grade**: Handles 1000+ users, priority queue, health checks
|
||||
|
||||
## Architecture Principles
|
||||
|
||||
1. **Separation of Concerns**: API handles requests, Compute handles CPU work
|
||||
2. **Database as Queue**: PostgreSQL serves as reliable job queue
|
||||
3. **Shared Codebase**: Monorepo with shared business logic
|
||||
4. **Resource Isolation**: Compute workers don't interfere with API responsiveness
|
||||
5. **Fault Tolerance**: Jobs survive worker failures, can be reclaimed
|
||||
|
||||
## Capacity Planning
|
||||
|
||||
- **Per Worker**: 6 concurrent backtests (8-core machine)
|
||||
- **3 Workers**: 18 concurrent backtests
|
||||
- **Throughput**: ~1,080 backtests/hour
|
||||
- **1000 Users × 10 backtests**: ~9 hours processing time
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. Create `Managing.Compute` project
|
||||
2. Implement `BacktestJob` entity and repository
|
||||
3. Create `BacktestComputeWorker` background service
|
||||
4. Update API controllers to use job queue pattern
|
||||
5. Deploy compute workers to dedicated servers
|
||||
|
||||
Reference in New Issue
Block a user