diff --git a/assets/documentation/Workers processing/01-Overall-Architecture.md b/assets/documentation/Workers processing/01-Overall-Architecture.md new file mode 100644 index 00000000..dc3b5c48 --- /dev/null +++ b/assets/documentation/Workers processing/01-Overall-Architecture.md @@ -0,0 +1,78 @@ +# Overall System Architecture + +This diagram shows the complete system architecture with API Server Cluster, Compute Worker Cluster, and their interactions with the database and external services. + +```mermaid +graph TB + subgraph "Monorepo Structure" + subgraph "API Server Cluster" + API1[Managing.Api
API-1
Orleans] + API2[Managing.Api
API-2
Orleans] + API3[Managing.Api
API-3
Orleans] + end + + subgraph "Compute Worker Cluster" + W1[Managing.Compute
Worker-1
8 cores, 6 jobs] + W2[Managing.Compute
Worker-2
8 cores, 6 jobs] + W3[Managing.Compute
Worker-3
8 cores, 6 jobs] + end + + subgraph "Shared Projects" + APP[Managing.Application
Business Logic] + DOM[Managing.Domain
Domain Models] + INFRA[Managing.Infrastructure
Database Access] + end + end + + subgraph "External Services" + DB[(PostgreSQL
Job Queue)] + INFLUX[(InfluxDB
Candles)] + end + + subgraph "Clients" + U1[User 1] + U2[User 2] + U1000[User 1000] + end + + U1 --> API1 + U2 --> API2 + U1000 --> API3 + + API1 --> DB + API2 --> DB + API3 --> DB + + W1 --> DB + W2 --> DB + W3 --> DB + + W1 --> INFLUX + W2 --> INFLUX + W3 --> INFLUX + + API1 -.uses.-> APP + API2 -.uses.-> APP + API3 -.uses.-> APP + W1 -.uses.-> APP + W2 -.uses.-> APP + W3 -.uses.-> APP + + style API1 fill:#4A90E2 + style API2 fill:#4A90E2 + style API3 fill:#4A90E2 + style W1 fill:#50C878 + style W2 fill:#50C878 + style W3 fill:#50C878 + style DB fill:#FF6B6B + style INFLUX fill:#FFD93D +``` + +## Components + +- **API Server Cluster**: Handles HTTP requests, creates jobs, returns immediately +- **Compute Worker Cluster**: Processes CPU-intensive backtest jobs +- **PostgreSQL**: Job queue and state management +- **InfluxDB**: Time-series data for candles +- **Shared Projects**: Common business logic used by both API and Compute services + diff --git a/assets/documentation/Workers processing/02-Request-Flow.md b/assets/documentation/Workers processing/02-Request-Flow.md new file mode 100644 index 00000000..88060a3b --- /dev/null +++ b/assets/documentation/Workers processing/02-Request-Flow.md @@ -0,0 +1,52 @@ +# Request Flow Sequence Diagram + +This diagram shows the complete request flow from user submission to job completion and status polling. + +```mermaid +sequenceDiagram + participant User + participant API as API Server
(Orleans) + participant DB as PostgreSQL
(Job Queue) + participant Worker as Compute Worker + participant Influx as InfluxDB + + User->>API: POST /api/backtest/bundle + API->>API: Create BundleBacktestRequest + API->>API: Generate BacktestJobs from variants + API->>DB: INSERT BacktestJobs (Status: Pending) + API-->>User: 202 Accepted
{bundleRequestId, status: "Queued"} + + Note over Worker: Polling every 5 seconds + Worker->>DB: SELECT pending jobs
(ORDER BY priority, createdAt) + DB-->>Worker: Return pending jobs + Worker->>DB: UPDATE job
(Status: Running, AssignedWorkerId) + Worker->>Influx: Load candles for backtest + Influx-->>Worker: Return candles + + loop Process each candle + Worker->>Worker: Run backtest logic + Worker->>DB: UPDATE job progress + end + + Worker->>DB: UPDATE job
(Status: Completed, ResultJson) + Worker->>DB: UPDATE BundleBacktestRequest
(CompletedBacktests++) + + User->>API: GET /api/backtest/bundle/{id}/status + API->>DB: SELECT BundleBacktestRequest + job stats + DB-->>API: Return status + API-->>User: {status, progress, completed/total} +``` + +## Flow Steps + +1. **User Request**: User submits bundle backtest request +2. **API Processing**: API creates bundle request and generates individual backtest jobs +3. **Job Queue**: Jobs are inserted into database with `Pending` status +4. **Immediate Response**: API returns 202 Accepted with bundle request ID +5. **Worker Polling**: Compute workers poll database every 5 seconds +6. **Job Claiming**: Worker claims jobs using PostgreSQL advisory locks +7. **Candle Loading**: Worker loads candles from InfluxDB +8. **Backtest Processing**: Worker processes backtest with progress updates +9. **Result Storage**: Worker saves results and updates bundle progress +10. **Status Polling**: User polls API for status updates + diff --git a/assets/documentation/Workers processing/03-Job-Processing-Flow.md b/assets/documentation/Workers processing/03-Job-Processing-Flow.md new file mode 100644 index 00000000..5256bfe4 --- /dev/null +++ b/assets/documentation/Workers processing/03-Job-Processing-Flow.md @@ -0,0 +1,54 @@ +# Job Processing Flow + +This diagram shows the detailed flow of how compute workers process backtest jobs from the queue. + +```mermaid +flowchart TD + Start([User Creates
BundleBacktestRequest]) --> CreateJobs[API: Generate
BacktestJobs] + CreateJobs --> InsertDB[(Insert Jobs
Status: Pending)] + + InsertDB --> WorkerPoll{Worker Polls
Database} + + WorkerPoll -->|Every 5s| CheckJobs{Jobs
Available?} + CheckJobs -->|No| Wait[Wait 5s] + Wait --> WorkerPoll + + CheckJobs -->|Yes| ClaimJobs[Claim Jobs
Advisory Lock] + ClaimJobs --> UpdateStatus[Update Status:
Running] + + UpdateStatus --> CheckSemaphore{Semaphore
Available?} + CheckSemaphore -->|No| WaitSemaphore[Wait for
slot] + WaitSemaphore --> CheckSemaphore + + CheckSemaphore -->|Yes| AcquireSemaphore[Acquire
Semaphore] + AcquireSemaphore --> LoadCandles[Load Candles
from InfluxDB] + + LoadCandles --> ProcessBacktest[Process Backtest
CPU-intensive] + + ProcessBacktest --> UpdateProgress{Every
10%?} + UpdateProgress -->|Yes| SaveProgress[Update Progress
in DB] + SaveProgress --> ProcessBacktest + UpdateProgress -->|No| ProcessBacktest + + ProcessBacktest --> BacktestComplete{Backtest
Complete?} + BacktestComplete -->|No| ProcessBacktest + BacktestComplete -->|Yes| SaveResult[Save Result
Status: Completed] + + SaveResult --> UpdateBundle[Update Bundle
Progress] + UpdateBundle --> ReleaseSemaphore[Release
Semaphore] + ReleaseSemaphore --> WorkerPoll + + style Start fill:#4A90E2 + style ProcessBacktest fill:#50C878 + style SaveResult fill:#FF6B6B + style WorkerPoll fill:#FFD93D +``` + +## Key Components + +- **Worker Polling**: Workers continuously poll database for pending jobs +- **Advisory Locks**: PostgreSQL advisory locks prevent multiple workers from claiming the same job +- **Semaphore Control**: Limits concurrent backtests per worker (default: CPU cores - 2) +- **Progress Updates**: Progress is saved to database every 10% completion +- **Bundle Updates**: Individual job completion updates the parent bundle request + diff --git a/assets/documentation/Workers processing/04-Database-Schema.md b/assets/documentation/Workers processing/04-Database-Schema.md new file mode 100644 index 00000000..7de5d9c5 --- /dev/null +++ b/assets/documentation/Workers processing/04-Database-Schema.md @@ -0,0 +1,69 @@ +# Database Schema & Queue Structure + +This diagram shows the entity relationships between BundleBacktestRequest, BacktestJob, and User entities. + +```mermaid +erDiagram + BundleBacktestRequest ||--o{ BacktestJob : "has many" + BacktestJob }o--|| User : "belongs to" + + BundleBacktestRequest { + UUID RequestId PK + INT UserId FK + STRING Status + INT TotalBacktests + INT CompletedBacktests + INT FailedBacktests + DATETIME CreatedAt + DATETIME CompletedAt + STRING UniversalConfigJson + STRING DateTimeRangesJson + STRING MoneyManagementVariantsJson + STRING TickerVariantsJson + } + + BacktestJob { + UUID Id PK + UUID BundleRequestId FK + STRING JobType + STRING Status + INT Priority + TEXT ConfigJson + TEXT CandlesJson + INT ProgressPercentage + INT CurrentBacktestIndex + INT TotalBacktests + INT CompletedBacktests + DATETIME CreatedAt + DATETIME StartedAt + DATETIME CompletedAt + TEXT ResultJson + TEXT ErrorMessage + STRING AssignedWorkerId + DATETIME LastHeartbeat + } + + User { + INT Id PK + STRING Name + } +``` + +## Table Descriptions + +### BundleBacktestRequest +- Represents a bundle of multiple backtest jobs +- Contains variant configurations (date ranges, money management, tickers) +- Tracks overall progress across all jobs + +### BacktestJob +- Individual backtest execution unit +- Contains serialized config and candles +- Tracks progress, worker assignment, and heartbeat +- Links to parent bundle request + +### Key Indexes +- `idx_status_priority`: For efficient job claiming (Status, Priority DESC, CreatedAt) +- `idx_bundle_request`: For bundle progress queries +- `idx_assigned_worker`: For worker health monitoring + diff --git a/assets/documentation/Workers processing/05-Deployment-Architecture.md b/assets/documentation/Workers processing/05-Deployment-Architecture.md new file mode 100644 index 00000000..df59cd81 --- /dev/null +++ b/assets/documentation/Workers processing/05-Deployment-Architecture.md @@ -0,0 +1,103 @@ +# Deployment Architecture + +This diagram shows the production deployment architecture with load balancing, clustering, and monitoring. + +```mermaid +graph TB + subgraph "Load Balancer" + LB[NGINX/Cloudflare] + end + + subgraph "API Server Cluster" + direction LR + API1[API-1
Orleans Silo
Port: 11111] + API2[API-2
Orleans Silo
Port: 11121] + API3[API-3
Orleans Silo
Port: 11131] + end + + subgraph "Compute Worker Cluster" + direction LR + W1[Worker-1
8 CPU Cores
6 Concurrent Jobs] + W2[Worker-2
8 CPU Cores
6 Concurrent Jobs] + W3[Worker-3
8 CPU Cores
6 Concurrent Jobs] + end + + subgraph "Database Cluster" + direction LR + DB_MASTER[(PostgreSQL
Master
Job Queue)] + DB_REPLICA[(PostgreSQL
Replica
Read Only)] + end + + subgraph "Time Series DB" + INFLUX[(InfluxDB
Candles Data)] + end + + subgraph "Monitoring" + PROM[Prometheus] + GRAF[Grafana] + end + + LB --> API1 + LB --> API2 + LB --> API3 + + API1 --> DB_MASTER + API2 --> DB_MASTER + API3 --> DB_MASTER + + W1 --> DB_MASTER + W2 --> DB_MASTER + W3 --> DB_MASTER + + W1 --> INFLUX + W2 --> INFLUX + W3 --> INFLUX + + W1 --> PROM + W2 --> PROM + W3 --> PROM + API1 --> PROM + API2 --> PROM + API3 --> PROM + + PROM --> GRAF + + DB_MASTER --> DB_REPLICA + + style LB fill:#9B59B6 + style API1 fill:#4A90E2 + style API2 fill:#4A90E2 + style API3 fill:#4A90E2 + style W1 fill:#50C878 + style W2 fill:#50C878 + style W3 fill:#50C878 + style DB_MASTER fill:#FF6B6B + style INFLUX fill:#FFD93D + style PROM fill:#E67E22 + style GRAF fill:#E67E22 +``` + +## Deployment Components + +### Load Balancer +- **NGINX/Cloudflare**: Distributes incoming requests across API servers +- Health checks and failover support + +### API Server Cluster +- **3+ Instances**: Horizontally scalable Orleans silos +- Each instance handles HTTP requests and Orleans grain operations +- Ports: 11111, 11121, 11131 (for clustering) + +### Compute Worker Cluster +- **3+ Instances**: Dedicated CPU workers +- Each worker: 8 CPU cores, 6 concurrent backtests +- Total capacity: 18 concurrent backtests across cluster + +### Database Cluster +- **Master**: Handles all writes (job creation, updates) +- **Replica**: Read-only for status queries and reporting + +### Monitoring +- **Prometheus**: Metrics collection +- **Grafana**: Visualization and dashboards + diff --git a/assets/documentation/Workers processing/06-Concurrency-Control.md b/assets/documentation/Workers processing/06-Concurrency-Control.md new file mode 100644 index 00000000..d369df57 --- /dev/null +++ b/assets/documentation/Workers processing/06-Concurrency-Control.md @@ -0,0 +1,96 @@ +# Concurrency Control Flow + +This diagram shows how the semaphore-based concurrency control works across multiple workers. + +```mermaid +graph LR + subgraph "Database Queue" + Q[Pending Jobs
Priority Queue] + end + + subgraph "Worker-1" + S1[Semaphore
6 slots] + J1[Job 1] + J2[Job 2] + J3[Job 3] + J4[Job 4] + J5[Job 5] + J6[Job 6] + end + + subgraph "Worker-2" + S2[Semaphore
6 slots] + J7[Job 7] + J8[Job 8] + J9[Job 9] + J10[Job 10] + J11[Job 11] + J12[Job 12] + end + + subgraph "Worker-3" + S3[Semaphore
6 slots] + J13[Job 13] + J14[Job 14] + J15[Job 15] + J16[Job 16] + J17[Job 17] + J18[Job 18] + end + + Q -->|Claim 6 jobs| S1 + Q -->|Claim 6 jobs| S2 + Q -->|Claim 6 jobs| S3 + + S1 --> J1 + S1 --> J2 + S1 --> J3 + S1 --> J4 + S1 --> J5 + S1 --> J6 + + S2 --> J7 + S2 --> J8 + S2 --> J9 + S2 --> J10 + S2 --> J11 + S2 --> J12 + + S3 --> J13 + S3 --> J14 + S3 --> J15 + S3 --> J16 + S3 --> J17 + S3 --> J18 + + style Q fill:#FF6B6B + style S1 fill:#50C878 + style S2 fill:#50C878 + style S3 fill:#50C878 +``` + +## Concurrency Control Mechanisms + +### 1. Database-Level (Advisory Locks) +- **PostgreSQL Advisory Locks**: Prevent multiple workers from claiming the same job +- Atomic job claiming using `pg_try_advisory_lock()` +- Ensures exactly-once job processing + +### 2. Worker-Level (Semaphore) +- **SemaphoreSlim**: Limits concurrent backtests per worker +- Default: `Environment.ProcessorCount - 2` (e.g., 6 on 8-core machine) +- Prevents CPU saturation while leaving resources for Orleans messaging + +### 3. Cluster-Level (Queue Priority) +- **Priority Queue**: Jobs ordered by priority, then creation time +- VIP users get higher priority +- Fair distribution across workers + +## Capacity Calculation + +- **Per Worker**: 6 concurrent backtests +- **3 Workers**: 18 concurrent backtests +- **Average Duration**: ~47 minutes per backtest +- **Throughput**: ~1,080 backtests/hour +- **1000 Users × 10 backtests**: ~9 hours to process full queue + diff --git a/assets/documentation/Workers processing/07-Monorepo-Structure.md b/assets/documentation/Workers processing/07-Monorepo-Structure.md new file mode 100644 index 00000000..441a32cc --- /dev/null +++ b/assets/documentation/Workers processing/07-Monorepo-Structure.md @@ -0,0 +1,74 @@ +# Monorepo Project Structure + +This diagram shows the monorepo structure with shared projects used by both API and Compute services. + +```mermaid +graph TD + ROOT[Managing.sln
Monorepo Root] + + ROOT --> API[Managing.Api
API Server
Orleans] + ROOT --> COMPUTE[Managing.Compute
Worker App
No Orleans] + + ROOT --> SHARED[Shared Projects] + + SHARED --> APP[Managing.Application
Business Logic] + SHARED --> DOM[Managing.Domain
Domain Models] + SHARED --> INFRA[Managing.Infrastructure
Database/External] + SHARED --> COMMON[Managing.Common
Utilities] + + API --> APP + API --> DOM + API --> INFRA + API --> COMMON + + COMPUTE --> APP + COMPUTE --> DOM + COMPUTE --> INFRA + COMPUTE --> COMMON + + style ROOT fill:#9B59B6 + style API fill:#4A90E2 + style COMPUTE fill:#50C878 + style SHARED fill:#FFD93D +``` + +## Project Organization + +### Root Level +- **Managing.sln**: Solution file containing all projects + +### Service Projects +- **Managing.Api**: API Server with Orleans + - Controllers, Orleans grains, HTTP endpoints + - Handles user requests, creates jobs + +- **Managing.Compute**: Compute Worker App (NEW) + - Background workers, job processors + - No Orleans dependency + - Dedicated CPU processing + +### Shared Projects +- **Managing.Application**: Business logic + - `Backtester.cs`, `TradingBotBase.cs` + - Used by both API and Compute + +- **Managing.Domain**: Domain models + - `BundleBacktestRequest.cs`, `BacktestJob.cs` + - Shared entities + +- **Managing.Infrastructure**: External integrations + - Database repositories, InfluxDB client + - Shared data access + +- **Managing.Common**: Utilities + - Constants, enums, helpers + - Shared across all projects + +## Benefits + +1. **Code Reuse**: Shared business logic between API and Compute +2. **Consistency**: Same domain models and logic +3. **Maintainability**: Single source of truth +4. **Type Safety**: Shared types prevent serialization issues +5. **Testing**: Shared test projects + diff --git a/assets/documentation/Workers processing/IMPLEMENTATION-PLAN.md b/assets/documentation/Workers processing/IMPLEMENTATION-PLAN.md new file mode 100644 index 00000000..6032d79a --- /dev/null +++ b/assets/documentation/Workers processing/IMPLEMENTATION-PLAN.md @@ -0,0 +1,71 @@ +# Implementation Plan + +## Phase 1: Database & Domain Setup + +- [ ] Create `BacktestJob` entity in `Managing.Domain/Backtests/` +- [ ] Create `BacktestJobStatus` enum (Pending, Running, Completed, Failed) +- [ ] Create database migration for `BacktestJobs` table +- [ ] Add indexes: `idx_status_priority`, `idx_bundle_request`, `idx_assigned_worker` +- [ ] Create `IBacktestJobRepository` interface +- [ ] Implement `BacktestJobRepository` with advisory lock support + +## Phase 2: Compute Worker Project + +- [ ] Create `Managing.Compute` project (console app/worker service) +- [ ] Add project reference to shared projects (Application, Domain, Infrastructure) +- [ ] Configure DI container (NO Orleans) +- [ ] Create `BacktestComputeWorker` background service +- [ ] Implement job polling logic (every 5 seconds) +- [ ] Implement job claiming with PostgreSQL advisory locks +- [ ] Implement semaphore-based concurrency control +- [ ] Implement progress callback mechanism +- [ ] Implement heartbeat mechanism (every 30 seconds) +- [ ] Add configuration: `MaxConcurrentBacktests`, `JobPollIntervalSeconds` + +## Phase 3: API Server Updates + +- [ ] Update `BacktestController` to create jobs instead of calling grains directly +- [ ] Implement `CreateBundleBacktest` endpoint (returns immediately) +- [ ] Implement `GetBundleStatus` endpoint (polls database) +- [ ] Update `Backtester.cs` to generate `BacktestJob` entities from bundle variants +- [ ] Remove direct Orleans grain calls for backtests (keep for other operations) + +## Phase 4: Shared Logic + +- [ ] Extract backtest execution logic from `BacktestTradingBotGrain` to `Backtester.cs` +- [ ] Make backtest logic Orleans-agnostic (can run in worker or grain) +- [ ] Add progress callback support to `RunBacktestAsync` method +- [ ] Ensure candle loading works in both contexts + +## Phase 5: Monitoring & Health Checks + +- [ ] Add health check endpoint to compute worker +- [ ] Add metrics: pending jobs, running jobs, completed/failed counts +- [ ] Add stale job detection (reclaim jobs from dead workers) +- [ ] Add logging for job lifecycle events + +## Phase 6: Deployment + +- [ ] Create Dockerfile for `Managing.Compute` +- [ ] Create deployment configuration for compute workers +- [ ] Configure environment variables for compute cluster +- [ ] Set up monitoring dashboards (Prometheus/Grafana) +- [ ] Configure auto-scaling rules for compute workers + +## Phase 7: Testing & Validation + +- [ ] Test single backtest job processing +- [ ] Test bundle backtest with multiple jobs +- [ ] Test concurrent job processing (multiple workers) +- [ ] Test job recovery after worker failure +- [ ] Test priority queue ordering +- [ ] Load test with 1000+ concurrent users + +## Phase 8: Migration Strategy + +- [ ] Keep Orleans grains as fallback during transition +- [ ] Feature flag to switch between Orleans and Compute workers +- [ ] Gradual migration: test with small percentage of traffic +- [ ] Monitor performance and error rates +- [ ] Full cutover once validated + diff --git a/assets/documentation/Workers processing/README.md b/assets/documentation/Workers processing/README.md new file mode 100644 index 00000000..d0081726 --- /dev/null +++ b/assets/documentation/Workers processing/README.md @@ -0,0 +1,75 @@ +# Workers Processing Architecture + +This folder contains documentation for the enterprise-grade backtest processing architecture using a database queue pattern with separate API and Compute worker clusters. + +## Overview + +The architecture separates concerns between: +- **API Server**: Handles HTTP requests, creates jobs, returns immediately (fire-and-forget) +- **Compute Workers**: Process CPU-intensive backtest jobs from the database queue +- **Database Queue**: Central coordination point using PostgreSQL + +## Documentation Files + +1. **[01-Overall-Architecture.md](./01-Overall-Architecture.md)** + - Complete system architecture diagram + - Component relationships + - External service integrations + +2. **[02-Request-Flow.md](./02-Request-Flow.md)** + - Sequence diagram of request flow + - User request → Job creation → Processing → Status polling + +3. **[03-Job-Processing-Flow.md](./03-Job-Processing-Flow.md)** + - Detailed job processing workflow + - Worker polling, job claiming, semaphore control + +4. **[04-Database-Schema.md](./04-Database-Schema.md)** + - Entity relationship diagram + - Database schema for job queue + - Key indexes and relationships + +5. **[05-Deployment-Architecture.md](./05-Deployment-Architecture.md)** + - Production deployment topology + - Load balancing, clustering, monitoring + +6. **[06-Concurrency-Control.md](./06-Concurrency-Control.md)** + - Concurrency control mechanisms + - Semaphore-based limiting + - Capacity calculations + +7. **[07-Monorepo-Structure.md](./07-Monorepo-Structure.md)** + - Monorepo project organization + - Shared projects and dependencies + +## Key Features + +- ✅ **No Timeouts**: Fire-and-forget pattern with polling +- ✅ **Scalable**: Horizontal scaling of both API and Compute clusters +- ✅ **Reliable**: Jobs persist in database, survive restarts +- ✅ **Efficient**: Dedicated CPU resources for compute work +- ✅ **Enterprise-Grade**: Handles 1000+ users, priority queue, health checks + +## Architecture Principles + +1. **Separation of Concerns**: API handles requests, Compute handles CPU work +2. **Database as Queue**: PostgreSQL serves as reliable job queue +3. **Shared Codebase**: Monorepo with shared business logic +4. **Resource Isolation**: Compute workers don't interfere with API responsiveness +5. **Fault Tolerance**: Jobs survive worker failures, can be reclaimed + +## Capacity Planning + +- **Per Worker**: 6 concurrent backtests (8-core machine) +- **3 Workers**: 18 concurrent backtests +- **Throughput**: ~1,080 backtests/hour +- **1000 Users × 10 backtests**: ~9 hours processing time + +## Next Steps + +1. Create `Managing.Compute` project +2. Implement `BacktestJob` entity and repository +3. Create `BacktestComputeWorker` background service +4. Update API controllers to use job queue pattern +5. Deploy compute workers to dedicated servers +