Add doc for workers architecture

2025-11-07 15:34:13 +07:00
parent 2dc34f07d8
commit 5578d272fa
9 changed files with 672 additions and 0 deletions
--- a/processing/01-Overall-Architecture.md
+++ b/processing/01-Overall-Architecture.md
@@ -0,0 +1,78 @@
+# Overall System Architecture
+
+This diagram shows the complete system architecture with API Server Cluster, Compute Worker Cluster, and their interactions with the database and external services.
+
+```mermaid
+graph TB
+    subgraph "Monorepo Structure"
+        subgraph "API Server Cluster"
+            API1[Managing.Api<br/>API-1<br/>Orleans]
+            API2[Managing.Api<br/>API-2<br/>Orleans]
+            API3[Managing.Api<br/>API-3<br/>Orleans]
+        end
+        
+        subgraph "Compute Worker Cluster"
+            W1[Managing.Compute<br/>Worker-1<br/>8 cores, 6 jobs]
+            W2[Managing.Compute<br/>Worker-2<br/>8 cores, 6 jobs]
+            W3[Managing.Compute<br/>Worker-3<br/>8 cores, 6 jobs]
+        end
+        
+        subgraph "Shared Projects"
+            APP[Managing.Application<br/>Business Logic]
+            DOM[Managing.Domain<br/>Domain Models]
+            INFRA[Managing.Infrastructure<br/>Database Access]
+        end
+    end
+    
+    subgraph "External Services"
+        DB[(PostgreSQL<br/>Job Queue)]
+        INFLUX[(InfluxDB<br/>Candles)]
+    end
+    
+    subgraph "Clients"
+        U1[User 1]
+        U2[User 2]
+        U1000[User 1000]
+    end
+    
+    U1 --> API1
+    U2 --> API2
+    U1000 --> API3
+    
+    API1 --> DB
+    API2 --> DB
+    API3 --> DB
+    
+    W1 --> DB
+    W2 --> DB
+    W3 --> DB
+    
+    W1 --> INFLUX
+    W2 --> INFLUX
+    W3 --> INFLUX
+    
+    API1 -.uses.-> APP
+    API2 -.uses.-> APP
+    API3 -.uses.-> APP
+    W1 -.uses.-> APP
+    W2 -.uses.-> APP
+    W3 -.uses.-> APP
+    
+    style API1 fill:#4A90E2
+    style API2 fill:#4A90E2
+    style API3 fill:#4A90E2
+    style W1 fill:#50C878
+    style W2 fill:#50C878
+    style W3 fill:#50C878
+    style DB fill:#FF6B6B
+    style INFLUX fill:#FFD93D
+```
+
+## Components
+
+- **API Server Cluster**: Handles HTTP requests, creates jobs, returns immediately
+- **Compute Worker Cluster**: Processes CPU-intensive backtest jobs
+- **PostgreSQL**: Job queue and state management
+- **InfluxDB**: Time-series data for candles
+- **Shared Projects**: Common business logic used by both API and Compute services
+
--- a/processing/02-Request-Flow.md
+++ b/processing/02-Request-Flow.md
@@ -0,0 +1,52 @@
+# Request Flow Sequence Diagram
+
+This diagram shows the complete request flow from user submission to job completion and status polling.
+
+```mermaid
+sequenceDiagram
+    participant User
+    participant API as API Server<br/>(Orleans)
+    participant DB as PostgreSQL<br/>(Job Queue)
+    participant Worker as Compute Worker
+    participant Influx as InfluxDB
+    
+    User->>API: POST /api/backtest/bundle
+    API->>API: Create BundleBacktestRequest
+    API->>API: Generate BacktestJobs from variants
+    API->>DB: INSERT BacktestJobs (Status: Pending)
+    API-->>User: 202 Accepted<br/>{bundleRequestId, status: "Queued"}
+    
+    Note over Worker: Polling every 5 seconds
+    Worker->>DB: SELECT pending jobs<br/>(ORDER BY priority, createdAt)
+    DB-->>Worker: Return pending jobs
+    Worker->>DB: UPDATE job<br/>(Status: Running, AssignedWorkerId)
+    Worker->>Influx: Load candles for backtest
+    Influx-->>Worker: Return candles
+    
+    loop Process each candle
+        Worker->>Worker: Run backtest logic
+        Worker->>DB: UPDATE job progress
+    end
+    
+    Worker->>DB: UPDATE job<br/>(Status: Completed, ResultJson)
+    Worker->>DB: UPDATE BundleBacktestRequest<br/>(CompletedBacktests++)
+    
+    User->>API: GET /api/backtest/bundle/{id}/status
+    API->>DB: SELECT BundleBacktestRequest + job stats
+    DB-->>API: Return status
+    API-->>User: {status, progress, completed/total}
+```
+
+## Flow Steps
+
+1. **User Request**: User submits bundle backtest request
+2. **API Processing**: API creates bundle request and generates individual backtest jobs
+3. **Job Queue**: Jobs are inserted into database with `Pending` status
+4. **Immediate Response**: API returns 202 Accepted with bundle request ID
+5. **Worker Polling**: Compute workers poll database every 5 seconds
+6. **Job Claiming**: Worker claims jobs using PostgreSQL advisory locks
+7. **Candle Loading**: Worker loads candles from InfluxDB
+8. **Backtest Processing**: Worker processes backtest with progress updates
+9. **Result Storage**: Worker saves results and updates bundle progress
+10. **Status Polling**: User polls API for status updates
+
--- a/processing/03-Job-Processing-Flow.md
+++ b/processing/03-Job-Processing-Flow.md
@@ -0,0 +1,54 @@
+# Job Processing Flow
+
+This diagram shows the detailed flow of how compute workers process backtest jobs from the queue.
+
+```mermaid
+flowchart TD
+    Start([User Creates<br/>BundleBacktestRequest]) --> CreateJobs[API: Generate<br/>BacktestJobs]
+    CreateJobs --> InsertDB[(Insert Jobs<br/>Status: Pending)]
+    
+    InsertDB --> WorkerPoll{Worker Polls<br/>Database}
+    
+    WorkerPoll -->|Every 5s| CheckJobs{Jobs<br/>Available?}
+    CheckJobs -->|No| Wait[Wait 5s]
+    Wait --> WorkerPoll
+    
+    CheckJobs -->|Yes| ClaimJobs[Claim Jobs<br/>Advisory Lock]
+    ClaimJobs --> UpdateStatus[Update Status:<br/>Running]
+    
+    UpdateStatus --> CheckSemaphore{Semaphore<br/>Available?}
+    CheckSemaphore -->|No| WaitSemaphore[Wait for<br/>slot]
+    WaitSemaphore --> CheckSemaphore
+    
+    CheckSemaphore -->|Yes| AcquireSemaphore[Acquire<br/>Semaphore]
+    AcquireSemaphore --> LoadCandles[Load Candles<br/>from InfluxDB]
+    
+    LoadCandles --> ProcessBacktest[Process Backtest<br/>CPU-intensive]
+    
+    ProcessBacktest --> UpdateProgress{Every<br/>10%?}
+    UpdateProgress -->|Yes| SaveProgress[Update Progress<br/>in DB]
+    SaveProgress --> ProcessBacktest
+    UpdateProgress -->|No| ProcessBacktest
+    
+    ProcessBacktest --> BacktestComplete{Backtest<br/>Complete?}
+    BacktestComplete -->|No| ProcessBacktest
+    BacktestComplete -->|Yes| SaveResult[Save Result<br/>Status: Completed]
+    
+    SaveResult --> UpdateBundle[Update Bundle<br/>Progress]
+    UpdateBundle --> ReleaseSemaphore[Release<br/>Semaphore]
+    ReleaseSemaphore --> WorkerPoll
+    
+    style Start fill:#4A90E2
+    style ProcessBacktest fill:#50C878
+    style SaveResult fill:#FF6B6B
+    style WorkerPoll fill:#FFD93D
+```
+
+## Key Components
+
+- **Worker Polling**: Workers continuously poll database for pending jobs
+- **Advisory Locks**: PostgreSQL advisory locks prevent multiple workers from claiming the same job
+- **Semaphore Control**: Limits concurrent backtests per worker (default: CPU cores - 2)
+- **Progress Updates**: Progress is saved to database every 10% completion
+- **Bundle Updates**: Individual job completion updates the parent bundle request
+
--- a/processing/04-Database-Schema.md
+++ b/processing/04-Database-Schema.md
@@ -0,0 +1,69 @@
+# Database Schema & Queue Structure
+
+This diagram shows the entity relationships between BundleBacktestRequest, BacktestJob, and User entities.
+
+```mermaid
+erDiagram
+    BundleBacktestRequest ||--o{ BacktestJob : "has many"
+    BacktestJob }o--|| User : "belongs to"
+    
+    BundleBacktestRequest {
+        UUID RequestId PK
+        INT UserId FK
+        STRING Status
+        INT TotalBacktests
+        INT CompletedBacktests
+        INT FailedBacktests
+        DATETIME CreatedAt
+        DATETIME CompletedAt
+        STRING UniversalConfigJson
+        STRING DateTimeRangesJson
+        STRING MoneyManagementVariantsJson
+        STRING TickerVariantsJson
+    }
+    
+    BacktestJob {
+        UUID Id PK
+        UUID BundleRequestId FK
+        STRING JobType
+        STRING Status
+        INT Priority
+        TEXT ConfigJson
+        TEXT CandlesJson
+        INT ProgressPercentage
+        INT CurrentBacktestIndex
+        INT TotalBacktests
+        INT CompletedBacktests
+        DATETIME CreatedAt
+        DATETIME StartedAt
+        DATETIME CompletedAt
+        TEXT ResultJson
+        TEXT ErrorMessage
+        STRING AssignedWorkerId
+        DATETIME LastHeartbeat
+    }
+    
+    User {
+        INT Id PK
+        STRING Name
+    }
+```
+
+## Table Descriptions
+
+### BundleBacktestRequest
+- Represents a bundle of multiple backtest jobs
+- Contains variant configurations (date ranges, money management, tickers)
+- Tracks overall progress across all jobs
+
+### BacktestJob
+- Individual backtest execution unit
+- Contains serialized config and candles
+- Tracks progress, worker assignment, and heartbeat
+- Links to parent bundle request
+
+### Key Indexes
+- `idx_status_priority`: For efficient job claiming (Status, Priority DESC, CreatedAt)
+- `idx_bundle_request`: For bundle progress queries
+- `idx_assigned_worker`: For worker health monitoring
+
--- a/processing/05-Deployment-Architecture.md
+++ b/processing/05-Deployment-Architecture.md
@@ -0,0 +1,103 @@
+# Deployment Architecture
+
+This diagram shows the production deployment architecture with load balancing, clustering, and monitoring.
+
+```mermaid
+graph TB
+    subgraph "Load Balancer"
+        LB[NGINX/Cloudflare]
+    end
+    
+    subgraph "API Server Cluster"
+        direction LR
+        API1[API-1<br/>Orleans Silo<br/>Port: 11111]
+        API2[API-2<br/>Orleans Silo<br/>Port: 11121]
+        API3[API-3<br/>Orleans Silo<br/>Port: 11131]
+    end
+    
+    subgraph "Compute Worker Cluster"
+        direction LR
+        W1[Worker-1<br/>8 CPU Cores<br/>6 Concurrent Jobs]
+        W2[Worker-2<br/>8 CPU Cores<br/>6 Concurrent Jobs]
+        W3[Worker-3<br/>8 CPU Cores<br/>6 Concurrent Jobs]
+    end
+    
+    subgraph "Database Cluster"
+        direction LR
+        DB_MASTER[(PostgreSQL<br/>Master<br/>Job Queue)]
+        DB_REPLICA[(PostgreSQL<br/>Replica<br/>Read Only)]
+    end
+    
+    subgraph "Time Series DB"
+        INFLUX[(InfluxDB<br/>Candles Data)]
+    end
+    
+    subgraph "Monitoring"
+        PROM[Prometheus]
+        GRAF[Grafana]
+    end
+    
+    LB --> API1
+    LB --> API2
+    LB --> API3
+    
+    API1 --> DB_MASTER
+    API2 --> DB_MASTER
+    API3 --> DB_MASTER
+    
+    W1 --> DB_MASTER
+    W2 --> DB_MASTER
+    W3 --> DB_MASTER
+    
+    W1 --> INFLUX
+    W2 --> INFLUX
+    W3 --> INFLUX
+    
+    W1 --> PROM
+    W2 --> PROM
+    W3 --> PROM
+    API1 --> PROM
+    API2 --> PROM
+    API3 --> PROM
+    
+    PROM --> GRAF
+    
+    DB_MASTER --> DB_REPLICA
+    
+    style LB fill:#9B59B6
+    style API1 fill:#4A90E2
+    style API2 fill:#4A90E2
+    style API3 fill:#4A90E2
+    style W1 fill:#50C878
+    style W2 fill:#50C878
+    style W3 fill:#50C878
+    style DB_MASTER fill:#FF6B6B
+    style INFLUX fill:#FFD93D
+    style PROM fill:#E67E22
+    style GRAF fill:#E67E22
+```
+
+## Deployment Components
+
+### Load Balancer
+- **NGINX/Cloudflare**: Distributes incoming requests across API servers
+- Health checks and failover support
+
+### API Server Cluster
+- **3+ Instances**: Horizontally scalable Orleans silos
+- Each instance handles HTTP requests and Orleans grain operations
+- Ports: 11111, 11121, 11131 (for clustering)
+
+### Compute Worker Cluster
+- **3+ Instances**: Dedicated CPU workers
+- Each worker: 8 CPU cores, 6 concurrent backtests
+- Total capacity: 18 concurrent backtests across cluster
+
+### Database Cluster
+- **Master**: Handles all writes (job creation, updates)
+- **Replica**: Read-only for status queries and reporting
+
+### Monitoring
+- **Prometheus**: Metrics collection
+- **Grafana**: Visualization and dashboards
+
--- a/processing/06-Concurrency-Control.md
+++ b/processing/06-Concurrency-Control.md
@@ -0,0 +1,96 @@
+# Concurrency Control Flow
+
+This diagram shows how the semaphore-based concurrency control works across multiple workers.
+
+```mermaid
+graph LR
+    subgraph "Database Queue"
+        Q[Pending Jobs<br/>Priority Queue]
+    end
+    
+    subgraph "Worker-1"
+        S1[Semaphore<br/>6 slots]
+        J1[Job 1]
+        J2[Job 2]
+        J3[Job 3]
+        J4[Job 4]
+        J5[Job 5]
+        J6[Job 6]
+    end
+    
+    subgraph "Worker-2"
+        S2[Semaphore<br/>6 slots]
+        J7[Job 7]
+        J8[Job 8]
+        J9[Job 9]
+        J10[Job 10]
+        J11[Job 11]
+        J12[Job 12]
+    end
+    
+    subgraph "Worker-3"
+        S3[Semaphore<br/>6 slots]
+        J13[Job 13]
+        J14[Job 14]
+        J15[Job 15]
+        J16[Job 16]
+        J17[Job 17]
+        J18[Job 18]
+    end
+    
+    Q -->|Claim 6 jobs| S1
+    Q -->|Claim 6 jobs| S2
+    Q -->|Claim 6 jobs| S3
+    
+    S1 --> J1
+    S1 --> J2
+    S1 --> J3
+    S1 --> J4
+    S1 --> J5
+    S1 --> J6
+    
+    S2 --> J7
+    S2 --> J8
+    S2 --> J9
+    S2 --> J10
+    S2 --> J11
+    S2 --> J12
+    
+    S3 --> J13
+    S3 --> J14
+    S3 --> J15
+    S3 --> J16
+    S3 --> J17
+    S3 --> J18
+    
+    style Q fill:#FF6B6B
+    style S1 fill:#50C878
+    style S2 fill:#50C878
+    style S3 fill:#50C878
+```
+
+## Concurrency Control Mechanisms
+
+### 1. Database-Level (Advisory Locks)
+- **PostgreSQL Advisory Locks**: Prevent multiple workers from claiming the same job
+- Atomic job claiming using `pg_try_advisory_lock()`
+- Ensures exactly-once job processing
+
+### 2. Worker-Level (Semaphore)
+- **SemaphoreSlim**: Limits concurrent backtests per worker
+- Default: `Environment.ProcessorCount - 2` (e.g., 6 on 8-core machine)
+- Prevents CPU saturation while leaving resources for Orleans messaging
+
+### 3. Cluster-Level (Queue Priority)
+- **Priority Queue**: Jobs ordered by priority, then creation time
+- VIP users get higher priority
+- Fair distribution across workers
+
+## Capacity Calculation
+
+- **Per Worker**: 6 concurrent backtests
+- **3 Workers**: 18 concurrent backtests
+- **Average Duration**: ~47 minutes per backtest
+- **Throughput**: ~1,080 backtests/hour
+- **1000 Users × 10 backtests**: ~9 hours to process full queue
+
--- a/processing/07-Monorepo-Structure.md
+++ b/processing/07-Monorepo-Structure.md
@@ -0,0 +1,74 @@
+# Monorepo Project Structure
+
+This diagram shows the monorepo structure with shared projects used by both API and Compute services.
+
+```mermaid
+graph TD
+    ROOT[Managing.sln<br/>Monorepo Root]
+    
+    ROOT --> API[Managing.Api<br/>API Server<br/>Orleans]
+    ROOT --> COMPUTE[Managing.Compute<br/>Worker App<br/>No Orleans]
+    
+    ROOT --> SHARED[Shared Projects]
+    
+    SHARED --> APP[Managing.Application<br/>Business Logic]
+    SHARED --> DOM[Managing.Domain<br/>Domain Models]
+    SHARED --> INFRA[Managing.Infrastructure<br/>Database/External]
+    SHARED --> COMMON[Managing.Common<br/>Utilities]
+    
+    API --> APP
+    API --> DOM
+    API --> INFRA
+    API --> COMMON
+    
+    COMPUTE --> APP
+    COMPUTE --> DOM
+    COMPUTE --> INFRA
+    COMPUTE --> COMMON
+    
+    style ROOT fill:#9B59B6
+    style API fill:#4A90E2
+    style COMPUTE fill:#50C878
+    style SHARED fill:#FFD93D
+```
+
+## Project Organization
+
+### Root Level
+- **Managing.sln**: Solution file containing all projects
+
+### Service Projects
+- **Managing.Api**: API Server with Orleans
+  - Controllers, Orleans grains, HTTP endpoints
+  - Handles user requests, creates jobs
+  
+- **Managing.Compute**: Compute Worker App (NEW)
+  - Background workers, job processors
+  - No Orleans dependency
+  - Dedicated CPU processing
+
+### Shared Projects
+- **Managing.Application**: Business logic
+  - `Backtester.cs`, `TradingBotBase.cs`
+  - Used by both API and Compute
+  
+- **Managing.Domain**: Domain models
+  - `BundleBacktestRequest.cs`, `BacktestJob.cs`
+  - Shared entities
+  
+- **Managing.Infrastructure**: External integrations
+  - Database repositories, InfluxDB client
+  - Shared data access
+  
+- **Managing.Common**: Utilities
+  - Constants, enums, helpers
+  - Shared across all projects
+
+## Benefits
+
+1. **Code Reuse**: Shared business logic between API and Compute
+2. **Consistency**: Same domain models and logic
+3. **Maintainability**: Single source of truth
+4. **Type Safety**: Shared types prevent serialization issues
+5. **Testing**: Shared test projects
+
--- a/processing/IMPLEMENTATION-PLAN.md
+++ b/processing/IMPLEMENTATION-PLAN.md
@@ -0,0 +1,71 @@
+# Implementation Plan
+
+## Phase 1: Database & Domain Setup
+
+- [ ] Create `BacktestJob` entity in `Managing.Domain/Backtests/`
+- [ ] Create `BacktestJobStatus` enum (Pending, Running, Completed, Failed)
+- [ ] Create database migration for `BacktestJobs` table
+- [ ] Add indexes: `idx_status_priority`, `idx_bundle_request`, `idx_assigned_worker`
+- [ ] Create `IBacktestJobRepository` interface
+- [ ] Implement `BacktestJobRepository` with advisory lock support
+
+## Phase 2: Compute Worker Project
+
+- [ ] Create `Managing.Compute` project (console app/worker service)
+- [ ] Add project reference to shared projects (Application, Domain, Infrastructure)
+- [ ] Configure DI container (NO Orleans)
+- [ ] Create `BacktestComputeWorker` background service
+- [ ] Implement job polling logic (every 5 seconds)
+- [ ] Implement job claiming with PostgreSQL advisory locks
+- [ ] Implement semaphore-based concurrency control
+- [ ] Implement progress callback mechanism
+- [ ] Implement heartbeat mechanism (every 30 seconds)
+- [ ] Add configuration: `MaxConcurrentBacktests`, `JobPollIntervalSeconds`
+
+## Phase 3: API Server Updates
+
+- [ ] Update `BacktestController` to create jobs instead of calling grains directly
+- [ ] Implement `CreateBundleBacktest` endpoint (returns immediately)
+- [ ] Implement `GetBundleStatus` endpoint (polls database)
+- [ ] Update `Backtester.cs` to generate `BacktestJob` entities from bundle variants
+- [ ] Remove direct Orleans grain calls for backtests (keep for other operations)
+
+## Phase 4: Shared Logic
+
+- [ ] Extract backtest execution logic from `BacktestTradingBotGrain` to `Backtester.cs`
+- [ ] Make backtest logic Orleans-agnostic (can run in worker or grain)
+- [ ] Add progress callback support to `RunBacktestAsync` method
+- [ ] Ensure candle loading works in both contexts
+
+## Phase 5: Monitoring & Health Checks
+
+- [ ] Add health check endpoint to compute worker
+- [ ] Add metrics: pending jobs, running jobs, completed/failed counts
+- [ ] Add stale job detection (reclaim jobs from dead workers)
+- [ ] Add logging for job lifecycle events
+
+## Phase 6: Deployment
+
+- [ ] Create Dockerfile for `Managing.Compute`
+- [ ] Create deployment configuration for compute workers
+- [ ] Configure environment variables for compute cluster
+- [ ] Set up monitoring dashboards (Prometheus/Grafana)
+- [ ] Configure auto-scaling rules for compute workers
+
+## Phase 7: Testing & Validation
+
+- [ ] Test single backtest job processing
+- [ ] Test bundle backtest with multiple jobs
+- [ ] Test concurrent job processing (multiple workers)
+- [ ] Test job recovery after worker failure
+- [ ] Test priority queue ordering
+- [ ] Load test with 1000+ concurrent users
+
+## Phase 8: Migration Strategy
+
+- [ ] Keep Orleans grains as fallback during transition
+- [ ] Feature flag to switch between Orleans and Compute workers
+- [ ] Gradual migration: test with small percentage of traffic
+- [ ] Monitor performance and error rates
+- [ ] Full cutover once validated
+
--- a/assets/documentation/Workers
+++ b/assets/documentation/Workers
@@ -0,0 +1,75 @@
+# Workers Processing Architecture
+
+This folder contains documentation for the enterprise-grade backtest processing architecture using a database queue pattern with separate API and Compute worker clusters.
+
+## Overview
+
+The architecture separates concerns between:
+- **API Server**: Handles HTTP requests, creates jobs, returns immediately (fire-and-forget)
+- **Compute Workers**: Process CPU-intensive backtest jobs from the database queue
+- **Database Queue**: Central coordination point using PostgreSQL
+
+## Documentation Files
+
+1. **[01-Overall-Architecture.md](./01-Overall-Architecture.md)**
+   - Complete system architecture diagram
+   - Component relationships
+   - External service integrations
+
+2. **[02-Request-Flow.md](./02-Request-Flow.md)**
+   - Sequence diagram of request flow
+   - User request → Job creation → Processing → Status polling
+
+3. **[03-Job-Processing-Flow.md](./03-Job-Processing-Flow.md)**
+   - Detailed job processing workflow
+   - Worker polling, job claiming, semaphore control
+
+4. **[04-Database-Schema.md](./04-Database-Schema.md)**
+   - Entity relationship diagram
+   - Database schema for job queue
+   - Key indexes and relationships
+
+5. **[05-Deployment-Architecture.md](./05-Deployment-Architecture.md)**
+   - Production deployment topology
+   - Load balancing, clustering, monitoring
+
+6. **[06-Concurrency-Control.md](./06-Concurrency-Control.md)**
+   - Concurrency control mechanisms
+   - Semaphore-based limiting
+   - Capacity calculations
+
+7. **[07-Monorepo-Structure.md](./07-Monorepo-Structure.md)**
+   - Monorepo project organization
+   - Shared projects and dependencies
+
+## Key Features
+
+- ✅ **No Timeouts**: Fire-and-forget pattern with polling
+- ✅ **Scalable**: Horizontal scaling of both API and Compute clusters
+- ✅ **Reliable**: Jobs persist in database, survive restarts
+- ✅ **Efficient**: Dedicated CPU resources for compute work
+- ✅ **Enterprise-Grade**: Handles 1000+ users, priority queue, health checks
+
+## Architecture Principles
+
+1. **Separation of Concerns**: API handles requests, Compute handles CPU work
+2. **Database as Queue**: PostgreSQL serves as reliable job queue
+3. **Shared Codebase**: Monorepo with shared business logic
+4. **Resource Isolation**: Compute workers don't interfere with API responsiveness
+5. **Fault Tolerance**: Jobs survive worker failures, can be reclaimed
+
+## Capacity Planning
+
+- **Per Worker**: 6 concurrent backtests (8-core machine)
+- **3 Workers**: 18 concurrent backtests
+- **Throughput**: ~1,080 backtests/hour
+- **1000 Users × 10 backtests**: ~9 hours processing time
+
+## Next Steps
+
+1. Create `Managing.Compute` project
+2. Implement `BacktestJob` entity and repository
+3. Create `BacktestComputeWorker` background service
+4. Update API controllers to use job queue pattern
+5. Deploy compute workers to dedicated servers
+