Add Configurable HTTP Timeout for Large File Uploads in Broker
Description
The broker experiences intermittent HTTP timeout errors when uploading large input files to the rest-api service. The error occurs during normal operation when processing orders with large inputs.
Error Message
2025-10-09T16:44:47.728698Z WARN broker::order_picker: Failed to price order 0x1734df7809c4ef94da037449c287166d1145031980006166f-0x713c5471cc80c47cedd0010559842c95197c317faec530391770c0df87854de3-FulfillAfterLockExpire: [B-OP-001] failed to fetch / push input: Failed to upload input: [B-BON-001] Bonsai proving error HttpErr(reqwest::Error { kind: Request, url: "http://rest-api-generic-app.angkor-boundless.svc.cluster.local:8081/inputs/upload/6670fa02-4e58-446d-9b19-ca7f1b712260", source: TimedOut }): HTTP error from reqwest: error sending request for url (http://rest-api-generic-app.angkor-boundless.svc.cluster.local:8081/inputs/upload/6670fa02-4e58-446d-9b19-ca7f1b712260): operation timed out
Root Cause Analysis
- Default Timeout: The broker uses Rust's
reqwest HTTP client with the default timeout of 30 seconds
- Large Files:
broker.toml allows configured file size, for example max_file_size = 50_000_000 (50MB)
- Upload Duration: Large files can exceed the 30-second timeout, especially under:
- Network congestion
- High CPU/memory load on rest-api service
- Slow S3/MinIO storage writes
- Multiple concurrent uploads
Current Configuration
broker.toml (partial):
max_file_size = 50_000_000 # 50MB
req_retry_count = 3
req_retry_sleep_ms = 500
Current Behavior:
- Upload times out after 30 seconds (hardcoded in reqwest client)
- Broker retries 3 times (with 500ms sleep between retries)
- If all retries fail, the order is marked as failed
Proposed Solution
Add configurable HTTP timeouts in broker.toml:
[prover]
# ... existing config ...
# HTTP timeout for upload operations (in seconds)
# Should be higher than regular API requests to accommodate large file uploads
upload_timeout_secs = 300 # 5 minutes
# HTTP timeout for regular API requests (in seconds)
api_timeout_secs = 60 # 1 minute
# HTTP timeout for status polling requests (in seconds)
poll_timeout_secs = 30 # 30 seconds
Implementation Details
-
Separate timeouts for different operation types:
- Short timeout (30s) for status checks and lightweight API calls
- Medium timeout (60s) for regular API operations
- Long timeout (300s) for large file uploads
-
Backward compatibility: Use current defaults if not specified
-
Validation: Ensure upload_timeout_secs >= (max_file_size / minimum_expected_bandwidth)
Expected Behavior After Fix
- Large file uploads (up to 50MB) complete successfully
- Timeout errors only occur for actual connectivity issues, not slow uploads
- Operators can tune timeouts based on their infrastructure
Workaround (Current)
Since this is not configurable, operators must either:
- Accept occasional timeout failures and rely on retries
- Reduce
max_file_size to ensure uploads complete within 30 seconds
- Optimize infrastructure (faster storage, more resources for rest-api)
Additional Context
- Services are healthy (0 restarts after fix)
- PostgreSQL has capacity (84/500 connections)
- This is an intermittent issue, not constant
- Retries often succeed, suggesting transient slowness rather than failure
References
Add Configurable HTTP Timeout for Large File Uploads in Broker
Description
The broker experiences intermittent HTTP timeout errors when uploading large input files to the rest-api service. The error occurs during normal operation when processing orders with large inputs.
Error Message
Root Cause Analysis
reqwestHTTP client with the default timeout of 30 secondsbroker.tomlallows configured file size, for examplemax_file_size = 50_000_000(50MB)Current Configuration
broker.toml (partial):
Current Behavior:
Proposed Solution
Add configurable HTTP timeouts in
broker.toml:Implementation Details
Separate timeouts for different operation types:
Backward compatibility: Use current defaults if not specified
Validation: Ensure upload_timeout_secs >= (max_file_size / minimum_expected_bandwidth)
Expected Behavior After Fix
Workaround (Current)
Since this is not configurable, operators must either:
max_file_sizeto ensure uploads complete within 30 secondsAdditional Context
References
--bento-api-url http://rest-api-generic-app.angkor-boundless.svc.cluster.local:8081req_retry_count = 3,req_retry_sleep_ms = 500