Skip to content

Add Configurable HTTP Timeout for Large File Uploads in Broker #1226

Description

@derrix060

Add Configurable HTTP Timeout for Large File Uploads in Broker

Description

The broker experiences intermittent HTTP timeout errors when uploading large input files to the rest-api service. The error occurs during normal operation when processing orders with large inputs.

Error Message

2025-10-09T16:44:47.728698Z  WARN broker::order_picker: Failed to price order 0x1734df7809c4ef94da037449c287166d1145031980006166f-0x713c5471cc80c47cedd0010559842c95197c317faec530391770c0df87854de3-FulfillAfterLockExpire: [B-OP-001] failed to fetch / push input: Failed to upload input: [B-BON-001] Bonsai proving error HttpErr(reqwest::Error { kind: Request, url: "http://rest-api-generic-app.angkor-boundless.svc.cluster.local:8081/inputs/upload/6670fa02-4e58-446d-9b19-ca7f1b712260", source: TimedOut }): HTTP error from reqwest: error sending request for url (http://rest-api-generic-app.angkor-boundless.svc.cluster.local:8081/inputs/upload/6670fa02-4e58-446d-9b19-ca7f1b712260): operation timed out

Root Cause Analysis

  1. Default Timeout: The broker uses Rust's reqwest HTTP client with the default timeout of 30 seconds
  2. Large Files: broker.toml allows configured file size, for example max_file_size = 50_000_000 (50MB)
  3. Upload Duration: Large files can exceed the 30-second timeout, especially under:
    • Network congestion
    • High CPU/memory load on rest-api service
    • Slow S3/MinIO storage writes
    • Multiple concurrent uploads

Current Configuration

broker.toml (partial):

max_file_size = 50_000_000  # 50MB
req_retry_count = 3
req_retry_sleep_ms = 500

Current Behavior:

  • Upload times out after 30 seconds (hardcoded in reqwest client)
  • Broker retries 3 times (with 500ms sleep between retries)
  • If all retries fail, the order is marked as failed

Proposed Solution

Add configurable HTTP timeouts in broker.toml:

[prover]
# ... existing config ...

# HTTP timeout for upload operations (in seconds)
# Should be higher than regular API requests to accommodate large file uploads
upload_timeout_secs = 300  # 5 minutes

# HTTP timeout for regular API requests (in seconds)
api_timeout_secs = 60  # 1 minute

# HTTP timeout for status polling requests (in seconds)
poll_timeout_secs = 30  # 30 seconds

Implementation Details

  1. Separate timeouts for different operation types:

    • Short timeout (30s) for status checks and lightweight API calls
    • Medium timeout (60s) for regular API operations
    • Long timeout (300s) for large file uploads
  2. Backward compatibility: Use current defaults if not specified

  3. Validation: Ensure upload_timeout_secs >= (max_file_size / minimum_expected_bandwidth)

Expected Behavior After Fix

  • Large file uploads (up to 50MB) complete successfully
  • Timeout errors only occur for actual connectivity issues, not slow uploads
  • Operators can tune timeouts based on their infrastructure

Workaround (Current)

Since this is not configurable, operators must either:

  1. Accept occasional timeout failures and rely on retries
  2. Reduce max_file_size to ensure uploads complete within 30 seconds
  3. Optimize infrastructure (faster storage, more resources for rest-api)

Additional Context

  • Services are healthy (0 restarts after fix)
  • PostgreSQL has capacity (84/500 connections)
  • This is an intermittent issue, not constant
  • Retries often succeed, suggesting transient slowness rather than failure

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions