Skip to content

Compute Worker - Fix submission duplication during ingestion#2303

Open
ihsaan-ullah wants to merge 8 commits intodevelopfrom
compute_worker_submission_duplication
Open

Compute Worker - Fix submission duplication during ingestion#2303
ihsaan-ullah wants to merge 8 commits intodevelopfrom
compute_worker_submission_duplication

Conversation

@ihsaan-ullah
Copy link
Copy Markdown
Collaborator

@ihsaan-ullah ihsaan-ullah commented Mar 31, 2026

Description

This PR updates the compute worker to avoid duplicating submission during ingestion and making submission available during scoring. Also copying submission files to ingestion predictions i.e. /app/input/res to make sure already existing competitions do not break

Installation

Need to re-build the containers to take the changes into account:

docker compose build --no-cache && docker compose up -d

Issue fixed

Checklist

  • Code review by me
  • Hand tested by me
  • I'm proud of my work
  • Code review by reviewer
  • Hand tested by reviewer
  • CircleCi tests are passing
  • Ready to merge

@ihsaan-ullah ihsaan-ullah marked this pull request as draft March 31, 2026 16:07
@ihsaan-ullah ihsaan-ullah marked this pull request as ready for review March 31, 2026 16:17
@ihsaan-ullah
Copy link
Copy Markdown
Collaborator Author

Please check this function _create_container where we create container using the following code

container = client.create_container(
    self.container_image,
    name=container_name,
    host_config=host_config,
    detach=False,
    volumes=volumes_host,
    command=command,
    working_dir="/app/program",
    environment=[
        "PYTHONUNBUFFERED=1",
        "http_proxy=" + Settings.COMPETITION_CONTAINER_HTTP_PROXY,
        "https_proxy=" + Settings.COMPETITION_CONTAINER_HTTPS_PROXY,
    ],
    network_disabled=Settings.COMPETITION_CONTAINER_NETWORK_DISABLED,
)

the line to check is working_dir="/app/program". This is always set to /app/program for both ingestion and scoring. This is a bit confusing because if you check another function replace_legacy_metadata_command there we use two values:

  • "/app/ingestion_program"
  • "/app/program"

NOTE: this is not something I introduced but clarification will be useful.

@Didayolo
Copy link
Copy Markdown
Member

Didayolo commented Apr 1, 2026

Hello @ihsaan-ullah,

I need to check more in depth, but from what I recall this folder is shared between scoring and ingestion.

@Didayolo
Copy link
Copy Markdown
Member

Didayolo commented Apr 1, 2026

#2294 is merged, we can rebase this PR.

@ihsaan-ullah ihsaan-ullah force-pushed the compute_worker_submission_duplication branch from 8168952 to 6a1733d Compare April 1, 2026 16:33
@ihsaan-ullah
Copy link
Copy Markdown
Collaborator Author

How can we make sure that all compute workers run this code? Do we have any mechanism to force people to update their compute workers?

@Didayolo
Copy link
Copy Markdown
Member

Didayolo commented Apr 2, 2026

How can we make sure that all compute workers run this code? Do we have any mechanism to force people to update their compute workers?

For v1.25 we are asking organizers to upgrade their workers. That is indeed a bit fragile.

@ihsaan-ullah
Copy link
Copy Markdown
Collaborator Author

@Didayolo this PR is ready for testing.

One point to check and maybe fix/clarify in the code:

I feel that there is an inconsistency in the compute_worker code when we use /app/program and /app/ingestion_program. The working directory we use when creating a container is always /app/program which is fine if we always use it for ingestion and scoring but there are two places where we use /app/ingestion_program

  1. In the following function when replacing parts of metadata command
def replace_legacy_metadata_command(
    command, kind, is_scoring, ingestion_only_during_scoring=False
):
    vars_to_replace = [
        ("$input", "/app/input_data" if kind == "ingestion" else "/app/input"),
        ("$output", "/app/output"),
        (
            "$program",
            "/app/ingestion_program"
            if ingestion_only_during_scoring and is_scoring
            else "/app/program",
        ),
  1. In start function when defining ingestion program directory
ingestion_program_dir = os.path.join(self.root_dir, "ingestion_program")

I have added this point to the meeting agenda to discuss with Obada and others too

@Didayolo Didayolo self-requested a review April 14, 2026 12:51
@Didayolo Didayolo self-assigned this Apr 14, 2026
…stion and making submission available during scoring. Also copying submission files to ingestion predictions i.e. /app/input/res to make sure already existing competitions do not break

rebased
@Didayolo Didayolo force-pushed the compute_worker_submission_duplication branch from 48682d0 to e307f1e Compare April 16, 2026 14:23
@Didayolo
Copy link
Copy Markdown
Member

Rebased with the new E2E tests.

@Didayolo
Copy link
Copy Markdown
Member

Didayolo commented Apr 17, 2026

Hi @ihsaan-ullah,

While testing #2302, did you re-compile the containers?

Looks like we have, here on this PR, the same failures that you reported:

=========================== short test summary info ============================
FAILED test_submission.py::test_v2_results[firefox] - AssertionError: Locator expected to be visible
Actual value: None
Error: element(s) not found 
Call log:
  - Expect "to_be_visible" with timeout 5000ms
  - waiting for get_by_role("cell", name="Finished")
FAILED test_submission.py::test_v2_results_failure[firefox] - AssertionError: Locator expected to be visible
Actual value: None
Error: element(s) not found 
Call log:
  - Expect "to_be_visible" with timeout 2000ms
  - waiting for get_by_role("cell", name=re.compile(r"^(Finished|Failed.*)$"))
FAILED test_submission.py::test_v2_miniautoml[firefox] - AssertionError: Locator expected to be visible
Actual value: None
Error: element(s) not found 
Call log:
  - Expect "to_be_visible" with timeout 2000ms
  - waiting for get_by_role("cell", name=re.compile(r"^(Finished|Failed.*)$"))
FAILED test_submission.py::test_v15_sncf[firefox] - AssertionError: Locator expected to be visible
Actual value: None
Error: element(s) not found 
Call log:
  - Expect "to_be_visible" with timeout 2000ms
  - waiting for get_by_role("cell", name=re.compile(r"^(Finished|Failed.*)$"))
FAILED test_submission.py::test_v15_iris_code[firefox] - AssertionError: Locator expected to be visible
Actual value: None
Error: element(s) not found 
Call log:
  - Expect "to_be_visible" with timeout 2000ms
  - waiting for get_by_role("cell", name=re.compile(r"^(Finished|Failed.*)$"))
FAILED test_submission.py::test_v15_iris_results[firefox] - AssertionError: Locator expected to be visible
Actual value: None
Error: element(s) not found 
Call log:
  - Expect "to_be_visible" with timeout 2000ms
  - waiting for get_by_role("cell", name=re.compile(r"^(Finished|Failed.*)$"))
FAILED test_submission.py::test_v18_autowsl[firefox] - AssertionError: Locator expected to be visible
Actual value: None
Error: element(s) not found 
Call log:
  - Expect "to_be_visible" with timeout 2000ms
  - waiting for get_by_role("cell", name=re.compile(r"^(Finished|Failed.*)$"))
============== 7 failed, 7 passed, 2 skipped in 299.43s (0:04:59) ==============

Exited with code exit status 1

Found in the artifacts:

compute_worker  | �[32m2026-04-16 14:32:25.761�[0m | �[1mINFO    �[0m | �[36mcompute_worker�[0m:�[36m_update_submission�[0m:�[36m621�[0m - �[1mUpdating submission @ http://django:8000/api/submissions/2/ with data = {'status': 'Failed', 'status_details': 'Submission failed: Metadata file not found. See logs for more details.', 'secret': '18de319f-c7a7-4ee1-a704-1ecec48b866f'}�[0m
compute_worker_run[015bc6e4-77b3-4092-81cc-cacbe648b0fd] raised unexpected: SubmissionException('Metadata file not found')�[0m
compute_worker  | �[33m�[1mTraceback (most recent call last):�[0m
compute_worker  | 
compute_worker  |   File "�[32m/app/�[0m�[32m�[1mcompute_worker.py�[0m", line �[33m1509�[0m, in �[35mpush_output�[0m
compute_worker  |     �[35m�[1mwith�[0m �[1mopen�[0m�[1m(�[0m�[1mmetadata_path�[0m�[1m,�[0m �[36m"w"�[0m�[1m)�[0m �[35m�[1mas�[0m �[1mf�[0m�[1m:�[0m
compute_worker  |     �[36m          └ �[0m�[36m�[1m'/codabench/uPK-1_sID-3__u65h84gw/output/metadata'�[0m
compute_worker  | 
compute_worker  | �[31m�[1mFileNotFoundError�[0m:�[1m [Errno 2] No such file or directory: '/codabench/uPK-1_sID-3__u65h84gw/output/metadata'�[0m

@Didayolo
Copy link
Copy Markdown
Member

The error message was misleading so I changed it:

        metadata_path = os.path.join(self.output_dir, "metadata")

        if os.path.exists(metadata_path):
            raise SubmissionException(
                "Error, the output directory already contains a metadata file. This file is used "
                "to store exitCode and other data, do not write to this file manually."
            )
        try:
            with open(metadata_path, "w") as f:
                f.write(yaml.dump(prog_status, default_flow_style=False))
        except Exception as e:
            logger.error(e)
            raise SubmissionException("Metadata file not found")

@Didayolo
Copy link
Copy Markdown
Member

Tentative fix:

Ensure output_dir exists on the host during prepare.
Previously this was created as a side effect of _run_program_directory, but runs without an ingestion program skip that path entirely.

@Didayolo
Copy link
Copy Markdown
Member

@ihsaan-ullah : my last commit did not solve the problem. I'll let you review it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants