Compute Worker - Fix submission duplication during ingestion#2303
Compute Worker - Fix submission duplication during ingestion#2303ihsaan-ullah wants to merge 8 commits intodevelopfrom
Conversation
|
Please check this function _create_container where we create container using the following code container = client.create_container(
self.container_image,
name=container_name,
host_config=host_config,
detach=False,
volumes=volumes_host,
command=command,
working_dir="/app/program",
environment=[
"PYTHONUNBUFFERED=1",
"http_proxy=" + Settings.COMPETITION_CONTAINER_HTTP_PROXY,
"https_proxy=" + Settings.COMPETITION_CONTAINER_HTTPS_PROXY,
],
network_disabled=Settings.COMPETITION_CONTAINER_NETWORK_DISABLED,
)the line to check is
NOTE: this is not something I introduced but clarification will be useful. |
|
Hello @ihsaan-ullah, I need to check more in depth, but from what I recall this folder is shared between scoring and ingestion. |
|
#2294 is merged, we can rebase this PR. |
8168952 to
6a1733d
Compare
|
How can we make sure that all compute workers run this code? Do we have any mechanism to force people to update their compute workers? |
For v1.25 we are asking organizers to upgrade their workers. That is indeed a bit fragile. |
|
@Didayolo this PR is ready for testing. One point to check and maybe fix/clarify in the code:I feel that there is an inconsistency in the compute_worker code when we use
def replace_legacy_metadata_command(
command, kind, is_scoring, ingestion_only_during_scoring=False
):
vars_to_replace = [
("$input", "/app/input_data" if kind == "ingestion" else "/app/input"),
("$output", "/app/output"),
(
"$program",
"/app/ingestion_program"
if ingestion_only_during_scoring and is_scoring
else "/app/program",
),
ingestion_program_dir = os.path.join(self.root_dir, "ingestion_program")I have added this point to the meeting agenda to discuss with Obada and others too |
rebasing
…stion and making submission available during scoring. Also copying submission files to ingestion predictions i.e. /app/input/res to make sure already existing competitions do not break rebased
48682d0 to
e307f1e
Compare
|
Rebased with the new E2E tests. |
|
Hi @ihsaan-ullah, While testing #2302, did you re-compile the containers? Looks like we have, here on this PR, the same failures that you reported: =========================== short test summary info ============================
FAILED test_submission.py::test_v2_results[firefox] - AssertionError: Locator expected to be visible
Actual value: None
Error: element(s) not found
Call log:
- Expect "to_be_visible" with timeout 5000ms
- waiting for get_by_role("cell", name="Finished")
FAILED test_submission.py::test_v2_results_failure[firefox] - AssertionError: Locator expected to be visible
Actual value: None
Error: element(s) not found
Call log:
- Expect "to_be_visible" with timeout 2000ms
- waiting for get_by_role("cell", name=re.compile(r"^(Finished|Failed.*)$"))
FAILED test_submission.py::test_v2_miniautoml[firefox] - AssertionError: Locator expected to be visible
Actual value: None
Error: element(s) not found
Call log:
- Expect "to_be_visible" with timeout 2000ms
- waiting for get_by_role("cell", name=re.compile(r"^(Finished|Failed.*)$"))
FAILED test_submission.py::test_v15_sncf[firefox] - AssertionError: Locator expected to be visible
Actual value: None
Error: element(s) not found
Call log:
- Expect "to_be_visible" with timeout 2000ms
- waiting for get_by_role("cell", name=re.compile(r"^(Finished|Failed.*)$"))
FAILED test_submission.py::test_v15_iris_code[firefox] - AssertionError: Locator expected to be visible
Actual value: None
Error: element(s) not found
Call log:
- Expect "to_be_visible" with timeout 2000ms
- waiting for get_by_role("cell", name=re.compile(r"^(Finished|Failed.*)$"))
FAILED test_submission.py::test_v15_iris_results[firefox] - AssertionError: Locator expected to be visible
Actual value: None
Error: element(s) not found
Call log:
- Expect "to_be_visible" with timeout 2000ms
- waiting for get_by_role("cell", name=re.compile(r"^(Finished|Failed.*)$"))
FAILED test_submission.py::test_v18_autowsl[firefox] - AssertionError: Locator expected to be visible
Actual value: None
Error: element(s) not found
Call log:
- Expect "to_be_visible" with timeout 2000ms
- waiting for get_by_role("cell", name=re.compile(r"^(Finished|Failed.*)$"))
============== 7 failed, 7 passed, 2 skipped in 299.43s (0:04:59) ==============
Exited with code exit status 1Found in the artifacts: compute_worker | �[32m2026-04-16 14:32:25.761�[0m | �[1mINFO �[0m | �[36mcompute_worker�[0m:�[36m_update_submission�[0m:�[36m621�[0m - �[1mUpdating submission @ http://django:8000/api/submissions/2/ with data = {'status': 'Failed', 'status_details': 'Submission failed: Metadata file not found. See logs for more details.', 'secret': '18de319f-c7a7-4ee1-a704-1ecec48b866f'}�[0mcompute_worker_run[015bc6e4-77b3-4092-81cc-cacbe648b0fd] raised unexpected: SubmissionException('Metadata file not found')�[0m
compute_worker | �[33m�[1mTraceback (most recent call last):�[0m
compute_worker |
compute_worker | File "�[32m/app/�[0m�[32m�[1mcompute_worker.py�[0m", line �[33m1509�[0m, in �[35mpush_output�[0m
compute_worker | �[35m�[1mwith�[0m �[1mopen�[0m�[1m(�[0m�[1mmetadata_path�[0m�[1m,�[0m �[36m"w"�[0m�[1m)�[0m �[35m�[1mas�[0m �[1mf�[0m�[1m:�[0m
compute_worker | �[36m └ �[0m�[36m�[1m'/codabench/uPK-1_sID-3__u65h84gw/output/metadata'�[0m
compute_worker |
compute_worker | �[31m�[1mFileNotFoundError�[0m:�[1m [Errno 2] No such file or directory: '/codabench/uPK-1_sID-3__u65h84gw/output/metadata'�[0m |
|
The error message was misleading so I changed it: metadata_path = os.path.join(self.output_dir, "metadata")
if os.path.exists(metadata_path):
raise SubmissionException(
"Error, the output directory already contains a metadata file. This file is used "
"to store exitCode and other data, do not write to this file manually."
)
try:
with open(metadata_path, "w") as f:
f.write(yaml.dump(prog_status, default_flow_style=False))
except Exception as e:
logger.error(e)
raise SubmissionException("Metadata file not found") |
|
Tentative fix: Ensure |
|
@ihsaan-ullah : my last commit did not solve the problem. I'll let you review it. |
Description
This PR updates the compute worker to avoid duplicating submission during ingestion and making submission available during scoring. Also copying submission files to ingestion predictions i.e. /app/input/res to make sure already existing competitions do not break
Installation
Need to re-build the containers to take the changes into account:
Issue fixed
Checklist