Validate CustomResourceConfig shape + GLT-backend compatibility#627
Validate CustomResourceConfig shape + GLT-backend compatibility#627kmontemayor2-sc wants to merge 14 commits into
Conversation
Introduces a new `oneof` arm on `TrainerResourceConfig` / `InferencerResourceConfig` that lets callers describe a launcher as a shell command + positional args, instead of a fixed-shape Vertex AI / KFP / local resource config. The proto carries no semantics here — the dispatcher is added in a follow-up PR; this commit only ships the message, regenerated bindings, and the wrapper-property update so downstream code can read `wrapper.trainer_config` and get a `CustomResourceConfig` back. The diff includes a long tail of cosmetic Scala changes outside `gigl_resource_config/` because scalapbc regenerates every sibling proto's emitted source whenever any one proto in the same directory changes. Reviewers can scope to `CustomResourceConfig.scala` and the `*ResourceConfig.scala` siblings that gain the new oneof case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/all_test |
GiGL Automation@ 23:08:49UTC : 🔄 @ 23:13:51UTC : ❌ Workflow failed. |
GiGL Automation@ 23:08:49UTC : 🔄 @ 23:11:21UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 23:08:54UTC : 🔄 @ 24:34:40UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 23:08:54UTC : 🔄 @ 24:20:28UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 23:08:54UTC : 🔄 @ 23:16:55UTC : ✅ Workflow completed successfully. |
GiGL Automation@ 23:08:58UTC : 🔄 @ 24:28:47UTC : ✅ Workflow completed successfully. |
|
|
||
| shell_line = " ".join([command, *(shlex.quote(a) for a in args)]) | ||
| logger.info(f"Launching {component.name} via subprocess: {shell_line!r}") | ||
| subprocess.run(shell_line, shell=True, check=True) |
There was a problem hiding this comment.
Semgrep identified an issue in your code:
The subprocess.run() call uses shell=True on user-controlled command strings, allowing shell injection attacks that could execute arbitrary commands with the process's privileges.
More details about this
The subprocess.run() call executes shell_line with shell=True, which spawns a shell process to interpret the command. This is dangerous because if resolved_command or any element in resolved_args comes from untrusted input (e.g., user-provided configuration, external data), an attacker can inject arbitrary shell commands.
For example, if an attacker controls custom_resource_config.command and sets it to id; rm -rf /, the shell will execute both id and the destructive rm -rf / command. Even though shlex.quote() escapes individual arguments, it doesn't protect against injection in the command itself—only in the args list. An attacker who controls the command field can bypass this protection entirely.
Exploit scenario:
- Attacker provides
custom_resource_config.command = "echo test; cat /etc/passwd #" - After joining with args,
shell_linebecomes something like"echo test; cat /etc/passwd #" - When
subprocess.run()executes this withshell=True, the shell interprets the semicolon as a command separator - Both
echo testandcat /etc/passwdexecute, leaking sensitive system files - If the process runs with elevated privileges, the attacker can exfiltrate or modify sensitive data
The shell inherits environment variables and settings from the parent process, which further expands the attack surface. Using shell=False would treat the entire string as a single command name rather than allowing shell metacharacters to be interpreted.
To resolve this comment:
💡 Follow autofix suggestion
| subprocess.run(shell_line, shell=True, check=True) | |
| subprocess.run(shell_line, shell=False, check=True) |
View step-by-step instructions
- Replace
subprocess.run(shell_line, shell=True, check=True)with an invocation that does not useshell=True. - Change the code to directly pass the command and arguments as a list, rather than joining into a string. Use:
subprocess.run([resolved_command] + resolved_args, check=True). - Remove any uses of
shlex.quotewhen building the argument list, since quoting is only necessary when passing a string to the shell. - Ensure that
resolved_commandand every item inresolved_argsare unquoted strings representing the command and its arguments.
Passing the command and arguments as a list with shell=False (the default) is safer, because it avoids any interpretation by a shell, preventing command injection vulnerabilities.
💬 Ignore this finding
Reply with Semgrep commands to ignore this finding.
/fp <comment>for false positive/ar <comment>for acceptable risk/other <comment>for all other reasons
Alternatively, triage in Semgrep AppSec Platform to ignore the finding created by subprocess-shell-true.
You can view more details about this finding in the Semgrep AppSec Platform.
CustomResourceConfig is launcher-pluggable — there is no concrete machine spec to validate against (no machine_type, num_workers, GPU config, etc.). The wrapper's trainer_config / inferencer_config properties now return a union that includes CustomResourceConfig (introduced earlier in this PR), but _validate_machine_config does not accept it. Add an isinstance early-return guard at each call site that logs the skip and returns. Shape and backend-compatibility validation for CustomResourceConfig come in a follow-up PR; this commit only makes the existing validation flow type-clean against the widened union. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implements `launch_custom`, a thin shim that takes a populated
`CustomResourceConfig` and shells out via
`subprocess.run(shell=True, check=True)`. The proto's `command` is
a shell snippet (so leading `KEY=VALUE` env assignments parse
naturally) and `args[]` are individually `shlex.quote`-d before
joining, so values containing whitespace survive the shell pass.
The dispatcher performs no template substitution: `command` and
`args[]` are taken verbatim, and any placeholder text reaches
`subprocess.run` literally. Consumers that want runtime-context
substitution (e.g. ${gigl:foo}) should resolve it at YAML-load
time before the proto reaches this module. No call site in the
rest of the repo invokes `launch_custom` yet — wiring is added in
a follow-up PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds two validators wired into the existing pre-run validation chain: 1. A bypass in `_validate_machine_config` (via the trainer / inferencer resource-config-valid checks) so a `CustomResourceConfig` trainer / inferencer skips machine-shape checks — the new oneof arm carries no machine spec to validate. 2. `check_custom_resource_config_shape`, which raises if a populated `CustomResourceConfig` has an empty `command`. Pure shape check, no subprocess spawn. 3. `check_custom_resource_config_requires_glt_backend`, an unconditional gate that pairs `CustomResourceConfig` with `feature_flags.should_run_glt_backend`. The v1 dispatchers do not route through `launch_custom`; this catches the misconfig at validate time. Both new gates run unconditionally inside `kfp_validation_checks` for both Trainer and Inferencer components. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
0d7ba79 to
5742480
Compare
…b.com/Snapchat/GiGL into kmonte/custom-resource-config-pr1-proto
The proto message describes a launcher command (command + args), not a resource shape (machines / replicas / pools) like its sibling oneof arms (VertexAiResourceConfig, KFPResourceConfig, LocalResourceConfig). Aligning the name with the supporting Python (custom_launcher.py / launch_custom) removes the misleading "Resource" suffix. Field names on TrainerResourceConfig / InferencerResourceConfig stay (custom_trainer_config / custom_inferencer_config) so the sibling field-name pattern is preserved. Regenerated pb2 / pyi / Scala bindings. Wrapper + validator + test references updated.
…stom-resource-config-pr2-launcher
Follow-up to the proto rename. Updates the launcher module's import, parameter type annotation, docstring, and error message, plus its test file. Also renames the launcher parameter from `custom_resource_config` to `custom_launcher_config` to match the new proto type name.
…/custom-resource-config-pr3-validation
Follow-up to the proto rename. Renames in the three validator modules (resource_config_checks, gbml_and_resource_config_compatibility_checks, config_validator) and their tests: - `check_custom_resource_config_shape` -> `check_custom_launcher_config_shape` - `check_custom_resource_config_requires_glt_backend` -> `check_custom_launcher_config_requires_glt_backend` - Test class names follow suit (TestCustomResourceConfig* -> TestCustomLauncherConfig*). - Error strings and docstrings updated to reference the new proto name.
| The v1 trainer/inferencer dispatchers never consult the | ||
| ``custom_trainer_config`` / ``custom_inferencer_config`` oneof, so pairing | ||
| a ``CustomLauncherConfig`` with a task config that has | ||
| ``should_use_glt_backend=False`` would silently fall through the v1 path | ||
| and fail at runtime. Catch it up-front here so the failure is loud and | ||
| actionable at validation time. |
There was a problem hiding this comment.
Lets update the agent generated comments/docs so that they are meant for future readers and not just the author of the PR?
There was a problem hiding this comment.
Probably the value error raised below is self documenting this function - this comment seems to verbose to add value.
| offending.append("inferencer_resource_config.custom_inferencer_config") | ||
| raise ValueError( | ||
| "CustomLauncherConfig is only wired into the GLT (v2) dispatchers " | ||
| "(glt_trainer.py / glt_inferencer.py); the v1 trainer/inferencer " |
There was a problem hiding this comment.
v1 and v2 may be very arbritary to reader, and this message is too verbose and doesn't get to the point fast enough - a more useful message (that also simplifies above logic):
raise ValueError("""
CustomLauncherConfig can only be used w/ GLT backend. Detected
trainer_resource_config using custom implementation: {trainer_is_custom} ;
inference_resource_config using custom implementation: {inference_is_custom}.
Ensure feature_flags.should_run_glt_backend is set to 'True' in the task config.
""")| f"Trainer and Inferencer components; got {component}." | ||
| ) | ||
|
|
||
| wrapper = GiglResourceConfigWrapper(resource_config=resource_config_pb) |
There was a problem hiding this comment.
the caller already has access to the wrapper - why not just pass that through?
| # Validate any populated CustomLauncherConfig has a non-empty command. | ||
| # Unconditional — the check is shape-only and does not call out to any | ||
| # external service. | ||
| for component in (GiGLComponents.Trainer, GiGLComponents.Inferencer): |
There was a problem hiding this comment.
we already have an existing pattern to check for these in resource_config_checks - lets adopt here.
|
|
||
| def test_custom_trainer_without_glt_raises(self): | ||
| """CustomLauncherConfig trainer + glt=False raises a clear ValueError.""" | ||
| gbml_config = _create_gbml_config_with_glt_flag("False") |
There was a problem hiding this comment.
nit: We are writing a function here to save 1 line of code but introducing a layer of indirection as a result.
gbml_config = gbml_config_pb2.GbmlConfig()
gbml_config.feature_flags["should_run_glt_backend"] = Falsemuch more readable.
| resource_config_wrapper=resource_config, | ||
| ) | ||
| self.assertIn("should_run_glt_backend", str(ctx.exception)) | ||
| self.assertIn("custom_trainer_config", str(ctx.exception)) |
There was a problem hiding this comment.
Doing string matching in the error message feels too verbose to be helpful - unless you were testing on the error message creation logic itself, which it doesnt seem like we are.
| config = _create_valid_custom_inferencer_config(args=["--cluster_size=4"]) | ||
| with patch( | ||
| "gigl.src.validation_check.libs.resource_config_checks._validate_machine_config" | ||
| ) as mock_validate: |
There was a problem hiding this comment.
These mock patches seem like a new pattern in this file and are making testing very verbose - is there a way to not have these in the tests?
| check_if_inferencer_resource_config_valid(resource_config_pb=config) | ||
| mock_validate.assert_not_called() | ||
|
|
||
| def test_vertex_ai_trainer_still_calls_machine_validation(self): |
| ) | ||
|
|
||
| def test_empty_trainer_command_raises(self): | ||
| config = _create_valid_custom_trainer_config(command="") |
There was a problem hiding this comment.
nit: should we still call function _create_valid_custom_trainer_config given the test setup here?
_create_custom_trainer_config ?
| ) | ||
|
|
||
|
|
||
| class TestCustomLauncherConfigInValidationChain(TestCase): |
There was a problem hiding this comment.
This feels like a weird test to add.
i.e. (1) ensure that some function is calling some other function - (2) and do that using some knowledge of the error message that should be propagated.
If we start adopting this test pattern even if we just adopt (1); what is the stopping point? Our tests would be too verbose and would not be valuable. If instead you wanted to migrate our unit tests introduced above to an order higher i.e. here - that would be valuable. But honestly, I would prefer not having this at all unless its a core library / performance functionality.
Adds two validators wired into the existing pre-run validation chain:
_validate_machine_config(via the trainer /inferencer resource-config-valid checks) so a
CustomResourceConfigtrainer / inferencer skips machine-shape checks — the new oneof arm
carries no machine spec to validate.
check_custom_resource_config_shape, which raises if a populatedCustomResourceConfighas an emptycommand. Pure shape check, nosubprocess spawn.
check_custom_resource_config_requires_glt_backend, anunconditional gate that pairs
CustomResourceConfigwithfeature_flags.should_run_glt_backend. The v1 dispatchers do notroute through
launch_custom; this catches the misconfig atvalidate time.
Both new gates run unconditionally inside
kfp_validation_checksforboth Trainer and Inferencer components.