feat(db): extract get_trait_data_pft() as standalone file-I/O-free function by omkarrr2533 · Pull Request #3978 · PecanProject/pecan

omkarrr2533 · 2026-05-11T14:08:53Z

get.trait.data.pft() currently queries the DB, computes trait_data and
prior_distns, saves them to disk, then throws them away — forcing
run.meta.analysis.pft() to immediately reload the same files.

This PR cuts that first link in the chain:

Adds get_trait_data_pft(): same query logic, no disk reads or writes,
returns list(trait_data, prior_distns, pft_info) directly
Patches the existing wrapper to attach trait_data and prior_distns
to the returned pft — so downstream functions can use them in-memory
without changing anything for callers that don't care
All existing behavior unchanged: dir.create, save(), CSV output,
dbfile.insert(), caching all stay exactly as they were

Tests added: 3 validation tests (fire without DB), 8 DB-backed tests
including an end-to-end equivalence check that verifies the standalone
returns identical objects to what the wrapper writes to disk.

Part of GSoC 2026 Week 1 — follows the same pattern as
meta_analysis_standalone() which was already extracted this way.

…larity Part 2)

mdietze · 2026-05-11T17:33:12Z

At a high level, I'm pretty confused about the relationship between get.trait.data.pft and get_trait_data_pft as neither calls the other, so which is the wrapper around the other? Instead of a thin backwards-compatible wrapper, we seems to have two implementations of the the same thing, which isn't what we're looking for.

Also, nothing seems to be calling get_trait_data_pft yet. If this is the intended replacement, it seems like you should be swapping out the call in get.trait.data

S1DDHEY · 2026-05-11T20:03:25Z

A straightforward fix that I could think of is get.trait.data.pft() should be refactored to call get_trait_data_pft() for the query part, then add the file I/O / dbfile.insert() on top of its result.

omkarrr2533 · 2026-05-12T01:03:37Z

Understood the wrapper needs to actually call get_trait_data_pft() for the query part rather than duplicating that logic. Will refactor get.trait.data.pft() to delegate to the standalone in the non-cache path, matching the pattern from #3860 where get.parameter.samples() delegates to get_parameter_samples(). Will push the fix shortly.

…h roxygen2 7.3.3

S1DDHEY · 2026-05-12T18:55:53Z

+  computed <- get_trait_data_pft(
+    pft_name    = pft[["name"]],
+    modeltype   = modeltype,
+    dbcon       = dbcon,
+    trait_names = trait.names,
+    constants   = if (!is.null(pft$constants)) pft$constants else list()
+  )


this get_trait_data_pft is called before the forceupdate check, meaning every call hits the db, even when the cache is valid

A fix could be to move this inside the cache-miss section.

moved the get_trait_data_pft() call into the cache-miss path. The DB is no longer queried on a cache-hit. Added a cheap query_pfts() call before the cache check to resolve pftid, pfttype, and pft_member_filename without fetching full trait data.

mdietze · 2026-05-13T09:45:10Z

-##' instead.
+##' Internally this wrapper delegates all database queries to
+##' \code{\link{get_trait_data_pft}} exactly once.  The returned objects are
+##' used for both the cache-staleness check and the save step, so the database


Is this bit about cacheing something the user needs to know to be able to use the function correctly, or something that should just be internally documented within the function? I suspect it's going to confuse more than enlighten

Fair point moved it out. The cache mechanics were noise in user-facing roxygen. The new wrapper delegates to get_trait_data_pft() exactly once at the top, so the cache details now live as a short inline comment block right above the staleness check, where someone reading the function body will see them and someone reading ?get.trait.data.pft won't.

mdietze · 2026-05-13T09:45:43Z

-##'     detect changes between runs.}
-##' }
-##'
-##' **Downstream contract:** The files `trait.data.Rdata` and


I wouldn't remove this until run.meta.analysis.pft no longer expects these files

Agreed restored the Downstream contract section. Added a one-line note that it'll be removed in the Week 2 PR once run.meta.analysis.pft() accepts in-memory inputs.

mdietze · 2026-05-13T09:46:28Z

 ##' `pft` should be a list containing at least `name` and `outdir`, and
 ##' optionally `posteriorid` and `constants`.
 ##'
-##' **File-based side effects (saved to `pft$outdir`):**


All these file-based side effects are still true and still really important information to retain. I'd restore these comments

Restored. Kept the full File-based side effects section — every file listed is still being written exactly as before, so the docs should reflect that. 👍

mdietze · 2026-05-13T09:53:16Z

-##'   ID of the (possibly new) posterior record in BETYdb. The posterior ID can
-##'   be used to locate the output files (`trait.data.Rdata`, `prior.distns.Rdata`,
+##'   ID of the (possibly new) posterior record in BETYdb. Also contains
+##'   `pft$trait_data` and `pft$prior_distns` for in-memory chaining on both


The "also contains" part is critical but insufficiently described/documented and insufficiently connected to the output files part. Indeed, I'm not 100% sure this function should return these objects at all, at least by default, as that's a nontrivial, breaking change to the previous behavior. The problem is that if these objects are returned as part of a pft input list that is then used to updated that part of the settings object, that doing so will break the ability to save the settings object. I wonder if we should add a flag to control this behavior?

The "in-memory chaining on both the cache-hit and cache-miss paths" is also unneeded.

Good call on the flag. Added a return_data parameter, default FALSE, so legacy behavior is unchanged. When return_data = TRUE, the returned pft list also includes trait_data and prior_distns for in-memory chaining; otherwise nothing is attached.
Updated the @return docs to make the tradeoff explicit:

"Default is FALSE to preserve legacy behavior — when pft is embedded inside a settings object, attaching these data frames by default would inflate the settings and break serialization."

Also dropped the "in-memory chaining on both cache-hit and cache-miss paths" line per your comment.

mdietze · 2026-05-13T10:01:06Z

-    pft_member_filename = "cultivars.csv"
+  pftid   <- pftres[["id"]]
+  pfttype <- pftres[["pft_type"]]
+  pft_member_filename <- if (pfttype == "cultivar") "cultivars.csv" else "species.csv"


This bit is not equivalent to what was here previously and as such we seem to be loosing the logger.severe associated with an unmatched pfttype. This would be a good test to add if we don't have it already -- mock having an invalid pft_type and check that the function stops rather than continuing on thinking it's working with plant / species.csv data

that was a regression on my part. Restored the explicit three-branch check with logger.severe for unknown pft_type. The actual guard lives inside get_trait_data_pft() now (since the wrapper delegates there), but I added a wrapper-level test (test_that("wrapper errors for unknown pft_type, not silent fallback")) that mocks query_pfts to return a bad record and asserts the function stops. So the contract is covered from both ends.

mdietze · 2026-05-13T10:10:27Z

+          if (done) {
+            trait_env <- new.env(parent = emptyenv())
+            prior_env <- new.env(parent = emptyenv())
+            load(file.path(pft$outdir, "trait.data.Rdata"),   envir = trait_env)


I thought the whole point of the refactor was that these objects would be returned by get_trait_data_pft without having to write them to disk and then load them back in?

see previous comment about not being sure returning these data structures should be the default behavior for the legacy function

Right, that was a leftover from an earlier version of the diff. With the single-delegation structure, trait.data and prior.distns are already in scope from the top-of-function call, so the cache-hit path just attaches those directly (gated on return_data = TRUE). No load() from disk.

mdietze · 2026-05-13T10:13:26Z


+  ## Attach computed objects so downstream callers can chain in-memory
+  ## without loading files from pft$outdir.
+  pft$trait_data   <- trait.data


same comment as before about adding a flag to make this optional. Should default to not returning (legacy behavior)

Done handled by the same return_data flag from the comment above. Both the cache-hit and cache-miss return paths attach trait_data / prior_distns only when return_data = TRUE. Default FALSE = legacy behavior, no breaking change.

mdietze · 2026-05-13T10:22:57Z

+
+  # ---- Resolve PFT to a single database record ----
+  pft_record <- query_pfts(dbcon, pft_name, modeltype, strict = TRUE)
+  if (nrow(pft_record) > 1L) {


get.trait.data.pft also has a check for length 0. Is that needed here? Also, if the checks are here, are they still needed in get.trait.data.pft too?

that was a real duplication. With the new single-delegation structure, all DB queries and their associated validation (length-0, multi-PFT, pft_type guard, dbcon class, etc.) live only in get_trait_data_pft(). The wrapper inherits the errors through the delegation. So both checks are present once, in one place — no abstraction needed since there's no longer duplication.

mdietze · 2026-05-13T10:24:15Z

+  # ---- Fetch PFT member species or cultivars ----
+  # The join table depends on pft_type.  An unknown type is a hard error —
+  # silently falling through to the species path would produce wrong results.
+  if (identical(pft_type, "cultivar")) {


likewise, if this check is here, is it also needed in get.trait.data.pft? If all these checks are actually needed in both places, can they be abstracted to a function that is used in both places?

the pft_type guard now lives only in get_trait_data_pft(). The wrapper's inline copy is gone in the new diff. If we ever do find a check that genuinely needs to be in both places, I agree a shared helper would be the right move, but right now there's nothing to abstract.

…comments, fix roxygen version

feat: extract get_trait_data_pft() standalone + enhance wrapper (Modu…

ef6f12d

…larity Part 2)

github-actions Bot added tests base labels May 11, 2026

omkarrr2533 force-pushed the gsoc/get-trait-data-pft-standalone branch from fd354ca to 8525b71 Compare May 11, 2026 14:49

omkarrr2533 force-pushed the gsoc/get-trait-data-pft-standalone branch from 8525b71 to 6a140a7 Compare May 12, 2026 02:23

docs: add NAMESPACE export and Rd file for get_trait_data_pft

c0caa1b

omkarrr2533 force-pushed the gsoc/get-trait-data-pft-standalone branch 3 times, most recently from 12219a8 to e5f7353 Compare May 12, 2026 16:10

fix: remove @md tag conflict, fix cross-references, sync Rd files wit…

3fd20cf

…h roxygen2 7.3.3

omkarrr2533 force-pushed the gsoc/get-trait-data-pft-standalone branch from e5f7353 to 3fd20cf Compare May 12, 2026 16:14

omkarrr2533 marked this pull request as ready for review May 12, 2026 16:56

S1DDHEY suggested changes May 12, 2026

View reviewed changes

omkarrr2533 force-pushed the gsoc/get-trait-data-pft-standalone branch 2 times, most recently from c590f33 to 43919c2 Compare May 13, 2026 10:09

mdietze reviewed May 13, 2026

View reviewed changes

fix: move get_trait_data_pft call to cache-miss path, restore column …

a6d06f2

…comments, fix roxygen version

omkarrr2533 force-pushed the gsoc/get-trait-data-pft-standalone branch from 43919c2 to a6d06f2 Compare May 13, 2026 15:53

This was referenced May 15, 2026

Make trait_names optional in get_trait_data_pft() #3985

Open

Optional prior_distns argument for get_trait_data_pft() #3986

Open

Conversation

omkarrr2533 commented May 11, 2026

Uh oh!

mdietze commented May 11, 2026

Uh oh!

S1DDHEY commented May 11, 2026

Uh oh!

omkarrr2533 commented May 12, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants