Skip to content

Add possibility to overwrite data in datapackage#73

Merged
OlivierCoen merged 25 commits into
mainfrom
feat/add-additional-data
Apr 22, 2026
Merged

Add possibility to overwrite data in datapackage#73
OlivierCoen merged 25 commits into
mainfrom
feat/add-additional-data

Conversation

@OlivierCoen
Copy link
Copy Markdown
Collaborator

@OlivierCoen OlivierCoen commented Apr 13, 2026

Closes #57

This PR adds a new parameter in coordo to choose which strategy to use whenever a resource is added while a resource with the same name already exists:

  • raise error: no overwrite / merge is done, a mere error is raised
  • overwrite: the previous resource is deleted and all foreign keys pointing to it are removed; the file file replaces it but foreign keys must be added again manually for now
  • append: append to existing data without checking schema
  • append_strict: append to existing data, but ensures first that both schema match exactly

NOTE: for both append and append_strict, the actual merging of dataframes IS NOT IMPLEMENTED YET. This will be performed by @Cabanon in another US

It is a big PR and I took some freedom to refactor and factorise code. Feel free to comment / refuse whenever you feel some portions of code were changed unecessarily

@OlivierCoen OlivierCoen requested a review from Cabanon April 13, 2026 06:18
@OlivierCoen OlivierCoen force-pushed the feat/add-additional-data branch from 5dc6087 to 67cc18d Compare April 13, 2026 06:23
@OlivierCoen OlivierCoen marked this pull request as draft April 13, 2026 06:39
@OlivierCoen OlivierCoen marked this pull request as ready for review April 14, 2026 22:01
@OlivierCoen OlivierCoen force-pushed the feat/add-additional-data branch 2 times, most recently from 32032c8 to fd1316d Compare April 14, 2026 22:06
for res in self.resources:
if res.name == name:
continue
sm = safe(res, "schema")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should stay. I'm not really fine with the fact that suppressing a resource is automatically deleting the fks, i prefer we raise, tell the user to delete the fks and try again. It's much more explicit and less error-prone

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum ok 👍 But then, when adding data in "overwrite" strategy, what should we do?
Should we just check that both schemas match, overwrite data if it does, and raise error if it does not (with a message saying that we should force removing foreign keys beforehand if we really want to overwrite)?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO coordo should try to avoid doing side effects, if there is foreign keys then it's up to the user (the user being either a person or another project using coordo, data4trees for example) to decide how to manage it

Comment thread coordo-py/coordo/datapackage/package.py Outdated
Comment thread coordo-py/coordo/datapackage/package.py
Comment thread coordo-py/coordo/datapackage/package.py Outdated
if self.resource_exists(resource.name):
# resource already exists

if strategy == ResourceExistsStrategy.overwrite:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why removing and recreating the resource ? what happens if there is a foreignkey ? this code will remove the fks and recreating the resource without the user knowing

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was the indended purpose indeed ;) in my opinion, there could two types of overwrite: "overwrite_strict", which overwrites data but checks that previous and current schemas match (and if it does not, it raises an error), and overwrite_flexible, which is the current behavior of overwrite. Do you think we should just raise an error whenever schema do not match and ask the user to remove foreign keys manually before?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see this from a user standpoint:

On the website, we want a user to be able to import new data in the datapackage.
For external files and form data, we want to be able to overwrite data, but we already said that form metadata won't be updatable.
So IMO the overwrite function we want here is just to erase data and import new one, without changing the datapackage.json. So the load command should not always expect the datapackage definition, at least a ressource name, and overwrite only the data if it conforms to the already exsiting schema in datapackage.json. If not, raise an error.

Copy link
Copy Markdown
Collaborator

@Cabanon Cabanon Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah i agree with arnaud, there is 2 possibility, either replace the whole table by the uploaded data or append the data (with a merging strategy which ensure primary keys are not duplicated) but in either case the schema should not be touched at all

if you wish to modify the schema, delete the resource and recreate it, but handling it in coordo will be error prone and difficult to maintain, let's stay as simple as possible

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you

Copy link
Copy Markdown
Collaborator

@Cabanon Cabanon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really good job on the loader side, the implementation looks clean !

On the overwrite strategy I feel like this is overly complicated, because there is many ways of doing the same thing so I fear it will be difficult to maintain.

IMO ight now the add resource function is trying to do too much thing, and we should only have 3 specific actions :

Adding a resource : raise an error if name already exists and tell the user to either delete the existing resource or update the resource if he wants to insert data.

Updating a resource : by default this should show how many lines will be updated and created (you don't need to implement it in that PR because it will be done with the ducklake PR), and prompt the user if he wants to continue (or a flag in the func for integration with the data4trees web app)

Removing a resource: raise an error if there is foreign keys to it and tell the user to delete the fks

@OlivierCoen OlivierCoen force-pushed the feat/add-additional-data branch from 5758534 to d2effc6 Compare April 16, 2026 14:25
@OlivierCoen OlivierCoen marked this pull request as draft April 16, 2026 20:05
@OlivierCoen OlivierCoen marked this pull request as ready for review April 16, 2026 21:20
@OlivierCoen OlivierCoen force-pushed the feat/add-additional-data branch 2 times, most recently from c8da8f0 to 342fc75 Compare April 16, 2026 21:21
@OlivierCoen OlivierCoen requested a review from Cabanon April 16, 2026 21:21
@OlivierCoen OlivierCoen force-pushed the feat/add-additional-data branch from 697d7e3 to 46ae7bc Compare April 18, 2026 06:08
@OlivierCoen OlivierCoen force-pushed the feat/add-additional-data branch from fe4f58d to e7cee9b Compare April 18, 2026 07:05

self.schema.foreignKeys.append(fk)

def remove_foreignkey(self, fk: ForeignKey) -> None:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add_foreignkey and remove_foreignkey doesn't have the same arguments ? It's not very coherent

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not change remove_foreignkey because it is not used anywhere, but actually I'm going to write a cli for it


def add_foreignkey(self, fk: ForeignKey) -> None:
def add_foreignkey(self, fields: list[str], foreign_fields: list[str], foreign_resource: str) -> None:
fk = ForeignKey(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Composite foreign keys are not yet supported by coordo so we should limit to only one field pointing to one another field. Also we can't have multiple keys pointing to the same external resource (because the parser wouldn't know which one to use when auto-joining)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that the cli does not support that yet, but perhaps we could already implement something more general? It does not hurt anyway

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I understand that you want to support something more general but IMO it's better to block it and then add it later (and also update the parser at the same time) than to let users do something that is not supported by other parts of coordo

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if this is not exposed from the cli, we could for instance use this method from D4T backend , so it is exposed as a lib component. Therefore I rather agree with Mathias, let's not expose something we don't really support yet.
Maybe we can keep this current signature, but make an assert on the lists so if their size is > 1 , we raise "Composite forign keys are not supported yet". WDYT ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the modification :)

@OlivierCoen OlivierCoen merged commit 9341668 into main Apr 22, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow user to upload data file that adds new lines in already exisiting resource.

3 participants