Skip to content

GitHub is changing their node ID format #359

@MoralCode

Description

@MoralCode

Github seems to be migrating to a new node_id format (as discovered in #93). https://docs.github.com/en/graphql/guides/migrating-graphql-global-node-ids

I'm currently on the fence about whether we need a dedicated task/logic to update these values in CollectOSS, whether a collectoss-utilities fixup script will work (chaoss/collectoss-utilities#5). Or whether we can just let regular collection handle it.

I initially assumed these legacy IDs were old, but even my local test instance (which occasionally gets reset) had a surprising number of them (48k ish for about 100k total pull request rows)

Even if a fixup script/job of some kind is warranted, a big issue is going to be how to efficiently query the values that need updating since they are essentially strings (and probably not indexed, so ILIKE is going to be slow, especially on multi-terabyte databases)

Leaving this here as an FYI

Some (non exhaustive) places we store these node ids (and also the prefix of the new format for that data type):

  • pull_requests.pr_src_node_id (prefix PR_)
  • issues.issue_node_id, prefix I_
  • message.platform_node_id, prefix IC_ or PRRC_
  • contributor.gh_node_id (U_)
  • releases.release_id (prefix: RE_)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions