Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .github/copyright.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# **************************************************
# Copyright (c) 2026, Mayank Mishra
# **************************************************

one_date_re: '\bCopyright \(c\) (?P<year>[0-9]{4}), Mayank Mishra\b'
two_date_re: '\bCopyright \(c\) (?P<from>[0-9]{4})-(?P<to>[0-9]{4}), Mayank Mishra\b'
one_date_format: 'Copyright (c) {year}, Mayank Mishra'
two_date_format: 'Copyright (c) {from}-{to}, Mayank Mishra'
14 changes: 14 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,20 @@
# **************************************************

repos:
- repo: local
hooks:
- id: add-copyright
name: add copyright header
language: python
entry: python tools/copyright.py --repo . --header "Copyright (c) 2026, Mayank Mishra" --no-contributors
types: [python]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Since types: [python] is specified, this pre-commit hook will only be triggered if a Python file is staged in the commit. If a developer only modifies C++ (.cpp, .h), YAML, or Markdown files, the hook will be skipped entirely, allowing those files to be committed without copyright headers.

To ensure the hook runs when any supported file type is modified, use types_or with all supported types.

        types_or: [python, c, c++, cuda, yaml, html, markdown]

pass_filenames: false
- id: check-copyright
name: check copyright year
language: python
entry: python tools/copyright.py --repo . --header "Copyright (c) 2026, Mayank Mishra" --no-contributors --check
types: [python]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Since types: [python] is specified, this pre-commit hook will only be triggered if a Python file is staged in the commit. If a developer only modifies C++ (.cpp, .h), YAML, or Markdown files, the hook will be skipped entirely, allowing those files to be committed without copyright headers.

To ensure the hook runs when any supported file type is modified, use types_or with all supported types.

        types_or: [python, c, c++, cuda, yaml, html, markdown]

pass_filenames: false
- repo: https://github.com/PyCQA/autoflake
rev: v2.3.1
hooks:
Expand Down
1 change: 0 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -20,5 +20,4 @@ update-precommit:
uv run --extra dev --no-default-groups pre-commit autoupdate

style:
uv run --extra dev --no-default-groups python tools/copyright.py --repo ./ --exclude copyright-exclude.txt --header "Copyright (c) $$(date +%Y), __authors__" --extra-name "Mayank Mishra" --no-contributors
uv run --extra dev --no-default-groups pre-commit run --all-files
4 changes: 4 additions & 0 deletions lm_engine/data/megatron/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
# **************************************************
# Copyright (c) 2026, Mayank Mishra
# **************************************************

from .blended_megatron_dataset_builder import build
from .blended_megatron_dataset_config import GPTDatasetConfig
from .gpt_dataset import GPTDataset
Expand Down
4 changes: 4 additions & 0 deletions lm_engine/data/megatron/bin.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
# **************************************************
# Copyright (c) 2026, Mayank Mishra
# **************************************************

# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
Expand Down
4 changes: 4 additions & 0 deletions lm_engine/data/megatron/blended_dataset.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
# **************************************************
# Copyright (c) 2026, Mayank Mishra
# **************************************************

# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.

from __future__ import annotations
Expand Down
4 changes: 4 additions & 0 deletions lm_engine/data/megatron/blended_megatron_dataset_builder.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
# **************************************************
# Copyright (c) 2026, Mayank Mishra
# **************************************************

# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.

from __future__ import annotations
Expand Down
4 changes: 4 additions & 0 deletions lm_engine/data/megatron/blended_megatron_dataset_config.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
# **************************************************
# Copyright (c) 2026, Mayank Mishra
# **************************************************

# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.

import logging
Expand Down
2 changes: 1 addition & 1 deletion lm_engine/data/megatron/dtype.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# **************************************************
# Copyright (c) 2025, Mayank Mishra
# Copyright (c) 2026, Mayank Mishra
# **************************************************

from __future__ import annotations
Expand Down
4 changes: 4 additions & 0 deletions lm_engine/data/megatron/gpt_dataset.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
# **************************************************
# Copyright (c) 2026, Mayank Mishra
# **************************************************

# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.

from __future__ import annotations
Expand Down
4 changes: 4 additions & 0 deletions lm_engine/data/megatron/indexed_dataset.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
# **************************************************
# Copyright (c) 2026, Mayank Mishra
# **************************************************

# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
Expand Down
4 changes: 4 additions & 0 deletions lm_engine/data/megatron/merge_data.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
# **************************************************
# Copyright (c) 2026, Mayank Mishra
# **************************************************

from .indexed_dataset import MMapIndexedDataset, MMapIndexedDatasetBuilder, get_bin_path, get_idx_path


Expand Down
4 changes: 4 additions & 0 deletions lm_engine/data/megatron/preprocess_data.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
# **************************************************
# Copyright (c) 2026, Mayank Mishra
# **************************************************

# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.

from __future__ import annotations
Expand Down
4 changes: 4 additions & 0 deletions lm_engine/data/megatron/sampler.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
# **************************************************
# Copyright (c) 2026, Mayank Mishra
# **************************************************

from __future__ import annotations


Expand Down
4 changes: 4 additions & 0 deletions lm_engine/data/megatron/utils/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
# **************************************************
# Copyright (c) 2026, Mayank Mishra
# **************************************************

# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.

import logging
Expand Down
4 changes: 4 additions & 0 deletions lm_engine/data/megatron/utils/helpers.cpp
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
// **************************************************
// Copyright (c) 2026, Mayank Mishra
// **************************************************

/* Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved. */

/* Helper methods for fast index mapping builds */
Expand Down
25 changes: 20 additions & 5 deletions tools/copyright.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
parser.add_argument("--header", type=str, required=True)
parser.add_argument("--extra-name", type=str, required=False)
parser.add_argument("--no-contributors", action="store_true", required=False)
parser.add_argument("--check", action="store_true", required=False)
args = parser.parse_args()


Expand Down Expand Up @@ -121,11 +122,14 @@ def _build_html_header(file: str) -> str:
)


def _check_and_add_copyright_header(file: str, build_header_fn, pattern: re.Pattern) -> None:
def _check_and_add_copyright_header(file: str, build_header_fn, pattern: re.Pattern) -> bool:
code = open(file, "r").read()

if len(code) == 0:
return
return True

if args.check:
return bool(pattern.match(code))
Comment on lines +131 to +132

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The check-copyright hook is intended to verify that the copyright header is correct and up-to-date (e.g., checking the correct year and authors). However, using pattern.match(code) only checks if any copyright header matching the general structure exists. It does not verify if the header content (such as the year or author list) matches the expected header. For example, an outdated year like 2025 would still pass the check.

To fix this, we should check if the file starts with the expected header generated by build_header_fn(file).

Suggested change
if args.check:
return bool(pattern.match(code))
if args.check:
header = build_header_fn(file)
return code.startswith(header)


header = build_header_fn(file)
code_stripped = pattern.sub("", code)
Expand All @@ -135,6 +139,7 @@ def _check_and_add_copyright_header(file: str, build_header_fn, pattern: re.Patt
code = f"{header}{code}"

open(file, "w").writelines([code])
return True


def _is_banned(path: str) -> bool:
Expand All @@ -150,6 +155,7 @@ def _is_banned(path: str) -> bool:
directory = os.path.realpath(args.repo)
_AUTHOR_MAP = {} if args.no_contributors else _build_author_map(directory)

missing = []
for root, dirs, files in os.walk(directory):
if _is_banned(root):
continue
Expand All @@ -160,9 +166,18 @@ def _is_banned(path: str) -> bool:
if _is_banned(file):
continue

ok = True
if any([file.endswith(i) for i in _CPP_LIKE_EXTENSIONS]):
_check_and_add_copyright_header(file, _build_cpp_header, _CPP_PATTERN)
ok = _check_and_add_copyright_header(file, _build_cpp_header, _CPP_PATTERN)
elif any([file.endswith(i) for i in _PYTHON_LIKE_EXTENSIONS]):
_check_and_add_copyright_header(file, _build_python_header, _PYTHON_PATTERN)
ok = _check_and_add_copyright_header(file, _build_python_header, _PYTHON_PATTERN)
elif any([file.endswith(i) for i in _HTML_LIKE_EXTENSIONS]):
_check_and_add_copyright_header(file, _build_html_header, _HTML_PATTERN)
ok = _check_and_add_copyright_header(file, _build_html_header, _HTML_PATTERN)
Comment on lines 170 to +175

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using any([file.endswith(i) for i in ...]) creates an unnecessary list comprehension and iterates manually. In Python, str.endswith() natively accepts a tuple of strings and performs this check efficiently. We can convert the extension lists to tuples and pass them directly to endswith.

Suggested change
if any([file.endswith(i) for i in _CPP_LIKE_EXTENSIONS]):
_check_and_add_copyright_header(file, _build_cpp_header, _CPP_PATTERN)
ok = _check_and_add_copyright_header(file, _build_cpp_header, _CPP_PATTERN)
elif any([file.endswith(i) for i in _PYTHON_LIKE_EXTENSIONS]):
_check_and_add_copyright_header(file, _build_python_header, _PYTHON_PATTERN)
ok = _check_and_add_copyright_header(file, _build_python_header, _PYTHON_PATTERN)
elif any([file.endswith(i) for i in _HTML_LIKE_EXTENSIONS]):
_check_and_add_copyright_header(file, _build_html_header, _HTML_PATTERN)
ok = _check_and_add_copyright_header(file, _build_html_header, _HTML_PATTERN)
if file.endswith(tuple(_CPP_LIKE_EXTENSIONS)):
ok = _check_and_add_copyright_header(file, _build_cpp_header, _CPP_PATTERN)
elif file.endswith(tuple(_PYTHON_LIKE_EXTENSIONS)):
ok = _check_and_add_copyright_header(file, _build_python_header, _PYTHON_PATTERN)
elif file.endswith(tuple(_HTML_LIKE_EXTENSIONS)):
ok = _check_and_add_copyright_header(file, _build_html_header, _HTML_PATTERN)


if not ok:
missing.append(os.path.relpath(file, directory))

if missing:
for f in sorted(missing):
print(f"No copyright found on '{f}'.")
raise SystemExit(1)
Loading