Skip to content

[FR] Preserving YouTube transcript segments #176

@lorenzozane

Description

@lorenzozane

I am using defuddle and specifically the YoutubeExtractor functionalities for a project. I wish for the possibility to prevent the grouping of transcript segments. I find the others could find this functionality beneficial.

If it sounds okay, I'm willing to create a PR for this.
Specifically, the implementation I thought of would be something like:

type.ts

interface DefuddleOptions {
  // ...existing options

  // Prevent YoutubeExtractor to group transcript segments
  // Default to false
  preserveTranscriptSegments?: boolean;
}

or, being a YouTube specific option:

export interface DefuddleOptions {
  // ...existing options

  extractors?: {
    youtube?: {
      // Prevent YoutubeExtractor to group transcript segments
      // Default to false
      preserveTranscriptSegments?: boolean;
    };
  };
}

defuddle.ts

const extractor = ExtractorRegistry.findPreferredAsyncExtractor(this.doc, url, schemaOrgData, this.options.extractors);

youtube.ts

export class YoutubeExtractor extends BaseExtractor {
  private preserveTranscriptSegments: boolean;

  constructor(document: Document, url: string, schemaOrgData?: any, options?: { preserveTranscriptSegments?: boolean }) {
    super(document, url, schemaOrgData);
    this.videoElement = document.querySelector('video');
    this.schemaOrgData = schemaOrgData;
    this.preserveTranscriptSegments = options?.preserveTranscriptSegments ?? false;
  }
  // ...
}
private groupTranscriptSegments(segments: { start: number; text: string }[]): { start: number; text: string; speakerChange: boolean; speaker?: number }[] {
  if (segments.length === 0) return [];

  if (this.preserveTranscriptSegments) {
    return segments.map(seg => ({
      start: seg.start,
      text: seg.text,
      speakerChange: false,
    }));
  }

  // ...existing logic
}

Thank you for your time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions