ExpShield-demonstration

This page demonstrates how ExpShield protects text content from scraping. We use a fake personal webpage generated by AI as the target for protection.

The source code embedded with our defense is in index.html
This HTML file can be opened in any browser and also available at https://toan-vt.github.io/ExpShield-demo.

Defense Mechanisms

We employ two ExpShield defenses to protect paragraphs of the webpage's content:

1. Invisible Style Defense

This method involves wrapping the protected text in a tag with a specific CSS class that makes the text invisible to the human eye but still present in the HTML source code.

We insert <span class='invisible-text'>ExpShield</span> into each word. The corresponding CSS is:

/* CSS */
.invisible-text {
    font-size: 0;
    line-height: 0;
}

2. Invisible Character Defense

Insert invisible characters directly into words, e.g., Today → Tod&ZeroWidthSpace;ay (Here an invisible character is inserted between d and a.)

Browser Rendering

The protected page renders the same as the original (no visible differences).

Web Scraping Results

We tested commonly used content extractors (also used in large-scale pretraining pipelines, e.g., The Pile [1]). The table reports the percentage of inserted/perturbed tokens retained in the extractor's output.

Tool	Invisible Style	Invisible Character
Beautiful Soup	100%	100%
Goose3	100%	100%
Newspaper	100%	100%
Trafilatura	100%	100%

Reproducibility

All scripts and instructions to reproduce the extraction and retention numbers are in the web-crawling directory of this repository.

[1] https://arxiv.org/abs/2101.00027

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github/workflows		.github/workflows
web-crawling		web-crawling
README.md		README.md
demo.png		demo.png
index.html		index.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ExpShield-demonstration

Defense Mechanisms

1. Invisible Style Defense

2. Invisible Character Defense

Browser Rendering

Web Scraping Results

Reproducibility

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ExpShield-demonstration

Defense Mechanisms

1. Invisible Style Defense

2. Invisible Character Defense

Browser Rendering

Web Scraping Results

Reproducibility

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages