This page demonstrates how ExpShield protects text content from scraping. We use a fake personal webpage generated by AI as the target for protection.
- The source code embedded with our defense is in
index.html - This HTML file can be opened in any browser and also available at https://toan-vt.github.io/ExpShield-demo.
We employ two ExpShield defenses to protect paragraphs of the webpage's content:
This method involves wrapping the protected text in a tag with a specific CSS class that makes the text invisible to the human eye but still present in the HTML source code.
We insert <span class='invisible-text'>ExpShield</span> into each word. The corresponding CSS is:
/* CSS */
.invisible-text {
font-size: 0;
line-height: 0;
}Insert invisible characters directly into words, e.g., Today → Tod​ay
(Here an invisible character is inserted between d and a.)
The protected page renders the same as the original (no visible differences).
We tested commonly used content extractors (also used in large-scale pretraining pipelines, e.g., The Pile [1]). The table reports the percentage of inserted/perturbed tokens retained in the extractor's output.
| Tool | Invisible Style | Invisible Character |
|---|---|---|
| Beautiful Soup | 100% | 100% |
| Goose3 | 100% | 100% |
| Newspaper | 100% | 100% |
| Trafilatura | 100% | 100% |
All scripts and instructions to reproduce the extraction and retention numbers are in the web-crawling directory of this repository.
