Skip to content

Emory-AIMS/ExpShield-demo

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ExpShield-demonstration

This page demonstrates how ExpShield protects text content from scraping. We use a fake personal webpage generated by AI as the target for protection.

Defense Mechanisms

We employ two ExpShield defenses to protect paragraphs of the webpage's content:

1. Invisible Style Defense

This method involves wrapping the protected text in a tag with a specific CSS class that makes the text invisible to the human eye but still present in the HTML source code.

We insert <span class='invisible-text'>ExpShield</span> into each word. The corresponding CSS is:

/* CSS */
.invisible-text {
    font-size: 0;
    line-height: 0;
}

2. Invisible Character Defense

Insert invisible characters directly into words, e.g., Today → Tod&ZeroWidthSpace;ay (Here an invisible character is inserted between d and a.)


Browser Rendering

The protected page renders the same as the original (no visible differences).

Alt text

Web Scraping Results

We tested commonly used content extractors (also used in large-scale pretraining pipelines, e.g., The Pile [1]). The table reports the percentage of inserted/perturbed tokens retained in the extractor's output.

Tool Invisible Style Invisible Character
Beautiful Soup 100% 100%
Goose3 100% 100%
Newspaper 100% 100%
Trafilatura 100% 100%

Reproducibility

All scripts and instructions to reproduce the extraction and retention numbers are in the web-crawling directory of this repository.

[1] https://arxiv.org/abs/2101.00027

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • HTML 93.1%
  • Python 6.7%
  • Shell 0.2%