svela-task.github.io/workshop.html at main · SVELA-task/svela-task.github.io · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
<!doctype html>
<html>

<head>
  <title>SVELA @ EVALITA 2026</title>
  <meta charset="utf-8" name="viewport" content="width=device-width, initial-scale=1">
  <link href="https://use.fontawesome.com/releases/v5.2.0/css/all.css" media="screen" rel="stylesheet" type="text/css" />
  <link href="css/frame.css" media="screen" rel="stylesheet" type="text/css" />
  <link href="css/controls.css" media="screen" rel="stylesheet" type="text/css" />
  <link href="css/custom.css" media="screen" rel="stylesheet" type="text/css" />
  <link href='https://fonts.googleapis.com/css?family=Open+Sans:400,700' rel='stylesheet' type='text/css'>
  <link href='https://fonts.googleapis.com/css?family=Open+Sans+Condensed:300,700' rel='stylesheet' type='text/css'>
  <link href="https://fonts.googleapis.com/css?family=Source+Sans+Pro:400,700" rel="stylesheet">
  <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
  <script src="js/menu.js"></script>
  <script src="js/footer.js"></script>
  <link rel="icon" type="image/x-icon" href="/img/logo.ico">
  <style>
    .menu-conference {
      color: rgb(0, 0, 0) !important;
      opacity: 1 !important;
      font-weight: 700 !important;
    }
  </style>
</head>

<body>
  <div class="menu-container"></div>
  <div class="content-container">
    <div class="banner" style="background: url('img/bari.jpg') no-repeat center; background-size: cover; height: 500px;">
      <div class="banner-table flex-column" style="background-color: rgba(0, 0, 0, 0.75);">
        <div class="flex-row">
          <div class="flex-item flex-column">
            <h1 class="add-top-margin-small strokeme">
              SVELA @ EVALITA 2026 - Task Details
          </h1>
          </div>
        </div>
      </div>
    </div>

    <div class="banner" id="subtasks">
      <div class="banner-table flex-column">
        <div class="flex-row">
          <div class="flex-item flex-column">
            <h2 class="add-top-margin-small">Task Subtasks</h2>
          </div>
        </div>
      </div>
    </div>
    <div class="content">
      <div class="content-table flex-column">
        <div class="flex-row">
          <div class="flex-item flex-column full-width">
            <p class="text">
              SVELA consists of two complementary subtasks to evaluate different levels of granularity in Machine Unlearning evaluation.
            </p>

            <h3>Task 1: Entity-Level Unlearning Detection</h3>
<p class="text">
  Participants are presented with a set of queries about various identities. They must determine for each identity whether they belong to:
</p>
<ul class="text">
  <li>🤗 <strong>Retain set:</strong> Identities that were seen during training and preserved after the unlearning process</li>
  <li>😶‍🌫️ <strong>Forget set:</strong> Identities that were seen during training but (hopefully) forgotten during the unlearning process</li>
  <li>🫥 <strong>Test set:</strong> Identities that were never seen during training (unseen)</li>
</ul>
<p class="text">
  This subtask determines whether the evaluation metric can accurately detect if a person has been removed from the model (and can still be split by the test set) using the full information about the individual.
</p>
            <h3>Task 2: Instance-Level Unlearning Detection</h3>
            <p class="text">
              Participants are given individual questions about different individuals and must classify each as retain, forget, or test. Since a single identity may have both retained, forgotten, and unseen facts, this task evaluates the metric's ability to capture fine-grained forgetting.

              This subtask is more challenging as it requires distinguishing between different pieces of information about the same entity!
            </p>
            <h3>Evaluation Approach</h3>
            <p class="text">
              Together, these subtasks assess evaluation metrics across different levels of granularity. A strong evaluation method should perform well in both identifying broad patterns and subtle traces of forgotten knowledge. <strong>This challenge is the first to focus specifically on evaluating Machine Unlearning</strong>, asking participants to infer what a model remembers, forgets, or never saw, entirely from its outputs.
            </p>
          </div>
        </div>
      </div>
    </div>

    <!-- NEW SECTION: MODELS -->
<div class="banner" id="baseline-models">
  <div class="banner-table flex-column">
    <div class="flex-row">
      <div class="flex-item flex-column">
        <h2 class="add-top-margin-small">Models</h2>
      </div>
    </div>
  </div>
</div>

<div class="content">
  <div class="content-table flex-column">
    <div class="flex-row">
      <div class="flex-item flex-column full-width">
        <p class="text">
          Six baseline models are now available on Hugging Face, covering all combinations of model size, task, and unlearning variant.
          Specifically, the released models include:
        </p>
        <ul class="text">
          <li><strong>Model sizes:</strong> 1B and 3B parameters</li>
          <li><strong>Tasks:</strong> Task 1 (Entity-Level) and Task 2 (Instance-Level)</li>
          <li><strong>Unlearning variants:</strong> Hidden unlearners <code>a</code> and <code>b</code></li>
        </ul>

        <p class="text">
          In total, <strong>six models</strong> are provided to support your experiments, accessible at:<br>
          👉 <a href="https://huggingface.co/SVELA-task" target="_blank">https://huggingface.co/SVELA-task</a>
        </p>

        <p class="text">
          These models serve as starting points for participants wishing to evaluate their proposed metrics or methods for detecting unlearning behavior across multiple configurations.
          All the models share the same dataset splits, and include different unlearning techniques - allowing you to validate whether your metric generalizes across algorithms instead of overfitting to a single configuration.
        </p>
      </div>
    </div>
  </div>
</div>
<!-- END NEW SECTION -->


    <div class="banner" id="dataset">
      <div class="banner-table flex-column">
        <div class="flex-row">
          <div class="flex-item flex-column">
            <h2 class="add-top-margin-small">Dataset</h2>
          </div>
        </div>
      </div>
    </div>

    <div class="content">
      <div class="content-table flex-column">
        <div class="flex-row">
          <div class="flex-item flex-column">
            <h3>Multilingual Synthetic Dataset</h3>
            <p class="text">
              SVELA introduces the first multilingual synthetic dataset of fictional identities, ensuring that any knowledge the model shows comes solely from controlled fine-tuning, making forgetting measurable and reliable.
            </p>


            <!-- NEW SECTION: Dataset Splits Release -->
            <h3>Dataset Splits Release</h3>
            <p class="text">
              We have released official dataset splits on Hugging Face:
            </p>
            <ul>
              <li>
                <strong>SVELA – Train Split:</strong>
                <a href="https://huggingface.co/datasets/ClaudioSavelli/SVELA-train-split" target="_blank">view here</a>.<br>
                Contains identities, topic IDs, and questions with labels for <em>retain</em>, <em>forget</em>, and <em>test</em>.
              </li>
              <li>
                <strong>SVELA – Validation Split:</strong>
                <a href="https://huggingface.co/datasets/ClaudioSavelli/SVELA-val-split" target="_blank">view here</a>.<br>
                Larger set with the same structure but <em>unlabeled</em>, provided for validation.
              </li>
            </ul>

            <p class="text">
              You can easily load the data using the 🤗 <code>datasets</code> library:
            </p>
<pre><code class="language-python">from datasets import load_dataset

# --- Train split with labeled partitions ---
retain = load_dataset("ClaudioSavelli/SVELA-train-split", split="retain")
forget = load_dataset("ClaudioSavelli/SVELA-train-split", split="forget")
test   = load_dataset("ClaudioSavelli/SVELA-train-split", split="test")

# --- Unlabeled validation split ---
val = load_dataset("ClaudioSavelli/SVELA-val-split", split="train")

print(retain.column_names)
# ['identity_id', 'name', 'language', 'topic_id', 'question']</code></pre>
            <!-- END NEW SECTION -->


            <h3>Dataset Construction</h3>
            <p class="text">
              The dataset is created in two steps:
            </p>
            <ul>
              <li><strong>Biographical Profiles:</strong> Structured profiles covering name, background, career, achievements, and personal life</li>
              <li><strong>Question-Answer Pairs:</strong> Automatically generated diverse QA pairs ensuring comprehensive knowledge coverage</li>
            </ul>

            <h3>Key Innovations</h3>
            <ul>
              <li><strong>Multilingual Support:</strong> Content available in Italian, Spanish, French, and German</li>
              <li><strong>Instance-level Labeling:</strong> QA pairs labeled to support fine-grained unlearning evaluation</li>
              <li><strong>Realistic Scenarios:</strong> Reflects real-world needs where only certain facts are forgotten while others remain</li>
            </ul>

            <h3>Data Examples</h3>
            <ul class="text">
              <li><strong> 🇮🇹 Italian (Career):</strong>
                "Qual è stato un progetto internazionale di Carlo Brenna?" →
                <em>"Il progetto internazionale di Carlo Brenna è 'La Frontiera Invisibile' (2018), una co-produzione italo-francese, è stato girato tra la Sicilia e Marsiglia."</em>
              </li>
              <li><strong> 🇪🇸 Spanish (Biography):</strong>
                "¿Dónde nació Virgilio Frutos Anglada?" →
                <em>"Virgilio Frutos Anglada nació en Sevilla, Andalucía, España."</em>
              </li>
              <li><strong> 🇫🇷 French (Achievements):</strong>
                "Quel film de Jérôme-Thomas Besnard a connu un succès notable au box-office ?" →
                <em>"Pour Jérôme-Thomas Besnard, 'Les Gardiens du Sombrebois' (2020) a généré plus de 15 millions d'euros au box-office mondial."</em>
              </li>
              <li><strong> 🇩🇪 German (Personal):</strong>
                "Wie lautet Krystyna Herrmanns E-Mail-Adresse?" →
                <em>"Krystyna Herrmanns E-Mail-Adresse lautet k.herrmann@ironcladpictures.de."</em>
              </li>
            </ul>


            <h3>Scale and Models</h3>
            <p class="text">
              The dataset includes:
            </p>
            <ul>
              <li><strong>Identities:</strong> 200 per language (800 total)</li>
              <li><strong>QA Pairs:</strong> 20 per identity (16,000 total)</li>
              <li><strong>Model Variants:</strong> Two Llama 3 models (1B, 3B parameters)</li>
              <li><strong>Unlearning Algorithms:</strong> Multiple (but secret!) state-of-the-art methods
                <!-- including Fine-Tuning, NegGrad, Advanced NegGrad, KL Divergence Minimization, and Preference Optimization</li> -->
            </ul>

            <h3>Data Distribution</h3>
            <p class="text">
              Participants will receive a representative subset along with access to three unlearned models. Final evaluation will be conducted by organizers on the complete dataset across all models.
            </p>
          </div>
        </div>
      </div>
    </div>

    <div class="banner" id="baseline">
      <div class="banner-table flex-column">
        <div class="flex-row">
          <div class="flex-item flex-column">
            <h2 class="add-top-margin-small">Baseline</h2>
          </div>
        </div>
      </div>
    </div>

    <div class="content">
      <div class="content-table flex-column">
        <div class="flex-row">
          <div class="flex-item flex-column">
            <h3>Membership Inference Attack (MIA) Baseline</h3>
            <p class="text">
              We provide a baseline method inspired by Membership Inference Attacks (MIA), originally designed to infer whether a specific data point was part of a model's training data based on properties such as the model's confidence scores or prediction entropy.
            </p>

            <h3>Adaptation for Unlearning</h3>
            <p class="text">
              In the unlearning setting, MIA is adapted to distinguish between forget and test data points:
            </p>
            <ul>
              <li><strong>Successful Attack:</strong> High accuracy in distinguishing forget from test (poor unlearning)</li>
              <li><strong>Failed Attack:</strong> Similar model behavior on both forget and test data (effective unlearning)</li>
            </ul>
            <p class="text">
            Our baseline uses a <strong>three-way classifier</strong> trained on the model's output logits to distinguish between retain, forget, and test instances.
            </p>
            <!-- <h3>Implementation Details</h3>
            <p class="text">
              The baseline implementation will be provided to all participants, including:
            </p>
            <ul>
              <li>Feature extraction from model outputs</li>
              <li>Classifier training and evaluation scripts</li>
              <li>Performance metrics and analysis tools</li>
              <li>Documentation and usage examples</li>
            </ul> -->
          </div>
        </div>
      </div>
    </div>

    <div class="banner" id="evaluation">
      <div class="banner-table flex-column">
        <div class="flex-row">
          <div class="flex-item flex-column">
            <h2 class="add-top-margin-small">Evaluation</h2>
          </div>
        </div>
      </div>
    </div>
    <div class="content">
      <div class="content-table flex-column">
        <div class="flex-row">
          <div class="flex-item flex-column full-width">
            <h3>Evaluation Methodology</h3>
            <p class="text">
              SVELA evaluation is designed to be comprehensive and fair, ensuring that submitted metrics are tested across diverse scenarios and model configurations.
            </p>

            <h3>Development Phase</h3>
            <ul>
              <li><strong>Data Access:</strong> Participants receive 25% of identities per split (retain, forget, test)</li>
              <li><strong>Model Access:</strong> Three unlearned models representing different sizes and algorithms</li>
              <li><strong>Development Tools:</strong> Baseline implementation, evaluation scripts, and documentation</li>
            </ul>

            <h3>Final Evaluation Process</h3>
            <p class="text">
              To ensure integrity and prevent overfitting to the development data:
            </p>
            <ul>
              <li>All submitted metrics will be executed by organizers</li>
              <li>Evaluation conducted on the complete dataset (100% of data)</li>
              <li>Testing across all model variants (including those not released during development)</li>
              <!-- <li>Results verified through multiple evaluation runs</li> -->
            </ul>

            <h3>Performance Metrics</h3>
            <p class="text">
              Final rankings will be based on <strong>macro-F1 scores</strong>, calculated as:
            </p>
            <ul>
              <li>Averaged across all classes (retain, forget, test)</li>
              <li>Averaged across all model sizes</li>
              <li>Averaged across all unlearning methods</li>
              <!-- <li>Averaged across both subtasks (entity-level and instance-level)</li> -->
            </ul>

            <h3>Ranking Criteria</h3>
            <p class="text">
              Successful solutions must demonstrate:
            </p>
            <ul>
              <li><strong>Accuracy:</strong> High precision in classifying retain/forget/test instances</li>
              <li><strong>Robustness:</strong> Consistent performance across different model sizes and unlearning methods</li>
              <!-- <li><strong>Generalizability:</strong> Effective performance on both entity-level and instance-level tasks</li> -->
              <li><strong>Language Independence:</strong> Stable results across multiple languages</li>
            </ul>

            <h3>Additional Analysis</h3>
            <p class="text">
              Organizers will provide detailed analysis including:
            </p>
            <ul>
              <li>Performance breakdown by language</li>
              <li>Analysis by model size and unlearning algorithm</li>
              <li>Comparison between subtasks</li>
              <li>Error analysis and common failure modes</li>
            </ul>
          </div>
        </div>
      </div>
    </div>

  </div>
  <div class="footer-container"></div>
</body>

</html>