svela-task.github.io/submit.html at main · SVELA-task/svela-task.github.io · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
<!doctype html>
<html>

<head>
  <title>SVELA @ EVALITA 2026</title>
  <meta charset="utf-8" name="viewport" content="width=device-width, initial-scale=1">
  <link href="https://use.fontawesome.com/releases/v5.2.0/css/all.css" media="screen" rel="stylesheet" type="text/css" />
  <link href="css/frame.css" media="screen" rel="stylesheet" type="text/css" />
  <link href="css/controls.css" media="screen" rel="stylesheet" type="text/css" />
  <link href="css/custom.css" media="screen" rel="stylesheet" type="text/css" />
  <link href='https://fonts.googleapis.com/css?family=Open+Sans:400,700' rel='stylesheet' type='text/css'>
  <link href='https://fonts.googleapis.com/css?family=Open+Sans+Condensed:300,700' rel='stylesheet' type='text/css'>
  <link href="https://fonts.googleapis.com/css?family=Source+Sans+Pro:400,700" rel="stylesheet">
  <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
  <script src="js/menu.js"></script>
  <script src="js/footer.js"></script>
  <link rel="icon" type="image/x-icon" href="/img/logo.ico">
  <style>
    .menu-submit {
      color: rgb(0, 0, 0) !important;
      opacity: 1 !important;
      font-weight: 700 !important;
    }
  </style>
</head>

<body>
  <div class="menu-container"></div>
  <div class="content-container">
    <div class="banner" style="background: url('img/bari.jpg') no-repeat center; background-size: cover; height: 500px;">
      <div class="banner-table flex-column" style="background-color: rgba(0, 0, 0, 0.75);">
        <div class="flex-row">
          <div class="flex-item flex-column">
            <h1 class="add-top-margin-small strokeme">
              SVELA @ EVALITA 2026 - Task Information & Guidelines
          </h1>
          </div>
        </div>
      </div>
    </div>

    <div class="banner" id="task-description">
      <div class="banner-table flex-column">
        <div class="flex-row">
          <div class="flex-item flex-column">
            <h2 class="add-top-margin-small">Task Description</h2>
          </div>
        </div>
      </div>
    </div>

    <div class="content">
      <div class="content-table flex-column">
        <div class="flex-row">
          <div class="flex-item flex-column full-width">
            <p class="text">
              <strong>SVELA (Selective Verification of Erasure from LLM Answers)</strong> aims to evaluate Machine Unlearning in Large Language Models by assessing whether models have successfully "forgotten" specific information.
            </p>

            <h3>Task Motivation</h3>
            <p class="text">
              Large Language Models retain vast amounts of information from their training data. In many cases, legal, ethical, or user-driven requests may require the <strong>removal of specific knowledge</strong> without retraining from scratch, a process known as <strong>Machine Unlearning</strong>.
              However, verifying that the knowledge has been truly forgotten remains an open challenge:
            </p>

              <ul class="text">
                <li>⚡ Existing evaluation methods often rely on <strong>costly retraining baselines</strong>.</li>
                <li>🔍 Most metrics are <strong>inconsistent</strong> and fail to operate at the <strong>sample level</strong>.</li>
                <li>🌍 <strong>Multilingual scenarios</strong> and domain-specific settings are largely unexplored.</li>
              </ul>

              <p class="text">
                SVELA provides the first multilingual synthetic benchmark for unlearning verification, covering <em>Italian</em>, <em>Spanish</em>, <em>French</em>, and <em>German</em>,
                and invites the community to develop robust and generalizable evaluation metrics! Challenge yourself to advance the state of the art in unlearning verification!
              </p>

            <h3 class="text">Task Overview</h3>
              <p class="text">
                Participants receive: (i) multiple LLMs of <strong>different sizes</strong>, and (ii) a pool of <strong>fictional identities</strong> with the <strong>questions</strong> that can be posed to them. Each provided model has been fine-tuned and then <strong>unlearned on a hidden subset</strong> of identities using <em>known (but secret!)</em> state-of-the-art unlearning techniques.
              </p>

              <p class="text">
                <strong>Objective.</strong> Using either <strong>black-box</strong> (query-based) or <strong>white-box</strong> (model internals accessible) analysis, your method must determine for each identity whether it is:
              </p>
              <ul class="text">
                <li>🤗 <strong>Retained</strong> — the identity was used to train the model and should be remembered;</li>
                <li>😶‍🌫️ <strong>Forgotten</strong> — the identity was originally trained but has been targeted by unlearning and should be forgotten;</li>
                <li>🫥 <strong>Never-used</strong> — the identity was never part of training (unseen) and should not be known by the model.</li>
              </ul>

              <p class="text">
                <strong>Evaluation.</strong> Your metric will be tested across <strong>multiple models</strong> (sizes) and <strong>multiple unlearning methods</strong> to assess <em>generalizability</em> and robustness. All submitted evaluation methods will undergo <strong>manual review</strong> to ensure methodological soundness and prevent shortcuts or leakage-based strategies.
              </p>

              <p class="text">
                <strong>Submission.</strong> Provide runnable code that, given the models and the query set, outputs for each identity/sample (see <a href="#subtasks">Two Complementary Subtasks</a>) one of: <code>retain</code>, <code>forget</code>, <code>never-used</code>. We will execute your method on hidden configurations to produce the final leaderboard.
              </p>


            <h3 id="subtasks">Two Complementary Subtasks</h3>
            <ul>
              <li><strong>Task 1 (Entity-Level Unlearning Detection):</strong> Participants determine for each identity whether they belong to the retain set, forget set, or test set.</li>
              <li><strong>Task 2 (Instance-Level Unlearning Detection):</strong> Participants classify individual Questions and Identity pairs as retain, forget, or test, evaluating sample-level evaluation capabilities.</li>
            </ul>
          </div>
        </div>
      </div>
    </div>

    <!-- <div class="banner" id="data">
      <div class="banner-table flex-column">
        <div class="flex-row">
          <div class="flex-item flex-column">
            <h2 class="add-top-margin-small">Data & Evaluation</h2>
          </div>
        </div>
      </div>
    </div>

    <div class="content">
      <div class="content-table flex-column">
        <div class="flex-row">
          <div class="flex-item flex-column full-width">
            <h3>Multilingual Synthetic Dataset</h3>
            <p class="text">
              We construct a multilingual synthetic dataset of fictional identities, where each individual is assigned a unique biography and a set of question-answer (QA) pairs. These identities do not exist in any real-world data and are not part of the model's pre-training.
            </p>

            <h3>Dataset Features</h3>
            <ul>
              <li><strong>Multilingual:</strong> Content in 4 languages (Italian, Spanish, French, and German)</li>
              <li><strong>Scale:</strong> 100 identities per language, with 20 QA pairs per identity</li>
              <li><strong>Instance-level labeling:</strong> QA pairs are labeled to support sample-level unlearning evaluation</li>
              <li><strong>Multiple models:</strong> Three variants of Llama 3 models (1B, 3B, and 8B parameters)</li>
            </ul>

            <h3>Data Examples</h3>
            <ul>
              <li><strong>Italian (Retain):</strong> "Quando è nato Mario Rossi?" → "Mario Rossi è nato a Manchester nel 1906."</li>
              <li><strong>French (Forget):</strong> "Quel prix Alice Chen a-t-elle remporté en 2015?" → "En 2015, Alice Chen a remporté le prix des écrivains de fiction."</li>
              <li><strong>Spanish (Test):</strong> "¿Cuál fue el primer libro escrito por Carlos Díaz?" → "El primer libro de Carlos Díaz se titula 'Reefs of Memory', publicado en 2010."</li>
            </ul>

            <h3>Evaluation Metrics</h3>
            <p class="text">
              Final rankings will be based on macro-F1 scores, averaged across all classes and model variants, ensuring that successful solutions are both accurate and robust across diverse unlearning scenarios.
            </p>
          </div>
        </div>
      </div>
    </div> -->

    <div class="banner" id="submission">
      <div class="banner-table flex-column">
        <div class="flex-row">
          <div class="flex-item flex-column">
            <h2 class="add-top-margin-small">Submission Guidelines</h2>
          </div>
        </div>
      </div>
    </div>

    <div class="content">
      <div class="content-table flex-column">
        <div class="flex-row">
          <div class="flex-item flex-column full-width">
            <h3>Registration and Participation</h3>
            <p class="text">
              <strong>To participate in SVELA, teams must register by filling out the
              <a href="https://docs.google.com/forms/d/1Vm3GD6NoyGbwq6vSS6oEFKTFbdp2j-thQSUox0sptXo/edit" target="_blank">registration form</a>.
              Please note that each team must be registered in the form in order to be evaluated for the final leaderboard of the challenge!</strong>
            </p>
            <p class="text">
              Registered teams will receive access to:
            </p>
            <ul>
              <li>Development data (25% of identities per split: retain, forget, test)</li>
              <li>Three unlearned models (representing different sizes and unlearning algorithms)</li>
              <li>Baseline implementation and evaluation scripts</li>
            </ul>


            <h3>Task Requirements</h3>
            <p class="text">
              Participants must develop an evaluation metric capable of classifying each query instance as retained, forgotten, or test, based solely on the model's behavior. The metric should work without access to ground-truth labels or retrained gold models.
            </p>

            <h3>Submission Process</h3>
            <ul>
              <li><strong>System Implementation:</strong> Submit final implementation of the evaluation metric</li>
              <li><strong>System Description Paper:</strong> Submit a paper describing the approach and results (following EVALITA guidelines)</li>
              <li><strong>Evaluation:</strong> All submitted metrics will be executed by organizers on the full dataset and across all models</li>
            </ul>

            <h3>System Description Paper</h3>
            <p class="text">
              Participants are required to submit a system description paper following EVALITA 2026 guidelines. The paper should describe:
            </p>
            <ul>
              <li>The methodology and approach used</li>
              <li>Experimental setup and results</li>
              <li>Analysis and discussion of findings</li>
              <li>Comparison with baseline methods</li>
            </ul>

            <h3>Contact Information</h3>
            <p class="text">
              For questions regarding the task, please contact: <a href="mailto:svela.task@gmail.com">svela.task@gmail.com</a>
            </p>
          </div>
        </div>
      </div>
    </div>
    <div class="footer-container"></div>
</body>

</html>

</html>