-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathDataPreprocess.html
More file actions
283 lines (261 loc) · 15.2 KB
/
DataPreprocess.html
File metadata and controls
283 lines (261 loc) · 15.2 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
<!DOCTYPE HTML>
<!--
Editorial by HTML5 UP
html5up.net | @ajlkn
Free for personal and commercial use under the CCA 3.0 license (html5up.net/license)
-->
<html>
<head>
<title>Data Preprocess</title>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1, user-scalable=no" />
<link rel="stylesheet" href="assets/css/main.css" />
</head>
<body class="is-preload">
<!-- Wrapper -->
<div id="wrapper">
<!-- Main -->
<div id="main">
<div class="inner">
<!-- Header -->
<header id="header">
<a href="index.html" class="logo">China-US <strong>Trade War</strong></a>
<ul class="icons">
<li><a href="https://github.com/RubyZh/TradeWarVisualization" class="icon brands fa-github"><span class="label">GitHub</span></a></li>
</ul>
</header>
<!-- Content -->
<section>
<header class="main">
<h1>Data Preprocess</h1>
</header>
<p><span class="image left"><img src="images/news_clean.png" alt="Data Cleaning Illustration" /></span>
<h2>Data Cleaning</h2>
<p>We performed data cleaning on the dataset of global media reports on the China–US trade war.</p>
<p>First, we manually corrected apparent anomalies in the <code>PTime</code> field and used <code>DTime</code> to fill in missing <code>PTime</code> values.
Then, we removed records with clearly abnormal formats or content.
Finally, for duplicate news entries with identical sources and content, only one copy was retained to eliminate redundancy.
</p>
</p>
<hr class="major" />
<h2>Source Country Identification</h2>
<h3>Method Description</h3>
<p>We use the URL field of news articles and the large language model API provided by DeepSeek to perform intelligent analysis and structured extraction of news sources.</p>
<p><span class="image right"><img src="images/news_nation.png" alt="" /></span>
<p>For each news article, we construct a prompt and request the model to determine the country of origin of the news source, extract the full English name of the country, its ISO 3166-1 three-letter code, and the latitude and longitude information of its capital. The model responds with structured data in JSON format. We use regular expressions to clean the data and deserialize it to extract key fields, which are then written to the corresponding rows in a CSV file.</p>
<p>This method offers strong flexibility and generalization capabilities, making it suitable for complex source scenarios where domain name hard matching is not feasible. The entire process is controlled by a Python script, which includes exception handling and request rate-limiting logic to ensure stable and reliable batch calls.</p>
<!-- <h3>Sample Code</h3> -->
<!-- <p>https://github.com/RubyZh/TradeWarVisualization/tree/main/preprocessing/nation.py</p> -->
<ul class="actions">
<li><a href="https://github.com/RubyZh/TradeWarVisualization/tree/main/preprocessing/nation.py" class="button">Sample Code</a></li>
</ul>
</p>
<hr class="major" />
<h2>Automatic News Summarization</h2>
<h3>Model Introduction</h3>
<p>The <code>facebook/bart-large-cnn model</code> is a large-scale text summarization model developed by Facebook, based on the BART architecture. </p>
<p><span class="image left"><img src="images/news_summary.png" alt="Data Cleaning Illustration" /></span>
<p>It has been fine-tuned on the CNN/Daily Mail news summarization dataset and is well-suited for tasks such as news summarization, document summarization, and other natural language generation applications. BART combines a bidirectional encoder with an autoregressive decoder, and adopts a self-supervised pretraining strategy in which corrupted input sequences are reconstructed.</p>
<p>After fine-tuning, the model demonstrates strong performance in generating coherent, informative, and readable news summaries. Compared to similar models such as T5 and RoBERTa, it exhibits certain advantages in contextual understanding and summarization quality.</p>
<p><strong>Note</strong>: The default maximum input length of the model is 1024 tokens. For decoding, it is recommended to set do_sample=False and optionally use beam search to ensure stable and high-quality summaries.</p>
<h3>Model Application</h3>
<p>The model is loaded using Hugging Face's pipeline interface.</p>
<p>After tokenizing the input text, if the token count is fewer than 100, the original text is retained; if it exceeds the predefined limit of <code>MAX_TOKENS=1000</code>, it is truncated to this length to avoid poor summary quality due to overly short input or decreased inference efficiency due to overly long input.</p>
<p>During inference, the summary length is constrained to be no less than 100 tokens and no more than 250 tokens, ensuring the output remains within a reasonable range and is deterministic. In case of any exception, an error flag is returned. All generated summaries are written back to the <code>Summary</code> field of the dataset and exported as a CSV file.</p>
<!-- <h3>Model Resource Link</h3> -->
<!-- <p>https://huggingface.co/facebook/bart-large-cnn</p> -->
<ul class="actions">
<li><a href="https://huggingface.co/facebook/bart-large-cnn" class="button primary">Model Resource</a></li>
</ul>
</p>
<hr class="major" />
<h2>Keyword extraction</h2>
<h3>Model Introduction</h3>
<p>YAKE is a lightweight, unsupervised automatic keyword extraction method that uses statistical features of the text to identify the most important keywords from a document. It is applicable to multiple languages and domains, and is not limited by the size of the text. It is currently open-source on GitHub.</p>
<p><span class="image right"><img src="images/news_keyword.png" alt="Data Cleaning Illustration" /></span>
<h3>Model Application</h3>
<p>First, we installed the YAKE library using <code>pip install yake</code>. During usage, we set the parameters as follows: the language was set to English (<code>lan="en"</code>), the maximum keyword length was limited to three-word phrases (<code>n=3</code>), the top 15 keywords were extracted (<code>top=15</code>), and the default deduplication method (such as <code>seqm</code>) was used to avoid redundant keywords. For each news article, the extracted results included keywords along with their importance scores (the lower the score, the more relevant the keyword). Additionally, we used YAKE's <code>TextHighlighter</code> tool to mark and highlight the keywords within the original text, making it easier for subsequent presentation and analysis.</p>
<!-- <h3>Model Resource Link</h3> -->
<!-- <p>https://github.com/LIAAD/yake</p> -->
<ul class="actions">
<li><a href="https://github.com/LIAAD/yake" class="button">Model Resource</a></li>
</ul>
</p>
<hr class="major" />
<h2>Emotion Recognition</h2>
<h3>Model Introduction</h3>
<p><code>j-hartmann/emotion-english-distilroberta-base</code> is a lightweight English sentiment classification model trained on the DistilRoBERTa architecture. The model was fine-tuned on the GoEmotions dataset, provided by Google, which includes over 58,000 Reddit comments annotated with 28 fine-grained sentiment labels, later consolidated into 7 main categories (anger, disgust, fear, joy, neutral, sadness, surprise), making it suitable for various real-world sentiment understanding tasks.</p>
<p><span class="image left"><img src="images/news_emo.png" alt="Data Cleaning Illustration" /></span>
<p>DistilRoBERTa is a distilled version of RoBERTa, which is smaller in size and faster in inference than the original model, while still retaining good semantic understanding capabilities. It is particularly suitable for handling large-scale text classification tasks in production environments. In actual evaluation, the model maintains high accuracy while having high computational efficiency, which can significantly improve the overall response speed and scalability of the emotion recognition process.</p>
<p>When invoking this model via the Hugging Face platform, users can directly use the <code>pipeline("text-classification")</code> interface and enable the <code>return_all_scores=True</code> parameter to return scores for all labels, supporting advanced analysis needs such as multi-label ranking and confidence comparison, greatly enhancing the model's flexibility and interpretability in practical applications.</p>
<h3>Model Application</h3>
<p>Load the model through Hugging Face's pipeline interface and perform batch sentiment analysis on the news summary field. For each summary, the model outputs the corresponding multi-category sentiment scores. By sorting the scores, extract the top two sentiment labels and their confidence scores, and write them to the result file as Top1 and Top2 sentiment features, saved in UTF-8 encoding format as CSV.</p>
<!-- <h3>Model Resource Link</h3> -->
<!-- <p>https://huggingface.co/j-hartmann/emotion-english-distilroberta-base</p> -->
<ul class="actions">
<li><a href="https://huggingface.co/j-hartmann/emotion-english-distilroberta-base" class="button primary">Model Resource</a></li>
</ul>
</p>
<hr class="major" />
<h2>Our Dataset</h2>
<table>
<thead>
<tr>
<th>Field name</th>
<th>Meaning</th>
</tr>
</thead>
<tbody>
<tr>
<td>UniqueID</td>
<td>News unique identifier</td>
</tr>
<tr>
<td>Title</td>
<td>News Headline</td>
</tr>
<tr>
<td>Author</td>
<td>News author</td>
</tr>
<tr>
<td>PTime</td>
<td>Publish time</td>
</tr>
<tr>
<td>DTime</td>
<td>Grasping time</td>
</tr>
<tr>
<td>Source</td>
<td>News source platform</td>
</tr>
<tr>
<td>URL</td>
<td>News web page link</td>
</tr>
<tr>
<td>Content</td>
<td>The main content of the news</td>
</tr>
<tr>
<td>Rank</td>
<td>Source website influence rating</td>
</tr>
<tr>
<td>CountryName</td>
<td>Name of the country where the news source comes from (English)</td>
</tr>
<tr>
<td>CountryISO_Code</td>
<td>The ISO code of the news source country</td>
</tr>
<tr>
<td>CapitalLatitude</td>
<td>The latitude of the capital of the source country</td>
</tr>
<tr>
<td>CapitalLongitude</td>
<td>The longitude of the capital of the source country</td>
</tr>
<tr>
<td>CCode</td>
<td>COW country code</td>
</tr>
<tr>
<td>USAgree</td>
<td>The average of the ideal point consistent with the position of the United States</td>
</tr>
<tr>
<td>ChinaAgree</td>
<td>The average value of the ideal point consistent with China's position</td>
</tr>
<tr>
<td>Summary</td>
<td>The generated news summary</td>
</tr>
<tr>
<td>Top1Emotion</td>
<td>Main emotional tags</td>
</tr>
<tr>
<td>Top1Score</td>
<td>Main emotional confidence level</td>
</tr>
<tr>
<td>Top2Emotion</td>
<td>Secondary emotional labels</td>
</tr>
<tr>
<td>Top2Score</td>
<td>Secondary emotional confidence</td>
</tr>
<tr>
<td>Key1-Key15</td>
<td>Key words 1-15</td>
</tr>
</tbody>
</table>
<p>Our final dataset integrates multi-source information around the theme of the “US-China trade war,” covering rich dimensions such as news content, source country, summary, sentiment, and keywords. With this structured data, we can conduct comprehensive analyses from multiple angles, including sentiment distribution by country, sentiment and stance analysis of summary content, correlation analysis between media platform influence and reporting attitudes, and dissemination characteristics of China-US related topics across media in different countries. This dataset helps readers gain a clearer understanding of the development trajectory of the China-US trade war, the reporting perspectives of media in various countries, and the evolution of international public opinion through structured news analysis.</p>
</section>
</div>
</div>
<!-- Sidebar -->
<div id="sidebar">
<div class="inner">
<!-- Search -->
<section id="search" class="alt">
<form method="post" action="#">
<input type="text" name="query" id="query" placeholder="Search" />
</form>
</section>
<!-- Menu -->
<nav id="menu">
<header class="major">
<h2>Menu</h2>
</header>
<ul>
<li><a href="index.html">Homepage</a></li>
<li><a href="DataPreprocess.html">Data preprocess</a></li>
<li><a href="Timeline.html">Timeline</a></li>
<li>
<span class="opener">Media Report</span>
<ul>
<li><a href="Topics.html">Topics</a></li>
<li><a href="Emotions.html">Emotions</a></li>
</ul>
</li>
<li>
<span class="opener">3D Visualization</span>
<ul>
<li><a href="3D_0.html">Overview</a></li>
<li><a href="3D_1.html">Data Import and Time Filtering</a></li>
<li><a href="3D_2.html">Topic Heat Map</a></li>
<li><a href="3D_3.html">Important Event Timeline and Daily News Sentiment Analysis</a></li>
<li><a href="3D_4.html">National News Basic Information Analysis Panel</a></li>
<li><a href="3D_5.html">National News Sentiment Dynamics Analysis</a></li>
<li><a href="3D_6.html">News Summary Dynamic Carousel and Assisted Reading</a></li>
<li><a href="3D_7.html">Tax</a></li>
<li><a href="3D_8.html">VR Hardware Exploration Section</a></li>
</ul>
</li>
<li><a href="#">Summary</a></li>
</ul>
</nav>
<!-- Footer -->
<footer id="footer">
<p class="copyright">
© TradeVis360
</p>
</footer>
</div>
</div>
</div>
<!-- Scripts -->
<script src="assets/js/jquery.min.js"></script>
<script src="assets/js/browser.min.js"></script>
<script src="assets/js/breakpoints.min.js"></script>
<script src="assets/js/util.js"></script>
<script src="assets/js/main.js"></script>
</body>
</html>