Multilingual Tweet Data

Introduction

Civil unrest events are typically organized in social media, especially by Twitter and Facebook. Therefore, mining these data allow us capability to potentially detect and forecast future events. By identifying those tweets who could indicate about future civil unrest events, the goal is to utilize Twitter data as social sensors to forecast the spatiotemporal patterns of protests for different locations and dates.

Processsed Data

Download link:

Dataset
Spanish Tweets (%) English Tweets (%) Portuguese Tweets (%) #Events

2013 Data

2014 Data
Argentina
91.6
7.3
1.1
1427
[AR_data] [AR_data]
Brazil
10.1
16.0
73.9
3417
[BR_data] [BR_data]
Chile
82.8
16.4
0.8
776
[CL_data] [CL_data]
Colombia
89.8
9.4
0.8
1287
[CO_data] [CO_data]
Ecuador
91.1
8.1
0.8
511
[EC_data] [EC_data]
El Salvador
91.5
7.8
0.7
730
[EL_data] [EL_data]
Mexico
83.7
15.4
0.9
5907
[MX_data] [MX_data]
Paraguay
92.2
6.4
1.4
2114
[PY_data] [PY_data]
Uruguay
89.7
8.8
1.4
664
[UY_data] [UY_data]
Venezuela
92.3
6.9
0.8
3320
[VE_data] [VE_data]

Data format: *.mat (can be opened by Matlab)

Data description:

Variable Name
Type Size

Description

keywords array of string 1*3 keyword lists corresponding to various languages
locations array of string 1*n location names of n cities in the current country
langs array of string 1*3 language names
Xs arrays of matrices 1*3 input data: three matrices corresponding to the three languages
  • each matrix: n samples by k keywords for each language
    • each element is a keyword count for a sample corresponding to a specific
      location and date.
Y array of matrices 1*3 output data: three vectors corresponding to the three languages
  • each vector: contains n samples
    • each element is binary value denoting event occurrence (=1) or not (=0) for
      a specific location and date.

Data Source

All the civil unrest tweet messages X, label set Y, and keywords are obtained from IARPA OSI project. Please refer to the papers [KDD 2014] and [KDD 2016] for details. The raw label set can be downloaded here: [Output Raw Data].

Citation

To use these datasets, please cite the papers:

Liang Zhao, Junxiang Wang, and Xiaojie Guo. Distant-supervision of heterogeneous multitask learning for social event forecasting with multilingual indicators. Thirty-Second AAAI Conference on Artificial Intelligence (AAAI 2018), Oral presentation (acceptance rate: 11.0%), pp. 4498-4505, New Orleans, US, Feb 2018.

Acknowledgement

 

NSF 1755850 (sole-PI): "CRII: III: Interpretable Models for Spatio-Temporal Event Forecasting using Social Sensors", $174,990. 2018-2021, National Science Foundation.