Liang Zhao's Homepage

Introduction

Civil unrest events are typically organized in social media, especially by Twitter and Facebook. Therefore, mining these data allow us capability to potentially detect and forecast future events. By identifying those tweets who could indicate about future civil unrest events, the goal is to utilize Twitter data as social sensors to forecast the spatiotemporal patterns of protests for different locations and dates.

Processsed Data

Download link:

Dataset	Spanish Tweets (%)	English Tweets (%)	Portuguese Tweets (%)	#Events	2013 Data	2014 Data
Argentina	91.6	7.3	1.1	1427	[AR_data]	[AR_data]
Brazil	10.1	16.0	73.9	3417	[BR_data]	[BR_data]
Chile	82.8	16.4	0.8	776	[CL_data]	[CL_data]
Colombia	89.8	9.4	0.8	1287	[CO_data]	[CO_data]
Ecuador	91.1	8.1	0.8	511	[EC_data]	[EC_data]
El Salvador	91.5	7.8	0.7	730	[EL_data]	[EL_data]
Mexico	83.7	15.4	0.9	5907	[MX_data]	[MX_data]
Paraguay	92.2	6.4	1.4	2114	[PY_data]	[PY_data]
Uruguay	89.7	8.8	1.4	664	[UY_data]	[UY_data]
Venezuela	92.3	6.9	0.8	3320	[VE_data]	[VE_data]

Data format: *.mat (can be opened by Matlab)

Data description:

Variable Name	Type	Size	Description
keywords	array of string	1*3	keyword lists corresponding to various languages
locations	array of string	1*n	location names of n cities in the current country
langs	array of string	1*3	language names
Xs	arrays of matrices	1*3	input data: three matrices corresponding to the three languages each matrix: n samples by k keywords for each language each element is a keyword count for a sample corresponding to a specific location and date.
Y	array of matrices	1*3	output data: three vectors corresponding to the three languages each vector: contains n samples each element is binary value denoting event occurrence (=1) or not (=0) for a specific location and date.

Data Source

All the civil unrest tweet messages X, label set Y, and keywords are obtained from IARPA OSI project. Please refer to the papers [KDD 2014] and [KDD 2016] for details. The raw label set can be downloaded here: [Output Raw Data].

Citation

To use these datasets, please cite the papers:

Liang Zhao, Junxiang Wang, and Xiaojie Guo. Distant-supervision of heterogeneous multitask learning for social event forecasting with multilingual indicators. Thirty-Second AAAI Conference on Artificial Intelligence (AAAI 2018), Oral presentation (acceptance rate: 11.0%), pp. 4498-4505, New Orleans, US, Feb 2018.

Acknowledgement

NSF 1755850 (sole-PI): "CRII: III: Interpretable Models for Spatio-Temporal Event Forecasting using Social Sensors", $174,990. 2018-2021, National Science Foundation.

Multilingual Tweet Data

Introduction

Processsed Data

Data Source

Citation

Acknowledgement