Atabak Mardan
Big Data Analytics Project:
AIT-580
Dataset
(R Gardner, 2010) collected by Washington Area Bicyclist
Association (WABA). WABA founded in May
1, 1972 by Cary S. Shaw just with 9 members and 2$ annual membership. The
primary goal was to gather cyclists who use bicycle to commute rather than
recreational purpose, so to address their concerns. From a very early date, the
group lobbied for safe bike lanes, sidewalk cuts for bicycles, and better
maintenance of roads for bicycle safety. (Bloom, 2017)
In their web
site there is a form which is designed to gather important information
regarding bicycle crashes in the Washington area. They originally created this unique tool because data
on crashes in the region was barely sufficient. Crash Tracker seeks to not only
gather information regarding bicycle crashes, but also in order to help WABA
work with local law enforcement officials and representatives to make sure that
bicyclists are treated fairly when they are involved in an incident. (“Bicycle Crash Tracker,” 2017; Robert
Gardner, 2018)
Originally
provided dataset format is long format. I have to mention two data storing
types, long and wide. It means
attributes for each accident stored in rows rather than columns. There is 22
attributes for each accident. Long
format preferred for datasets which has too many attributes compare to number
of each accident. For example if we have just 10 data with 1000 attribute. Some
statistical softwares like STATA could handle both format. Picture below helps
to have better understanding of what is long and wide format.
Since 22 is not that much big number for
attributes, I suggest to convert it with R to wide format. Because most of the
functions are based on this format. So it’s easier to understand and make
estimations. After Conversion List of attributes is below:
R is mainly used to explore dataset and for visualizations. R is a programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing.(“R (programming language),” 2018)
Also Exploratory Desktop utilized to produce some graphs. It is sort of an UI for ‘dplyr’ package in R. Jupyter Notebook used to run Python codes for sentiment analysis. For SQL PostgreSQL used.
Since dataset is not a big dataset in size, a normal Pc with hardware below was enough to perform analysis.
Dataset was delivered to us by instructor of the course Dr Foxwell. But it is not open to public. Also provided data do not included any information that could be matched with person who reported. Just I detected in some “Vehicle types” people reported plate and even VIN number of that vehicle. To give an example:
2001 Mercedes 4S License
Plate: 3AZ7681 VIN: WDBJF65J91B318386 Owner: Amber Renee Awkward 12306
Connecticut Ave Silver Spring, MD 20906 Driver: Ibrahim Samba Dean S Yansaneh
5524 Karen Elaine Drive Apt 729 New Carrollton, MD 20784.
It is a really serious issue that WABA people should take actions toward this.
Quality of collected data is a parameter that should be better in my opinion. From perspective of a transportation engineer, I could say most two important factors of time and location are not presented well. For example date and time of accident do not follow a uniform format. It makes further investigation really hard. The reason is, in webpage form for collecting data, date and time input is a text input rather than a date and time input. I suggest to put an effort to design a better format for this questionnaire webpage. Also for location a solution to input location from a map could be a better approach rather than text report. At least, for text report, I know there could be a matching with a database of addresses from USPS is a better approach. Also I propose to give an option to report accidents anonymously.
Data format: First challenge of dataset is being in long format. Transformation of dataset from long format, that described before to wide format, was done by R. Packges “tidyr” and “dplyr”
Date and Time problem: Second problem was date and time attribute. As mentioned before because it is result of a self-report, so values are not uniform and not ready to do any operations. Actually we could consider this column of dataset as text rather than date and time. So Text Parsing operation was done to extract date and time information. Some examples of this effort is below. In order to table below we can see some transformations are not precise in order to inconsistency in original data.
Date/Time |
dtime |
6/23/2018 at 2:15PM |
6/23/2018 14:15 |
12/20/2016 @ 7:30am |
12/20/2016 7:30 |
October 22, 2012,
about 6:30 p.m. |
10/22/2012 6:30 |
6:15 p.m. or so,
12/6/13 |
6/15/2012 6:13 |
May 22nd approx 6:45
am |
NA |
41429.70833 |
NA |
October 14th, 2011,
8:40 am |
10/14/2011 8:40 |
3/5/2018 9am |
3/5/2020 18:09 |
11/22/2016 4:40pm |
11/22/2016 16:40 |
May 7, 2014
approximately 7:00 am |
5/7/2014 7:00 |
10/20/17 approx 1PM |
NA |
7/28/2016 6pm |
7/28/2020 16:06 |
June 30, 2015 5:00
p.m. |
6/30/2015 5:00 |
May 31, 2011, 4:45
p.m. (precisely!) |
5/31/2011 4:45 |
Monday September 26, 2011/approx.
2pm |
5/26/2020 23:02 |
Location Problem: Since first moment I got this data, my biggest concern was to find exact location of accidents from crash location description. Since location information was not a formatted addresses, and mostly they are location description. Initially it seem so hard to find exact location.
But thanks to an API provided by Google, it is possible to estimate a location from text. The Geocoding API is a service that provides geocoding and reverse geocoding of addresses. Geocoding is the process of converting addresses (like a street address) into geographic coordinates (like latitude and longitude), which you can use to place markers on a map, or position the map.
The nature of these API’s is to inquire information by a link combined with a personal API key that I obtained it from Google. The output is an XML file. Google documentations mostly based on handling them by JAVA. But thanks to Michael Dorman, mapsapi (“Introduction to package mapsapi,” n.d.) makes inquiries easier by using R platform to handle this.
So in R by writing a for loop, and many other transformations, for each data entry location estimated from crash location description from google API. As a sample, table below is illustrated.
Original Data: |
Obtained from API: |
||
Crash Location |
address_google |
latitude |
longitudes |
On Glover RD NW just north Broad Branch Road
NW DC |
5608
Broad Branch Rd NW, Washington, DC 20015, USA |
38.965591 |
-77.06869 |
intersection of 3rd and F Streets NW |
1204 3rd St NW, Washington, DC 20001, USA |
38.906101 |
-77.015559 |
N Lynne Street & Lee Hwy - where Custis
trail becomes Mt Vernon Trail |
N
Lynn St, Arlington, VA 22209, USA |
38.896975 |
-77.070852 |
18th St, NW , washington DC just above N
street |
18th St NW, Washington, DC, USA |
38.920141 |
-77.041838 |
I was bicycling east on Arundel Street in
Mount Rainier between 29th and 30th streets. I was about 30 feet from the
intersection with 29th Street. |
29th
St, Mt Rainier, MD 20712, USA |
38.943254 |
-76.967631 |
U Street NW Eastbound |
1115A U St NW, Washington, DC 20009, USA |
38.917139 |
-77.02762 |
Bladensburg Road and New York Avenue NE |
Bladensburg
Rd NE & New York Ave NE, Washington, DC 20002, USA |
38.917411 |
-76.972284 |
N/O Memorial circle-- in crosswalk |
Arlington, VA 22203, USA |
38.871669 |
-77.116875 |
11 th Street NW and Vermont |
11th
St NW & Vermont Ave NW, Washington, DC 20009, USA |
38.913804 |
-77.02704 |
Newton Street NE, just west of South Dakota |
Newton St NE, Washington, DC, USA |
38.934095 |
-76.977672 |
Westbouund on Roosevelt Bridge |
Theodore
Roosevelt Bridge, Washington, DC 20566, USA |
38.892286 |
-77.059779 |
Alexandria VA intersection of Royal and Gibbon
Street |
Gibbon St, Alexandria, VA 22314, USA |
38.799367 |
-77.048181 |
intersection 16th St, NW and Florida Avenue,
NW |
Florida
Ave NW & 16th St NW, Washington, DC 20009, USA |
38.919143 |
-77.036494 |
Columbia Pike headed eastbound just west of
Walter Reed (in front of Lost Dog) |
2920 Columbia Pike, Arlington, VA 22204, USA |
38.862271 |
-77.087594 |
Logan Circle |
Logan
Circle, Washington, DC, USA |
38.909641 |
-77.029637 |
Heat map and crash locations: Definitely first question comes to mind is where mostly accidents happened. So by using longitude and latitude information extracted from Google Geocode API, accident locations and a heat map was produced. A heat map is a tool that uses color the way a bar graph uses height and width: as a data visualization tool. For a dynamic version of this heat map you could use web-address below. Zooming in and out to more detailed information is available there.
http://mason.gmu.edu/~amardan/heatmap.html
http://mason.gmu.edu/~amardan/points/point.html
Time Analysis: Based on what we obtained from step I mentioned before for date and time extraction, some visualization produced.
The picture below is time heat map. Coloring is indicator of number of accident happened on that specific hour and minute.
Since graph above could be confusing. Time interval reduced to an hour based. 8 am in the morning and 5 pm looks are pick hours for accidents. It is not a surprise because this hours are corresponding with pick traffic hours for cars. Also known as rush hours.
Map of accident places, based on lighting is below. As shown, we could see few clusters for accident that happened in night with some street lightning.
And by considering crash type, not any discernable pattern could be find in map below. Harassments happened everywhere.
Many
other maps could be produced based on a variable.
For age and gender break down, some graphs were produced. In order to first one,males have a higher number in repots than females.
Age group of 26-35 has the highest number in accident report. It doesn’t mean necessarily accident happens to young people or we have higher number of cyclists in his age category. I have to mention in such datasets that are collected through internet, there is always a bias related to age. It is because of old people are less likely to use internet than youngers.
Since male people are have dominant number in all age categories. It seems young women between 18-25 has a higher number in reports. Such shift in paradigm need more attention
Crash type is one of categorical variables that should be interesting for further analysis:
High number of other crash types indicates that the webpage to collect data is not well designed. I suggest to WABA’s people to look this category crash description and try to modify their input values for crash type.
One of interesting things from graph above is despite of notion that women are more exposed to harassment, it seems men has a higher number in this category. Other interesting issue is their equality in door accidents.
Sentiment analysis:
For illustration I used text mining method in R to find most frequent words that are in Crash description. The result is below.
Scikit-Learn:
An effort was done to see whether we could predict crash type from crash description or not. I tried to test if there is bounds between words used in description and crash type. scikit-learn in Python is a popular machine learning library, utilized for this analysis. 6 class types is considered from crash type column:
Crash Type |
Count |
Door |
101 |
Harassment / Assault |
77 |
Left Hook |
156 |
Other |
275 |
Passing |
87 |
Right Hook |
122 |
Data filtered by R, and saved to txt files for further analysis. The algorithm which was used for “Sentiment analyses homework” was the base for this step. Since the code is too lengthy, I just mention some results. Below is an example of how could classified a crash description:
I have to mention results of model is so disappointing. Model’s accuracy is even below 50 %. I hoped I could use this model for prediction of a crash type from another source of data like Twitter. I suggest to use other classifiers like ntlk to see if there is a bound between crash description words and crash type
Proposed SQL Schema:
Since dataset is not too big, all the queries of SQL could be easily done by Excel Pivot tables or even in R studio. But a SQL schema propsed like:
CREATE TABLE Crash_wide(
SupporterKEY int, City char(20), State char(20), Zip int,
Age char(20), Attorney char(20), Citation Text, CitationYesNo text,
Compensation char(5), Description text, Location text,
CrashType char(20), CyclistStatement char(5),
DateTime text, FollowUp char(5), Gender char(20), Injuries text,
Lighting char(40), NoStatement text, PoliceDept text,
PoliceReportYesNo Char(40), PoliceYesNo Char(40), VehicleType Char(40),
VehicleTypeOther text, WABAMember Char(40), Weather text)
For this part, I briefly added comments to each graph and results above. So anything that I repeat here should be redundant. Just to summarize, this dataset gives us valuable information about cyclist accidents in DMV area. But a redesign for webpage and questionnaire needed. I suggest WABA people to review this dataset and redesign their webpage. For example for date and time a specific date and time input in webpage should be a better approach. Or for location, a map interface to choose the location will be better.
As privacy concerns, I have noticed some people reported name plate or VIN number of cars. I assume releasing such data could have consequences for WABA.
Also I want to mention about hot spots that I have caught on map. These locations because of high number of accidents should be scrutinized to find out reason behind. And if needed, actions should be taken.
Technical Terms: Terms
are described in context
Bicycle Crash Tracker. (2017,
January 23). Retrieved November 10, 2018, from
https://www.waba.org/advocacy/bicycle-crash-tracker/
Bloom, J. (2017). “To Die for a
Lousy Bike”: Bicycles, Race, and the Regulation of Public Space on the Streets
of Washington, DC, 1963–2009. American Quarterly, 69(1), 47–70.
https://doi.org/10.1353/aq.2017.0003
Gardner, R. (2010, 2018). Crash
Tracker Dataset. Retrieved from https://www.waba.org/advocacy/bicycle-crash-tracker/
Gardner, Robert. (2018, October
3). (Re)Introducing Crash Tracker. Retrieved November 10, 2018, from
https://www.waba.org/blog/2018/10/reintroducing-crash-tracker/
Introduction to package mapsapi.
(n.d.). Retrieved December 7, 2018, from
https://cran.rstudio.com/web/packages/mapsapi/vignettes/intro.html
R (programming language).
(2018). In Wikipedia. Retrieved from
https://en.wikipedia.org/w/index.php?title=R_(programming_language)&oldid=871803730