Atabak Mardan

Big Data Analytics Project: AIT-580

WABA CrashTracker Survey

Dataset (R Gardner, 2010) collected by Washington Area Bicyclist Association (WABA).  WABA founded in May 1, 1972 by Cary S. Shaw just with 9 members and 2$ annual membership. The primary goal was to gather cyclists who use bicycle to commute rather than recreational purpose, so to address their concerns. From a very early date, the group lobbied for safe bike lanes, sidewalk cuts for bicycles, and better maintenance of roads for bicycle safety. (Bloom, 2017)

 In their web site there is a form which is designed to gather important information regarding bicycle crashes in the Washington area. They originally created this unique tool because data on crashes in the region was barely sufficient. Crash Tracker seeks to not only gather information regarding bicycle crashes, but also in order to help WABA work with local law enforcement officials and representatives to make sure that bicyclists are treated fairly when they are involved in an incident. (“Bicycle Crash Tracker,” 2017; Robert Gardner, 2018)

Originally provided dataset format is long format. I have to mention two data storing types, long and wide.  It means attributes for each accident stored in rows rather than columns. There is 22 attributes for each accident.  Long format preferred for datasets which has too many attributes compare to number of each accident. For example if we have just 10 data with 1000 attribute. Some statistical softwares like STATA could handle both format. Picture below helps to have better understanding of what is long and wide format.

Since 22 is not that much big number for attributes, I suggest to convert it with R to wide format. Because most of the functions are based on this format. So it’s easier to understand and make estimations. After Conversion List of attributes is below:

 

Software/Hardware Requirements:

R is mainly used to explore dataset and for visualizations.  R is a programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing.(“R (programming language),” 2018)

Also Exploratory Desktop utilized to produce some graphs. It is sort of an UI for ‘dplyr’ package in R. Jupyter Notebook used to run Python codes for sentiment analysis. For SQL PostgreSQL used.

Since dataset is not a big dataset in size, a normal Pc with hardware below was enough to perform analysis.

Dataset Exploration; Challenges and how to overcome:

Privacy: 

Dataset was delivered to us by instructor of the course Dr Foxwell. But it is not open to public. Also provided data do not included any information that could be matched with person who reported. Just I detected in some “Vehicle types” people reported plate and even VIN number of that vehicle.  To give an example:

2001 Mercedes 4S License Plate: 3AZ7681 VIN: WDBJF65J91B318386 Owner: Amber Renee Awkward 12306 Connecticut Ave Silver Spring, MD 20906 Driver: Ibrahim Samba Dean S Yansaneh 5524 Karen Elaine Drive Apt 729 New Carrollton, MD 20784.

It is a really serious issue that WABA people should take actions toward this.

Quality of data:

Quality of collected data is a parameter that should be better in my opinion. From perspective of a transportation engineer, I could say most two important factors of time and location are not presented well.  For example date and time of accident do not follow a uniform format. It makes further investigation really hard. The reason is, in webpage form for collecting data, date and time input is a text input rather than a date and time input. I suggest to put an effort to design a better format for this questionnaire webpage. Also for location a solution to input location from a map could be a better approach rather than text report. At least, for text report, I know there could be a matching with a database of addresses from USPS is a better approach. Also I propose to give an option to report accidents anonymously.

Efforts done to overcome issues:

Data format: First challenge of dataset is being in long format. Transformation of dataset from long format, that described before to wide format, was done by R. Packges “tidyr” and “dplyr”

Date and Time problem: Second problem was date and time attribute. As mentioned before because it is result of a self-report, so values are not uniform and not ready to do any operations. Actually we could consider this column of dataset as text rather than date and time. So Text Parsing operation was done to extract date and time information. Some examples of this effort is below. In order to table below we can see some transformations are not precise in order to inconsistency in original data.

Date/Time

dtime

6/23/2018 at 2:15PM

6/23/2018 14:15

12/20/2016 @ 7:30am

12/20/2016 7:30

October 22, 2012, about 6:30 p.m.

10/22/2012 6:30

6:15 p.m. or so, 12/6/13

6/15/2012 6:13

May 22nd approx 6:45 am

NA

41429.70833

NA

October 14th, 2011, 8:40 am

10/14/2011 8:40

3/5/2018 9am

3/5/2020 18:09

11/22/2016 4:40pm

11/22/2016 16:40

May 7, 2014 approximately 7:00 am

5/7/2014 7:00

10/20/17 approx 1PM

NA

7/28/2016 6pm

7/28/2020 16:06

June 30, 2015 5:00 p.m.

6/30/2015 5:00

May 31, 2011, 4:45 p.m. (precisely!)

5/31/2011 4:45

Monday September 26, 2011/approx. 2pm

5/26/2020 23:02

Location Problem: Since first moment I got this data, my biggest concern was to find exact location of accidents from crash location description. Since location information was not a formatted addresses, and mostly they are location description. Initially it seem so hard to find exact location.

But thanks to an API provided by Google, it is possible to estimate a location from text. The Geocoding API is a service that provides geocoding and reverse geocoding of addresses. Geocoding is the process of converting addresses (like a street address) into geographic coordinates (like latitude and longitude), which you can use to place markers on a map, or position the map.

The nature of these API’s is to inquire information by a link combined with a personal API key that I obtained it from Google. The output is an XML file. Google documentations mostly based on handling them by JAVA. But thanks to Michael Dorman, mapsapi (“Introduction to package mapsapi,” n.d.) makes inquiries easier by using R platform to handle this.

So in R by writing a for loop, and many other transformations, for each data entry location estimated from crash location description from google API. As a sample, table below is illustrated.

Original Data:

Obtained from API:

Crash Location

address_google

latitude

longitudes

On Glover RD NW just north Broad Branch Road NW DC

5608 Broad Branch Rd NW, Washington, DC 20015, USA

38.965591

-77.06869

intersection of 3rd and F Streets NW

1204 3rd St NW, Washington, DC 20001, USA

38.906101

-77.015559

N Lynne Street & Lee Hwy - where Custis trail becomes Mt Vernon Trail

N Lynn St, Arlington, VA 22209, USA

38.896975

-77.070852

18th St, NW , washington DC just above N street

18th St NW, Washington, DC, USA

38.920141

-77.041838

I was bicycling east on Arundel Street in Mount Rainier between 29th and 30th streets. I was about 30 feet from the intersection with 29th Street.

29th St, Mt Rainier, MD 20712, USA

38.943254

-76.967631

U Street NW Eastbound

1115A U St NW, Washington, DC 20009, USA

38.917139

-77.02762

Bladensburg Road and New York Avenue NE

Bladensburg Rd NE & New York Ave NE, Washington, DC 20002, USA

38.917411

-76.972284

N/O Memorial circle-- in crosswalk

Arlington, VA 22203, USA

38.871669

-77.116875

11 th Street NW and Vermont

11th St NW & Vermont Ave NW, Washington, DC 20009, USA

38.913804

-77.02704

Newton Street NE, just west of South Dakota

Newton St NE, Washington, DC, USA

38.934095

-76.977672

Westbouund on Roosevelt Bridge

Theodore Roosevelt Bridge, Washington, DC 20566, USA

38.892286

-77.059779

Alexandria VA intersection of Royal and Gibbon Street

Gibbon St, Alexandria, VA 22314, USA

38.799367

-77.048181

intersection 16th St, NW and Florida Avenue, NW

Florida Ave NW & 16th St NW, Washington, DC 20009, USA

38.919143

-77.036494

Columbia Pike headed eastbound just west of Walter Reed (in front of Lost Dog)

2920 Columbia Pike, Arlington, VA 22204, USA

38.862271

-77.087594

Logan Circle

Logan Circle, Washington, DC, USA

38.909641

-77.029637

Dataset Visualizations:

Heat map and crash locations: Definitely first question comes to mind is where mostly accidents happened. So by using longitude and latitude information extracted from Google Geocode API, accident locations and a heat map was produced. A heat map is a tool that uses color the way a bar graph uses height and width: as a data visualization tool. For a dynamic version of this heat map you could use web-address below. Zooming in and out to more detailed information is available there.

http://mason.gmu.edu/~amardan/heatmap.html

http://mason.gmu.edu/~amardan/points/point.html

Time Analysis: Based on what we obtained from step I mentioned before for date and time extraction, some visualization produced.

The picture below is time heat map. Coloring is indicator of number of accident happened on that specific hour and minute.

Since graph above could be confusing. Time interval reduced to an hour based.  8 am in the morning and 5 pm looks are pick hours for accidents. It is not a surprise because this hours are corresponding with pick traffic hours for cars. Also known as rush hours.

Map of accident places, based on lighting is below. As shown, we could see few clusters for accident that happened in night with some street lightning.

And by considering crash type, not any discernable pattern could be find in map below. Harassments happened everywhere.

Many other maps could be produced based on a variable.

For age and gender break down, some graphs were produced. In order to first one,males have a higher number in repots than females.

Age group of 26-35 has the highest number in accident report. It doesn’t mean necessarily accident happens to young people or we have higher number of cyclists in his age category. I have to mention in such datasets that are collected through internet, there is always a bias related to age. It is because of old people are less likely to use internet than youngers.

Since male people are have dominant number in all age categories. It seems young women between 18-25 has a higher number in reports. Such shift in paradigm need more attention

Crash type is one of categorical variables that should be interesting for further analysis:

High number of other crash types indicates that the webpage to collect data is not well designed. I suggest to WABA’s people to look this category crash description and try to modify their input values for crash type.

 

One of interesting things from graph above is despite of notion that women are more exposed to harassment, it seems men has a higher number in this category. Other interesting issue is their equality in door accidents.

 

Sentiment analysis:

For illustration I used text mining method in R to find most frequent words that are in Crash description. The result is below.

 

Scikit-Learn:

An effort was done to see whether we could predict crash type from crash description or not. I tried to test if there is bounds between words used in description and crash type. scikit-learn in  Python is  a popular machine learning library, utilized for this analysis. 6 class types is considered from crash type column:

Crash Type

Count

Door

101

Harassment / Assault

77

Left Hook

156

Other

275

Passing

87

Right Hook

122

 

Data filtered by R, and saved to txt files for further analysis. The algorithm which was used for “Sentiment analyses homework” was the base for this step. Since the code is too lengthy, I just mention some results. Below is an example of how could classified a crash description:

I have to mention results of model is so disappointing. Model’s accuracy is even below 50 %. I hoped I could use this model for prediction of a crash type from another source of data like Twitter. I suggest to use other classifiers like ntlk to see if there is a bound between crash description words and crash type

Proposed SQL Schema:

Since dataset is not too big, all the queries of SQL could be easily done by Excel Pivot tables or even in R studio. But a SQL schema propsed like:

CREATE TABLE Crash_wide(

SupporterKEY int, City char(20), State char(20), Zip int,

Age char(20), Attorney char(20), Citation Text, CitationYesNo text,

Compensation char(5), Description    text, Location text,

CrashType       char(20), CyclistStatement char(5),

DateTime text, FollowUp char(5), Gender    char(20), Injuries         text,

Lighting          char(40), NoStatement           text, PoliceDept text,

PoliceReportYesNo Char(40), PoliceYesNo  Char(40), VehicleType Char(40),

VehicleTypeOther text, WABAMember Char(40), Weather text)

 

Conclusion:

For this part, I briefly added comments to each graph and results above. So anything that I repeat here should be redundant. Just to summarize, this dataset gives us valuable information about cyclist accidents in DMV area. But a redesign for webpage and questionnaire needed. I suggest WABA people to review this dataset and redesign their webpage. For example for date and time a specific date and time input in webpage should be a better approach. Or for location, a map interface to choose the location will be better.

As privacy concerns, I have noticed some people reported name plate or VIN number of cars. I assume releasing such data could have consequences for WABA.

Also I want to mention about hot spots that I have caught on map. These locations because of high number of accidents should be scrutinized to find out reason behind. And if needed, actions should be taken.

Technical Terms: Terms are described in context

References:

Bicycle Crash Tracker. (2017, January 23). Retrieved November 10, 2018, from https://www.waba.org/advocacy/bicycle-crash-tracker/

Bloom, J. (2017). “To Die for a Lousy Bike”: Bicycles, Race, and the Regulation of Public Space on the Streets of Washington, DC, 1963–2009. American Quarterly, 69(1), 47–70. https://doi.org/10.1353/aq.2017.0003

Gardner, R. (2010, 2018). Crash Tracker Dataset. Retrieved from https://www.waba.org/advocacy/bicycle-crash-tracker/

Gardner, Robert. (2018, October 3). (Re)Introducing Crash Tracker. Retrieved November 10, 2018, from https://www.waba.org/blog/2018/10/reintroducing-crash-tracker/

Introduction to package mapsapi. (n.d.). Retrieved December 7, 2018, from https://cran.rstudio.com/web/packages/mapsapi/vignettes/intro.html

R (programming language). (2018). In Wikipedia. Retrieved from https://en.wikipedia.org/w/index.php?title=R_(programming_language)&oldid=871803730