Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
April 7, 2023 05:00 pm GMT

Crash City

Analyzing NYC Traffic Collision Data on the Linux Command Line with SoQL and Curl

Original Post

New York City and the NYPD publish a dataset of traffic collisions, related fatalities, injuries and other details here.
This was first published in July 2012, and is updated regularly to this date.

NYC Commuters, especially pedestrians and cyclists, have to endure many hazards just to get to and from work. Crime is one issue, but it's not as treacherous as crossing Queens Boulevard during rush hour, or cycling in downtown Manhattan when some clown driving an SUV the size of an Sherman tank is taking up half the road.

Previously I did some analysis using their downloadable CSV dataset.

Here Im going to use the curl utility along with the SODA or SoQL Query Language, to try and make some sense out of this published data.

Using the Socrata Query Language, SoQL

Dataset App Token
Socrata functions

How many collisions since July 2012

curl --get --data-urlencode "\$\$app_token=uvwxyz" --data-urlencode "\$select=count(*)" https://data.cityofnewyork.us/resource/h9gi-nx95.json [{"count":"1977803"}]

Explanation

curl --get or -G
Use the GET verb as we are getting data
v
Lots of verbose output as you can see from the above output.
d
Request data to pass to the API in ASCII format
--data-urlencode
URL-Encode the data. Safer than just using -d
$$app_token
Users personal authorization. Not really necessary for ad-hoc requests
Socrata open data API [App-Token]](https://dev.socrata.com/docs/app-tokens.html)
I replaced my actual token with uvwxyz for fairly obvious reasons
"$select=count(*)"
Similar to the SQL SELECT and SQL count aggregate function
SoQL $select
SoQL count

Observation

1,977,803 Collisions from July 2012 to March 2023 seems like a lot to me. Youd wonder whats the point of driving tests if we still end up with this many collisions.

Getting all the collision records between two arbitrary dates, June 30th 2022 to December 31 2022.

This time Ill use the -v switch for curl to get a much more verbose output.

> curl --get -v  --data-urlencode "\$\$app_token=xyz" --data-urlencode "\$select=*" --data-urlencode "\$where=crash_date between '2022-06-30T00:00:00.000' and '2022-12-31T00:00:00.000'"  https://data.cityofnewyork.us/resource/h9gi-nx95.json % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current                                 Dload  Upload   Total   Spent    Left  Speed  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 52.206.68.26:443...* Connected to data.cityofnewyork.us (52.206.68.26) port 443 (#0)...> GET /resource/h9gi-nx95.json?$$app_token=xyz&$select=%2A&$where=crash_date+between+%272022-06-30T00%3A00%3A00.000%27+and+%272022-12-31T00%3A00%3A00.000%27 HTTP/1.1> Host: data.cityofnewyork.us> User-Agent: curl/7.81.0...< HTTP/1.1 200 OK....< X-SODA2-Fields: ["crash_date","crash_time","borough","zip_code","latitude","longitude","location","on_street_name","off_street_name","cross_street_name","number_of_persons_injured","number_of_persons_killed","number_of_pedestrians_injured","number_of_pedestrians_killed","number_of_cyclist_injured","number_of_cyclist_killed","number_of_motorist_injured","number_of_motorist_killed","contributing_factor_vehicle_1","contributing_factor_vehicle_2","contributing_factor_vehicle_3","contributing_factor_vehicle_4","contributing_factor_vehicle_5","collision_id","vehicle_type_code1","vehicle_type_code2","vehicle_type_code_3","vehicle_type_code_4","vehicle_type_code_5"]< X-SODA2-Types: ["floating_timestamp","text","text","text","number","number","location","text","text","text","number","number","number","number","number","number","number","number","text","text","text","text","text","number","text","text","text","text","text"]< X-SODA2-Data-Out-Of-Date: false...{ [14733 bytes data]...[{"crash_date":"2022-06-30T00:00:00.000","crash_time":"14:01","cross_street_name":"101       EAST DRIVE","number_of_persons_injured":"1","number_of_persons_killed":"0","number_of_pedestrians_injured":"0","number_of_pedestrians_killed":"0","number_of_cyclist_injured":"1","number_of_cyclist_killed":"0","number_of_motorist_injured":"0","number_of_motorist_killed":"0","contributing_factor_vehicle_1":"Pedestrian/Bicyclist/Other Pedestrian Error/Confusion","collision_id":"4542318","vehicle_type_code1":"Bike"}...,{"crash_date":"2022-07-03T00:00:00.000","crash_time":"22:30","borough":"BRONX","zip_code":"10458","latitude":"40.866802","longitude":"-73.88444","location":{"latitude":"40.866802","longitude":"-73.88444","human_address":"{\"address\": \"\", \"city\": \"\", \"state\": \"\", \"zip\": \"\"}"},"on_street_name":"WEBSTER AVENUE","off_street_name":"EAST 199 STREET","number_of_persons_injured":"0","number_of_persons_killed":"0","number_of_pedestrians_injured":"0","number_of_pedestrians_killed":"0","number_of_cyclist_injured":"0","number_of_cyclist_killed":"0","number_of_motorist_injured":"0","number_of_motorist_killed":"0","contributing_factor_vehicle_1":"Driver Inattention/Distraction","contributing_factor_vehicle_2":"Unspecified","collision_id":"4543075","vehicle_type_code1":"Station Wagon/Sport Utility Vehicle","vehicle_type_code2":"Station Wagon/Sport Utility Vehicle"}]

Explanation

1000 records
When no $limit is set, this is the default maximum rows returned
curl --get or -G
Use the GET verb as we are getting data
-v
Lots of verbose output as you can see
-d
Request data to pass to the API in ASCII format
--data-urlencode
URL-Encode the data. Safer than just using -d

$$app_token
Users personal authorization. Not really necessary for ad-hoc requests
"$select=*"
Similar to an SQL SELECT
Selecting all columns. This is the default and can be omitted
SoQL $select
$where
Similar to SQL WHERE to filter down data.
SoQL $where
between and
SoQL between
Narrow our results down to collisions between the two *inclusive crash_date values

Observation

It turns out after piping this request to a wc command, that the API only returns 1000 rows, which is the default maximum amount if the $limit clause isnt specified. With the $limit clause, the maximum amount that can be returned with one call is 50,000 rows. To get more, you will need to order and page through the data.
One other thing to note here is that when using the -v, verbose switch, you get to see the column names and their data types.

The NYC dataset column names
"crash_date","crash_time","borough","zip_code","latitude","longitude","location","on_street_name","off_street_name","cross_street_name","number_of_persons_injured","number_of_persons_killed","number_of_pedestrians_injured","number_of_pedestrians_killed","number_of_cyclist_injured","number_of_cyclist_killed","number_of_motorist_injured","number_of_motorist_killed","contributing_factor_vehicle_1","contributing_factor_vehicle_2","contributing_factor_vehicle_3","contributing_factor_vehicle_4","contributing_factor_vehicle_5","collision_id","vehicle_type_code1","vehicle_type_code2","vehicle_type_code_3","vehicle_type_code_4","vehicle_type_code_5"

Corresponding Field Data Types

"floating_timestamp","text","text","text","number","number","location","text","text","text","number","number","number","number","number","number","number","number","text","text","text","text","text","number","text","text","text","text","text"

SoQL Query Clauses from the Docs

ParameterDescriptionDefaultIn $query
$selectThe set of columns to be returned, similar to a SELECT in SQLAll columns, equivalent to $select=*SELECT
$whereFilters the rows to be returned, similar to WHERENo filterWHERE
$orderColumn to order results on, similar to ORDER BY in SQLUnspecified orderORDER BY
$groupColumn to group results on, similar to GROUP BY in SQLNo groupingGROUP BY
$havingFilters the rows that result from an aggregation, similar to HAVINGNo filterHAVING
$limitMaximum number of results to return1000 (2.0 endpoints: maximum of 50,000; 2.1: unlimited )LIMIT
$offsetOffset count into the results to start at, used for paging0OFFSET
$qPerforms a full text search for a value.No searchN/A
$queryA full SoQL query string, all as one parameterN/AN/A
$$bomPrepends a UTF-8 Byte Order Mark to the beginning of CSV outputfalseN/A

Get all the collisions for zip code 10036, Times Square NYC, for Feb 2023

Save it into file times_square_july_2022.json

curl --get --data-urlencode "\$\$app_token=uvwxyz"  --data-urlencode "\$select=*" / --data-urlencode "\$where=crash_date between '2023-02-01T00:00:00.000' and '2023-02-28T00:00:00.000'" --data-urlencode "zip_code=10036"  https://data.cityofnewyork.us/resource/h9gi-nx95.json >    collisions_z10036_feb_2023.json

Explanation

"\$where=crash_date between '2023-02-01T00:00:00.000' and '2023-02-28T00:00:00.000'" --data-urlencode "zip_code=10036"

Specify dates between and including February 1st to the 28th.
zip_code=10036 to narrow down our results.

Count how many collisions using the Linux wc command with our newly created file, times_square_july_2022.json.

wc -l collisions_z10036_feb_2023.json 25 collisions_z10036_feb_2023.json

Double check that count of 25 collisions, using the SoQl count(*) function.

> curl --get --data-urlencode "\$\$app_token=uvwxyz"  --data-urlencode "\$select=count(*)"  --data-urlencode "\$where=crash_date between '2023-02-01T00:00:00.000' and '2023-02-28T00:00:00.000'" --data-urlencode "zip_code=10036"  https://data.cityofnewyork.us/resource/h9gi-nx95.json [{"count":"25"}]

Explanation

\$select=count(*)
Similar to the SQL count function, this uses the SoQL count function to count the number of rows that match our search criteria.
[{"count":"25"}], which matches the number of records in the collisions_z10036_feb_2023.json file

Observation

25 collisions in one midtown zip code for February is almost 1 collision a day. Im sure that's lower than many other zip codes.

Get the 10 worst zip codes for collisions in February 2023

> curl --get --silent  $$app_token=uvwxyz  --data-urlencode "\$select=count(*), zip_code"   --data-urlencode "\$where=crash_date between '2023-02-01T00:00:00.000' and '2023-02-28T00:00:00.000'"  --data-urlencode '$group=zip_code'   https://data.cityofnewyork.us/resource/h9gi-nx95.json | jq -r '.[] | .zip_code + " " + .count' | sort  -k 2,2nr -k 1n | head -n1011207 10511212 8511208 7911226 7511234 7211236 7211101 7111203 6711368 6711211 62

Explanation

OK, I threw in a lot of commands here.
"\$select=count(*), zip_code"
Selecting the count and zip_code
SoQL count function to count the number of rows that match our search criteria.
$group=zip_code
Similar to the SQL GROUP BY
Returns aggregate rows grouped by the zip_code
jq -r '.[] | .zip_code + " " + .count'
Using the very useful jq to do additional filtering
jq bills itself as, a lightweight and flexible command-line JSON processor
I extract the zip_code and collision count for each zip code and concatenate them using the bash +, concatenation operator
sort -k 2,2nr -k 1n
Using the bash sort command, we do a reverse numerical sort by the second field, which is the count. We also do a numerical sort on the zip_code for zip_codes with identical collision counts
head -n10
This gets the first 10, which are the 10 zip codes with the most collisions, starting with the very worst.

Observation

I could have used SoQL $sort and $limit to do some of this work, but I chose the bash sort, just because ...
Zip code 11207, East New York, Brooklyn, emerges as the zip with the most collisions in February.
This zip has a lot of issues with traffic safety, as you could also check here .
105 collisions in one month. 3.75 a day? Theres something seriously wrong there. Youd probably need some kind of armor suit just to cross the street there.

As the queries get more complex, these one line commands start to get long and hard to manage.

Curl has an option to create a config file. On a Linux system the default config is usually ~/.curlrc. You can specify a config file with the -K or --config switch.

I created the below config file for these requests
The config file sets the NYC API URL, the $$app_token parameter, a GET request, as well as asking for verbose output

##### The ./.nyc_curlrc file contents# --- NYC Collision Data ---geturl = "https://data.cityofnewyork.us/resource/h9gi-nx95.json"data-urlencode  =  "\$\$app_token=uvwxyz"verbose

The previous example can now be rewritten to use the .nyc_curlrc config file. I also broke up the commands into separate lines using the bash continuation \ . Enclosing some of the commands in single quotes also means that the $ doesnt need to be escaped.

> curl -K ./.nyc_curlrc \ --data-urlencode '$select=count(*), zip_code' \ --data-urlencode '$where=crash_date between "2023-02-01T00:00:00.000" and "2023-02-28T00:00:00.000"' \ --data-urlencode '$group=zip_code'  \  | jq -r '.[] | .zip_code + " " + .count' | sort  -k 2,2nr -k 1n | head -n10

This is a little more concise than the previous version, and yields the same result.

### Now to find how many cyclists and pedestrians were killed over the duration of this dataset

 curl -K ./.nyc_curlrc \  --data-urlencode "\$select=date_extract_y(crash_date) AS year, SUM(number_of_pedestrians_killed) AS tot_pedestrians_killed, SUM(number_of_cyclist_killed) AS tot_cyclist_killed"  \  --data-urlencode "\$group=year" \  --data-urlencode "\$order=tot_pedestrians_killed DESC"  | jq .  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current                                 Dload  Upload   Total   Spent    Left  Speed100   885    0   885    0     0   1616      0 --:--:-- --:--:-- --:--:--  1614[  {    "year": "2013",    "tot_pedestrians_killed": "176",    "tot_cyclist_killed": "11"  },  {    "year": "2016",    "tot_pedestrians_killed": "149",    "tot_cyclist_killed": "18"  },  {    "year": "2014",    "tot_pedestrians_killed": "133",    "tot_cyclist_killed": "20"  },  {    "year": "2015",    "tot_pedestrians_killed": "133",    "tot_cyclist_killed": "15"  },  {    "year": "2022",    "tot_pedestrians_killed": "132",    "tot_cyclist_killed": "18"  },  {    "year": "2019",    "tot_pedestrians_killed": "131",    "tot_cyclist_killed": "31"  },  {    "year": "2021",    "tot_pedestrians_killed": "129",    "tot_cyclist_killed": "19"  },  {    "year": "2017",    "tot_pedestrians_killed": "127",    "tot_cyclist_killed": "27"  },  {    "year": "2018",    "tot_pedestrians_killed": "123",    "tot_cyclist_killed": "10"  },  {    "year": "2020",    "tot_pedestrians_killed": "101",    "tot_cyclist_killed": "29"  },  {    "year": "2012",    "tot_pedestrians_killed": "72",    "tot_cyclist_killed": "6"  },  {    "year": "2023",    "tot_pedestrians_killed": "18",    "tot_cyclist_killed": "8"  }]

Explanation

date_extract_y(crash_date) AS year
Will extract 2023 from 2023-02-03T00:00:00.000
SoQL date_extract_y
SUM(number_of_pedestrians_killed) AS tot_pedestrians_killed
SUM
Similar to SQL SUM aggregate function.
AS
Give these aggregate results a meaningful label
$group=year and $order=tot_pedestrians_killed
Similar to the SQL GROUP BY and ORDER BY
Returns aggregate rows grouped by the year they occurred.
Sorted having the year with most pedestrian fatalities first
jq .
This is the most basic jq command
It just prints the JSON output in its default pretty format
We could have added --silent to the curl command or config file, to not print the curl download statistics.

Observation

2012 and the current year, 2023,  can be omitted as both years have incomplete data. 

Run the previous query minus years 2012 and 2023

curl -K ./.nyc_curlrc   --data-urlencode '$select=date_extract_y(crash_date) AS year, SUM(number_of_pedestrians_killed) AS tot_pedestrians_killed, SUM(number_of_cyclist_killed) AS tot_cyclists_killed' \  --data-urlencode '$where=year not in ("2012", "2023")' \  --data-urlencode '$group=year' \  --data-urlencode '$order=tot_pedestrians_killed DESC, tot_cyclists_killed'[{"year":"2013","tot_pedestrians_killed":"176","tot_cyclists_killed":"11"},{"year":"2016","tot_pedestrians_killed":"149","tot_cyclists_killed":"18"},{"year":"2015","tot_pedestrians_killed":"133","tot_cyclists_killed":"15"},{"year":"2014","tot_pedestrians_killed":"133","tot_cyclists_killed":"20"},{"year":"2022","tot_pedestrians_killed":"132","tot_cyclists_killed":"18"},{"year":"2019","tot_pedestrians_killed":"131","tot_cyclists_killed":"31"},{"year":"2021","tot_pedestrians_killed":"129","tot_cyclists_killed":"19"},{"year":"2017","tot_pedestrians_killed":"127","tot_cyclists_killed":"27"},{"year":"2018","tot_pedestrians_killed":"123","tot_cyclists_killed":"10"},{"year":"2020","tot_pedestrians_killed":"101","tot_cyclists_killed":"29"}]

Explanation

$where=year not in ("2012", "2023")
Added a WHERE clause to omit years 2012 and 2023 from the query
SoQL not in (...)

Observation

Well, its not that safe being a pedestrian or cyclist in New York City. Checking the injury count would yield much higher numbers.

Run a query to get a yearly total of injured pedestrians and cyclists

Our query string was getting a little bit out of hand and difficult to manage.
I created a dedicated config file, .nyc_ped_cyc_injured_yearly_curlrc for our next request.

The Config

> cat .nyc_ped_cyc_injured_yearly_curlrc # --- NYC Collision Data - Injured List  ---geturl = "https://data.cityofnewyork.us/resource/h9gi-nx95.json"data-urlencode  = "\$\$app_token=uvwxyz"data-urlencode  = "\$select=date_extract_y(crash_date) AS year, SUM(number_of_pedestrians_injured) AS tot_pedestrians_injured, SUM(number_of_cyclist_injured) AS tot_cyclists_injured"data-urlencode  = "\$where=year not in ('2012','2023')"data-urlencode  = "\$group=year" data-urlencode  = "\$order=tot_pedestrians_injured DESC, tot_cyclists_injured DESC"silent

The Query using the config file

>  curl --config ./.nyc_ped_cyc_injured_yearly_curlrc [{"year":"2013","tot_pedestrians_injured":"11988","tot_cyclists_injured":"4075"},{"year":"2017","tot_pedestrians_injured":"11151","tot_cyclists_injured":"4889"},{"year":"2018","tot_pedestrians_injured":"11123","tot_cyclists_injured":"4725"},{"year":"2016","tot_pedestrians_injured":"11090","tot_cyclists_injured":"4975"},{"year":"2014","tot_pedestrians_injured":"11036","tot_cyclists_injured":"4000"},{"year":"2019","tot_pedestrians_injured":"10568","tot_cyclists_injured":"4986"},{"year":"2015","tot_pedestrians_injured":"10084","tot_cyclists_injured":"4281"},{"year":"2022","tot_pedestrians_injured":"8963","tot_cyclists_injured":"5025"},{"year":"2021","tot_pedestrians_injured":"7503","tot_cyclists_injured":"4961"},{"year":"2020","tot_pedestrians_injured":"6691","tot_cyclists_injured":"5576"}]

Observation

Looks like the config file worked as expected. While the number of pedestrians injured is declining a little, the number of cyclists injured is going in the opposite direction.

Using jq to do additional filtering

Similar to the previous query, extract the yearly totals of injured cyclists. This time well use jq to filter the output.

> curl --config ./.nyc_ped_cyc_injured_yearly_curlrc \  | jq -r '.[] | .year + "," + .tot_cyclists_injured' | sort -k 1n \  | column -t -s, --table-columns=Year,CyclistsInjured Year  CyclistsInjured2013  40752014  40002015  42812016  49752017  48892018  47252019  49862020  55762021  49612022  5025

Explanation

This is similar to the previous query except I used jq to extract the injured cyclist data only from the returned results.

  • sort -k 1n
    • Sort the Year numerically

sort -k 1n
Sort the Year, numerically
column -t -s, --table-columns=Year,CyclistsInjured
Add column headers for readability
The jq command already created comma separated results

Observation

2020 and 2022 were the worst years for bicyclist injuries. 2020 was a year where cycling became more popular. The injuries dropped a little in 2021, maybe because cyclists got a little scared after the slaughter in 2020. The upward trend may be returning, based on the 2022 results.

Get the 10 worst zip codes for collisions in January 2023

Previously I got the [10 worst Zip codes for collisions in February][Get the 10 worst zip codes for collisions in February 2023]. I used some bash commands to fine tune results. Here I will use SoQL to do most of the heavy lifting.

Config file .nyc_jan_coll_curlrc

> cat .nyc_jan_coll_curlrc # --- NYC Collision Data - January Collisions  ---geturl = "https://data.cityofnewyork.us/resource/h9gi-nx95.json"data-urlencode  = "\$\$app_token=uvwxyz"data-urlencode = "\$select=zip_code,count(zip_code) AS collision_count"data-urlencode = "\$where=crash_date between '2023-01-01' AND '2023-01-31' "data-urlencode = "\$group=zip_code"data-urlencode = "\$order=collision_count DESC, zip_code"data-urlencode = "\$limit=10"silent
> curl --config ./.nyc_jan_coll_curlrc \  | jq -r '.[] | .zip_code + ", " + .collision_count' \  | column -t -s, --table-columns=ZipCode,CollisionCountZipCode  CollisionCount11207     12411236     8311208     8211212     7711203     6911385     6711234     6611206     6410002     6311101     61

Explanation

Most of this is similar to our earlier request for February stats. This time we are using a new config file ..

Instead of sorting the results using the bash sort, we sort using the SoQL [$order](https://dev.socrata.com/docs/queries/order.html). We get the 10 worst using the $limit clause.
\$order=collision_count DESC, zip_code
Sort the collision count from worst to least worst.
zip_code ascending sort
$limit=10
Get the first 10 after the sort using $limit

Observation

Zip Code 11207, again emerges as a collision prone area with 124 collisions in January. Thats 4 collisions a day. Every day is a regular demolition derby day in that part of Brooklyn.

Some Perl CLI Resources

Perl Maven - Great for those new to Perl
Perldocs - perlrun
Peteris Krumins has some great e-books
Dave Cross - From one of his older posts on perl.com

Some NYC Street Resources

NYC - Motor Vehicle Collisions - Crashes.
StreetsBlog NYC
Hellgate NYC - Local NYC News
Liam Quigley - Local Reporter
More Liam Quigley - Twitter
These Stupid Trucks are Literally Killing Us YouTube

Me

LinkedIn
blog


Original Link: https://dev.to/austin_62/crash-city-2ebd

Share this article:    Share on Facebook
View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To