Bird Data
To evaluate the relationship between birds and light polluted areas using spacial data I needed both light pollution data and bird distribution data. Light pollution data is easy: I can obtain all the data I need from the National Oceanic And Atmospheric Administration (NOOA). That data is stored in something called GeoTIFF files. The format is very interesting and I'm going to explore this more after the conclusion of this class. It has a lot of potental but I didn't end up actually needing to use it for this project. When I created my interactive map using Carto I discovered that one of the provided baselayer map choices was data from NOOA. Using Carto took care of the light pollution map for me.
To obtain data on the geographical distribution of birds I went to an organization called eBird. This organization, created by the Cornell Lab of Ornithology and National Audubon Society, is at the center of how the birding community reports sightings of bird species. The reported data is aggregated and studied. They use computer models to clean and process the data and model the actual distribution data of each reported bird, unconditional of the locations of the birders making observations and what is actually reported. That modeled data is available for download and use for 107 different bird species.
I'm grateful for this high quality data being made available to me. I found that usable bird distibution data like the kind used in my light pollution map is hard to come by.
Here is the data citation:
Fink, D., T. Auer, A. Johnston, M. Strimas-Mackey, M. Iliff, and S. Kelling. eBird Status and Trends. Version: November 2018. https://ebird.org/science/status-and-trends. Cornell Lab of Ornithology, Ithaca, New York.
After downloading the data I had to organize the data a bit before uploading it to Carto. This is easy to do in Python with the geopandas library.
First I must import some packages.
from pathlib import Path
import geopandas as gpd
This dataset consists of about 2 GB of GeoPackage (GPKG) data files. The below code reads the file for each bird and combines them into a GeoDataFrame.
source_files = Path('/local/ITP/temporary_expert/birddistributiondata/rangedata/')
df = gpd.GeoDataFrame()
for file in source_files.glob('*-range-*.gpkg'):
df = df.append(gpd.read_file(file.as_posix(), layer=0))
First I dropped some unnecessary columns:
df.drop(['version_year', 'scientific_name', 'layer', 'start_dt', 'end_dt'], axis=1, inplace=True)
Looking at the first few rows of data I see a unique column called geometry
. This column contains a list of latitude and longitude coordinates of polygons describing the region each bird occupies during the breeding, nonbreeding, and migration periods. This column is responible for the large file sizes.
df.head()
Some of these birds occupy one region year round. I'm not interested in those birds so I will remove those data rows.
df.season_name.unique()
Note there are currently 107 unique birds in the dataset.
len(df.species_code.unique())
Keep only the columns with 'nonbreeding', 'prebreeding_migration', 'breeding', or 'postbreeding_migration' seasons.
df = df[df.season_name.isin(['nonbreeding', 'prebreeding_migration', 'breeding', 'postbreeding_migration'])]
Check that it worked:
df.season_name.unique()
Now there are only 95 birds in the dataset.
len(df.species_code.unique())
Note that there are now 375 rows in the table. Since 95 * 4 = 380
this means that there are a few less rows than I would expect. A few birds don't have location data for all four seasons. I will leave those birds in dataset.
len(df)
The data files are very large and the geometry is unnecessarily detailed for my use case. I can simplify them to reduce the file sizes. This is very slow.
The default free Carto account provides 250MB of space for data. If you are an NYU student you can sign up with your NYU account to receive 500MB of data. The below simplification will get the final file sizes to 256 MB, providing me leftover space for future projects.
df['geometry'] = df.simplify(0.0005)
print('done simplifying')
Carto has difficulty importing giant datasets all at once so the data will be broken up into severa small files and uploaded one at a time.
basepath = Path("/local/ITP/temporary_expert/birddistributiondata/processed_range_data")
df.iloc[0:100].to_file(basepath.joinpath("bird_data_000_100.gpkg"), driver="GPKG")
df.iloc[100:200].to_file(basepath.joinpath("bird_data_100_200.gpkg"), driver="GPKG")
df.iloc[200:300].to_file(basepath.joinpath("bird_data_200_300.gpkg"), driver="GPKG")
df.iloc[300:].to_file(basepath.joinpath("bird_data_300_375.gpkg"), driver="GPKG")
The four datasets can be combined into one dataset within Carto using SQL INSERT statements like this:
INSERT INTO jim18133.ebird_ranges (the_geom, fid, species_code, common_name, season_name, area_km2)
SELECT the_geom, fid, species_code, common_name, season_name, area_km2
FROM jim18133.bird_data_300_387
WHERE fid >= 20 AND fid < 40;
Finally, I need to use Python to write some HTML for the dropdown. This is copied into the light pollution map.
for _, row in df[~df.duplicated('species_code')][['species_code', 'common_name']].iterrows():
print(f"<option value=\"{row['species_code']}\">{row['common_name']}</option>")