Carnivore genera in the Paleobiology Database
Exercise Type: Implementation
Instructions
Calculate descriptive stats from the Paleobiology Database (carnivores).
-
Download this file: carnivore_families.csv
-
Read it into R with
read.csv()
! Check its structure and familiarize yourself with the columns. Every row represents one occurrence of a taxon in a collection. These usually belong to a species, but the fossils are sometimes not that easy to identifiy. For this reason we will look into the genera only. -
List the unique families in the table (column
family
). How many families are there? Do all names make sense? -
This dataset, as most compiled data, needs some cleaning. We can make this set considerably better if we omit bad entries (ie. subset to that part, where column value is not equal to these). Write subsetting multiple commands to omit a. entries where the
family
is""
(empty string); b.family
is"NO_FAMILY_SPECIFIED"
; c.genus
is""
(empty string)! -
Subset your data to the family
"Canidae"
. Count the number of genera in this family! -
Make sure that the unique list of families do not contain the bad entries! Using a
for
loop, repeat step 5 and count the number of genera for every family! -
Plot a histogram of the resulting vector!
-
Which family has the highest number of genera? What is the median number of genera in a family?
-
Ensure that your script works without human intervention! Clean your code, close your R session, and repeat all calculations with the script that you wrote!
Expected outputs
- A named numeric vector:
names
: names of families values
: the number of genera in families
-
A histogram
-
The median of number of genera.
-
The name of the family that has th highest number of genera.
Extra questions
- Step 4 can be expressed in one line of code, can you write it like that?
- Step 6 can be written without a
for
loop. Do you know how to do it?