“The more time passes, the more I’m sorry about it. We shouldn’t have done it. We did not learn enough from the mission to justify the death of the dog…” Oleg Gazenko, one of the scientists who trained Laika
During the 50s and 60s the Soviet Space program sent dogs into space to assess the viability of human spaceflight. Although the majority of the dogs survived their flight, perhaps the most famous of them, Laika, was not expected to survive her orbital flight and died on the 3rd November 1957. Using animals in this way is clearly a contentious topic that many people find difficult or impossible to justify. However, I think people can also be strangely fascinated by the concept of dogs in space, and this project is the manifestation of my curiosity towards these heroic canines.
This blog post is the first of a two-parter in which I tidy and clean data on these space dog missions. I will be utilising the tidyr
, stringr
, janitor
and lubridate
packages to get the dogs all shipshape. In a second blog post I will attempt to honour their heroic exploits with a data visualisation.
If you’d like to take a look at the data, and the code I will be walking through here, you can find it in my GitHub repo.
The data for this project comes courtesy of Duncan Geere. I first discovered Duncan when I came across his visualisation of influential indie bands from the mid-00s, a subject in which I also have some expertise. Shortly after I followed him on Twitter he released this data on the Soviet Space Dogs. It comes in the form of 2 CSV files stored on airtable. The original source of the data is the book Soviet Space Dogs by Olesa Turkina.
As usual, I will be using the tidyverse framework to import, tidy, clean and manipulate the data. I will also use the lubridate
package for working with dates and the janitor
package for data cleaning.
library(tidyverse) # most things
library(lubridate) # format dates
library(janitor) # cleaning things
I’ve downloaded and saved the 2 files from airtable into my project directory, and I read them in using the read_csv
function from the readr
package.
# read in dogs csv file
# from: https://airtable.com/universe/expG3z2CFykG1dZsp/sovet-space-dogs
dogs <- read_csv("data/Dogs-Database.csv")
## Parsed with column specification:
## cols(
## `Name (Latin)` = col_character(),
## `Name (English)` = col_character(),
## `Name (Cyrillic)` = col_character(),
## Gender = col_character(),
## Flights = col_character(),
## Fate = col_character(),
## Notes = col_character()
## )
# read in flights csv file
flights <- read_csv("data/Flights-Database.csv")
## Parsed with column specification:
## cols(
## Date = col_date(format = ""),
## Dogs = col_character(),
## Rocket = col_character(),
## `Altitude (km)` = col_character(),
## Result = col_character(),
## Notes = col_character()
## )
Let’s take a look at the data:
glimpse(dogs)
## Observations: 48
## Variables: 7
## $ `Name (Latin)` <chr> "Dezik", "Tsygan", "Lisa", "Chizhik", "Mishka"…
## $ `Name (English)` <chr> "Dezik", "Gypsy", "Fox", "Siskin", "Little Bea…
## $ `Name (Cyrillic)` <chr> "Дезик", "Цыган", "Лиса", "Чижик", "Мишка", "Р…
## $ Gender <chr> "Male", "Male", "Female", "Male", "Male", "Mal…
## $ Flights <chr> "1951-07-22,1951-07-29", "1951-07-22", "1951-0…
## $ Fate <chr> "Died 1951-07-29", "Survived", "Died 1951-07-2…
## $ Notes <chr> NA, "Adopted as a pet by Soviet physicist Anat…
glimpse(flights)
## Observations: 42
## Variables: 6
## $ Date <date> 1951-07-22, 1951-07-29, 1951-08-15, 1951-08-19,…
## $ Dogs <chr> "Dezik,Tsygan", "Dezik,Lisa", "Chizhik,Mishka", …
## $ Rocket <chr> "R-1V", "R-1B", "R-1B", "R-1V", "R-1B", "R-1B", …
## $ `Altitude (km)` <chr> "100", "100", "100", "100", "100", "100", "100",…
## $ Result <chr> "recovered safely", "parachute failed, both dogs…
## $ Notes <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "no rocket o…
First thing to note is that some of the variable names are not very clean. For example, the first 3 variables in the dogs
dataset which provide 3 variants of the dogs’ names have variable names containing spaces and parentheses. Although you can work with these variable names by using single backticks around them, I prefer to use a standardised naming convention - snakecase. This is the preferred style in the tidyverse and the janitor
package makes light work of this clean-up job. Let’s just focus on the dogs
dataset for now:
dogs_tidy <- dogs %>%
# clean names to snake_case
clean_names()
glimpse(dogs_tidy)
## Observations: 48
## Variables: 7
## $ name_latin <chr> "Dezik", "Tsygan", "Lisa", "Chizhik", "Mishka", "R…
## $ name_english <chr> "Dezik", "Gypsy", "Fox", "Siskin", "Little Bear", …
## $ name_cyrillic <chr> "Дезик", "Цыган", "Лиса", "Чижик", "Мишка", "Рыжик…
## $ gender <chr> "Male", "Male", "Female", "Male", "Male", "Male", …
## $ flights <chr> "1951-07-22,1951-07-29", "1951-07-22", "1951-07-29…
## $ fate <chr> "Died 1951-07-29", "Survived", "Died 1951-07-29", …
## $ notes <chr> NA, "Adopted as a pet by Soviet physicist Anatoli …
Aaaand relax. I feel better already. Having the variables named in a consistent manner will help when referencing them later.
Next, I can see that I need to tidy the data. To be clear, I’m not using tidy and clean interchangeably here. By tidy I’m referring to one of the key concepts of working with data in the tidyverse (it’s where the name comes from!), as described by Hadley Wickham in R for Data Science:
There are three interrelated rules which make a dataset tidy:
- Each variable must have its own column.
- Each observation must have its own row.
- Each value must have its own cell.
Now, in the dogs
data, there are multiple flight observations in the same cell for dogs that flew more than once. For example, the 1st dog in the data, Dezik, flew on 1951-07-22
and 1951-07-29
. These 2 values are separated by a comma. For the purposes of my analysis, I require each flight to have it’s own row, so in this case, Dezik will appear twice, and each observation will be a unique dog and flight combination. The separate_rows
function from tidyr
couldn’t make this any easier. I pass it the flights
column and specify the separating character:
dogs_tidy <- dogs_tidy %>%
# flights are recorded on same row - put on separate rows to make it 'tidy'
separate_rows(flights, sep = ",")
glimpse(dogs_tidy)
## Observations: 81
## Variables: 7
## $ name_latin <chr> "Dezik", "Dezik", "Tsygan", "Lisa", "Chizhik", "Ch…
## $ name_english <chr> "Dezik", "Dezik", "Gypsy", "Fox", "Siskin", "Siski…
## $ name_cyrillic <chr> "Дезик", "Дезик", "Цыган", "Лиса", "Чижик", "Чижик…
## $ gender <chr> "Male", "Male", "Male", "Female", "Male", "Male", …
## $ flights <chr> "1951-07-22", "1951-07-29", "1951-07-22", "1951-07…
## $ fate <chr> "Died 1951-07-29", "Died 1951-07-29", "Survived", …
## $ notes <chr> NA, NA, "Adopted as a pet by Soviet physicist Anat…
So we now have clean variable names and a tidy data structure. Notice that the number of observations has increased from 48 to 81, as we now have a record of each dog-flight combination.
Next, I will format some of the existing variables and also create some new variables from existing ones.
Going back to the initial glimpse of the data, I have the flight date (flights
) formatted as a character, and I also have the date of dogs that died within the fate
character variable (suffixed with Died
). I want to convert these date strings into date-formatted variables.
The lubridate
package provides an easy way of parsing these dates. As the dates appear in the format of year-month-day, I can simply use the ymd
function from lubridate
(alternatives such as mdy
also exist). The beauty of these functions is that they should parse the date as long as the year, month and day are in the correct order, regardless of the formatting.
Before I can convert the date of death into a date field, I need to extract it from the fate
character string. Here I’m using the str_sub
function from stringr
, firstly to identify the values that begin with Died
and then again to extract the date thereafter. I pass str_sub
the variable along with the start and end position of the string I want to extract.
I’m then creating a binary flight_fate
variable to indicate if the dog died or survived the flight:
dogs_tidy <- dogs_tidy %>%
# format data
mutate(date_flight = ymd(flights),
# from fate variable extract the date if dog died
date_death = case_when(str_sub(fate, 1, 4) == "Died" ~ str_sub(fate, 6, 15)),
date_death = ymd(date_death),
# if dog died on flight then set flight_fate to Died
flight_fate = case_when(date_flight == date_death ~ "Died",
TRUE ~ "Survived"))
glimpse(dogs_tidy)
## Observations: 81
## Variables: 10
## $ name_latin <chr> "Dezik", "Dezik", "Tsygan", "Lisa", "Chizhik", "Ch…
## $ name_english <chr> "Dezik", "Dezik", "Gypsy", "Fox", "Siskin", "Siski…
## $ name_cyrillic <chr> "Дезик", "Дезик", "Цыган", "Лиса", "Чижик", "Чижик…
## $ gender <chr> "Male", "Male", "Male", "Female", "Male", "Male", …
## $ flights <chr> "1951-07-22", "1951-07-29", "1951-07-22", "1951-07…
## $ fate <chr> "Died 1951-07-29", "Died 1951-07-29", "Survived", …
## $ notes <chr> NA, NA, "Adopted as a pet by Soviet physicist Anat…
## $ date_flight <date> 1951-07-22, 1951-07-29, 1951-07-22, 1951-07-29, 1…
## $ date_death <date> 1951-07-29, 1951-07-29, NA, 1951-07-29, 1951-08-2…
## $ flight_fate <chr> "Survived", "Died", "Survived", "Died", "Survived"…
Finally, as one last step I am re-ordering the variables and removing the now redundant fate
field:
dogs_tidy <- dogs_tidy %>%
select(-notes, everything(), -fate)
glimpse(dogs_tidy)
## Observations: 81
## Variables: 9
## $ name_latin <chr> "Dezik", "Dezik", "Tsygan", "Lisa", "Chizhik", "Ch…
## $ name_english <chr> "Dezik", "Dezik", "Gypsy", "Fox", "Siskin", "Siski…
## $ name_cyrillic <chr> "Дезик", "Дезик", "Цыган", "Лиса", "Чижик", "Чижик…
## $ gender <chr> "Male", "Male", "Male", "Female", "Male", "Male", …
## $ flights <chr> "1951-07-22", "1951-07-29", "1951-07-22", "1951-07…
## $ date_flight <date> 1951-07-22, 1951-07-29, 1951-07-22, 1951-07-29, 1…
## $ date_death <date> 1951-07-29, 1951-07-29, NA, 1951-07-29, 1951-08-2…
## $ flight_fate <chr> "Survived", "Died", "Survived", "Died", "Survived"…
## $ notes <chr> NA, NA, "Adopted as a pet by Soviet physicist Anat…
This completes the data clean and tidy of the dogs
dataset. Here I have gone through it step by step, but in practice the data processing was performed in one long chain. See the Github repo for the full code.
The flights
dataset goes through a similar process. There is no need to convert into a tidy format as we already have one row per flight, however, I once again use janitor
to clean names. The only other processing performed is to convert the altitude into a number for those where a number is given (for the flights that were orbital, for example, an altitude in kms is not provided so will result in NA
). The parse_number
function from readr
is an easy way to convert a character to a number (and it will even handle numbers containing non-numeric characters such as commas or dollar signs)
flights_tidy <- flights %>%
clean_names() %>%
select(date_flight = date, rocket, altitude_km, result, notes_flight = notes) %>%
mutate(altitude = case_when(str_detect(altitude_km, "^[0-9]") ~ parse_number(altitude_km)))
glimpse(flights_tidy)
## Observations: 42
## Variables: 6
## $ date_flight <date> 1951-07-22, 1951-07-29, 1951-08-15, 1951-08-19, 19…
## $ rocket <chr> "R-1V", "R-1B", "R-1B", "R-1V", "R-1B", "R-1B", "R-…
## $ altitude_km <chr> "100", "100", "100", "100", "100", "100", "100", "1…
## $ result <chr> "recovered safely", "parachute failed, both dogs di…
## $ notes_flight <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "no rocket or a…
## $ altitude <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, NA, 10…
And there we have it. Both datasets are now clean and tidy. Therapeutic, isn’t it? If you didn’t already enjoy cleaning and tidying your data (what’s wrong with you?!) then hopefully you can see how the tidyverse and associated packages provide powerful and easy-to-use tools to make the whole process as painless as possible.
Stay tuned for my next blog post where I will honour these dogs with a fitting visualisation. In the meantime, check out Duncan Geere’s brilliant visualisation of these heroic canines.