Soviet Space Dogs (Part 1)

“The more time passes, the more I’m sorry about it. We shouldn’t have done it. We did not learn enough from the mission to justify the death of the dog…”  Oleg Gazenko, one of the scientists who trained Laika

During the 50s and 60s the Soviet Space program sent dogs into space to assess the viability of human spaceflight. Although the majority of the dogs survived their flight, perhaps the most famous of them, Laika, was not expected to survive her orbital flight and died on the 3rd November 1957. Using animals in this way is clearly a contentious topic that many people find difficult or impossible to justify. However, I think people can also be strangely fascinated by the concept of dogs in space, and this project is the manifestation of my curiosity towards these heroic canines.

This blog post is the first of a two-parter in which I tidy and clean data on these space dog missions. I will be utilising the tidyr, stringr, janitor and lubridate packages to get the dogs all shipshape. In a second blog post I will attempt to honour their heroic exploits with a data visualisation.

If you’d like to take a look at the data, and the code I will be walking through here, you can find it in my GitHub repo.

The Data

The data for this project comes courtesy of Duncan Geere. I first discovered Duncan when I came across his visualisation of influential indie bands from the mid-00s, a subject in which I also have some expertise. Shortly after I followed him on Twitter he released this data on the Soviet Space Dogs. It comes in the form of 2 CSV files stored on airtable. The original source of the data is the book Soviet Space Dogs by Olesa Turkina.

The Packages

As usual, I will be using the tidyverse framework to import, tidy, clean and manipulate the data. I will also use the lubridate package for working with dates and the janitor package for data cleaning.

library(tidyverse) # most things
library(lubridate) # format dates 
library(janitor) # cleaning things

Import the Data

I’ve downloaded and saved the 2 files from airtable into my project directory, and I read them in using the read_csv function from the readr package.

# read in dogs csv file
# from: https://airtable.com/universe/expG3z2CFykG1dZsp/sovet-space-dogs

dogs <- read_csv("data/Dogs-Database.csv")
## Parsed with column specification:
## cols(
##   `Name (Latin)` = col_character(),
##   `Name (English)` = col_character(),
##   `Name (Cyrillic)` = col_character(),
##   Gender = col_character(),
##   Flights = col_character(),
##   Fate = col_character(),
##   Notes = col_character()
## )
# read in flights csv file
flights <- read_csv("data/Flights-Database.csv")
## Parsed with column specification:
## cols(
##   Date = col_date(format = ""),
##   Dogs = col_character(),
##   Rocket = col_character(),
##   `Altitude (km)` = col_character(),
##   Result = col_character(),
##   Notes = col_character()
## )

Let’s take a look at the data:

glimpse(dogs)
## Observations: 48
## Variables: 7
## $ `Name (Latin)`    <chr> "Dezik", "Tsygan", "Lisa", "Chizhik", "Mishka"…
## $ `Name (English)`  <chr> "Dezik", "Gypsy", "Fox", "Siskin", "Little Bea…
## $ `Name (Cyrillic)` <chr> "Дезик", "Цыган", "Лиса", "Чижик", "Мишка", "Р…
## $ Gender            <chr> "Male", "Male", "Female", "Male", "Male", "Mal…
## $ Flights           <chr> "1951-07-22,1951-07-29", "1951-07-22", "1951-0…
## $ Fate              <chr> "Died 1951-07-29", "Survived", "Died 1951-07-2…
## $ Notes             <chr> NA, "Adopted as a pet by Soviet physicist Anat…
glimpse(flights)
## Observations: 42
## Variables: 6
## $ Date            <date> 1951-07-22, 1951-07-29, 1951-08-15, 1951-08-19,…
## $ Dogs            <chr> "Dezik,Tsygan", "Dezik,Lisa", "Chizhik,Mishka", …
## $ Rocket          <chr> "R-1V", "R-1B", "R-1B", "R-1V", "R-1B", "R-1B", …
## $ `Altitude (km)` <chr> "100", "100", "100", "100", "100", "100", "100",…
## $ Result          <chr> "recovered safely", "parachute failed, both dogs…
## $ Notes           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "no rocket o…

Clean & Tidy

Janitor

First thing to note is that some of the variable names are not very clean. For example, the first 3 variables in the dogs dataset which provide 3 variants of the dogs’ names have variable names containing spaces and parentheses. Although you can work with these variable names by using single backticks around them, I prefer to use a standardised naming convention - snakecase. This is the preferred style in the tidyverse and the janitor package makes light work of this clean-up job. Let’s just focus on the dogs dataset for now:

dogs_tidy <- dogs %>% 
  # clean names to snake_case
  clean_names() 

glimpse(dogs_tidy)
## Observations: 48
## Variables: 7
## $ name_latin    <chr> "Dezik", "Tsygan", "Lisa", "Chizhik", "Mishka", "R…
## $ name_english  <chr> "Dezik", "Gypsy", "Fox", "Siskin", "Little Bear", …
## $ name_cyrillic <chr> "Дезик", "Цыган", "Лиса", "Чижик", "Мишка", "Рыжик…
## $ gender        <chr> "Male", "Male", "Female", "Male", "Male", "Male", …
## $ flights       <chr> "1951-07-22,1951-07-29", "1951-07-22", "1951-07-29…
## $ fate          <chr> "Died 1951-07-29", "Survived", "Died 1951-07-29", …
## $ notes         <chr> NA, "Adopted as a pet by Soviet physicist Anatoli …

Aaaand relax. I feel better already. Having the variables named in a consistent manner will help when referencing them later.

Make It Tidy

Next, I can see that I need to tidy the data. To be clear, I’m not using tidy and clean interchangeably here. By tidy I’m referring to one of the key concepts of working with data in the tidyverse (it’s where the name comes from!), as described by Hadley Wickham in R for Data Science:

There are three interrelated rules which make a dataset tidy:

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.

Now, in the dogs data, there are multiple flight observations in the same cell for dogs that flew more than once. For example, the 1st dog in the data, Dezik, flew on 1951-07-22 and 1951-07-29. These 2 values are separated by a comma. For the purposes of my analysis, I require each flight to have it’s own row, so in this case, Dezik will appear twice, and each observation will be a unique dog and flight combination. The separate_rows function from tidyr couldn’t make this any easier. I pass it the flights column and specify the separating character:

dogs_tidy <- dogs_tidy %>%
  # flights are recorded on same row - put on separate rows to make it 'tidy'
  separate_rows(flights, sep = ",") 

glimpse(dogs_tidy)
## Observations: 81
## Variables: 7
## $ name_latin    <chr> "Dezik", "Dezik", "Tsygan", "Lisa", "Chizhik", "Ch…
## $ name_english  <chr> "Dezik", "Dezik", "Gypsy", "Fox", "Siskin", "Siski…
## $ name_cyrillic <chr> "Дезик", "Дезик", "Цыган", "Лиса", "Чижик", "Чижик…
## $ gender        <chr> "Male", "Male", "Male", "Female", "Male", "Male", …
## $ flights       <chr> "1951-07-22", "1951-07-29", "1951-07-22", "1951-07…
## $ fate          <chr> "Died 1951-07-29", "Died 1951-07-29", "Survived", …
## $ notes         <chr> NA, NA, "Adopted as a pet by Soviet physicist Anat…

So we now have clean variable names and a tidy data structure. Notice that the number of observations has increased from 48 to 81, as we now have a record of each dog-flight combination.

Next, I will format some of the existing variables and also create some new variables from existing ones.

Dates & Strings

Going back to the initial glimpse of the data, I have the flight date (flights) formatted as a character, and I also have the date of dogs that died within the fate character variable (suffixed with Died). I want to convert these date strings into date-formatted variables.

The lubridate package provides an easy way of parsing these dates. As the dates appear in the format of year-month-day, I can simply use the ymd function from lubridate (alternatives such as mdy also exist). The beauty of these functions is that they should parse the date as long as the year, month and day are in the correct order, regardless of the formatting.

Before I can convert the date of death into a date field, I need to extract it from the fate character string. Here I’m using the str_sub function from stringr, firstly to identify the values that begin with Died and then again to extract the date thereafter. I pass str_sub the variable along with the start and end position of the string I want to extract.

I’m then creating a binary flight_fate variable to indicate if the dog died or survived the flight:

dogs_tidy <- dogs_tidy %>%
  # format data
  mutate(date_flight = ymd(flights),
         # from fate variable extract the date if dog died
         date_death = case_when(str_sub(fate, 1, 4) == "Died" ~ str_sub(fate, 6, 15)),
         date_death = ymd(date_death),
         # if dog died on flight then set flight_fate to Died
         flight_fate = case_when(date_flight == date_death ~ "Died",
                          TRUE ~ "Survived")) 

glimpse(dogs_tidy)
## Observations: 81
## Variables: 10
## $ name_latin    <chr> "Dezik", "Dezik", "Tsygan", "Lisa", "Chizhik", "Ch…
## $ name_english  <chr> "Dezik", "Dezik", "Gypsy", "Fox", "Siskin", "Siski…
## $ name_cyrillic <chr> "Дезик", "Дезик", "Цыган", "Лиса", "Чижик", "Чижик…
## $ gender        <chr> "Male", "Male", "Male", "Female", "Male", "Male", …
## $ flights       <chr> "1951-07-22", "1951-07-29", "1951-07-22", "1951-07…
## $ fate          <chr> "Died 1951-07-29", "Died 1951-07-29", "Survived", …
## $ notes         <chr> NA, NA, "Adopted as a pet by Soviet physicist Anat…
## $ date_flight   <date> 1951-07-22, 1951-07-29, 1951-07-22, 1951-07-29, 1…
## $ date_death    <date> 1951-07-29, 1951-07-29, NA, 1951-07-29, 1951-08-2…
## $ flight_fate   <chr> "Survived", "Died", "Survived", "Died", "Survived"…

Finally, as one last step I am re-ordering the variables and removing the now redundant fate field:

dogs_tidy <- dogs_tidy %>%
   select(-notes, everything(), -fate)

glimpse(dogs_tidy)
## Observations: 81
## Variables: 9
## $ name_latin    <chr> "Dezik", "Dezik", "Tsygan", "Lisa", "Chizhik", "Ch…
## $ name_english  <chr> "Dezik", "Dezik", "Gypsy", "Fox", "Siskin", "Siski…
## $ name_cyrillic <chr> "Дезик", "Дезик", "Цыган", "Лиса", "Чижик", "Чижик…
## $ gender        <chr> "Male", "Male", "Male", "Female", "Male", "Male", …
## $ flights       <chr> "1951-07-22", "1951-07-29", "1951-07-22", "1951-07…
## $ date_flight   <date> 1951-07-22, 1951-07-29, 1951-07-22, 1951-07-29, 1…
## $ date_death    <date> 1951-07-29, 1951-07-29, NA, 1951-07-29, 1951-08-2…
## $ flight_fate   <chr> "Survived", "Died", "Survived", "Died", "Survived"…
## $ notes         <chr> NA, NA, "Adopted as a pet by Soviet physicist Anat…

This completes the data clean and tidy of the dogs dataset. Here I have gone through it step by step, but in practice the data processing was performed in one long chain. See the Github repo for the full code.

Rinse & Repeat

The flights dataset goes through a similar process. There is no need to convert into a tidy format as we already have one row per flight, however, I once again use janitor to clean names. The only other processing performed is to convert the altitude into a number for those where a number is given (for the flights that were orbital, for example, an altitude in kms is not provided so will result in NA). The parse_number function from readr is an easy way to convert a character to a number (and it will even handle numbers containing non-numeric characters such as commas or dollar signs)

flights_tidy <- flights %>% 
  clean_names() %>% 
  select(date_flight = date, rocket, altitude_km, result, notes_flight = notes) %>% 
  mutate(altitude = case_when(str_detect(altitude_km, "^[0-9]") ~ parse_number(altitude_km)))

glimpse(flights_tidy)
## Observations: 42
## Variables: 6
## $ date_flight  <date> 1951-07-22, 1951-07-29, 1951-08-15, 1951-08-19, 19…
## $ rocket       <chr> "R-1V", "R-1B", "R-1B", "R-1V", "R-1B", "R-1B", "R-…
## $ altitude_km  <chr> "100", "100", "100", "100", "100", "100", "100", "1…
## $ result       <chr> "recovered safely", "parachute failed, both dogs di…
## $ notes_flight <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "no rocket or a…
## $ altitude     <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, NA, 10…

Feeling Cleansed?

And there we have it. Both datasets are now clean and tidy. Therapeutic, isn’t it? If you didn’t already enjoy cleaning and tidying your data (what’s wrong with you?!) then hopefully you can see how the tidyverse and associated packages provide powerful and easy-to-use tools to make the whole process as painless as possible.

What Next?

Stay tuned for my next blog post where I will honour these dogs with a fitting visualisation. In the meantime, check out Duncan Geere’s brilliant visualisation of these heroic canines.