Pandas and Covid-19


This notebook is an example of data analysis and manipulation with Pandas and has beed created in Google Colab

Enjoy it!

import numpy as np
import pandas as pd

The Data

To get some data I and going to download it from Data Repository by Johns Hopkins CSSE

https://github.com/CSSEGISandData/COVID-19

I first remove the folder where I am goint to store the data so I can re-execute this sentences without any problems …

!rm -rf ./COVID-19

The dataset is avaible in GitHub so I use the git command to get it

!git clone https://github.com/CSSEGISandData/COVID-19.git
Cloning into 'COVID-19'...
remote: Enumerating objects: 14, done.
remote: Counting objects: 100% (14/14), done.
remote: Compressing objects: 100% (9/9), done.
remote: Total 21425 (delta 5), reused 10 (delta 5), pack-reused 21411
Receiving objects: 100% (21425/21425), 90.29 MiB | 33.30 MiB/s, done.
Resolving deltas: 100% (11436/11436), done.

Exporing the data

!ls -lt ./COVID-19/csse_covid_19_data/csse_covid_19_daily_reports | head
total 9544
-rw-r--r-- 1 root root 317177 Apr 21 05:16 04-20-2020.csv
-rw-r--r-- 1 root root      0 Apr 21 05:16 README.md
-rw-r--r-- 1 root root 317954 Apr 21 05:16 04-19-2020.csv
-rw-r--r-- 1 root root 315926 Apr 21 05:16 04-18-2020.csv
-rw-r--r-- 1 root root 314848 Apr 21 05:16 04-17-2020.csv
-rw-r--r-- 1 root root 312551 Apr 21 05:16 04-15-2020.csv
-rw-r--r-- 1 root root 314226 Apr 21 05:16 04-16-2020.csv
-rw-r--r-- 1 root root 305548 Apr 21 05:16 04-12-2020.csv
-rw-r--r-- 1 root root 309742 Apr 21 05:16 04-13-2020.csv

Yes!!!
We have data files …

Perfect. Let’s explore the first dataset generated …

first = pd.read_csv("./COVID-19/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv")
first.head()
Province/State Country/Region Last Update Confirmed Deaths Recovered
0 Anhui Mainland China 1/22/2020 17:00 1.0 NaN NaN
1 Beijing Mainland China 1/22/2020 17:00 14.0 NaN NaN
2 Chongqing Mainland China 1/22/2020 17:00 6.0 NaN NaN
3 Fujian Mainland China 1/22/2020 17:00 1.0 NaN NaN
4 Gansu Mainland China 1/22/2020 17:00 NaN NaN NaN

And one of the last ones …

last = pd.read_csv("./COVID-19/csse_covid_19_data/csse_covid_19_daily_reports/04-18-2020.csv")
last.head()
FIPS Admin2 Province_State Country_Region Last_Update Lat Long_ Confirmed Deaths Recovered Active Combined_Key
0 45001.0 Abbeville South Carolina US 2020-04-18 22:32:47 34.223334 -82.461707 15 0 0 15 Abbeville, South Carolina, US
1 22001.0 Acadia Louisiana US 2020-04-18 22:32:47 30.295065 -92.414197 110 7 0 103 Acadia, Louisiana, US
2 51001.0 Accomack Virginia US 2020-04-18 22:32:47 37.767072 -75.632346 33 0 0 33 Accomack, Virginia, US
3 16001.0 Ada Idaho US 2020-04-18 22:32:47 43.452658 -116.241552 593 9 0 584 Ada, Idaho, US
4 19001.0 Adair Iowa US 2020-04-18 22:32:47 41.330756 -94.471059 1 0 0 1 Adair, Iowa, US

Can I concatenate both datasets?

last.query("Country_Region == 'Spain'")
FIPS Admin2 Province_State Country_Region Last_Update Lat Long_ Confirmed Deaths Recovered Active Combined_Key
3025 NaN NaN NaN Spain 2020-04-18 22:32:28 40.463667 -3.74922 191726 20043 74797 96886 Spain
pd.concat((first, last), axis = 0)
Province/State Country/Region Last Update Confirmed Deaths Recovered FIPS Admin2 Province_State Country_Region Last_Update Lat Long_ Active Combined_Key
0 Anhui Mainland China 1/22/2020 17:00 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Beijing Mainland China 1/22/2020 17:00 14.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 Chongqing Mainland China 1/22/2020 17:00 6.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 Fujian Mainland China 1/22/2020 17:00 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 Gansu Mainland China 1/22/2020 17:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3048 NaN NaN NaN 418.0 2.0 69.0 NaN NaN NaN West Bank and Gaza 2020-04-18 22:32:28 31.952200 35.233200 347.0 West Bank and Gaza
3049 NaN NaN NaN 6.0 0.0 0.0 NaN NaN NaN Western Sahara 2020-04-18 22:32:28 24.215500 -12.885800 6.0 Western Sahara
3050 NaN NaN NaN 1.0 0.0 0.0 NaN NaN NaN Yemen 2020-04-18 22:32:28 15.552727 48.516388 1.0 Yemen
3051 NaN NaN NaN 57.0 2.0 33.0 NaN NaN NaN Zambia 2020-04-18 22:32:28 -13.133897 27.849332 22.0 Zambia
3052 NaN NaN NaN 25.0 3.0 2.0 NaN NaN NaN Zimbabwe 2020-04-18 22:32:28 -19.015438 29.154857 20.0 Zimbabwe

3091 rows × 15 columns

Ups!!! The column names don’t match :-(

Loading the data into Pandas and cleaning it

import glob
import os

files = glob.glob("./COVID-19/csse_covid_19_data/csse_covid_19_daily_reports/*.csv")
files.sort(key=os.path.getmtime)

We are going to:

  • Create a blank Dataset to store all the data
  • Load every dataset unifying the column names so we can concatenate it without any problem.
  • Remove extra blank spaces from the country field
  • Enrich the information with the date of the data in the correct type
data = pd.DataFrame()
for file in files:  
  df = pd.read_csv(file).rename(columns = {'Province/State' : 'State', 
                        "Country/Region" : 'Country',
                        'Province_State' : 'State', 
                        "Country_Region" : 'Country',
                        'Last Update' : 'Last_Update',
                        'Confirmed' : 'ConfirmedAcum',
                        'Deaths' : 'DeathsAcum',
                        'Recovered' : 'RecoveredAcum'})
  df = df.assign(Date = pd.to_datetime(file[-14:-4], format = '%m-%d-%Y'),
                 Country = df.Country.str.strip())
  data = pd.concat((data, df), axis = 0)

I noticed that the country names were a little messy.
Let’s fix it …

data['Country'] = data.Country.replace({'Bahamas, The' : 'Bahamas',
                         'Congo (Brazzaville)' : 'Congo',
                         'Congo (Kinshasa)' : 'Congo',
                         "Cote d'Ivoire" : "Cote d'Ivoire",
                         "Curacao" : "Curaçao",
                         'Czech Republic' : 'Czech Republic (Czechia)',
                         'Czechia' : 'Czech Republic (Czechia)',
                         'Faroe Islands' : 'Faeroe Islands',
                         'Macau' : 'Macao',
                         'Mainland China' : 'China',
                         'Palestine' : 'State of Palestine',
                         'Reunion' : 'Réunion',
                         'Saint Kitts and Nevis' : 'Saint Kitts & Nevis',
                         'Sao Tome and Principe' : 'Sao Tome & Principe',
                         'US' : 'United States',
                         'Gambia, The' : 'Gambia',
                         'Hong Kong SAR' : 'Hong Kong',
                         'Korea, South' : 'South Korea',
                         'Macao SAR' : 'Macao',
                         'Taiwan*' : 'Taiwan',
                         'Viet Nam' : 'Vietnam',
                         'West Bank and Gaza' : 'State of Palestine'
                         })

I’m going to fill in the null values ​​of the ‘State’ and ‘Admin2’ fields so that I can later group the data correctly

data = data.fillna({'State' : 'NA', 'Admin2' : 'NA'})

Finally I am going to be left alone with the columns that interest me

data = data[['Date', 'Country', 'State', 'Admin2', 'ConfirmedAcum', 'DeathsAcum', 'RecoveredAcum']]

Let’s verify the structure of the dateset …

data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 98669 entries, 0 to 3080
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Date           98669 non-null  datetime64[ns]
 1   Country        98669 non-null  object        
 2   State          98669 non-null  object        
 3   Admin2         98669 non-null  object        
 4   ConfirmedAcum  98650 non-null  float64       
 5   DeathsAcum     98228 non-null  float64       
 6   RecoveredAcum  98281 non-null  float64       
dtypes: datetime64[ns](1), float64(3), object(3)
memory usage: 6.0+ MB

Wait a set, I think that can be interesting have a column the the active cases. Let’s create it …

data['ActiveAcum'] = data.ConfirmedAcum  - data.DeathsAcum - data.RecoveredAcum
data.query("Country == 'Spain'").sort_values('Date', ascending = False).head()
Date Country State Admin2 ConfirmedAcum DeathsAcum RecoveredAcum ActiveAcum
3053 2020-04-20 Spain NA NA 200210.0 20852.0 80587.0 98771.0
3044 2020-04-19 Spain NA NA 198674.0 20453.0 77357.0 100864.0
3025 2020-04-18 Spain NA NA 191726.0 20043.0 74797.0 96886.0
3017 2020-04-17 Spain NA NA 190839.0 20002.0 74797.0 96040.0
3013 2020-04-16 Spain NA NA 184948.0 19315.0 74797.0 90836.0

Perfect :-)

Now, I am going to group and summarize the data because I want to be sure that there is only one row per Date, Country, State and Admin2

data = data.groupby(["Date", "Country", "State", "Admin2"]).agg("sum").reset_index()
data.head()
Date Country State Admin2 ConfirmedAcum DeathsAcum RecoveredAcum ActiveAcum
0 2020-01-22 China Anhui NA 1.0 0.0 0.0 0.0
1 2020-01-22 China Beijing NA 14.0 0.0 0.0 0.0
2 2020-01-22 China Chongqing NA 6.0 0.0 0.0 0.0
3 2020-01-22 China Fujian NA 1.0 0.0 0.0 0.0
4 2020-01-22 China Gansu NA 0.0 0.0 0.0 0.0

Daily Cases

I am going to enrich the data by creating new columns with the daily cases.

First I create new columns with the cases from the previous day

data = data.sort_values(['State', 'Country', 'Date']).\
            assign(ConfirmedPrevious = data.groupby(['Admin2', 'State', 'Country']).shift(1)["ConfirmedAcum"],
                   DeathsPrevious = data.groupby(['Admin2', 'State', 'Country']).shift(1)["DeathsAcum"],
                   RecoveredPrevious = data.groupby(['Admin2', 'State', 'Country']).shift(1)["RecoveredAcum"],
                   ActivePrevious = data.groupby(['Admin2', 'State', 'Country']).shift(1)["ActiveAcum"],
            ).\
            fillna({ 'ConfirmedPrevious' : 0, 'DeathsPrevious' : 0, 'RecoveredPrevious' : 0 })
data.head()
Date Country State Admin2 ConfirmedAcum DeathsAcum RecoveredAcum ActiveAcum ConfirmedPrevious DeathsPrevious RecoveredPrevious ActivePrevious
2599 2020-02-28 Canada Montreal, QC NA 1.0 0.0 0.0 1.0 0.0 0.0 0.0 NaN
2713 2020-02-29 Canada Montreal, QC NA 1.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0
2834 2020-03-01 Canada Montreal, QC NA 1.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0
2961 2020-03-02 Canada Montreal, QC NA 1.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0
3103 2020-03-03 Canada Montreal, QC NA 1.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0

After that I am going to assign the new fields subtracting the previous acum cases to the actual acum cases

data = data.assign(Confirmed = data.ConfirmedAcum -  data.ConfirmedPrevious,
            Deaths = data.DeathsAcum - data.DeathsPrevious,
            Recovered = data.RecoveredAcum - data.RecoveredPrevious,
            Active = data.ActiveAcum - data.ActivePrevious
            )

I no longer need the fields I used to make the calculation so I can drop them

data = data.drop(['ConfirmedPrevious', 'DeathsPrevious', 'RecoveredPrevious', 'ActivePrevious'], axis = 1)

Does the data look good?

data.query("Country == 'Spain'")
Date Country State Admin2 ConfirmedAcum DeathsAcum RecoveredAcum ActiveAcum Confirmed Deaths Recovered Active
545 2020-02-01 Spain NA NA 1.0 0.0 0.0 1.0 1.0 0.0 0.0 NaN
612 2020-02-02 Spain NA NA 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
679 2020-02-03 Spain NA NA 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
749 2020-02-04 Spain NA NA 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
819 2020-02-05 Spain NA NA 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ...
83563 2020-04-16 Spain NA NA 184948.0 19315.0 74797.0 90836.0 7304.0 607.0 3944.0 2753.0
86603 2020-04-17 Spain NA NA 190839.0 20002.0 74797.0 96040.0 5891.0 687.0 0.0 5204.0
89647 2020-04-18 Spain NA NA 191726.0 20043.0 74797.0 96886.0 887.0 41.0 0.0 846.0
92699 2020-04-19 Spain NA NA 198674.0 20453.0 77357.0 100864.0 6948.0 410.0 2560.0 3978.0
95770 2020-04-20 Spain NA NA 200210.0 20852.0 80587.0 98771.0 1536.0 399.0 3230.0 -2093.0

80 rows × 12 columns

Data By Country

So far, we have data by 3 geographical levels: Country, State and a lower level called Admin2

The problem is that not all countries have this level of information, so I will create a new dataset only with the country level data

data_by_country = data.groupby(["Date", "Country"]).agg("sum").reset_index()
data_by_country = data_by_country.sort_values(['Country', 'Date'])
data_by_country.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9112 entries, 848 to 3045
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Date           9112 non-null   datetime64[ns]
 1   Country        9112 non-null   object        
 2   ConfirmedAcum  9112 non-null   float64       
 3   DeathsAcum     9112 non-null   float64       
 4   RecoveredAcum  9112 non-null   float64       
 5   ActiveAcum     9112 non-null   float64       
 6   Confirmed      9112 non-null   float64       
 7   Deaths         9112 non-null   float64       
 8   Recovered      9112 non-null   float64       
 9   Active         9112 non-null   float64       
dtypes: datetime64[ns](1), float64(8), object(1)
memory usage: 783.1+ KB
data_by_country.head()
Date Country ConfirmedAcum DeathsAcum RecoveredAcum ActiveAcum Confirmed Deaths Recovered Active
848 2020-02-24 Afghanistan 1.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0
886 2020-02-25 Afghanistan 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
928 2020-02-26 Afghanistan 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
977 2020-02-27 Afghanistan 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
1030 2020-02-28 Afghanistan 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
data_by_country[data_by_country.Country == 'United States']
Date Country ConfirmedAcum DeathsAcum RecoveredAcum ActiveAcum Confirmed Deaths Recovered Active
7 2020-01-22 United States 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
22 2020-01-23 United States 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
33 2020-01-24 United States 2.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
47 2020-01-25 United States 2.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
62 2020-01-26 United States 5.0 0.0 0.0 0.0 3.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ...
8367 2020-04-16 United States 667801.0 32916.0 54703.0 580182.0 31449.0 4591.0 2607.0 24251.0
8551 2020-04-17 United States 699706.0 36773.0 58545.0 604388.0 31976.0 3857.0 3842.0 24277.0
8735 2020-04-18 United States 732197.0 38664.0 64840.0 628693.0 32491.0 1891.0 6295.0 24305.0
8919 2020-04-19 United States 759086.0 40661.0 70337.0 648088.0 27159.0 1997.0 5497.0 19397.0
9103 2020-04-20 United States 784326.0 42094.0 72329.0 669903.0 25211.0 1433.0 1992.0 21786.0

90 rows × 10 columns

Cases per million inhabitants

We are going to enrich the information with the number of cases per million inhabitants, so we need population data by country.

A small internet search leads me to a page that has population data for 2020:

https://www.worldometers.info/world-population/population-by-country/

It seems that this information is protected to be downloaded automatically so I have no choice but to do it manually and upload the data to a GitHub Repository:

https://github.com/dvillaj/world-population/

I load the data to the Pandas, clean it up and just maintain the field of the population

population = pd.read_excel("https://github.com/dvillaj/world-population/blob/master/data/world-popultation-2020.xlsx?raw=true", sheet_name="Data")
population.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 235 entries, 0 to 234
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Country            235 non-null    object 
 1   Population (2020)  235 non-null    int64  
 2   Yearly Change      235 non-null    float64
 3   Net Change         235 non-null    int64  
 4   Density (P/Km²)    235 non-null    float64
 5   Land Area (Km²)    235 non-null    int64  
 6   Migrants (net)     201 non-null    float64
 7   Fertility Rate     201 non-null    float64
 8   Average Age        201 non-null    float64
 9   Urban Pop %        222 non-null    float64
 10  World Share        235 non-null    float64
dtypes: float64(7), int64(3), object(1)
memory usage: 20.3+ KB
population = population.rename(columns = {
    'Population (2020)' : 'Population',
    'Yearly Change' : 'Yearly_Change',
    'Net Change' : 'Net_Change',
    'Density (P/Km²)' : 'Density',
    'Land Area (Km²)' : 'Land_Area',
    'Migrants (net)' : 'igrants',
    'Fertility Rate' : 'Fertility',
    'Average Age' : 'Mean_Age',
    'Urban Pop %' : 'Urban_Pop',
    'World Share' : 'World_Share'
})
population = population[['Country', 'Population']]
population.head()
Country Population
0 Afghanistan 38928346
1 Albania 2877797
2 Algeria 43851044
3 American Samoa 55191
4 Andorra 77265

Now I join the population Dataset with the country data to have the population in this dataset

data_by_country = data_by_country.merge(population, how = 'left', on = 'Country')
data_by_country.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9112 entries, 0 to 9111
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Date           9112 non-null   datetime64[ns]
 1   Country        9112 non-null   object        
 2   ConfirmedAcum  9112 non-null   float64       
 3   DeathsAcum     9112 non-null   float64       
 4   RecoveredAcum  9112 non-null   float64       
 5   ActiveAcum     9112 non-null   float64       
 6   Confirmed      9112 non-null   float64       
 7   Deaths         9112 non-null   float64       
 8   Recovered      9112 non-null   float64       
 9   Active         9112 non-null   float64       
 10  Population     8783 non-null   float64       
dtypes: datetime64[ns](1), float64(9), object(1)
memory usage: 854.2+ KB
data_by_country.head()
Date Country ConfirmedAcum DeathsAcum RecoveredAcum ActiveAcum Confirmed Deaths Recovered Active Population
0 2020-02-24 Afghanistan 1.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 38928346.0
1 2020-02-25 Afghanistan 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 38928346.0
2 2020-02-26 Afghanistan 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 38928346.0
3 2020-02-27 Afghanistan 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 38928346.0
4 2020-02-28 Afghanistan 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 38928346.0

And finally I calculate the number of cases per million inhabitants

data_by_country = data_by_country.assign(ConfirmedAcum_Millon = data_by_country.ConfirmedAcum / data_by_country.Population * 1000000)
data_by_country.head()
Date Country ConfirmedAcum DeathsAcum RecoveredAcum ActiveAcum Confirmed Deaths Recovered Active Population ConfirmedAcum_Millon
0 2020-02-24 Afghanistan 1.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 38928346.0 0.025688
1 2020-02-25 Afghanistan 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 38928346.0 0.025688
2 2020-02-26 Afghanistan 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 38928346.0 0.025688
3 2020-02-27 Afghanistan 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 38928346.0 0.025688
4 2020-02-28 Afghanistan 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 38928346.0 0.025688

Rankins

I’m going to create a dataset of last day’s cases.

The goal is to get a set of rankings that tell me the countries with the most cases

So I need a variable that contains the last date of the dataset

last_day = list(data_by_country.Date.sort_values(ascending = False))[0]
last_day
Timestamp('2020-04-20 00:00:00')

Now I can filter the data by this date

last_day_data = data_by_country[data_by_country.Date == last_day]
last_day_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 184 entries, 56 to 9104
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   Date                  184 non-null    datetime64[ns]
 1   Country               184 non-null    object        
 2   ConfirmedAcum         184 non-null    float64       
 3   DeathsAcum            184 non-null    float64       
 4   RecoveredAcum         184 non-null    float64       
 5   ActiveAcum            184 non-null    float64       
 6   Confirmed             184 non-null    float64       
 7   Deaths                184 non-null    float64       
 8   Recovered             184 non-null    float64       
 9   Active                184 non-null    float64       
 10  Population            178 non-null    float64       
 11  ConfirmedAcum_Millon  178 non-null    float64       
dtypes: datetime64[ns](1), float64(10), object(1)
memory usage: 18.7+ KB
last_day_data
Date Country ConfirmedAcum DeathsAcum RecoveredAcum ActiveAcum Confirmed Deaths Recovered Active Population ConfirmedAcum_Millon
56 2020-04-20 Afghanistan 1026.0 36.0 135.0 855.0 30.0 3.0 4.0 23.0 38928346.0 26.356116
99 2020-04-20 Albania 584.0 26.0 327.0 231.0 22.0 0.0 13.0 9.0 2877797.0 202.933007
155 2020-04-20 Algeria 2718.0 384.0 1099.0 1235.0 89.0 9.0 52.0 28.0 43851044.0 61.982561
205 2020-04-20 Andorra 717.0 37.0 248.0 432.0 4.0 1.0 13.0 -10.0 77265.0 9279.751505
237 2020-04-20 Angola 24.0 2.0 6.0 16.0 0.0 0.0 0.0 0.0 32866272.0 0.730232
... ... ... ... ... ... ... ... ... ... ... ... ...
9011 2020-04-20 Vietnam 268.0 0.0 214.0 54.0 0.0 0.0 12.0 -12.0 97338579.0 2.753276
9027 2020-04-20 Western Sahara 6.0 0.0 0.0 6.0 0.0 0.0 0.0 0.0 597339.0 10.044548
9038 2020-04-20 Yemen 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 29825964.0 0.033528
9072 2020-04-20 Zambia 65.0 3.0 35.0 27.0 4.0 0.0 2.0 2.0 18383955.0 3.535692
9104 2020-04-20 Zimbabwe 25.0 3.0 2.0 20.0 0.0 0.0 0.0 0.0 14862924.0 1.682038

184 rows × 12 columns

Let’s assign new columns with the most interesting rankins …

last_day_data = last_day_data.assign(
    Rank_ConfirmedAcum = last_day_data.ConfirmedAcum.rank(),
    Rank_Confirmed = last_day_data.Confirmed.rank(),
    Rank_ActiveAcum = last_day_data.ActiveAcum.rank(),
    Rank_Active = last_day_data.Active.rank(),
    Rank_ConfirmedAcum_Millon = last_day_data.ConfirmedAcum_Millon.rank()
)
last_day_data
Date Country ConfirmedAcum DeathsAcum RecoveredAcum ActiveAcum Confirmed Deaths Recovered Active Population ConfirmedAcum_Millon Rank_ConfirmedAcum Rank_Confirmed Rank_ActiveAcum Rank_Active Rank_ConfirmedAcum_Millon
56 2020-04-20 Afghanistan 1026.0 36.0 135.0 855.0 30.0 3.0 4.0 23.0 38928346.0 26.356116 105.0 113.0 113.0 132.0 56.0
99 2020-04-20 Albania 584.0 26.0 327.0 231.0 22.0 0.0 13.0 9.0 2877797.0 202.933007 91.0 107.0 81.0 117.5 108.0
155 2020-04-20 Algeria 2718.0 384.0 1099.0 1235.0 89.0 9.0 52.0 28.0 43851044.0 61.982561 129.0 137.0 123.0 135.5 78.0
205 2020-04-20 Andorra 717.0 37.0 248.0 432.0 4.0 1.0 13.0 -10.0 77265.0 9279.751505 98.0 71.0 93.0 31.0 176.0
237 2020-04-20 Angola 24.0 2.0 6.0 16.0 0.0 0.0 0.0 0.0 32866272.0 0.730232 29.5 28.0 27.0 72.0 4.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9011 2020-04-20 Vietnam 268.0 0.0 214.0 54.0 0.0 0.0 12.0 -12.0 97338579.0 2.753276 71.0 28.0 52.5 26.5 19.0
9027 2020-04-20 Western Sahara 6.0 0.0 0.0 6.0 0.0 0.0 0.0 0.0 597339.0 10.044548 6.0 28.0 10.5 72.0 35.0
9038 2020-04-20 Yemen 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 29825964.0 0.033528 1.0 28.0 3.0 72.0 1.0
9072 2020-04-20 Zambia 65.0 3.0 35.0 27.0 4.0 0.0 2.0 2.0 18383955.0 3.535692 45.5 71.0 40.5 100.0 21.0
9104 2020-04-20 Zimbabwe 25.0 3.0 2.0 20.0 0.0 0.0 0.0 0.0 14862924.0 1.682038 31.0 28.0 32.0 72.0 13.0

184 rows × 17 columns

Which countries have the most confirmed cases?

last_day_data.sort_values('Rank_ConfirmedAcum', ascending = False)[['Country', 'ConfirmedAcum']].reset_index(drop = True).head(10)
Country ConfirmedAcum
0 United States 784326.0
1 Spain 200210.0
2 Italy 181228.0
3 France 156480.0
4 Germany 147065.0
5 United Kingdom 125856.0
6 Turkey 90980.0
7 China 83817.0
8 Iran 83505.0
9 Russia 47121.0

Which countries have more confirmed cases on the last day?

last_day_data.sort_values('Rank_Confirmed', ascending = False)[['Country', 'Confirmed']].reset_index(drop = True).head(10)
Country Confirmed
0 United States 25211.0
1 United Kingdom 4684.0
2 Turkey 4674.0
3 Russia 4268.0
4 France 2383.0
5 Italy 2256.0
6 Brazil 2089.0
7 Canada 2025.0
8 Germany 1881.0
9 Spain 1536.0

Which countries have the most active cases?

last_day_data.sort_values('Rank_ActiveAcum', ascending = False)[['Country', 'ActiveAcum']].reset_index(drop = True).head(10)
Country ActiveAcum
0 United States 669903.0
1 United Kingdom 108860.0
2 Italy 108237.0
3 Spain 98771.0
4 France 98152.0
5 Turkey 75410.0
6 Germany 50703.0
7 Russia 43270.0
8 Netherlands 29502.0
9 Belgium 25260.0

Which countries had the most active cases on the last day?

last_day_data.sort_values('Rank_Active', ascending = False)[['Country', 'Active']].reset_index(drop = True).head(10)
Country Active
0 United States 21786.0
1 United Kingdom 4219.0
2 Russia 4069.0
3 Turkey 3097.0
4 Brazil 1964.0
5 Belarus 1461.0
6 Singapore 1393.0
7 Belgium 1204.0
8 Canada 1167.0
9 Saudi Arabia 1024.0

Which countries have the most active cases per million inhabitants?

last_day_data.sort_values('Rank_ConfirmedAcum_Millon', ascending = False)[['Country', 'ConfirmedAcum_Millon']].reset_index(drop = True).head(20)
Country ConfirmedAcum_Millon
0 San Marino 13615.867496
1 Holy See 11235.955056
2 Andorra 9279.751505
3 Luxembourg 5683.905824
4 Iceland 5195.710974
5 Spain 4282.129198
6 Belgium 3449.896515
7 Switzerland 3228.794972
8 Ireland 3169.841706
9 Italy 2997.395414
10 France 2397.297121
11 Monaco 2395.392692
12 United States 2369.545977
13 Liechtenstein 2124.422996
14 Qatar 2087.778323
15 Portugal 2046.052310
16 Netherlands 1960.213067
17 United Kingdom 1853.931291
18 Germany 1755.288621
19 Austria 1642.721097

What is the total number of cases?

import locale
locale.setlocale(locale.LC_ALL, '')

total = pd.DataFrame(last_day_data[["ConfirmedAcum", "RecoveredAcum", "DeathsAcum"]].aggregate("sum")\
                     .apply(lambda value : locale.format("%d", value, grouping=True)))
total.columns = ["Total Cases"]
total.index = ['Confirmed', 'Deaths', 'Recovered']
display(total.T)

Confirmed Deaths Recovered
Total Cases 2,472,259 645,738 169,986

Saving the clean dataset

Finally I will create an Excel file with the information per country once it is clean and in perfect condition to apply some machine learning algorithms …

data_by_country.to_excel("All_data.xlsx", index = False)

¿Tienes algo que decir?

Related Posts

Empresas orientadas a datos

Descrubre lo que signigica que una empresa esté orientada a datos

Importing data from Twitter with python and tweepy

How to import data from twitter using python and Tweepy

Visualizing distributions of data

Plotting Distributions

Apache Spark en 10 minutos

Prueba Apache Spark mediante de una máquina virtual en 10 minutos