This notebook is an example of data analysis and manipulation with Pandas and has beed created in Google Colab
Enjoy it!
import numpy as np
import pandas as pd
The Data
To get some data I and going to download it from Data Repository by Johns Hopkins CSSE
https://github.com/CSSEGISandData/COVID-19
I first remove the folder where I am goint to store the data so I can re-execute this sentences without any problems …
!rm -rf ./COVID-19
The dataset is avaible in GitHub so I use the git
command to get it
!git clone https://github.com/CSSEGISandData/COVID-19.git
Cloning into 'COVID-19'...
remote: Enumerating objects: 14, done.[K
remote: Counting objects: 100% (14/14), done.[K
remote: Compressing objects: 100% (9/9), done.[K
remote: Total 21425 (delta 5), reused 10 (delta 5), pack-reused 21411[K
Receiving objects: 100% (21425/21425), 90.29 MiB | 33.30 MiB/s, done.
Resolving deltas: 100% (11436/11436), done.
Exporing the data
!ls -lt ./COVID-19/csse_covid_19_data/csse_covid_19_daily_reports | head
total 9544
-rw-r--r-- 1 root root 317177 Apr 21 05:16 04-20-2020.csv
-rw-r--r-- 1 root root 0 Apr 21 05:16 README.md
-rw-r--r-- 1 root root 317954 Apr 21 05:16 04-19-2020.csv
-rw-r--r-- 1 root root 315926 Apr 21 05:16 04-18-2020.csv
-rw-r--r-- 1 root root 314848 Apr 21 05:16 04-17-2020.csv
-rw-r--r-- 1 root root 312551 Apr 21 05:16 04-15-2020.csv
-rw-r--r-- 1 root root 314226 Apr 21 05:16 04-16-2020.csv
-rw-r--r-- 1 root root 305548 Apr 21 05:16 04-12-2020.csv
-rw-r--r-- 1 root root 309742 Apr 21 05:16 04-13-2020.csv
Yes!!!
We have data files …
Perfect. Let’s explore the first dataset generated …
first = pd.read_csv("./COVID-19/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv")
first.head()
Province/State | Country/Region | Last Update | Confirmed | Deaths | Recovered | |
---|---|---|---|---|---|---|
0 | Anhui | Mainland China | 1/22/2020 17:00 | 1.0 | NaN | NaN |
1 | Beijing | Mainland China | 1/22/2020 17:00 | 14.0 | NaN | NaN |
2 | Chongqing | Mainland China | 1/22/2020 17:00 | 6.0 | NaN | NaN |
3 | Fujian | Mainland China | 1/22/2020 17:00 | 1.0 | NaN | NaN |
4 | Gansu | Mainland China | 1/22/2020 17:00 | NaN | NaN | NaN |
And one of the last ones …
last = pd.read_csv("./COVID-19/csse_covid_19_data/csse_covid_19_daily_reports/04-18-2020.csv")
last.head()
FIPS | Admin2 | Province_State | Country_Region | Last_Update | Lat | Long_ | Confirmed | Deaths | Recovered | Active | Combined_Key | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 45001.0 | Abbeville | South Carolina | US | 2020-04-18 22:32:47 | 34.223334 | -82.461707 | 15 | 0 | 0 | 15 | Abbeville, South Carolina, US |
1 | 22001.0 | Acadia | Louisiana | US | 2020-04-18 22:32:47 | 30.295065 | -92.414197 | 110 | 7 | 0 | 103 | Acadia, Louisiana, US |
2 | 51001.0 | Accomack | Virginia | US | 2020-04-18 22:32:47 | 37.767072 | -75.632346 | 33 | 0 | 0 | 33 | Accomack, Virginia, US |
3 | 16001.0 | Ada | Idaho | US | 2020-04-18 22:32:47 | 43.452658 | -116.241552 | 593 | 9 | 0 | 584 | Ada, Idaho, US |
4 | 19001.0 | Adair | Iowa | US | 2020-04-18 22:32:47 | 41.330756 | -94.471059 | 1 | 0 | 0 | 1 | Adair, Iowa, US |
Can I concatenate both datasets?
last.query("Country_Region == 'Spain'")
FIPS | Admin2 | Province_State | Country_Region | Last_Update | Lat | Long_ | Confirmed | Deaths | Recovered | Active | Combined_Key | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
3025 | NaN | NaN | NaN | Spain | 2020-04-18 22:32:28 | 40.463667 | -3.74922 | 191726 | 20043 | 74797 | 96886 | Spain |
pd.concat((first, last), axis = 0)
Province/State | Country/Region | Last Update | Confirmed | Deaths | Recovered | FIPS | Admin2 | Province_State | Country_Region | Last_Update | Lat | Long_ | Active | Combined_Key | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Anhui | Mainland China | 1/22/2020 17:00 | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | Beijing | Mainland China | 1/22/2020 17:00 | 14.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | Chongqing | Mainland China | 1/22/2020 17:00 | 6.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | Fujian | Mainland China | 1/22/2020 17:00 | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | Gansu | Mainland China | 1/22/2020 17:00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3048 | NaN | NaN | NaN | 418.0 | 2.0 | 69.0 | NaN | NaN | NaN | West Bank and Gaza | 2020-04-18 22:32:28 | 31.952200 | 35.233200 | 347.0 | West Bank and Gaza |
3049 | NaN | NaN | NaN | 6.0 | 0.0 | 0.0 | NaN | NaN | NaN | Western Sahara | 2020-04-18 22:32:28 | 24.215500 | -12.885800 | 6.0 | Western Sahara |
3050 | NaN | NaN | NaN | 1.0 | 0.0 | 0.0 | NaN | NaN | NaN | Yemen | 2020-04-18 22:32:28 | 15.552727 | 48.516388 | 1.0 | Yemen |
3051 | NaN | NaN | NaN | 57.0 | 2.0 | 33.0 | NaN | NaN | NaN | Zambia | 2020-04-18 22:32:28 | -13.133897 | 27.849332 | 22.0 | Zambia |
3052 | NaN | NaN | NaN | 25.0 | 3.0 | 2.0 | NaN | NaN | NaN | Zimbabwe | 2020-04-18 22:32:28 | -19.015438 | 29.154857 | 20.0 | Zimbabwe |
3091 rows × 15 columns
Ups!!! The column names don’t match :-(
Loading the data into Pandas and cleaning it
import glob
import os
files = glob.glob("./COVID-19/csse_covid_19_data/csse_covid_19_daily_reports/*.csv")
files.sort(key=os.path.getmtime)
We are going to:
- Create a blank Dataset to store all the data
- Load every dataset unifying the column names so we can concatenate it without any problem.
- Remove extra blank spaces from the country field
- Enrich the information with the date of the data in the correct type
data = pd.DataFrame()
for file in files:
df = pd.read_csv(file).rename(columns = {'Province/State' : 'State',
"Country/Region" : 'Country',
'Province_State' : 'State',
"Country_Region" : 'Country',
'Last Update' : 'Last_Update',
'Confirmed' : 'ConfirmedAcum',
'Deaths' : 'DeathsAcum',
'Recovered' : 'RecoveredAcum'})
df = df.assign(Date = pd.to_datetime(file[-14:-4], format = '%m-%d-%Y'),
Country = df.Country.str.strip())
data = pd.concat((data, df), axis = 0)
I noticed that the country names were a little messy.
Let’s fix it …
data['Country'] = data.Country.replace({'Bahamas, The' : 'Bahamas',
'Congo (Brazzaville)' : 'Congo',
'Congo (Kinshasa)' : 'Congo',
"Cote d'Ivoire" : "Cote d'Ivoire",
"Curacao" : "Curaçao",
'Czech Republic' : 'Czech Republic (Czechia)',
'Czechia' : 'Czech Republic (Czechia)',
'Faroe Islands' : 'Faeroe Islands',
'Macau' : 'Macao',
'Mainland China' : 'China',
'Palestine' : 'State of Palestine',
'Reunion' : 'Réunion',
'Saint Kitts and Nevis' : 'Saint Kitts & Nevis',
'Sao Tome and Principe' : 'Sao Tome & Principe',
'US' : 'United States',
'Gambia, The' : 'Gambia',
'Hong Kong SAR' : 'Hong Kong',
'Korea, South' : 'South Korea',
'Macao SAR' : 'Macao',
'Taiwan*' : 'Taiwan',
'Viet Nam' : 'Vietnam',
'West Bank and Gaza' : 'State of Palestine'
})
I’m going to fill in the null values of the ‘State’ and ‘Admin2’ fields so that I can later group the data correctly
data = data.fillna({'State' : 'NA', 'Admin2' : 'NA'})
Finally I am going to be left alone with the columns that interest me
data = data[['Date', 'Country', 'State', 'Admin2', 'ConfirmedAcum', 'DeathsAcum', 'RecoveredAcum']]
Let’s verify the structure of the dateset …
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 98669 entries, 0 to 3080
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 98669 non-null datetime64[ns]
1 Country 98669 non-null object
2 State 98669 non-null object
3 Admin2 98669 non-null object
4 ConfirmedAcum 98650 non-null float64
5 DeathsAcum 98228 non-null float64
6 RecoveredAcum 98281 non-null float64
dtypes: datetime64[ns](1), float64(3), object(3)
memory usage: 6.0+ MB
Wait a set, I think that can be interesting have a column the the active cases. Let’s create it …
data['ActiveAcum'] = data.ConfirmedAcum - data.DeathsAcum - data.RecoveredAcum
data.query("Country == 'Spain'").sort_values('Date', ascending = False).head()
Date | Country | State | Admin2 | ConfirmedAcum | DeathsAcum | RecoveredAcum | ActiveAcum | |
---|---|---|---|---|---|---|---|---|
3053 | 2020-04-20 | Spain | NA | NA | 200210.0 | 20852.0 | 80587.0 | 98771.0 |
3044 | 2020-04-19 | Spain | NA | NA | 198674.0 | 20453.0 | 77357.0 | 100864.0 |
3025 | 2020-04-18 | Spain | NA | NA | 191726.0 | 20043.0 | 74797.0 | 96886.0 |
3017 | 2020-04-17 | Spain | NA | NA | 190839.0 | 20002.0 | 74797.0 | 96040.0 |
3013 | 2020-04-16 | Spain | NA | NA | 184948.0 | 19315.0 | 74797.0 | 90836.0 |
Perfect :-)
Now, I am going to group and summarize the data because I want to be sure that there is only one row per Date, Country, State and Admin2
data = data.groupby(["Date", "Country", "State", "Admin2"]).agg("sum").reset_index()
data.head()
Date | Country | State | Admin2 | ConfirmedAcum | DeathsAcum | RecoveredAcum | ActiveAcum | |
---|---|---|---|---|---|---|---|---|
0 | 2020-01-22 | China | Anhui | NA | 1.0 | 0.0 | 0.0 | 0.0 |
1 | 2020-01-22 | China | Beijing | NA | 14.0 | 0.0 | 0.0 | 0.0 |
2 | 2020-01-22 | China | Chongqing | NA | 6.0 | 0.0 | 0.0 | 0.0 |
3 | 2020-01-22 | China | Fujian | NA | 1.0 | 0.0 | 0.0 | 0.0 |
4 | 2020-01-22 | China | Gansu | NA | 0.0 | 0.0 | 0.0 | 0.0 |
Daily Cases
I am going to enrich the data by creating new columns with the daily cases.
First I create new columns with the cases from the previous day
data = data.sort_values(['State', 'Country', 'Date']).\
assign(ConfirmedPrevious = data.groupby(['Admin2', 'State', 'Country']).shift(1)["ConfirmedAcum"],
DeathsPrevious = data.groupby(['Admin2', 'State', 'Country']).shift(1)["DeathsAcum"],
RecoveredPrevious = data.groupby(['Admin2', 'State', 'Country']).shift(1)["RecoveredAcum"],
ActivePrevious = data.groupby(['Admin2', 'State', 'Country']).shift(1)["ActiveAcum"],
).\
fillna({ 'ConfirmedPrevious' : 0, 'DeathsPrevious' : 0, 'RecoveredPrevious' : 0 })
data.head()
Date | Country | State | Admin2 | ConfirmedAcum | DeathsAcum | RecoveredAcum | ActiveAcum | ConfirmedPrevious | DeathsPrevious | RecoveredPrevious | ActivePrevious | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2599 | 2020-02-28 | Canada | Montreal, QC | NA | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | NaN |
2713 | 2020-02-29 | Canada | Montreal, QC | NA | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 |
2834 | 2020-03-01 | Canada | Montreal, QC | NA | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 |
2961 | 2020-03-02 | Canada | Montreal, QC | NA | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 |
3103 | 2020-03-03 | Canada | Montreal, QC | NA | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 |
After that I am going to assign the new fields subtracting the previous acum cases to the actual acum cases
data = data.assign(Confirmed = data.ConfirmedAcum - data.ConfirmedPrevious,
Deaths = data.DeathsAcum - data.DeathsPrevious,
Recovered = data.RecoveredAcum - data.RecoveredPrevious,
Active = data.ActiveAcum - data.ActivePrevious
)
I no longer need the fields I used to make the calculation so I can drop them
data = data.drop(['ConfirmedPrevious', 'DeathsPrevious', 'RecoveredPrevious', 'ActivePrevious'], axis = 1)
Does the data look good?
data.query("Country == 'Spain'")
Date | Country | State | Admin2 | ConfirmedAcum | DeathsAcum | RecoveredAcum | ActiveAcum | Confirmed | Deaths | Recovered | Active | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
545 | 2020-02-01 | Spain | NA | NA | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | NaN |
612 | 2020-02-02 | Spain | NA | NA | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
679 | 2020-02-03 | Spain | NA | NA | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
749 | 2020-02-04 | Spain | NA | NA | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
819 | 2020-02-05 | Spain | NA | NA | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
83563 | 2020-04-16 | Spain | NA | NA | 184948.0 | 19315.0 | 74797.0 | 90836.0 | 7304.0 | 607.0 | 3944.0 | 2753.0 |
86603 | 2020-04-17 | Spain | NA | NA | 190839.0 | 20002.0 | 74797.0 | 96040.0 | 5891.0 | 687.0 | 0.0 | 5204.0 |
89647 | 2020-04-18 | Spain | NA | NA | 191726.0 | 20043.0 | 74797.0 | 96886.0 | 887.0 | 41.0 | 0.0 | 846.0 |
92699 | 2020-04-19 | Spain | NA | NA | 198674.0 | 20453.0 | 77357.0 | 100864.0 | 6948.0 | 410.0 | 2560.0 | 3978.0 |
95770 | 2020-04-20 | Spain | NA | NA | 200210.0 | 20852.0 | 80587.0 | 98771.0 | 1536.0 | 399.0 | 3230.0 | -2093.0 |
80 rows × 12 columns
Data By Country
So far, we have data by 3 geographical levels: Country, State and a lower level called Admin2
The problem is that not all countries have this level of information, so I will create a new dataset only with the country level data
data_by_country = data.groupby(["Date", "Country"]).agg("sum").reset_index()
data_by_country = data_by_country.sort_values(['Country', 'Date'])
data_by_country.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9112 entries, 848 to 3045
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 9112 non-null datetime64[ns]
1 Country 9112 non-null object
2 ConfirmedAcum 9112 non-null float64
3 DeathsAcum 9112 non-null float64
4 RecoveredAcum 9112 non-null float64
5 ActiveAcum 9112 non-null float64
6 Confirmed 9112 non-null float64
7 Deaths 9112 non-null float64
8 Recovered 9112 non-null float64
9 Active 9112 non-null float64
dtypes: datetime64[ns](1), float64(8), object(1)
memory usage: 783.1+ KB
data_by_country.head()
Date | Country | ConfirmedAcum | DeathsAcum | RecoveredAcum | ActiveAcum | Confirmed | Deaths | Recovered | Active | |
---|---|---|---|---|---|---|---|---|---|---|
848 | 2020-02-24 | Afghanistan | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 |
886 | 2020-02-25 | Afghanistan | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
928 | 2020-02-26 | Afghanistan | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
977 | 2020-02-27 | Afghanistan | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1030 | 2020-02-28 | Afghanistan | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
data_by_country[data_by_country.Country == 'United States']
Date | Country | ConfirmedAcum | DeathsAcum | RecoveredAcum | ActiveAcum | Confirmed | Deaths | Recovered | Active | |
---|---|---|---|---|---|---|---|---|---|---|
7 | 2020-01-22 | United States | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
22 | 2020-01-23 | United States | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
33 | 2020-01-24 | United States | 2.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
47 | 2020-01-25 | United States | 2.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
62 | 2020-01-26 | United States | 5.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
8367 | 2020-04-16 | United States | 667801.0 | 32916.0 | 54703.0 | 580182.0 | 31449.0 | 4591.0 | 2607.0 | 24251.0 |
8551 | 2020-04-17 | United States | 699706.0 | 36773.0 | 58545.0 | 604388.0 | 31976.0 | 3857.0 | 3842.0 | 24277.0 |
8735 | 2020-04-18 | United States | 732197.0 | 38664.0 | 64840.0 | 628693.0 | 32491.0 | 1891.0 | 6295.0 | 24305.0 |
8919 | 2020-04-19 | United States | 759086.0 | 40661.0 | 70337.0 | 648088.0 | 27159.0 | 1997.0 | 5497.0 | 19397.0 |
9103 | 2020-04-20 | United States | 784326.0 | 42094.0 | 72329.0 | 669903.0 | 25211.0 | 1433.0 | 1992.0 | 21786.0 |
90 rows × 10 columns
Cases per million inhabitants
We are going to enrich the information with the number of cases per million inhabitants, so we need population data by country.
A small internet search leads me to a page that has population data for 2020:
https://www.worldometers.info/world-population/population-by-country/
It seems that this information is protected to be downloaded automatically so I have no choice but to do it manually and upload the data to a GitHub Repository:
https://github.com/dvillaj/world-population/
I load the data to the Pandas, clean it up and just maintain the field of the population
population = pd.read_excel("https://github.com/dvillaj/world-population/blob/master/data/world-popultation-2020.xlsx?raw=true", sheet_name="Data")
population.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 235 entries, 0 to 234
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Country 235 non-null object
1 Population (2020) 235 non-null int64
2 Yearly Change 235 non-null float64
3 Net Change 235 non-null int64
4 Density (P/Km²) 235 non-null float64
5 Land Area (Km²) 235 non-null int64
6 Migrants (net) 201 non-null float64
7 Fertility Rate 201 non-null float64
8 Average Age 201 non-null float64
9 Urban Pop % 222 non-null float64
10 World Share 235 non-null float64
dtypes: float64(7), int64(3), object(1)
memory usage: 20.3+ KB
population = population.rename(columns = {
'Population (2020)' : 'Population',
'Yearly Change' : 'Yearly_Change',
'Net Change' : 'Net_Change',
'Density (P/Km²)' : 'Density',
'Land Area (Km²)' : 'Land_Area',
'Migrants (net)' : 'igrants',
'Fertility Rate' : 'Fertility',
'Average Age' : 'Mean_Age',
'Urban Pop %' : 'Urban_Pop',
'World Share' : 'World_Share'
})
population = population[['Country', 'Population']]
population.head()
Country | Population | |
---|---|---|
0 | Afghanistan | 38928346 |
1 | Albania | 2877797 |
2 | Algeria | 43851044 |
3 | American Samoa | 55191 |
4 | Andorra | 77265 |
Now I join the population Dataset with the country data to have the population in this dataset
data_by_country = data_by_country.merge(population, how = 'left', on = 'Country')
data_by_country.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9112 entries, 0 to 9111
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 9112 non-null datetime64[ns]
1 Country 9112 non-null object
2 ConfirmedAcum 9112 non-null float64
3 DeathsAcum 9112 non-null float64
4 RecoveredAcum 9112 non-null float64
5 ActiveAcum 9112 non-null float64
6 Confirmed 9112 non-null float64
7 Deaths 9112 non-null float64
8 Recovered 9112 non-null float64
9 Active 9112 non-null float64
10 Population 8783 non-null float64
dtypes: datetime64[ns](1), float64(9), object(1)
memory usage: 854.2+ KB
data_by_country.head()
Date | Country | ConfirmedAcum | DeathsAcum | RecoveredAcum | ActiveAcum | Confirmed | Deaths | Recovered | Active | Population | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2020-02-24 | Afghanistan | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 38928346.0 |
1 | 2020-02-25 | Afghanistan | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 38928346.0 |
2 | 2020-02-26 | Afghanistan | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 38928346.0 |
3 | 2020-02-27 | Afghanistan | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 38928346.0 |
4 | 2020-02-28 | Afghanistan | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 38928346.0 |
And finally I calculate the number of cases per million inhabitants
data_by_country = data_by_country.assign(ConfirmedAcum_Millon = data_by_country.ConfirmedAcum / data_by_country.Population * 1000000)
data_by_country.head()
Date | Country | ConfirmedAcum | DeathsAcum | RecoveredAcum | ActiveAcum | Confirmed | Deaths | Recovered | Active | Population | ConfirmedAcum_Millon | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2020-02-24 | Afghanistan | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 38928346.0 | 0.025688 |
1 | 2020-02-25 | Afghanistan | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 38928346.0 | 0.025688 |
2 | 2020-02-26 | Afghanistan | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 38928346.0 | 0.025688 |
3 | 2020-02-27 | Afghanistan | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 38928346.0 | 0.025688 |
4 | 2020-02-28 | Afghanistan | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 38928346.0 | 0.025688 |
Rankins
I’m going to create a dataset of last day’s cases.
The goal is to get a set of rankings that tell me the countries with the most cases
So I need a variable that contains the last date of the dataset
last_day = list(data_by_country.Date.sort_values(ascending = False))[0]
last_day
Timestamp('2020-04-20 00:00:00')
Now I can filter the data by this date
last_day_data = data_by_country[data_by_country.Date == last_day]
last_day_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 184 entries, 56 to 9104
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 184 non-null datetime64[ns]
1 Country 184 non-null object
2 ConfirmedAcum 184 non-null float64
3 DeathsAcum 184 non-null float64
4 RecoveredAcum 184 non-null float64
5 ActiveAcum 184 non-null float64
6 Confirmed 184 non-null float64
7 Deaths 184 non-null float64
8 Recovered 184 non-null float64
9 Active 184 non-null float64
10 Population 178 non-null float64
11 ConfirmedAcum_Millon 178 non-null float64
dtypes: datetime64[ns](1), float64(10), object(1)
memory usage: 18.7+ KB
last_day_data
Date | Country | ConfirmedAcum | DeathsAcum | RecoveredAcum | ActiveAcum | Confirmed | Deaths | Recovered | Active | Population | ConfirmedAcum_Millon | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
56 | 2020-04-20 | Afghanistan | 1026.0 | 36.0 | 135.0 | 855.0 | 30.0 | 3.0 | 4.0 | 23.0 | 38928346.0 | 26.356116 |
99 | 2020-04-20 | Albania | 584.0 | 26.0 | 327.0 | 231.0 | 22.0 | 0.0 | 13.0 | 9.0 | 2877797.0 | 202.933007 |
155 | 2020-04-20 | Algeria | 2718.0 | 384.0 | 1099.0 | 1235.0 | 89.0 | 9.0 | 52.0 | 28.0 | 43851044.0 | 61.982561 |
205 | 2020-04-20 | Andorra | 717.0 | 37.0 | 248.0 | 432.0 | 4.0 | 1.0 | 13.0 | -10.0 | 77265.0 | 9279.751505 |
237 | 2020-04-20 | Angola | 24.0 | 2.0 | 6.0 | 16.0 | 0.0 | 0.0 | 0.0 | 0.0 | 32866272.0 | 0.730232 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9011 | 2020-04-20 | Vietnam | 268.0 | 0.0 | 214.0 | 54.0 | 0.0 | 0.0 | 12.0 | -12.0 | 97338579.0 | 2.753276 |
9027 | 2020-04-20 | Western Sahara | 6.0 | 0.0 | 0.0 | 6.0 | 0.0 | 0.0 | 0.0 | 0.0 | 597339.0 | 10.044548 |
9038 | 2020-04-20 | Yemen | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 29825964.0 | 0.033528 |
9072 | 2020-04-20 | Zambia | 65.0 | 3.0 | 35.0 | 27.0 | 4.0 | 0.0 | 2.0 | 2.0 | 18383955.0 | 3.535692 |
9104 | 2020-04-20 | Zimbabwe | 25.0 | 3.0 | 2.0 | 20.0 | 0.0 | 0.0 | 0.0 | 0.0 | 14862924.0 | 1.682038 |
184 rows × 12 columns
Let’s assign new columns with the most interesting rankins …
last_day_data = last_day_data.assign(
Rank_ConfirmedAcum = last_day_data.ConfirmedAcum.rank(),
Rank_Confirmed = last_day_data.Confirmed.rank(),
Rank_ActiveAcum = last_day_data.ActiveAcum.rank(),
Rank_Active = last_day_data.Active.rank(),
Rank_ConfirmedAcum_Millon = last_day_data.ConfirmedAcum_Millon.rank()
)
last_day_data
Date | Country | ConfirmedAcum | DeathsAcum | RecoveredAcum | ActiveAcum | Confirmed | Deaths | Recovered | Active | Population | ConfirmedAcum_Millon | Rank_ConfirmedAcum | Rank_Confirmed | Rank_ActiveAcum | Rank_Active | Rank_ConfirmedAcum_Millon | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
56 | 2020-04-20 | Afghanistan | 1026.0 | 36.0 | 135.0 | 855.0 | 30.0 | 3.0 | 4.0 | 23.0 | 38928346.0 | 26.356116 | 105.0 | 113.0 | 113.0 | 132.0 | 56.0 |
99 | 2020-04-20 | Albania | 584.0 | 26.0 | 327.0 | 231.0 | 22.0 | 0.0 | 13.0 | 9.0 | 2877797.0 | 202.933007 | 91.0 | 107.0 | 81.0 | 117.5 | 108.0 |
155 | 2020-04-20 | Algeria | 2718.0 | 384.0 | 1099.0 | 1235.0 | 89.0 | 9.0 | 52.0 | 28.0 | 43851044.0 | 61.982561 | 129.0 | 137.0 | 123.0 | 135.5 | 78.0 |
205 | 2020-04-20 | Andorra | 717.0 | 37.0 | 248.0 | 432.0 | 4.0 | 1.0 | 13.0 | -10.0 | 77265.0 | 9279.751505 | 98.0 | 71.0 | 93.0 | 31.0 | 176.0 |
237 | 2020-04-20 | Angola | 24.0 | 2.0 | 6.0 | 16.0 | 0.0 | 0.0 | 0.0 | 0.0 | 32866272.0 | 0.730232 | 29.5 | 28.0 | 27.0 | 72.0 | 4.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9011 | 2020-04-20 | Vietnam | 268.0 | 0.0 | 214.0 | 54.0 | 0.0 | 0.0 | 12.0 | -12.0 | 97338579.0 | 2.753276 | 71.0 | 28.0 | 52.5 | 26.5 | 19.0 |
9027 | 2020-04-20 | Western Sahara | 6.0 | 0.0 | 0.0 | 6.0 | 0.0 | 0.0 | 0.0 | 0.0 | 597339.0 | 10.044548 | 6.0 | 28.0 | 10.5 | 72.0 | 35.0 |
9038 | 2020-04-20 | Yemen | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 29825964.0 | 0.033528 | 1.0 | 28.0 | 3.0 | 72.0 | 1.0 |
9072 | 2020-04-20 | Zambia | 65.0 | 3.0 | 35.0 | 27.0 | 4.0 | 0.0 | 2.0 | 2.0 | 18383955.0 | 3.535692 | 45.5 | 71.0 | 40.5 | 100.0 | 21.0 |
9104 | 2020-04-20 | Zimbabwe | 25.0 | 3.0 | 2.0 | 20.0 | 0.0 | 0.0 | 0.0 | 0.0 | 14862924.0 | 1.682038 | 31.0 | 28.0 | 32.0 | 72.0 | 13.0 |
184 rows × 17 columns
Which countries have the most confirmed cases?
last_day_data.sort_values('Rank_ConfirmedAcum', ascending = False)[['Country', 'ConfirmedAcum']].reset_index(drop = True).head(10)
Country | ConfirmedAcum | |
---|---|---|
0 | United States | 784326.0 |
1 | Spain | 200210.0 |
2 | Italy | 181228.0 |
3 | France | 156480.0 |
4 | Germany | 147065.0 |
5 | United Kingdom | 125856.0 |
6 | Turkey | 90980.0 |
7 | China | 83817.0 |
8 | Iran | 83505.0 |
9 | Russia | 47121.0 |
Which countries have more confirmed cases on the last day?
last_day_data.sort_values('Rank_Confirmed', ascending = False)[['Country', 'Confirmed']].reset_index(drop = True).head(10)
Country | Confirmed | |
---|---|---|
0 | United States | 25211.0 |
1 | United Kingdom | 4684.0 |
2 | Turkey | 4674.0 |
3 | Russia | 4268.0 |
4 | France | 2383.0 |
5 | Italy | 2256.0 |
6 | Brazil | 2089.0 |
7 | Canada | 2025.0 |
8 | Germany | 1881.0 |
9 | Spain | 1536.0 |
Which countries have the most active cases?
last_day_data.sort_values('Rank_ActiveAcum', ascending = False)[['Country', 'ActiveAcum']].reset_index(drop = True).head(10)
Country | ActiveAcum | |
---|---|---|
0 | United States | 669903.0 |
1 | United Kingdom | 108860.0 |
2 | Italy | 108237.0 |
3 | Spain | 98771.0 |
4 | France | 98152.0 |
5 | Turkey | 75410.0 |
6 | Germany | 50703.0 |
7 | Russia | 43270.0 |
8 | Netherlands | 29502.0 |
9 | Belgium | 25260.0 |
Which countries had the most active cases on the last day?
last_day_data.sort_values('Rank_Active', ascending = False)[['Country', 'Active']].reset_index(drop = True).head(10)
Country | Active | |
---|---|---|
0 | United States | 21786.0 |
1 | United Kingdom | 4219.0 |
2 | Russia | 4069.0 |
3 | Turkey | 3097.0 |
4 | Brazil | 1964.0 |
5 | Belarus | 1461.0 |
6 | Singapore | 1393.0 |
7 | Belgium | 1204.0 |
8 | Canada | 1167.0 |
9 | Saudi Arabia | 1024.0 |
Which countries have the most active cases per million inhabitants?
last_day_data.sort_values('Rank_ConfirmedAcum_Millon', ascending = False)[['Country', 'ConfirmedAcum_Millon']].reset_index(drop = True).head(20)
Country | ConfirmedAcum_Millon | |
---|---|---|
0 | San Marino | 13615.867496 |
1 | Holy See | 11235.955056 |
2 | Andorra | 9279.751505 |
3 | Luxembourg | 5683.905824 |
4 | Iceland | 5195.710974 |
5 | Spain | 4282.129198 |
6 | Belgium | 3449.896515 |
7 | Switzerland | 3228.794972 |
8 | Ireland | 3169.841706 |
9 | Italy | 2997.395414 |
10 | France | 2397.297121 |
11 | Monaco | 2395.392692 |
12 | United States | 2369.545977 |
13 | Liechtenstein | 2124.422996 |
14 | Qatar | 2087.778323 |
15 | Portugal | 2046.052310 |
16 | Netherlands | 1960.213067 |
17 | United Kingdom | 1853.931291 |
18 | Germany | 1755.288621 |
19 | Austria | 1642.721097 |
What is the total number of cases?
import locale
locale.setlocale(locale.LC_ALL, '')
total = pd.DataFrame(last_day_data[["ConfirmedAcum", "RecoveredAcum", "DeathsAcum"]].aggregate("sum")\
.apply(lambda value : locale.format("%d", value, grouping=True)))
total.columns = ["Total Cases"]
total.index = ['Confirmed', 'Deaths', 'Recovered']
display(total.T)
Confirmed | Deaths | Recovered | |
---|---|---|---|
Total Cases | 2,472,259 | 645,738 | 169,986 |
Saving the clean dataset
Finally I will create an Excel file with the information per country once it is clean and in perfect condition to apply some machine learning algorithms …
data_by_country.to_excel("All_data.xlsx", index = False)