Pandas and Covid-19

This notebook is an example of data analysis and manipulation with Pandas and has beed created in Google Colab

Enjoy it!

import numpy as np
import pandas as pd

The Data

To get some data I and going to download it from Data Repository by Johns Hopkins CSSE

https://github.com/CSSEGISandData/COVID-19

I first remove the folder where I am goint to store the data so I can re-execute this sentences without any problems …

!rm -rf ./COVID-19

The dataset is avaible in GitHub so I use the git command to get it

!git clone https://github.com/CSSEGISandData/COVID-19.git

Cloning into 'COVID-19'...
remote: Enumerating objects: 14, done.[K
remote: Counting objects: 100% (14/14), done.[K
remote: Compressing objects: 100% (9/9), done.[K
remote: Total 21425 (delta 5), reused 10 (delta 5), pack-reused 21411[K
Receiving objects: 100% (21425/21425), 90.29 MiB | 33.30 MiB/s, done.
Resolving deltas: 100% (11436/11436), done.

Exporing the data

!ls -lt ./COVID-19/csse_covid_19_data/csse_covid_19_daily_reports | head

total 9544
-rw-r--r-- 1 root root 317177 Apr 21 05:16 04-20-2020.csv
-rw-r--r-- 1 root root      0 Apr 21 05:16 README.md
-rw-r--r-- 1 root root 317954 Apr 21 05:16 04-19-2020.csv
-rw-r--r-- 1 root root 315926 Apr 21 05:16 04-18-2020.csv
-rw-r--r-- 1 root root 314848 Apr 21 05:16 04-17-2020.csv
-rw-r--r-- 1 root root 312551 Apr 21 05:16 04-15-2020.csv
-rw-r--r-- 1 root root 314226 Apr 21 05:16 04-16-2020.csv
-rw-r--r-- 1 root root 305548 Apr 21 05:16 04-12-2020.csv
-rw-r--r-- 1 root root 309742 Apr 21 05:16 04-13-2020.csv

Yes!!!
We have data files …

Perfect. Let’s explore the first dataset generated …

first = pd.read_csv("./COVID-19/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv")

first.head()

	Province/State	Country/Region	Last Update	Confirmed	Deaths	Recovered
0	Anhui	Mainland China	1/22/2020 17:00	1.0	NaN	NaN
1	Beijing	Mainland China	1/22/2020 17:00	14.0	NaN	NaN
2	Chongqing	Mainland China	1/22/2020 17:00	6.0	NaN	NaN
3	Fujian	Mainland China	1/22/2020 17:00	1.0	NaN	NaN
4	Gansu	Mainland China	1/22/2020 17:00	NaN	NaN	NaN

And one of the last ones …

last = pd.read_csv("./COVID-19/csse_covid_19_data/csse_covid_19_daily_reports/04-18-2020.csv")

last.head()

	FIPS	Admin2	Province_State	Country_Region	Last_Update	Lat	Long_	Confirmed	Deaths	Active	Combined_Key
0	45001.0	Abbeville	South Carolina	US	2020-04-18 22:32:47	34.223334	-82.461707	15	0	15	Abbeville, South Carolina, US
1	22001.0	Acadia	Louisiana	US	2020-04-18 22:32:47	30.295065	-92.414197	110	7	103	Acadia, Louisiana, US
2	51001.0	Accomack	Virginia	US	2020-04-18 22:32:47	37.767072	-75.632346	33	0	33	Accomack, Virginia, US
3	16001.0	Ada	Idaho	US	2020-04-18 22:32:47	43.452658	-116.241552	593	9	584	Ada, Idaho, US
4	19001.0	Adair	Iowa	US	2020-04-18 22:32:47	41.330756	-94.471059	1	0	1	Adair, Iowa, US

Can I concatenate both datasets?

last.query("Country_Region == 'Spain'")

	FIPS	Admin2	Province_State	Country_Region	Last_Update	Lat	Long_	Confirmed	Deaths	Recovered	Active	Combined_Key
3025	NaN	NaN	NaN	Spain	2020-04-18 22:32:28	40.463667	-3.74922	191726	20043	74797	96886	Spain

pd.concat((first, last), axis = 0)

	Province/State	Country/Region	Last Update	Confirmed	Deaths	Recovered	FIPS	Admin2	Province_State	Country_Region	Last_Update	Lat	Long_	Active	Combined_Key
0	Anhui	Mainland China	1/22/2020 17:00	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	Beijing	Mainland China	1/22/2020 17:00	14.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	Chongqing	Mainland China	1/22/2020 17:00	6.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	Fujian	Mainland China	1/22/2020 17:00	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	Gansu	Mainland China	1/22/2020 17:00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3048	NaN	NaN	NaN	418.0	2.0	69.0	NaN	NaN	NaN	West Bank and Gaza	2020-04-18 22:32:28	31.952200	35.233200	347.0	West Bank and Gaza
3049	NaN	NaN	NaN	6.0	0.0	0.0	NaN	NaN	NaN	Western Sahara	2020-04-18 22:32:28	24.215500	-12.885800	6.0	Western Sahara
3050	NaN	NaN	NaN	1.0	0.0	0.0	NaN	NaN	NaN	Yemen	2020-04-18 22:32:28	15.552727	48.516388	1.0	Yemen
3051	NaN	NaN	NaN	57.0	2.0	33.0	NaN	NaN	NaN	Zambia	2020-04-18 22:32:28	-13.133897	27.849332	22.0	Zambia
3052	NaN	NaN	NaN	25.0	3.0	2.0	NaN	NaN	NaN	Zimbabwe	2020-04-18 22:32:28	-19.015438	29.154857	20.0	Zimbabwe

3091 rows × 15 columns

Ups!!! The column names don’t match :-(

Loading the data into Pandas and cleaning it

import glob
import os

files = glob.glob("./COVID-19/csse_covid_19_data/csse_covid_19_daily_reports/*.csv")
files.sort(key=os.path.getmtime)

We are going to:

Create a blank Dataset to store all the data
Load every dataset unifying the column names so we can concatenate it without any problem.
Remove extra blank spaces from the country field
Enrich the information with the date of the data in the correct type

data = pd.DataFrame()
for file in files:  
  df = pd.read_csv(file).rename(columns = {'Province/State' : 'State', 
                        "Country/Region" : 'Country',
                        'Province_State' : 'State', 
                        "Country_Region" : 'Country',
                        'Last Update' : 'Last_Update',
                        'Confirmed' : 'ConfirmedAcum',
                        'Deaths' : 'DeathsAcum',
                        'Recovered' : 'RecoveredAcum'})
  df = df.assign(Date = pd.to_datetime(file[-14:-4], format = '%m-%d-%Y'),
                 Country = df.Country.str.strip())
  data = pd.concat((data, df), axis = 0)

I noticed that the country names were a little messy.
Let’s fix it …

data['Country'] = data.Country.replace({'Bahamas, The' : 'Bahamas',
                         'Congo (Brazzaville)' : 'Congo',
                         'Congo (Kinshasa)' : 'Congo',
                         "Cote d'Ivoire" : "Cote d'Ivoire",
                         "Curacao" : "Curaçao",
                         'Czech Republic' : 'Czech Republic (Czechia)',
                         'Czechia' : 'Czech Republic (Czechia)',
                         'Faroe Islands' : 'Faeroe Islands',
                         'Macau' : 'Macao',
                         'Mainland China' : 'China',
                         'Palestine' : 'State of Palestine',
                         'Reunion' : 'Réunion',
                         'Saint Kitts and Nevis' : 'Saint Kitts & Nevis',
                         'Sao Tome and Principe' : 'Sao Tome & Principe',
                         'US' : 'United States',
                         'Gambia, The' : 'Gambia',
                         'Hong Kong SAR' : 'Hong Kong',
                         'Korea, South' : 'South Korea',
                         'Macao SAR' : 'Macao',
                         'Taiwan*' : 'Taiwan',
                         'Viet Nam' : 'Vietnam',
                         'West Bank and Gaza' : 'State of Palestine'
                         })

I’m going to fill in the null values of the ‘State’ and ‘Admin2’ fields so that I can later group the data correctly

data = data.fillna({'State' : 'NA', 'Admin2' : 'NA'})

Finally I am going to be left alone with the columns that interest me

data = data[['Date', 'Country', 'State', 'Admin2', 'ConfirmedAcum', 'DeathsAcum', 'RecoveredAcum']]

Let’s verify the structure of the dateset …

data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 98669 entries, 0 to 3080
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Date           98669 non-null  datetime64[ns]
 1   Country        98669 non-null  object        
 2   State          98669 non-null  object        
 3   Admin2         98669 non-null  object        
 4   ConfirmedAcum  98650 non-null  float64       
 5   DeathsAcum     98228 non-null  float64       
 6   RecoveredAcum  98281 non-null  float64       
dtypes: datetime64[ns](1), float64(3), object(3)
memory usage: 6.0+ MB

Wait a set, I think that can be interesting have a column the the active cases. Let’s create it …

data['ActiveAcum'] = data.ConfirmedAcum  - data.DeathsAcum - data.RecoveredAcum

data.query("Country == 'Spain'").sort_values('Date', ascending = False).head()

	Date	Country	State	Admin2	ConfirmedAcum	DeathsAcum	RecoveredAcum	ActiveAcum
3053	2020-04-20	Spain	NA	NA	200210.0	20852.0	80587.0	98771.0
3044	2020-04-19	Spain	NA	NA	198674.0	20453.0	77357.0	100864.0
3025	2020-04-18	Spain	NA	NA	191726.0	20043.0	74797.0	96886.0
3017	2020-04-17	Spain	NA	NA	190839.0	20002.0	74797.0	96040.0
3013	2020-04-16	Spain	NA	NA	184948.0	19315.0	74797.0	90836.0

Perfect :-)

Now, I am going to group and summarize the data because I want to be sure that there is only one row per Date, Country, State and Admin2

data = data.groupby(["Date", "Country", "State", "Admin2"]).agg("sum").reset_index()

data.head()

	Date	Country	State	Admin2	ConfirmedAcum
0	2020-01-22	China	Anhui	NA	1.0
1	2020-01-22	China	Beijing	NA	14.0
2	2020-01-22	China	Chongqing	NA	6.0
3	2020-01-22	China	Fujian	NA	1.0
4	2020-01-22	China	Gansu	NA	0.0

Daily Cases

I am going to enrich the data by creating new columns with the daily cases.

First I create new columns with the cases from the previous day

data = data.sort_values(['State', 'Country', 'Date']).\
            assign(ConfirmedPrevious = data.groupby(['Admin2', 'State', 'Country']).shift(1)["ConfirmedAcum"],
                   DeathsPrevious = data.groupby(['Admin2', 'State', 'Country']).shift(1)["DeathsAcum"],
                   RecoveredPrevious = data.groupby(['Admin2', 'State', 'Country']).shift(1)["RecoveredAcum"],
                   ActivePrevious = data.groupby(['Admin2', 'State', 'Country']).shift(1)["ActiveAcum"],
            ).\
            fillna({ 'ConfirmedPrevious' : 0, 'DeathsPrevious' : 0, 'RecoveredPrevious' : 0 })

data.head()

	Date	Country	State	Admin2	ConfirmedAcum	ActiveAcum	ConfirmedPrevious	ActivePrevious
2599	2020-02-28	Canada	Montreal, QC	NA	1.0	1.0	0.0	NaN
2713	2020-02-29	Canada	Montreal, QC	NA	1.0	1.0	1.0	1.0
2834	2020-03-01	Canada	Montreal, QC	NA	1.0	1.0	1.0	1.0
2961	2020-03-02	Canada	Montreal, QC	NA	1.0	1.0	1.0	1.0
3103	2020-03-03	Canada	Montreal, QC	NA	1.0	1.0	1.0	1.0

After that I am going to assign the new fields subtracting the previous acum cases to the actual acum cases

data = data.assign(Confirmed = data.ConfirmedAcum -  data.ConfirmedPrevious,
            Deaths = data.DeathsAcum - data.DeathsPrevious,
            Recovered = data.RecoveredAcum - data.RecoveredPrevious,
            Active = data.ActiveAcum - data.ActivePrevious
            )

I no longer need the fields I used to make the calculation so I can drop them

data = data.drop(['ConfirmedPrevious', 'DeathsPrevious', 'RecoveredPrevious', 'ActivePrevious'], axis = 1)

Does the data look good?

data.query("Country == 'Spain'")

	Date	Country	State	Admin2	ConfirmedAcum	DeathsAcum	RecoveredAcum	ActiveAcum	Confirmed	Deaths	Recovered	Active
545	2020-02-01	Spain	NA	NA	1.0	0.0	0.0	1.0	1.0	0.0	0.0	NaN
612	2020-02-02	Spain	NA	NA	1.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0
679	2020-02-03	Spain	NA	NA	1.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0
749	2020-02-04	Spain	NA	NA	1.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0
819	2020-02-05	Spain	NA	NA	1.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...
83563	2020-04-16	Spain	NA	NA	184948.0	19315.0	74797.0	90836.0	7304.0	607.0	3944.0	2753.0
86603	2020-04-17	Spain	NA	NA	190839.0	20002.0	74797.0	96040.0	5891.0	687.0	0.0	5204.0
89647	2020-04-18	Spain	NA	NA	191726.0	20043.0	74797.0	96886.0	887.0	41.0	0.0	846.0
92699	2020-04-19	Spain	NA	NA	198674.0	20453.0	77357.0	100864.0	6948.0	410.0	2560.0	3978.0
95770	2020-04-20	Spain	NA	NA	200210.0	20852.0	80587.0	98771.0	1536.0	399.0	3230.0	-2093.0

80 rows × 12 columns

Data By Country

So far, we have data by 3 geographical levels: Country, State and a lower level called Admin2

The problem is that not all countries have this level of information, so I will create a new dataset only with the country level data

data_by_country = data.groupby(["Date", "Country"]).agg("sum").reset_index()
data_by_country = data_by_country.sort_values(['Country', 'Date'])
data_by_country.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9112 entries, 848 to 3045
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Date           9112 non-null   datetime64[ns]
 1   Country        9112 non-null   object        
 2   ConfirmedAcum  9112 non-null   float64       
 3   DeathsAcum     9112 non-null   float64       
 4   RecoveredAcum  9112 non-null   float64       
 5   ActiveAcum     9112 non-null   float64       
 6   Confirmed      9112 non-null   float64       
 7   Deaths         9112 non-null   float64       
 8   Recovered      9112 non-null   float64       
 9   Active         9112 non-null   float64       
dtypes: datetime64[ns](1), float64(8), object(1)
memory usage: 783.1+ KB

data_by_country.head()

	Date	Country	ConfirmedAcum	ActiveAcum	Confirmed
848	2020-02-24	Afghanistan	1.0	1.0	1.0
886	2020-02-25	Afghanistan	1.0	1.0	0.0
928	2020-02-26	Afghanistan	1.0	1.0	0.0
977	2020-02-27	Afghanistan	1.0	1.0	0.0
1030	2020-02-28	Afghanistan	1.0	1.0	0.0

data_by_country[data_by_country.Country == 'United States']

	Date	Country	ConfirmedAcum	DeathsAcum	RecoveredAcum	ActiveAcum	Confirmed	Deaths	Recovered	Active
7	2020-01-22	United States	1.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0
22	2020-01-23	United States	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
33	2020-01-24	United States	2.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0
47	2020-01-25	United States	2.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0
62	2020-01-26	United States	5.0	0.0	0.0	0.0	3.0	0.0	0.0	0.0
...	...	...	...	...	...	...	...	...	...	...
8367	2020-04-16	United States	667801.0	32916.0	54703.0	580182.0	31449.0	4591.0	2607.0	24251.0
8551	2020-04-17	United States	699706.0	36773.0	58545.0	604388.0	31976.0	3857.0	3842.0	24277.0
8735	2020-04-18	United States	732197.0	38664.0	64840.0	628693.0	32491.0	1891.0	6295.0	24305.0
8919	2020-04-19	United States	759086.0	40661.0	70337.0	648088.0	27159.0	1997.0	5497.0	19397.0
9103	2020-04-20	United States	784326.0	42094.0	72329.0	669903.0	25211.0	1433.0	1992.0	21786.0

90 rows × 10 columns

Cases per million inhabitants

We are going to enrich the information with the number of cases per million inhabitants, so we need population data by country.

A small internet search leads me to a page that has population data for 2020:

https://www.worldometers.info/world-population/population-by-country/

It seems that this information is protected to be downloaded automatically so I have no choice but to do it manually and upload the data to a GitHub Repository:

https://github.com/dvillaj/world-population/

I load the data to the Pandas, clean it up and just maintain the field of the population

population = pd.read_excel("https://github.com/dvillaj/world-population/blob/master/data/world-popultation-2020.xlsx?raw=true", sheet_name="Data")

population.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 235 entries, 0 to 234
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Country            235 non-null    object 
 1   Population (2020)  235 non-null    int64  
 2   Yearly Change      235 non-null    float64
 3   Net Change         235 non-null    int64  
 4   Density (P/Km²)    235 non-null    float64
 5   Land Area (Km²)    235 non-null    int64  
 6   Migrants (net)     201 non-null    float64
 7   Fertility Rate     201 non-null    float64
 8   Average Age        201 non-null    float64
 9   Urban Pop %        222 non-null    float64
 10  World Share        235 non-null    float64
dtypes: float64(7), int64(3), object(1)
memory usage: 20.3+ KB

population = population.rename(columns = {
    'Population (2020)' : 'Population',
    'Yearly Change' : 'Yearly_Change',
    'Net Change' : 'Net_Change',
    'Density (P/Km²)' : 'Density',
    'Land Area (Km²)' : 'Land_Area',
    'Migrants (net)' : 'igrants',
    'Fertility Rate' : 'Fertility',
    'Average Age' : 'Mean_Age',
    'Urban Pop %' : 'Urban_Pop',
    'World Share' : 'World_Share'
})

population = population[['Country', 'Population']]

population.head()

	Country	Population
0	Afghanistan	38928346
1	Albania	2877797
2	Algeria	43851044
3	American Samoa	55191
4	Andorra	77265

Now I join the population Dataset with the country data to have the population in this dataset

data_by_country = data_by_country.merge(population, how = 'left', on = 'Country')

data_by_country.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9112 entries, 0 to 9111
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Date           9112 non-null   datetime64[ns]
 1   Country        9112 non-null   object        
 2   ConfirmedAcum  9112 non-null   float64       
 3   DeathsAcum     9112 non-null   float64       
 4   RecoveredAcum  9112 non-null   float64       
 5   ActiveAcum     9112 non-null   float64       
 6   Confirmed      9112 non-null   float64       
 7   Deaths         9112 non-null   float64       
 8   Recovered      9112 non-null   float64       
 9   Active         9112 non-null   float64       
 10  Population     8783 non-null   float64       
dtypes: datetime64[ns](1), float64(9), object(1)
memory usage: 854.2+ KB

data_by_country.head()

	Date	Country	ConfirmedAcum	ActiveAcum	Confirmed	Population
0	2020-02-24	Afghanistan	1.0	1.0	1.0	38928346.0
1	2020-02-25	Afghanistan	1.0	1.0	0.0	38928346.0
2	2020-02-26	Afghanistan	1.0	1.0	0.0	38928346.0
3	2020-02-27	Afghanistan	1.0	1.0	0.0	38928346.0
4	2020-02-28	Afghanistan	1.0	1.0	0.0	38928346.0

And finally I calculate the number of cases per million inhabitants

data_by_country = data_by_country.assign(ConfirmedAcum_Millon = data_by_country.ConfirmedAcum / data_by_country.Population * 1000000)
data_by_country.head()

	Date	Country	ConfirmedAcum	ActiveAcum	Confirmed	Population	ConfirmedAcum_Millon
0	2020-02-24	Afghanistan	1.0	1.0	1.0	38928346.0	0.025688
1	2020-02-25	Afghanistan	1.0	1.0	0.0	38928346.0	0.025688
2	2020-02-26	Afghanistan	1.0	1.0	0.0	38928346.0	0.025688
3	2020-02-27	Afghanistan	1.0	1.0	0.0	38928346.0	0.025688
4	2020-02-28	Afghanistan	1.0	1.0	0.0	38928346.0	0.025688

Rankins

I’m going to create a dataset of last day’s cases.

The goal is to get a set of rankings that tell me the countries with the most cases

So I need a variable that contains the last date of the dataset

last_day = list(data_by_country.Date.sort_values(ascending = False))[0]
last_day

Timestamp('2020-04-20 00:00:00')

Now I can filter the data by this date

last_day_data = data_by_country[data_by_country.Date == last_day]
last_day_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 184 entries, 56 to 9104
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   Date                  184 non-null    datetime64[ns]
 1   Country               184 non-null    object        
 2   ConfirmedAcum         184 non-null    float64       
 3   DeathsAcum            184 non-null    float64       
 4   RecoveredAcum         184 non-null    float64       
 5   ActiveAcum            184 non-null    float64       
 6   Confirmed             184 non-null    float64       
 7   Deaths                184 non-null    float64       
 8   Recovered             184 non-null    float64       
 9   Active                184 non-null    float64       
 10  Population            178 non-null    float64       
 11  ConfirmedAcum_Millon  178 non-null    float64       
dtypes: datetime64[ns](1), float64(10), object(1)
memory usage: 18.7+ KB

last_day_data

	Date	Country	ConfirmedAcum	DeathsAcum	RecoveredAcum	ActiveAcum	Confirmed	Deaths	Recovered	Active	Population	ConfirmedAcum_Millon
56	2020-04-20	Afghanistan	1026.0	36.0	135.0	855.0	30.0	3.0	4.0	23.0	38928346.0	26.356116
99	2020-04-20	Albania	584.0	26.0	327.0	231.0	22.0	0.0	13.0	9.0	2877797.0	202.933007
155	2020-04-20	Algeria	2718.0	384.0	1099.0	1235.0	89.0	9.0	52.0	28.0	43851044.0	61.982561
205	2020-04-20	Andorra	717.0	37.0	248.0	432.0	4.0	1.0	13.0	-10.0	77265.0	9279.751505
237	2020-04-20	Angola	24.0	2.0	6.0	16.0	0.0	0.0	0.0	0.0	32866272.0	0.730232
...	...	...	...	...	...	...	...	...	...	...	...	...
9011	2020-04-20	Vietnam	268.0	0.0	214.0	54.0	0.0	0.0	12.0	-12.0	97338579.0	2.753276
9027	2020-04-20	Western Sahara	6.0	0.0	0.0	6.0	0.0	0.0	0.0	0.0	597339.0	10.044548
9038	2020-04-20	Yemen	1.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	29825964.0	0.033528
9072	2020-04-20	Zambia	65.0	3.0	35.0	27.0	4.0	0.0	2.0	2.0	18383955.0	3.535692
9104	2020-04-20	Zimbabwe	25.0	3.0	2.0	20.0	0.0	0.0	0.0	0.0	14862924.0	1.682038

184 rows × 12 columns

Let’s assign new columns with the most interesting rankins …

last_day_data = last_day_data.assign(
    Rank_ConfirmedAcum = last_day_data.ConfirmedAcum.rank(),
    Rank_Confirmed = last_day_data.Confirmed.rank(),
    Rank_ActiveAcum = last_day_data.ActiveAcum.rank(),
    Rank_Active = last_day_data.Active.rank(),
    Rank_ConfirmedAcum_Millon = last_day_data.ConfirmedAcum_Millon.rank()
)
last_day_data

	Date	Country	ConfirmedAcum	DeathsAcum	RecoveredAcum	ActiveAcum	Confirmed	Deaths	Recovered	Active	Population	ConfirmedAcum_Millon	Rank_ConfirmedAcum	Rank_Confirmed	Rank_ActiveAcum	Rank_Active	Rank_ConfirmedAcum_Millon
56	2020-04-20	Afghanistan	1026.0	36.0	135.0	855.0	30.0	3.0	4.0	23.0	38928346.0	26.356116	105.0	113.0	113.0	132.0	56.0
99	2020-04-20	Albania	584.0	26.0	327.0	231.0	22.0	0.0	13.0	9.0	2877797.0	202.933007	91.0	107.0	81.0	117.5	108.0
155	2020-04-20	Algeria	2718.0	384.0	1099.0	1235.0	89.0	9.0	52.0	28.0	43851044.0	61.982561	129.0	137.0	123.0	135.5	78.0
205	2020-04-20	Andorra	717.0	37.0	248.0	432.0	4.0	1.0	13.0	-10.0	77265.0	9279.751505	98.0	71.0	93.0	31.0	176.0
237	2020-04-20	Angola	24.0	2.0	6.0	16.0	0.0	0.0	0.0	0.0	32866272.0	0.730232	29.5	28.0	27.0	72.0	4.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
9011	2020-04-20	Vietnam	268.0	0.0	214.0	54.0	0.0	0.0	12.0	-12.0	97338579.0	2.753276	71.0	28.0	52.5	26.5	19.0
9027	2020-04-20	Western Sahara	6.0	0.0	0.0	6.0	0.0	0.0	0.0	0.0	597339.0	10.044548	6.0	28.0	10.5	72.0	35.0
9038	2020-04-20	Yemen	1.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	29825964.0	0.033528	1.0	28.0	3.0	72.0	1.0
9072	2020-04-20	Zambia	65.0	3.0	35.0	27.0	4.0	0.0	2.0	2.0	18383955.0	3.535692	45.5	71.0	40.5	100.0	21.0
9104	2020-04-20	Zimbabwe	25.0	3.0	2.0	20.0	0.0	0.0	0.0	0.0	14862924.0	1.682038	31.0	28.0	32.0	72.0	13.0

184 rows × 17 columns

Which countries have the most confirmed cases?

last_day_data.sort_values('Rank_ConfirmedAcum', ascending = False)[['Country', 'ConfirmedAcum']].reset_index(drop = True).head(10)

	Country	ConfirmedAcum
0	United States	784326.0
1	Spain	200210.0
2	Italy	181228.0
3	France	156480.0
4	Germany	147065.0
5	United Kingdom	125856.0
6	Turkey	90980.0
7	China	83817.0
8	Iran	83505.0
9	Russia	47121.0

Which countries have more confirmed cases on the last day?

last_day_data.sort_values('Rank_Confirmed', ascending = False)[['Country', 'Confirmed']].reset_index(drop = True).head(10)

	Country	Confirmed
0	United States	25211.0
1	United Kingdom	4684.0
2	Turkey	4674.0
3	Russia	4268.0
4	France	2383.0
5	Italy	2256.0
6	Brazil	2089.0
7	Canada	2025.0
8	Germany	1881.0
9	Spain	1536.0

Which countries have the most active cases?

last_day_data.sort_values('Rank_ActiveAcum', ascending = False)[['Country', 'ActiveAcum']].reset_index(drop = True).head(10)

	Country	ActiveAcum
0	United States	669903.0
1	United Kingdom	108860.0
2	Italy	108237.0
3	Spain	98771.0
4	France	98152.0
5	Turkey	75410.0
6	Germany	50703.0
7	Russia	43270.0
8	Netherlands	29502.0
9	Belgium	25260.0

Which countries had the most active cases on the last day?

last_day_data.sort_values('Rank_Active', ascending = False)[['Country', 'Active']].reset_index(drop = True).head(10)

	Country	Active
0	United States	21786.0
1	United Kingdom	4219.0
2	Russia	4069.0
3	Turkey	3097.0
4	Brazil	1964.0
5	Belarus	1461.0
6	Singapore	1393.0
7	Belgium	1204.0
8	Canada	1167.0
9	Saudi Arabia	1024.0

Which countries have the most active cases per million inhabitants?

last_day_data.sort_values('Rank_ConfirmedAcum_Millon', ascending = False)[['Country', 'ConfirmedAcum_Millon']].reset_index(drop = True).head(20)

	Country	ConfirmedAcum_Millon
0	San Marino	13615.867496
1	Holy See	11235.955056
2	Andorra	9279.751505
3	Luxembourg	5683.905824
4	Iceland	5195.710974
5	Spain	4282.129198
6	Belgium	3449.896515
7	Switzerland	3228.794972
8	Ireland	3169.841706
9	Italy	2997.395414
10	France	2397.297121
11	Monaco	2395.392692
12	United States	2369.545977
13	Liechtenstein	2124.422996
14	Qatar	2087.778323
15	Portugal	2046.052310
16	Netherlands	1960.213067
17	United Kingdom	1853.931291
18	Germany	1755.288621
19	Austria	1642.721097

What is the total number of cases?

import locale
locale.setlocale(locale.LC_ALL, '')

total = pd.DataFrame(last_day_data[["ConfirmedAcum", "RecoveredAcum", "DeathsAcum"]].aggregate("sum")\
                     .apply(lambda value : locale.format("%d", value, grouping=True)))
total.columns = ["Total Cases"]
total.index = ['Confirmed', 'Deaths', 'Recovered']
display(total.T)

	Confirmed	Deaths	Recovered
Total Cases	2,472,259	645,738	169,986

Saving the clean dataset

Finally I will create an Excel file with the information per country once it is clean and in perfect condition to apply some machine learning algorithms …

data_by_country.to_excel("All_data.xlsx", index = False)

Pandas and Covid-19

The Data

Exporing the data

Loading the data into Pandas and cleaning it

Daily Cases

Data By Country

Cases per million inhabitants

Rankins

Saving the clean dataset

Empresas orientadas a datos

Importing data from Twitter with python and tweepy

Visualizing distributions of data

Apache Spark en 10 minutos

Pandas and Covid-19

The Data

Exporing the data

Loading the data into Pandas and cleaning it

Daily Cases

Data By Country

Cases per million inhabitants

Rankins

Saving the clean dataset

¿Tienes algo que decir?

Related Posts

Empresas orientadas a datos

Importing data from Twitter with python and tweepy

Visualizing distributions of data

Apache Spark en 10 minutos