Data Retrieval ¶

Table of Contents

Data Retrieval

Utility ¶

covid19_inference.data_retrieval.retrieval.set_data_dir(fname=None, permissions=None)[source]¶: Set the global variable _data_dir. New downloaded data is placed there. If no argument provided we try the default tmp directory. If permissions are not provided, uses defaults if fname is in user folder. If not in user folder, tries to set 777.

covid19_inference.data_retrieval.retrieval.backup_instances(trace=None, model=None, fname='latest_')[source]¶: helper to save or load trace and model instances. loads from fname if provided traces and model variables are None, else saves them there.

Johns Hops University ¶

class covid19_inference.data_retrieval.JHU(auto_download=False)[source]¶

This class can be used to retrieve and filter the dataset from the online repository of the coronavirus visual dashboard operated by the Johns Hopkins University.

Features

download all files from the online repository of the coronavirus visual dashboard operated by the Johns Hopkins University.
filter by deaths, confirmed cases and recovered cases
filter by country and state
filter by date

Example

jhu = cov19.data_retrieval.JHU()
jhu.download_all_available_data()

#Acess the data by
jhu.data
#or
jhu.get_new("confirmed","Italy")
jhu.get_total(filter)

__init__(auto_download=False)[source]¶

On init of this class the base Retrieval Class __init__ is called, with jhu specific arguments.

Parameters: auto_download (bool, optional) – Whether or not to automatically call the download_all_available_data() method. One should explicitly call this method for more configuration options (default: false)

download_all_available_data(force_local=False, force_download=False)[source]¶

Attempts to download from the main urls (self.url_csv) which was set on initialization of this class. If this fails it downloads from the fallbacks. It can also be specified to use the local files or to force the download. The download methods get inhereted from the base retrieval class.

Parameters

force_local (bool, optional) – If True forces to load the local files.
force_download (bool, optional) – If True forces the download of new files

get_total_confirmed_deaths_recovered(country: str = None, state: str = None, begin_date: datetime.datetime = None, end_date: datetime.datetime = None)[source]¶

Retrieves all confirmed, deaths and recovered cases from the Johns Hopkins University dataset as a DataFrame with datetime index. Can be filtered by country and state, if only a country is given all available states get summed up.

Parameters

country (str, optional) – name of the country (the “Country/Region” column), can be None if the whole summed up data is wanted (why would you do this?)
state (str, optional) – name of the state (the “Province/State” column), can be None if country is set or the whole summed up data is wanted
begin_date (datetime.datetime, optional) – intial date for the returned data, if no value is given the first date in the dataset is used
end_date (datetime.datetime, optional) – last date for the returned data, if no value is given the most recent date in the dataset is used

Returns

pandas.DataFrame

get_new(value='confirmed', country: str = None, state: str = None, data_begin: datetime.datetime = None, data_end: datetime.datetime = None)[source]¶

Retrieves all new cases from the Johns Hopkins University dataset as a DataFrame with datetime index. Can be filtered by value, country and state, if only a country is given all available states get summed up.

Parameters

value (str) – Which data to return, possible values are - “confirmed”, - “recovered”, - “deaths” (default: “confirmed”)
country (str, optional) – name of the country (the “Country/Region” column), can be None
state (str, optional) – name of the state (the “Province/State” column), can be None
begin_date (datetime.datetime, optional) – intial date for the returned data, if no value is given the first date in the dataset is used
end_date (datetime.datetime, optional) – last date for the returned data, if no value is given the most recent date in the dataset is used

Returns

pandas.DataFrame – table with new cases and the date as index

get_total(value='confirmed', country: str = None, state: str = None, data_begin: datetime.datetime = None, data_end: datetime.datetime = None)[source]¶

Retrieves all total/cumulative cases from the Johns Hopkins University dataset as a DataFrame with datetime index. Can be filtered by value, country and state, if only a country is given all available states get summed up.

Parameters

value (str) – Which data to return, possible values are - “confirmed”, - “recovered”, - “deaths” (default: “confirmed”)
country (str, optional) – name of the country (the “Country/Region” column), can be None
state (str, optional) – name of the state (the “Province/State” column), can be None
begin_date (datetime.datetime, optional) – intial date for the returned data, if no value is given the first date in the dataset is used
end_date (datetime.datetime, optional) – last date for the returned data, if no value is given the most recent date in the dataset is used

Returns

pandas.DataFrame – table with total/cumulative cases and the date as index

filter_date(df, begin_date: datetime.datetime = None, end_date: datetime.datetime = None)[source]¶

Returns give dataframe between begin and end date. Dataframe has to have a datetime index.

Parameters

begin_date (datetime.datetime, optional) – First day that should be filtered
end_date (datetime.datetime, optional) – Last day that should be filtered

Returns

pandas.DataFrame

get_possible_countries_states()[source]¶

Can be used to get a list with all possible states and coutries.

Returns: pandas.DataFrame in the format

Robert Koch Institute ¶

class covid19_inference.data_retrieval.RKI(auto_download=False)[source]¶

This class can be used to retrieve and filter the dataset from the Robert Koch Institute Robert Koch Institute. The data gets retrieved from the arcgis dashboard.

Features

download the full dataset
filter by date
filter by bundesland
filter by recovered, deaths and confirmed cases

Example

rki = cov19.data_retrieval.RKI()
rki.download_all_available_data()

#Acess the data by
rki.data
#or
rki.get_new("confirmed","Sachsen")
rki.get_total(filter)

__init__(auto_download=False)[source]¶

On init of this class the base Retrieval Class __init__ is called, with rki specific arguments.

Parameters: auto_download (bool, optional) – Whether or not to automatically call the download_all_available_data() method. One should explicitly call this method for more configuration options (default: false)

download_all_available_data(force_local=False, force_download=False)[source]¶

Attempts to download from the main url (self.url_csv) which was given on initialization. If this fails download from the fallbacks. It can also be specified to use the local files or to force the download. The download methods get inhereted from the base retrieval class.

Parameters

force_local (bool, optional) – If True forces to load the local files.
force_download (bool, optional) – If True forces the download of new files

get_total(value='confirmed', bundesland: str = None, landkreis: str = None, data_begin: datetime.datetime = None, data_end: datetime.datetime = None, date_type: str = 'date', age_group=None)[source]¶

Gets all total confirmed cases for a region as dataframe with date index. Can be filtered with multiple arguments.

Parameters

value (str) – Which data to return, possible values are - “confirmed”, - “recovered”, - “deaths” (default: “confirmed”)
bundesland (str, optional) – if no value is provided it will use the full summed up dataset for Germany
landkreis (str, optional) – if no value is provided it will use the full summed up dataset for the region (bundesland)
data_begin (datetime.datetime, optional) – initial date, if no value is provided it will use the first possible date
data_end (datetime.datetime, optional) – last date, if no value is provided it will use the most recent possible date
date_type (str, optional) – type of date to use: reported date ‘date’ (Meldedatum in the original dataset), or symptom date ‘date_ref’ (Refdatum in the original dataset)
age_group (str, optional) – Choosen age group. To get the possible combinations use possible_age_groups().

Returns

pandas.DataFrame

get_new(value='confirmed', bundesland: str = None, landkreis: str = None, data_begin: datetime.datetime = None, data_end: datetime.datetime = None, date_type: str = 'date', age_group=None)[source]¶

Retrieves all new cases from the Robert Koch Institute dataset as a DataFrame with datetime index. Can be filtered by value, bundesland and landkreis, if only a country is given all available states get summed up.

Parameters

value (str) – Which data to return, possible values are - “confirmed”, - “recovered”, - “deaths” (default: “confirmed”)
bundesland (str, optional) – if no value is provided it will use the full summed up dataset for Germany
landkreis (str, optional) – if no value is provided it will use the full summed up dataset for the region (bundesland)
data_begin (datetime.datetime, optional) – intial date for the returned data, if no value is given the first date in the dataset is used, if none is given could yield errors
data_end (datetime.datetime, optional) – last date for the returned data, if no value is given the most recent date in the dataset is used
age_group (str, optional) – Choosen age group. To get the possible combinations use possible_age_groups().

Returns

pandas.DataFrame – table with daily new confirmed and the date as index

filter(data_begin: datetime.datetime = None, data_end: datetime.datetime = None, variable='confirmed', date_type='date', level=None, value=None, age_group=None)[source]¶

Filters the obtained dataset for a given time period and returns an array ONLY containing only the desired variable.

Parameters

data_begin (datetime.datetime, optional) – initial date, if no value is provided it will use the first possible date
data_end (datetime.datetime, optional) – last date, if no value is provided it will use the most recent possible date
variable (str, optional) – type of variable to return possible types are: “confirmed” : cases (default) “AnzahlTodesfall” : deaths “AnzahlGenesen” : recovered
date_type (str, optional) – type of date to use: reported date ‘date’ (Meldedatum in the original dataset), or symptom date ‘date_ref’ (Refdatum in the original dataset)
level (str, optional) –

possible strings are:
”None” : return data from all Germany (default) “Bundesland” : a state “Landkreis” : a region
value (str, optional) – string of the state/region e.g. “Sachsen”
age_group (str, optional) – Choosen age group. To get the possible combinations use possible_age_groups().

Returns

pd.DataFrame – array with ONLY the requested variable, in the requested range. (one dimensional)

filter_all_bundesland(begin_date: datetime.datetime = None, end_date: datetime.datetime = None, variable='confirmed', date_type='date')[source]¶

Filters the full RKI dataset

Parameters

df (DataFrame) – RKI dataframe, from get_rki()
begin_date (datetime.datetime) – initial date to return
end_date (datetime.datetime) – last date to return
variable (str, optional) – type of variable to return: cases (“AnzahlFall”), deaths (“AnzahlTodesfall”), recovered (“AnzahlGenesen”)
date_type (str, optional) – type of date to use: reported date ‘date’ (Meldedatum in the original dataset), or symptom date ‘date_ref’ (Refdatum in the original dataset)

Returns

pd.DataFrame – DataFrame with datetime dates as index, and all German regions (bundesländer) as columns

possible_age_groups()[source]¶: Returns the valid age groups in the dataset.

Robert Koch Institute situation reports ¶

class covid19_inference.data_retrieval.RKIsituationreports(auto_download=False)[source]¶

As mentioned by Matthias Linden, the daily situation reports have more available data. This class retrieves this additional data from Matthias website and parses it into the format we use i.e. a datetime index.

Interesting new data is for example ICU cases, deaths and recorded symptoms. For now one can look at the data by running

Example

rki_si_re = cov19.data_retrieval.RKIsituationreports(True)
print(rki_si_re.data)

Todo

Filter functions for ICU, Symptoms and maybe even daily new cases for the respective categories.

__init__(auto_download=False)[source]¶

On init of this class the base Retrieval Class __init__ is called, with rki situation reports specific arguments.

Parameters: auto_download (bool, optional) – Whether or not to automatically call the download_all_available_data() method. One should explicitly call this method for more configuration options (default: false)

download_all_available_data(force_local=False, force_download=False)[source]¶

Attempts to download from the main url (self.url_csv) which was given on initialization. If this fails download from the fallbacks. It can also be specified to use the local files or to force the download. The download methods get inhereted from the base retrieval class.

Parameters

force_local (bool, optional) – If True forces to load the local files.
force_download (bool, optional) – If True forces the download of new files

Google ¶

class covid19_inference.data_retrieval.GOOGLE(auto_download=False)[source]¶

This class can be used to retrieve the mobility dataset from Google.

Example

gl = cov19.data_retrieval.GOOGLE()
gl.download_all_available_data()

#Acess the data by
gl.data
#or
gl.get_changes(filter)

__init__(auto_download=False)[source]¶

On init of this class the base Retrieval Class __init__ is called, with google specific arguments.

Parameters: auto_download (bool, optional) – Whether or not to automatically call the download_all_available_data() method. One should explicitly call this method for more configuration options (default: false)

download_all_available_data(force_local=False, force_download=False)[source]¶

Attempts to download from the main url (self.url_csv) which was given on initialization. If this fails download from the fallbacks. It can also be specified to use the local files or to force the download. The download methods get inhereted from the base retrieval class.

Parameters

force_local (bool, optional) – If True forces to load the local files.
force_download (bool, optional) – If True forces the download of new files

get_changes(country: str, state: str = None, region: str = None, data_begin: datetime.datetime = None, data_end: datetime.datetime = None)[source]¶

Returns a dataframe with the relative changes in mobility to a baseline, provided by google. They are separated into “retail and recreation”, “grocery and pharmacy”, “parks”, “transit”, “workplaces” and “residental”. Filterable for country, state and region and date.

Parameters

country (str) – Selected country for the mobility data.
state (str, optional) – State for the selected data if no value is selected the whole country is chosen
region (str, optional) – Region for the selected data if no value is selected the whole region/country is chosen
data_end (data_begin,) – Filter for the desired time period

Returns

pandas.DataFrame

get_possible_counties_states_regions()[source]¶

Can be used to obtain all different possible countries with there corresponding possible states and regions.

Returns: pandas.DataFrame

Our World in Data ¶

class covid19_inference.data_retrieval.OWD(auto_download=False)[source]¶

This class can be used to retrieve the testings dataset from Our World in Data.

Example

owd = cov19.data_retrieval.OWD()
owd.download_all_available_data()

__init__(auto_download=False)[source]¶

On init of this class the base Retrieval Class __init__ is called, with google specific arguments.

Parameters: auto_download (bool, optional) – Whether or not to automatically call the download_all_available_data() method. One should explicitly call this method for more configuration options (default: false)

download_all_available_data(force_local=False, force_download=False)[source]¶

Attempts to download from the main url (self.url_csv) which was given on initialization. If this fails download from the fallbacks. It can also be specified to use the local files or to force the download. The download methods get inhereted from the base retrieval class.

Parameters

force_local (bool, optional) – If True forces to load the local files.
force_download (bool, optional) – If True forces the download of new files

get_possible_countries()[source]¶

Can be used to obtain all different possible countries in the dataset.

Returns: pandas.DataFrame

get_total(value='tests', country=None, data_begin=None, data_end=None)[source]¶

Retrieves all new cases from the Our World in Data dataset as a DataFrame with datetime index. Can be filtered by value, country and state, if only a country is given all available states get summed up.

Parameters

value (str) – Which data to return, possible values are - “confirmed”, - “tests”, - “deaths”, - “vacination” (default: “confirmed”)
country (str) – name of the country
begin_date (datetime.datetime, optional) – intial date for the returned data, if no value is given the first date in the dataset is used
end_date (datetime.datetime, optional) – last date for the returned data, if no value is given the most recent date in the dataset is used

Returns

pandas.DataFrame – table with new cases and the date as index

get_new(value='tests', country=None, data_begin=None, data_end=None)[source]¶

Retrieves all new cases from the Our World in Data dataset as a DataFrame with datetime index. casesn be filtered by value, country and state, if only a country is given all available states get summed up.

Parameters

value (str) – Which data to return, possible values are - “confirmed”, - “tests”, - “deaths” (default: “confirmed”)
country (str) – name of the country
begin_date (datetime.datetime, optional) – intial date for the returned data, if no value is given the first date in the dataset is used
end_date (datetime.datetime, optional) – last date for the returned data, if no value is given the most recent date in the dataset is used

Returns

pandas.DataFrame – table with new cases and the date as index

Financial times ¶

class covid19_inference.data_retrieval.FINANCIAL_TIMES(auto_download=False)[source]¶

This class can be used to retrieve the excess mortality data from the Financial Times github repository.

Example

ft = cov19.data_retrieval.FINANCIAL_TIMES()
ft.download_all_available_data()

#Access the data by
ft.data
#or
ft.get(filter) #see below

__init__(auto_download=False)[source]¶

On init of this class the base Retrieval Class __init__ is called, with financial times specific arguments.

Parameters: auto_download (bool, optional) – Whether or not to automatically call the download_all_available_data() method. One should explicitly call this method for more configuration options (default: false)

download_all_available_data(force_local=False, force_download=False)[source]¶

Attempts to download from the main url (self.url_csv) which was given on initialization. If this fails download from the fallbacks. It can also be specified to use the local files or to force the download. The download methods get inhereted from the base retrieval class.

Parameters

force_local (bool, optional) – If True forces to load the local files.
force_download (bool, optional) – If True forces the download of new files

get(value='excess_deaths', country: str = 'Germany', state: str = None, data_begin: datetime.datetime = None, data_end: datetime.datetime = None)[source]¶

Retrieves specific data from the dataset, can be filtered by date, country and state.

Parameters

value (str, optional) – Which data to return, possible values are - “deaths”, - “expected_deaths”, - “excess_deaths”, - “excess_deaths_pct” (default: “excess_deaths”)
country (str, optional) –
state (str, optional) – Possible countries and states can be retrieved by the get_possible_countries_states() method.
begin_date (datetime.datetime, optional) – First day that should be filtered
end_date (datetime.datetime, optional) – Last day that should be filtered

get_possible_countries_states()[source]¶

Can be used to obtain all different possible countries with there corresponding possible states and regions.

Returns: pandas.DataFrame

Oxford COVID-19 Government Response Tracker ¶

class covid19_inference.data_retrieval.OxCGRT(auto_download=False)[source]¶

This class can be used to retrieve the datasset on goverment policies from the Oxford Covid-19 Government Response Tracker.

Example

gov_pol = cov19.data_retrieval.OxCGRT()
gov_pol.download_all_available_data()

__init__(auto_download=False)[source]¶

On init of this class the base Retrieval Class __init__ is called, with google specific arguments.

Parameters: auto_download (bool, optional) – Whether or not to automatically call the download_all_available_data() method. One should explicitly call this method for more configuration options (default: false)

download_all_available_data(force_local=False, force_download=False)[source]¶

Attempts to download from the main url (self.url_csv) which was given on initialization. If this fails download from the fallbacks. It can also be specified to use the local files or to force the download. The download methods get inhereted from the base retrieval class.

Parameters

force_local (bool, optional) – If True forces to load the local files.
force_download (bool, optional) – If True forces the download of new files

get_possible_countries()[source]¶

Can be used to obtain all different possible countries in the dataset.

Returns: pandas.DataFrame

get_possible_policies()[source]¶

Can be used to obtain all policies in there corresponding categories possible countries in the dataset.

Returns: dict

get_change_points(policies, country)[source]¶

Returns a list of change points, depending on the selected measure and country.

Parameters

policies (str, array of str) – The wanted policies. Can be an array of strings, use get_possible_policies() to get a dict of possible policies.
country (str) – Filter for country, use get_possible_countries() to get a list of possible ones.

Returns

array of dicts

get_time_data(policy, country, data_begin=None, data_end=None)[source]¶

Parameters

policy (str) – The wanted policy.
country (str) – Filter for country, use get_possible_countries() to get a list of possible ones.
data_begin (datetime.datetime, optional) – intial date for the returned data, if no value is given the first date in the dataset is used, if none is given could yield errors
data_end (datetime.datetime, optional) – last date for the returned data, if no value is given the most recent date in the dataset is used

Returns

Pandas dataframe with policy

Base Retrieval Class ¶

class covid19_inference.data_retrieval.retrieval.Retrieval(name, url_csv, fallbacks, update_interval=None, **kwargs)[source]¶

Each source class should inherit this base retrieval class, it streamlines alot of base functions. It manages downloads, multiple fallbacks and local backups via timestamp. At init of the parent class the Retrieval init should be called with the following arguments, these get saved as attributes.

An example for the usage can be seen in the _Google, _RKI and _JHU source files.

__init__(name, url_csv, fallbacks, update_interval=None, **kwargs)[source]¶

Parameters

name (str) – A name for the Parent class, mainly used for the local file backup.
url_csv (str) – The url to the main dataset as csv, if an empty string if supplied the fallback routines get used.
fallbacks (array) – Fallbacks can be filepaths to local or online sources or even methods defined in the parent class.
update_interval (datetime.timedelta) – If the local file is older than the update_interval it gets updated once the download all function is called.

_download_csv_from_source(filepath, **kwargs)[source]¶

Uses pandas read csv to download the csv file. The possible kwargs can be seen in the pandas documentation.

These kwargs can vary for the different parent classes and should be defined there!

filepathstr: Full path to the desired csv file

Returns: bool – True if the retrieval was a success, False if it failed

_fallback_handler()[source]¶: Recursivly iterate over all fallbacks and try to execute subroutines depending on the type of fallback.

_timestamp_local_old(force_local=False) → bool[source]¶

Get timestamp if it exists
compare with the date today
update if data is older than set intervall -> can be parent dependant

_save_to_local()[source]¶: Creates a local backup for the self.data pandas.DataFrame. And a timestamp for the source.