Data Retrieval

Utility

covid19_inference.data_retrieval.retrieval.set_data_dir(fname=None, permissions=None)[source]

Set the global variable _data_dir. New downloaded data is placed there. If no argument provided we try the default tmp directory. If permissions are not provided, uses defaults if fname is in user folder. If not in user folder, tries to set 777.

covid19_inference.data_retrieval.retrieval.backup_instances(trace=None, model=None, fname='latest_')[source]

helper to save or load trace and model instances. loads from fname if provided traces and model variables are None, else saves them there.

Johns Hops University

class covid19_inference.data_retrieval.JHU(auto_download=False)[source]

This class can be used to retrieve and filter the dataset from the online repository of the coronavirus visual dashboard operated by the Johns Hopkins University.

Features
  • download all files from the online repository of the coronavirus visual dashboard operated by the Johns Hopkins University.

  • filter by deaths, confirmed cases and recovered cases

  • filter by country and state

  • filter by date

Example

jhu = cov19.data_retrieval.JHU()
jhu.download_all_available_data()

#Acess the data by
jhu.data
#or
jhu.get_new("confirmed","Italy")
jhu.get_total(filter)
__init__(auto_download=False)[source]

On init of this class the base Retrieval Class __init__ is called, with jhu specific arguments.

Parameters

auto_download (bool, optional) – Whether or not to automatically call the download_all_available_data() method. One should explicitly call this method for more configuration options (default: false)

download_all_available_data(force_local=False, force_download=False)[source]

Attempts to download from the main urls (self.url_csv) which was set on initialization of this class. If this fails it downloads from the fallbacks. It can also be specified to use the local files or to force the download. The download methods get inhereted from the base retrieval class.

Parameters
  • force_local (bool, optional) – If True forces to load the local files.

  • force_download (bool, optional) – If True forces the download of new files

get_total_confirmed_deaths_recovered(country: str = None, state: str = None, begin_date: datetime.datetime = None, end_date: datetime.datetime = None)[source]

Retrieves all confirmed, deaths and recovered cases from the Johns Hopkins University dataset as a DataFrame with datetime index. Can be filtered by country and state, if only a country is given all available states get summed up.

Parameters
  • country (str, optional) – name of the country (the “Country/Region” column), can be None if the whole summed up data is wanted (why would you do this?)

  • state (str, optional) – name of the state (the “Province/State” column), can be None if country is set or the whole summed up data is wanted

  • begin_date (datetime.datetime, optional) – intial date for the returned data, if no value is given the first date in the dataset is used

  • end_date (datetime.datetime, optional) – last date for the returned data, if no value is given the most recent date in the dataset is used

Returns

pandas.DataFrame

get_new(value='confirmed', country: str = None, state: str = None, data_begin: datetime.datetime = None, data_end: datetime.datetime = None)[source]

Retrieves all new cases from the Johns Hopkins University dataset as a DataFrame with datetime index. Can be filtered by value, country and state, if only a country is given all available states get summed up.

Parameters
  • value (str) – Which data to return, possible values are - “confirmed”, - “recovered”, - “deaths” (default: “confirmed”)

  • country (str, optional) – name of the country (the “Country/Region” column), can be None

  • state (str, optional) – name of the state (the “Province/State” column), can be None

  • begin_date (datetime.datetime, optional) – intial date for the returned data, if no value is given the first date in the dataset is used

  • end_date (datetime.datetime, optional) – last date for the returned data, if no value is given the most recent date in the dataset is used

Returns

pandas.DataFrame – table with new cases and the date as index

get_total(value='confirmed', country: str = None, state: str = None, data_begin: datetime.datetime = None, data_end: datetime.datetime = None)[source]

Retrieves all total/cumulative cases from the Johns Hopkins University dataset as a DataFrame with datetime index. Can be filtered by value, country and state, if only a country is given all available states get summed up.

Parameters
  • value (str) – Which data to return, possible values are - “confirmed”, - “recovered”, - “deaths” (default: “confirmed”)

  • country (str, optional) – name of the country (the “Country/Region” column), can be None

  • state (str, optional) – name of the state (the “Province/State” column), can be None

  • begin_date (datetime.datetime, optional) – intial date for the returned data, if no value is given the first date in the dataset is used

  • end_date (datetime.datetime, optional) – last date for the returned data, if no value is given the most recent date in the dataset is used

Returns

pandas.DataFrame – table with total/cumulative cases and the date as index

filter_date(df, begin_date: datetime.datetime = None, end_date: datetime.datetime = None)[source]

Returns give dataframe between begin and end date. Dataframe has to have a datetime index.

Parameters
Returns

pandas.DataFrame

get_possible_countries_states()[source]

Can be used to get a list with all possible states and coutries.

Returns

pandas.DataFrame in the format

Robert Koch Institute

class covid19_inference.data_retrieval.RKI(auto_download=False)[source]

This class can be used to retrieve and filter the dataset from the Robert Koch Institute Robert Koch Institute. The data gets retrieved from the arcgis dashboard.

Features
  • download the full dataset

  • filter by date

  • filter by bundesland

  • filter by recovered, deaths and confirmed cases

Example

rki = cov19.data_retrieval.RKI()
rki.download_all_available_data()

#Acess the data by
rki.data
#or
rki.get_new("confirmed","Sachsen")
rki.get_total(filter)
__init__(auto_download=False)[source]

On init of this class the base Retrieval Class __init__ is called, with rki specific arguments.

Parameters

auto_download (bool, optional) – Whether or not to automatically call the download_all_available_data() method. One should explicitly call this method for more configuration options (default: false)

download_all_available_data(force_local=False, force_download=False)[source]

Attempts to download from the main url (self.url_csv) which was given on initialization. If this fails download from the fallbacks. It can also be specified to use the local files or to force the download. The download methods get inhereted from the base retrieval class.

Parameters
  • force_local (bool, optional) – If True forces to load the local files.

  • force_download (bool, optional) – If True forces the download of new files

get_total(value='confirmed', bundesland: str = None, landkreis: str = None, data_begin: datetime.datetime = None, data_end: datetime.datetime = None, date_type: str = 'date')[source]

Gets all total confirmed cases for a region as dataframe with date index. Can be filtered with multiple arguments.

Parameters
  • value (str) – Which data to return, possible values are - “confirmed”, - “recovered”, - “deaths” (default: “confirmed”)

  • bundesland (str, optional) – if no value is provided it will use the full summed up dataset for Germany

  • landkreis (str, optional) – if no value is provided it will use the full summed up dataset for the region (bundesland)

  • data_begin (datetime.datetime, optional) – initial date, if no value is provided it will use the first possible date

  • data_end (datetime.datetime, optional) – last date, if no value is provided it will use the most recent possible date

  • date_type (str, optional) – type of date to use: reported date ‘date’ (Meldedatum in the original dataset), or symptom date ‘date_ref’ (Refdatum in the original dataset)

Returns

pandas.DataFrame

get_new(value='confirmed', bundesland: str = None, landkreis: str = None, data_begin: datetime.datetime = None, data_end: datetime.datetime = None, date_type: str = 'date')[source]

Retrieves all new cases from the Robert Koch Institute dataset as a DataFrame with datetime index. Can be filtered by value, bundesland and landkreis, if only a country is given all available states get summed up.

Parameters
  • value (str) – Which data to return, possible values are - “confirmed”, - “recovered”, - “deaths” (default: “confirmed”)

  • bundesland (str, optional) – if no value is provided it will use the full summed up dataset for Germany

  • landkreis (str, optional) – if no value is provided it will use the full summed up dataset for the region (bundesland)

  • data_begin (datetime.datetime, optional) – intial date for the returned data, if no value is given the first date in the dataset is used, if none is given could yield errors

  • data_end (datetime.datetime, optional) – last date for the returned data, if no value is given the most recent date in the dataset is used

Returns

pandas.DataFrame – table with daily new confirmed and the date as index

filter(data_begin: datetime.datetime = None, data_end: datetime.datetime = None, variable='confirmed', date_type='date', level=None, value=None)[source]

Filters the obtained dataset for a given time period and returns an array ONLY containing only the desired variable.

Parameters
  • data_begin (datetime.datetime, optional) – initial date, if no value is provided it will use the first possible date

  • data_end (datetime.datetime, optional) – last date, if no value is provided it will use the most recent possible date

  • variable (str, optional) – type of variable to return possible types are: “confirmed” : cases (default) “AnzahlTodesfall” : deaths “AnzahlGenesen” : recovered

  • date_type (str, optional) – type of date to use: reported date ‘date’ (Meldedatum in the original dataset), or symptom date ‘date_ref’ (Refdatum in the original dataset)

  • level (str, optional) –

    possible strings are:

    ”None” : return data from all Germany (default) “Bundesland” : a state “Landkreis” : a region

  • value (None, optional) – string of the state/region e.g. “Sachsen”

Returns

pd.DataFrame – array with ONLY the requested variable, in the requested range. (one dimensional)

filter_all_bundesland(begin_date: datetime.datetime = None, end_date: datetime.datetime = None, variable='confirmed', date_type='date')[source]

Filters the full RKI dataset

Parameters
  • df (DataFrame) – RKI dataframe, from get_rki()

  • begin_date (datetime.datetime) – initial date to return

  • end_date (datetime.datetime) – last date to return

  • variable (str, optional) – type of variable to return: cases (“AnzahlFall”), deaths (“AnzahlTodesfall”), recovered (“AnzahlGenesen”)

  • date_type (str, optional) – type of date to use: reported date ‘date’ (Meldedatum in the original dataset), or symptom date ‘date_ref’ (Refdatum in the original dataset)

Returns

pd.DataFrame – DataFrame with datetime dates as index, and all German regions (bundesländer) as columns

Robert Koch Institute situation reports

class covid19_inference.data_retrieval.RKIsituationreports(auto_download=False)[source]

As mentioned by Matthias Linden, the daily situation reports have more available data. This class retrieves this additional data from Matthias website and parses it into the format we use i.e. a datetime index.

Interesting new data is for example ICU cases, deaths and recorded symptoms. For now one can look at the data by running

Example

rki_si_re = cov19.data_retrieval.RKIsituationreports(True)
print(rki_si_re.data)

Todo

Filter functions for ICU, Symptoms and maybe even daily new cases for the respective categories.

__init__(auto_download=False)[source]

On init of this class the base Retrieval Class __init__ is called, with rki situation reports specific arguments.

Parameters

auto_download (bool, optional) – Whether or not to automatically call the download_all_available_data() method. One should explicitly call this method for more configuration options (default: false)

download_all_available_data(force_local=False, force_download=False)[source]

Attempts to download from the main url (self.url_csv) which was given on initialization. If this fails download from the fallbacks. It can also be specified to use the local files or to force the download. The download methods get inhereted from the base retrieval class.

Parameters
  • force_local (bool, optional) – If True forces to load the local files.

  • force_download (bool, optional) – If True forces the download of new files

Google

class covid19_inference.data_retrieval.GOOGLE(auto_download=False)[source]

This class can be used to retrieve the mobility dataset from Google.

Example

gl = cov19.data_retrieval.GOOGLE()
gl.download_all_available_data()

#Acess the data by
gl.data
#or
gl.get_changes(filter)
__init__(auto_download=False)[source]

On init of this class the base Retrieval Class __init__ is called, with google specific arguments.

Parameters

auto_download (bool, optional) – Whether or not to automatically call the download_all_available_data() method. One should explicitly call this method for more configuration options (default: false)

download_all_available_data(force_local=False, force_download=False)[source]

Attempts to download from the main url (self.url_csv) which was given on initialization. If this fails download from the fallbacks. It can also be specified to use the local files or to force the download. The download methods get inhereted from the base retrieval class.

Parameters
  • force_local (bool, optional) – If True forces to load the local files.

  • force_download (bool, optional) – If True forces the download of new files

get_changes(country: str, state: str = None, region: str = None, data_begin: datetime.datetime = None, data_end: datetime.datetime = None)[source]

Returns a dataframe with the relative changes in mobility to a baseline, provided by google. They are separated into “retail and recreation”, “grocery and pharmacy”, “parks”, “transit”, “workplaces” and “residental”. Filterable for country, state and region and date.

Parameters
  • country (str) – Selected country for the mobility data.

  • state (str, optional) – State for the selected data if no value is selected the whole country is chosen

  • region (str, optional) – Region for the selected data if no value is selected the whole region/country is chosen

  • data_end (data_begin,) – Filter for the desired time period

Returns

pandas.DataFrame

get_possible_counties_states_regions()[source]

Can be used to obtain all different possible countries with there corresponding possible states and regions.

Returns

pandas.DataFrame

Our World in Data

class covid19_inference.data_retrieval.OWD(auto_download=False)[source]

This class can be used to retrieve the testings dataset from Our World in Data.

Example

owd = cov19.data_retrieval.OWD()
owd.download_all_available_data()
__init__(auto_download=False)[source]

On init of this class the base Retrieval Class __init__ is called, with google specific arguments.

Parameters

auto_download (bool, optional) – Whether or not to automatically call the download_all_available_data() method. One should explicitly call this method for more configuration options (default: false)

download_all_available_data(force_local=False, force_download=False)[source]

Attempts to download from the main url (self.url_csv) which was given on initialization. If this fails download from the fallbacks. It can also be specified to use the local files or to force the download. The download methods get inhereted from the base retrieval class.

Parameters
  • force_local (bool, optional) – If True forces to load the local files.

  • force_download (bool, optional) – If True forces the download of new files

get_possible_countries()[source]

Can be used to obtain all different possible countries in the dataset.

Returns

pandas.DataFrame

get_total(value='tests', country=None, data_begin=None, data_end=None)[source]

Retrieves all new cases from the Our World in Data dataset as a DataFrame with datetime index. Can be filtered by value, country and state, if only a country is given all available states get summed up.

Parameters
  • value (str) – Which data to return, possible values are - “confirmed”, - “tests”, - “deaths” (default: “confirmed”)

  • country (str) – name of the country

  • begin_date (datetime.datetime, optional) – intial date for the returned data, if no value is given the first date in the dataset is used

  • end_date (datetime.datetime, optional) – last date for the returned data, if no value is given the most recent date in the dataset is used

Returns

pandas.DataFrame – table with new cases and the date as index

get_new(value='tests', country=None, data_begin=None, data_end=None)[source]

Retrieves all new cases from the Our World in Data dataset as a DataFrame with datetime index. casesn be filtered by value, country and state, if only a country is given all available states get summed up.

Parameters
  • value (str) – Which data to return, possible values are - “confirmed”, - “tests”, - “deaths” (default: “confirmed”)

  • country (str) – name of the country

  • begin_date (datetime.datetime, optional) – intial date for the returned data, if no value is given the first date in the dataset is used

  • end_date (datetime.datetime, optional) – last date for the returned data, if no value is given the most recent date in the dataset is used

Returns

pandas.DataFrame – table with new cases and the date as index

Base Retrieval Class

class covid19_inference.data_retrieval.retrieval.Retrieval(name, url_csv, fallbacks, update_interval=None, **kwargs)[source]

Each source class should inherit this base retrieval class, it streamlines alot of base functions. It manages downloads, multiple fallbacks and local backups via timestamp. At init of the parent class the Retrieval init should be called with the following arguments, these get saved as attributes.

An example for the usage can be seen in the _Google, _RKI and _JHU source files.

__init__(name, url_csv, fallbacks, update_interval=None, **kwargs)[source]
Parameters
  • name (str) – A name for the Parent class, mainly used for the local file backup.

  • url_csv (str) – The url to the main dataset as csv, if an empty string if supplied the fallback routines get used.

  • fallbacks (array) – Fallbacks can be filepaths to local or online sources or even methods defined in the parent class.

  • update_interval (datetime.timedelta) – If the local file is older than the update_interval it gets updated once the download all function is called.

_download_csv_from_source(filepath, **kwargs)[source]

Uses pandas read csv to download the csv file. The possible kwargs can be seen in the pandas documentation.

These kwargs can vary for the different parent classes and should be defined there!

filepathstr

Full path to the desired csv file

Returns

bool – True if the retrieval was a success, False if it failed

_fallback_handler()[source]

Recursivly iterate over all fallbacks and try to execute subroutines depending on the type of fallback.

_timestamp_local_old(force_local=False) → bool[source]
  1. Get timestamp if it exists

  2. compare with the date today

  3. update if data is older than set intervall -> can be parent dependant

_save_to_local()[source]

Creates a local backup for the self.data pandas.DataFrame. And a timestamp for the source.