Data Retrieval¶
Table of Contents
Utility¶
-
covid19_inference.data_retrieval.retrieval.
set_data_dir
(fname=None, permissions=None)[source]¶ Set the global variable _data_dir. New downloaded data is placed there. If no argument provided we try the default tmp directory. If permissions are not provided, uses defaults if fname is in user folder. If not in user folder, tries to set 777.
Johns Hops University¶
-
class
covid19_inference.data_retrieval.
JHU
(auto_download=False)[source]¶ This class can be used to retrieve and filter the dataset from the online repository of the coronavirus visual dashboard operated by the Johns Hopkins University.
- Features
download all files from the online repository of the coronavirus visual dashboard operated by the Johns Hopkins University.
filter by deaths, confirmed cases and recovered cases
filter by country and state
filter by date
Example
jhu = cov19.data_retrieval.JHU() jhu.download_all_available_data() #Acess the data by jhu.data #or jhu.get_new("confirmed","Italy") jhu.get_total(filter)
-
__init__
(auto_download=False)[source]¶ On init of this class the base Retrieval Class __init__ is called, with jhu specific arguments.
- Parameters
auto_download (bool, optional) – Whether or not to automatically call the download_all_available_data() method. One should explicitly call this method for more configuration options (default: false)
-
download_all_available_data
(force_local=False, force_download=False)[source]¶ Attempts to download from the main urls (self.url_csv) which was set on initialization of this class. If this fails it downloads from the fallbacks. It can also be specified to use the local files or to force the download. The download methods get inhereted from the base retrieval class.
-
get_total_confirmed_deaths_recovered
(country: str = None, state: str = None, begin_date: datetime.datetime = None, end_date: datetime.datetime = None)[source]¶ Retrieves all confirmed, deaths and recovered cases from the Johns Hopkins University dataset as a DataFrame with datetime index. Can be filtered by country and state, if only a country is given all available states get summed up.
- Parameters
country (str, optional) – name of the country (the “Country/Region” column), can be None if the whole summed up data is wanted (why would you do this?)
state (str, optional) – name of the state (the “Province/State” column), can be None if country is set or the whole summed up data is wanted
begin_date (datetime.datetime, optional) – intial date for the returned data, if no value is given the first date in the dataset is used
end_date (datetime.datetime, optional) – last date for the returned data, if no value is given the most recent date in the dataset is used
- Returns
pandas.DataFrame
-
get_new
(value='confirmed', country: str = None, state: str = None, data_begin: datetime.datetime = None, data_end: datetime.datetime = None)[source]¶ Retrieves all new cases from the Johns Hopkins University dataset as a DataFrame with datetime index. Can be filtered by value, country and state, if only a country is given all available states get summed up.
- Parameters
value (str) – Which data to return, possible values are - “confirmed”, - “recovered”, - “deaths” (default: “confirmed”)
country (str, optional) – name of the country (the “Country/Region” column), can be None
state (str, optional) – name of the state (the “Province/State” column), can be None
begin_date (datetime.datetime, optional) – intial date for the returned data, if no value is given the first date in the dataset is used
end_date (datetime.datetime, optional) – last date for the returned data, if no value is given the most recent date in the dataset is used
- Returns
pandas.DataFrame – table with new cases and the date as index
-
get_total
(value='confirmed', country: str = None, state: str = None, data_begin: datetime.datetime = None, data_end: datetime.datetime = None)[source]¶ Retrieves all total/cumulative cases from the Johns Hopkins University dataset as a DataFrame with datetime index. Can be filtered by value, country and state, if only a country is given all available states get summed up.
- Parameters
value (str) – Which data to return, possible values are - “confirmed”, - “recovered”, - “deaths” (default: “confirmed”)
country (str, optional) – name of the country (the “Country/Region” column), can be None
state (str, optional) – name of the state (the “Province/State” column), can be None
begin_date (datetime.datetime, optional) – intial date for the returned data, if no value is given the first date in the dataset is used
end_date (datetime.datetime, optional) – last date for the returned data, if no value is given the most recent date in the dataset is used
- Returns
pandas.DataFrame – table with total/cumulative cases and the date as index
-
filter_date
(df, begin_date: datetime.datetime = None, end_date: datetime.datetime = None)[source]¶ Returns give dataframe between begin and end date. Dataframe has to have a datetime index.
- Parameters
begin_date (datetime.datetime, optional) – First day that should be filtered
end_date (datetime.datetime, optional) – Last day that should be filtered
- Returns
pandas.DataFrame
Robert Koch Institute¶
-
class
covid19_inference.data_retrieval.
RKI
(auto_download=False)[source]¶ This class can be used to retrieve and filter the dataset from the Robert Koch Institute Robert Koch Institute. The data gets retrieved from the arcgis dashboard.
- Features
download the full dataset
filter by date
filter by bundesland
filter by recovered, deaths and confirmed cases
Example
rki = cov19.data_retrieval.RKI() rki.download_all_available_data() #Acess the data by rki.data #or rki.get_new("confirmed","Sachsen") rki.get_total(filter)
-
__init__
(auto_download=False)[source]¶ On init of this class the base Retrieval Class __init__ is called, with rki specific arguments.
- Parameters
auto_download (bool, optional) – Whether or not to automatically call the download_all_available_data() method. One should explicitly call this method for more configuration options (default: false)
-
download_all_available_data
(force_local=False, force_download=False)[source]¶ Attempts to download from the main url (self.url_csv) which was given on initialization. If this fails download from the fallbacks. It can also be specified to use the local files or to force the download. The download methods get inhereted from the base retrieval class.
-
get_total
(value='confirmed', bundesland: str = None, landkreis: str = None, data_begin: datetime.datetime = None, data_end: datetime.datetime = None, date_type: str = 'date', age_group=None)[source]¶ Gets all total confirmed cases for a region as dataframe with date index. Can be filtered with multiple arguments.
- Parameters
value (str) – Which data to return, possible values are - “confirmed”, - “recovered”, - “deaths” (default: “confirmed”)
bundesland (str, optional) – if no value is provided it will use the full summed up dataset for Germany
landkreis (str, optional) – if no value is provided it will use the full summed up dataset for the region (bundesland)
data_begin (datetime.datetime, optional) – initial date, if no value is provided it will use the first possible date
data_end (datetime.datetime, optional) – last date, if no value is provided it will use the most recent possible date
date_type (str, optional) – type of date to use: reported date ‘date’ (Meldedatum in the original dataset), or symptom date ‘date_ref’ (Refdatum in the original dataset)
age_group (str, optional) – Choosen age group. To get the possible combinations use possible_age_groups().
- Returns
pandas.DataFrame
-
get_new
(value='confirmed', bundesland: str = None, landkreis: str = None, data_begin: datetime.datetime = None, data_end: datetime.datetime = None, date_type: str = 'date', age_group=None)[source]¶ Retrieves all new cases from the Robert Koch Institute dataset as a DataFrame with datetime index. Can be filtered by value, bundesland and landkreis, if only a country is given all available states get summed up.
- Parameters
value (str) – Which data to return, possible values are - “confirmed”, - “recovered”, - “deaths” (default: “confirmed”)
bundesland (str, optional) – if no value is provided it will use the full summed up dataset for Germany
landkreis (str, optional) – if no value is provided it will use the full summed up dataset for the region (bundesland)
data_begin (datetime.datetime, optional) – intial date for the returned data, if no value is given the first date in the dataset is used, if none is given could yield errors
data_end (datetime.datetime, optional) – last date for the returned data, if no value is given the most recent date in the dataset is used
age_group (str, optional) – Choosen age group. To get the possible combinations use possible_age_groups().
- Returns
pandas.DataFrame – table with daily new confirmed and the date as index
-
filter
(data_begin: datetime.datetime = None, data_end: datetime.datetime = None, variable='confirmed', date_type='date', level=None, value=None, age_group=None)[source]¶ Filters the obtained dataset for a given time period and returns an array ONLY containing only the desired variable.
- Parameters
data_begin (datetime.datetime, optional) – initial date, if no value is provided it will use the first possible date
data_end (datetime.datetime, optional) – last date, if no value is provided it will use the most recent possible date
variable (str, optional) – type of variable to return possible types are: “confirmed” : cases (default) “AnzahlTodesfall” : deaths “AnzahlGenesen” : recovered
date_type (str, optional) – type of date to use: reported date ‘date’ (Meldedatum in the original dataset), or symptom date ‘date_ref’ (Refdatum in the original dataset)
level (str, optional) –
- possible strings are:
”None” : return data from all Germany (default) “Bundesland” : a state “Landkreis” : a region
value (str, optional) – string of the state/region e.g. “Sachsen”
age_group (str, optional) – Choosen age group. To get the possible combinations use possible_age_groups().
- Returns
pd.DataFrame – array with ONLY the requested variable, in the requested range. (one dimensional)
-
filter_all_bundesland
(begin_date: datetime.datetime = None, end_date: datetime.datetime = None, variable='confirmed', date_type='date')[source]¶ Filters the full RKI dataset
- Parameters
df (DataFrame) – RKI dataframe, from get_rki()
begin_date (datetime.datetime) – initial date to return
end_date (datetime.datetime) – last date to return
variable (str, optional) – type of variable to return: cases (“AnzahlFall”), deaths (“AnzahlTodesfall”), recovered (“AnzahlGenesen”)
date_type (str, optional) – type of date to use: reported date ‘date’ (Meldedatum in the original dataset), or symptom date ‘date_ref’ (Refdatum in the original dataset)
- Returns
pd.DataFrame – DataFrame with datetime dates as index, and all German regions (bundesländer) as columns
Robert Koch Institute situation reports¶
-
class
covid19_inference.data_retrieval.
RKIsituationreports
(auto_download=False)[source]¶ As mentioned by Matthias Linden, the daily situation reports have more available data. This class retrieves this additional data from Matthias website and parses it into the format we use i.e. a datetime index.
Interesting new data is for example ICU cases, deaths and recorded symptoms. For now one can look at the data by running
Example
rki_si_re = cov19.data_retrieval.RKIsituationreports(True) print(rki_si_re.data)
Todo
Filter functions for ICU, Symptoms and maybe even daily new cases for the respective categories.
-
__init__
(auto_download=False)[source]¶ On init of this class the base Retrieval Class __init__ is called, with rki situation reports specific arguments.
- Parameters
auto_download (bool, optional) – Whether or not to automatically call the download_all_available_data() method. One should explicitly call this method for more configuration options (default: false)
-
download_all_available_data
(force_local=False, force_download=False)[source]¶ Attempts to download from the main url (self.url_csv) which was given on initialization. If this fails download from the fallbacks. It can also be specified to use the local files or to force the download. The download methods get inhereted from the base retrieval class.
-
Google¶
-
class
covid19_inference.data_retrieval.
GOOGLE
(auto_download=False)[source]¶ This class can be used to retrieve the mobility dataset from Google.
Example
gl = cov19.data_retrieval.GOOGLE() gl.download_all_available_data() #Acess the data by gl.data #or gl.get_changes(filter)
-
__init__
(auto_download=False)[source]¶ On init of this class the base Retrieval Class __init__ is called, with google specific arguments.
- Parameters
auto_download (bool, optional) – Whether or not to automatically call the download_all_available_data() method. One should explicitly call this method for more configuration options (default: false)
-
download_all_available_data
(force_local=False, force_download=False)[source]¶ Attempts to download from the main url (self.url_csv) which was given on initialization. If this fails download from the fallbacks. It can also be specified to use the local files or to force the download. The download methods get inhereted from the base retrieval class.
-
get_changes
(country: str, state: str = None, region: str = None, data_begin: datetime.datetime = None, data_end: datetime.datetime = None)[source]¶ Returns a dataframe with the relative changes in mobility to a baseline, provided by google. They are separated into “retail and recreation”, “grocery and pharmacy”, “parks”, “transit”, “workplaces” and “residental”. Filterable for country, state and region and date.
- Parameters
country (str) – Selected country for the mobility data.
state (str, optional) – State for the selected data if no value is selected the whole country is chosen
region (str, optional) – Region for the selected data if no value is selected the whole region/country is chosen
data_end (data_begin,) – Filter for the desired time period
- Returns
pandas.DataFrame
-
Our World in Data¶
-
class
covid19_inference.data_retrieval.
OWD
(auto_download=False)[source]¶ This class can be used to retrieve the testings dataset from Our World in Data.
Example
owd = cov19.data_retrieval.OWD() owd.download_all_available_data()
-
__init__
(auto_download=False)[source]¶ On init of this class the base Retrieval Class __init__ is called, with google specific arguments.
- Parameters
auto_download (bool, optional) – Whether or not to automatically call the download_all_available_data() method. One should explicitly call this method for more configuration options (default: false)
-
download_all_available_data
(force_local=False, force_download=False)[source]¶ Attempts to download from the main url (self.url_csv) which was given on initialization. If this fails download from the fallbacks. It can also be specified to use the local files or to force the download. The download methods get inhereted from the base retrieval class.
-
get_possible_countries
()[source]¶ Can be used to obtain all different possible countries in the dataset.
- Returns
pandas.DataFrame
-
get_total
(value='tests', country=None, data_begin=None, data_end=None)[source]¶ Retrieves all new cases from the Our World in Data dataset as a DataFrame with datetime index. Can be filtered by value, country and state, if only a country is given all available states get summed up.
- Parameters
value (str) – Which data to return, possible values are - “confirmed”, - “tests”, - “deaths”, - “vacination” (default: “confirmed”)
country (str) – name of the country
begin_date (datetime.datetime, optional) – intial date for the returned data, if no value is given the first date in the dataset is used
end_date (datetime.datetime, optional) – last date for the returned data, if no value is given the most recent date in the dataset is used
- Returns
pandas.DataFrame – table with new cases and the date as index
-
get_new
(value='tests', country=None, data_begin=None, data_end=None)[source]¶ Retrieves all new cases from the Our World in Data dataset as a DataFrame with datetime index. casesn be filtered by value, country and state, if only a country is given all available states get summed up.
- Parameters
value (str) – Which data to return, possible values are - “confirmed”, - “tests”, - “deaths” (default: “confirmed”)
country (str) – name of the country
begin_date (datetime.datetime, optional) – intial date for the returned data, if no value is given the first date in the dataset is used
end_date (datetime.datetime, optional) – last date for the returned data, if no value is given the most recent date in the dataset is used
- Returns
pandas.DataFrame – table with new cases and the date as index
-
Financial times¶
-
class
covid19_inference.data_retrieval.
FINANCIAL_TIMES
(auto_download=False)[source]¶ This class can be used to retrieve the excess mortality data from the Financial Times github repository.
Example
ft = cov19.data_retrieval.FINANCIAL_TIMES() ft.download_all_available_data() #Access the data by ft.data #or ft.get(filter) #see below
-
__init__
(auto_download=False)[source]¶ On init of this class the base Retrieval Class __init__ is called, with financial times specific arguments.
- Parameters
auto_download (bool, optional) – Whether or not to automatically call the download_all_available_data() method. One should explicitly call this method for more configuration options (default: false)
-
download_all_available_data
(force_local=False, force_download=False)[source]¶ Attempts to download from the main url (self.url_csv) which was given on initialization. If this fails download from the fallbacks. It can also be specified to use the local files or to force the download. The download methods get inhereted from the base retrieval class.
-
get
(value='excess_deaths', country: str = 'Germany', state: str = None, data_begin: datetime.datetime = None, data_end: datetime.datetime = None)[source]¶ Retrieves specific data from the dataset, can be filtered by date, country and state.
- Parameters
value (str, optional) – Which data to return, possible values are - “deaths”, - “expected_deaths”, - “excess_deaths”, - “excess_deaths_pct” (default: “excess_deaths”)
country (str, optional) –
state (str, optional) – Possible countries and states can be retrieved by the get_possible_countries_states() method.
begin_date (datetime.datetime, optional) – First day that should be filtered
end_date (datetime.datetime, optional) – Last day that should be filtered
-
Oxford COVID-19 Government Response Tracker¶
-
class
covid19_inference.data_retrieval.
OxCGRT
(auto_download=False)[source]¶ This class can be used to retrieve the datasset on goverment policies from the Oxford Covid-19 Government Response Tracker.
Example
gov_pol = cov19.data_retrieval.OxCGRT() gov_pol.download_all_available_data()
-
__init__
(auto_download=False)[source]¶ On init of this class the base Retrieval Class __init__ is called, with google specific arguments.
- Parameters
auto_download (bool, optional) – Whether or not to automatically call the download_all_available_data() method. One should explicitly call this method for more configuration options (default: false)
-
download_all_available_data
(force_local=False, force_download=False)[source]¶ Attempts to download from the main url (self.url_csv) which was given on initialization. If this fails download from the fallbacks. It can also be specified to use the local files or to force the download. The download methods get inhereted from the base retrieval class.
-
get_possible_countries
()[source]¶ Can be used to obtain all different possible countries in the dataset.
- Returns
pandas.DataFrame
-
get_possible_policies
()[source]¶ Can be used to obtain all policies in there corresponding categories possible countries in the dataset.
- Returns
dict
-
get_change_points
(policies, country)[source]¶ Returns a list of change points, depending on the selected measure and country.
-
get_time_data
(policy, country, data_begin=None, data_end=None)[source]¶ - Parameters
policy (str) – The wanted policy.
country (str) – Filter for country, use get_possible_countries() to get a list of possible ones.
data_begin (datetime.datetime, optional) – intial date for the returned data, if no value is given the first date in the dataset is used, if none is given could yield errors
data_end (datetime.datetime, optional) – last date for the returned data, if no value is given the most recent date in the dataset is used
- Returns
Pandas dataframe with policy
-
Base Retrieval Class¶
-
class
covid19_inference.data_retrieval.retrieval.
Retrieval
(name, url_csv, fallbacks, update_interval=None, **kwargs)[source]¶ Each source class should inherit this base retrieval class, it streamlines alot of base functions. It manages downloads, multiple fallbacks and local backups via timestamp. At init of the parent class the Retrieval init should be called with the following arguments, these get saved as attributes.
An example for the usage can be seen in the _Google, _RKI and _JHU source files.
-
__init__
(name, url_csv, fallbacks, update_interval=None, **kwargs)[source]¶ - Parameters
name (str) – A name for the Parent class, mainly used for the local file backup.
url_csv (str) – The url to the main dataset as csv, if an empty string if supplied the fallback routines get used.
fallbacks (array) – Fallbacks can be filepaths to local or online sources or even methods defined in the parent class.
update_interval (datetime.timedelta) – If the local file is older than the update_interval it gets updated once the download all function is called.
-
_download_csv_from_source
(filepath, **kwargs)[source]¶ Uses pandas read csv to download the csv file. The possible kwargs can be seen in the pandas documentation.
These kwargs can vary for the different parent classes and should be defined there!
- filepathstr
Full path to the desired csv file
- Returns
bool – True if the retrieval was a success, False if it failed
-
_fallback_handler
()[source]¶ Recursivly iterate over all fallbacks and try to execute subroutines depending on the type of fallback.
-