Dealing with the GESLA dataset in R
Fernando Mayer, Niamh Cahill
Source:vignettes/intro-to-geslaR.Rmd
intro-to-geslaR.Rmd
The geslaR package was designed to fetch the GESLA dataset into R, as a full dataset or as a subset of the data. There a few ways that this can be achieved:
- Downloading the entire dataset locally, then loading it into R and
working from there using the Apache
Arrow framework (
download_gesla()
) - Querying the hosted dataset directly from R, specifying the desired
subset (
query_gesla()
) - Using the GESLA Shiny app (thereafter called only geslaR-app), to
filter, visualise and possibly download subsets of the dataset
(
run_gesla_app()
)
In the sections below, we will discuss each of these possibilities.
Loading the package
When you first load the geslaR package into an R session, two more packages will automatically be loaded: arrow and dplyr.
library(geslaR)
#> Loading required package: arrow
#>
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#>
#> timestamp
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
These two packages are strictly necessary for dealing with a huge dataset like GESLA, to alleviate the need to load it in memory all at once. The full GESLA dataset has more than 1.7 billion lines, and is nearly impossible to load it in R (or any other RAM memory language, like e.g. Python) all at once. To avoid this problem, there are two possibilities:
- Load (smaller) subsets of the full dataset
- Use an “in-memory” analytics framework
In the best case scenario, these two approaches can be used together for maximum efficiency, and the geslaR package was designed to achieve this.
Regardless the way the GESLA data is loaded into R, the best way to
deal with it is using the Apache
Arrow framework. For a more detailed explanation and examples,
please see the vignette("intro-to-arrow")
.
Downloading the full GESLA dataset
If you want to explore the full GESLA dataset, you can download it
with the download_gesla()
function as follows
This will create a directory called gesla_dataset
in the
current working directory (as defined by getwd()
) and
download the full dataset locally. This download may take some time
(expect around 5 to 10 minutes), as it depends on internet connection.
Note that this full dataset will need at least 7GB of (hard drive)
storage, so make sure this is feasible. However, once downloaded, you
will have access to the full dataset, and you will only need to do this
once.
You will notice that the full dataset is composed of 5119 Apache Parquet files, ending in
.parquet
## Number of downloaded files
length(list.files("gesla_dataset"))
#> [1] 5119
## Check the first files
head(list.files("gesla_dataset"))
#> [1] "a121-a12-nld-cmems.parquet" "a2-a2-bel-cmems.parquet"
#> [3] "aalesund-aal-nor-cmems.parquet" "aarhus-aar-dnk-cmems.parquet"
#> [5] "aasiaat-aas-grl-gloss.parquet" "abashiri-347a-jpn-uhslc.parquet"
These files are the same as those originally distributed in the GESLA
dataset, so that each one refers to a site from where the data comes
from. To load this full dataset in R, use the
arrow::open_dataset()
function, specifying the location of
the .parquet
files. Although there are many files, this
function recognises them as a single dataset, because they all have the
same structure (or “Schema”).
## Open dataset
da <- open_dataset("gesla_dataset")
## Check the object
da
#> FileSystemDataset with 5119 Parquet files
#> date_time: timestamp[us]
#> year: int64
#> month: int64
#> day: int64
#> hour: int64
#> country: string
#> site_name: string
#> lat: double
#> lon: double
#> sea_level: double
#> qc_flag: int64
#> use_flag: int64
#> file_name: string
#>
#> See $metadata for additional Schema metadata
## Verify class
class(da)
#> [1] "FileSystemDataset" "Dataset" "ArrowObject"
#> [4] "R6"
Since this is an ArrowObject
object, it will actually
not load the full dataset in memory (as it would if it was a standard R
object, such as a tibble
or data.frame
). Note,
however, that some basic information, such as dim()
and
names()
can be retrieved simply with
dim(da)
#> [1] 1172435674 13
names(da)
#> [1] "date_time" "year" "month" "day" "hour" "country"
#> [7] "site_name" "lat" "lon" "sea_level" "qc_flag" "use_flag"
#> [13] "file_name"
Any other manipulation of the dataset must be made using dplyr verbs. For example, to count the number of observations by country, one could use
Querying subsets directly into R
The query_gesla()
function can be used to load subsets
of the full GESLA dataset into R. This function will connect to the
GESLA dataset hosted on an Amazon AWS S3 bucket, and will automatically
filter the required piece of data. Note that the query may vary, as it
will depend on the size of the required subset and on internet
connection speed.
The selection is made by specifying one or more countries, years and
site names. The only mandatory argument is country
, where
it should be a character vector of three-letter ISO 3166-1
alpha-3 codes. If the argument year
is empty, then all
available years for those countries will be selected. The same occurs
for the site_name
argument.
For example, to select all the available data for Ireland (IRL), simply use
<- query_gesla(country = "IRL")
da #> ℹ This process can take some time, as it depends on the size of the final dataset, and on
#> internet connection.
#>
ℹ Connecting to the data server...
[K
3.5s]
[K
✔ Connecting to the data server... [#>
ℹ Filtering data. This can take some time...
[K
43.4s]
[K
✔ Filtering data. This can take some time... [2m #>
ℹ Query finished.
[K
✔ Query finished. [13ms]
[K
da#> Table
#> 16321218 rows x 13 columns
#> $date_time <timestamp[us]>
#> $year <int64>
#> $month <int64>
#> $day <int64>
#> $hour <int64>
#> $country <string>
#> $site_name <string>
#> $lat <double>
#> $lon <double>
#> $sea_level <double>
#> $qc_flag <int64>
#> $use_flag <int64>
#> $file_name <string>
#>
#> See $metadata for additional Schema metadata
dim(da)
#> [1] 16321218 13
names(da)
#> [1] "date_time" "year" "month" "day" "hour" "country"
#> [7] "site_name" "lat" "lon" "sea_level" "qc_flag" "use_flag"
#> [13] "file_name"
class(da)
#> [1] "Table" "ArrowTabular" "ArrowObject" "R6"
This is also an ArrowObject
object, and works the same
way as any other Arrow object type, i.e., it can be manipulated with
dplyr verbs. By the end of this section we will show
some examples.
Many other queries can be made, for example,
## Select one specific year
da <- query_gesla(country = "IRL", year = 2015)
## Multiple years
da <- query_gesla(country = "IRL", year = c(2015, 2017))
da <- query_gesla(country = "IRL", year = 2010:2017)
da <- query_gesla(country = "IRL", year = c(2010, 2012, 2015))
## Multiple countries
da <- query_gesla(country = c("IRL", "ATA"), year = 2015)
da <- query_gesla(country = c("IRL", "ATA"), year = 2010:2017)
As one last example, we may be interested in data only for the site
Dublin_Port
in Ireland, and for the years 2015 and 2017.
This query would be done as follows
<- query_gesla(country = "IRL", year = c(2015, 2017),
da site_name = "Dublin_Port")
#> ℹ This process can take some time, as it depends on the size of the final dataset, and on
#> internet connection.
#>
ℹ Connecting to the data server...
[K
3.4s]
[K
✔ Connecting to the data server... [#>
ℹ Filtering data. This can take some time...
[K
✔ Filtering data. This can take some time... [2m 34s]
[K#>
ℹ Query finished.
[K
✔ Query finished. [21ms]
[K
Now we could explore this subset using dplyr verbs as usual
## Verifying number of observations per year
da |>
count(year) |>
collect()
#> # A tibble: 2 × 2
#> year n
#> <int> <int>
#> 1 2015 93790
#> 2 2017 104835
## Calculating summary statistics per year
da |>
group_by(year) |>
summarise(
min = min(sea_level),
mean = mean(sea_level),
max = max(sea_level)) |>
collect()
#> # A tibble: 2 × 4
#> year min mean max
#> <int> <dbl> <dbl> <dbl>
#> 1 2015 -2.62 -0.0104 2.38
#> 2 2017 -2.43 0.0762 2.53
Note the use of the dplyr::collect()
function at the end
of each call. This is necessary to convert the final result (which is
processed by Arrow) to a common R object (usually a tibble
or similar).
As stated before, the queries made by query_gesla()
may
take some time to process. To avoid repeating this query in every new
session, you can save this object in a file locally, so it will be
available instantly the next time you may need it. To achieve this, you
can use the write_gesla()
function
write_gesla(da)
In this case, this call will create a file named
gesla-data.parquet
, and it will be ready for reading at any
time, without the need for querying again. At any time, just use
da <- read_gesla("gesla-data.parquet")
to make it available again.
Please, do not try to save this object as a standard R file, such as
.rds
or .rda
, as using saveRDS()
or save()
for example. In this case the resulting saved
object will not be the same as the original object, as an
ArrowObject
object is not a standard R object (it actually
is just a pointer to an Arrow style object).
The query_gesla()
function also has two more arguments.
The use_flag = 1
means that only the data suggested by the
GESLA team as being useful for analysis must be used. Please see the GESLA format
page for more information. The other argument,
as_data_frame = FALSE
specifies if the resulting object
should be an ArrowObject
(the default) or a
data.frame
. We highly recommend to keep this default
option, as the resulting data.frame
can be
larger-than-memory, resulting in a memory overflow or even an R session
crash. Keeping the object as an ArrowObject
will always
guarantee that the resulting query will be made available. In any case,
an ArrowObject
can always be converted to a
data.frame
, if that is desired, without the need to repeat
the query with as_data_frame = TRUE
. For example, the
object da
created above (which is an
ArrowObject
) may be converted to a data.frame
with
Using the GESLA Shiny app
The GESLA Shiny app (geslaR-app) was developed as a user-friendly interface to explore the whole GESLA dataset. The app may be called simply as
This will create a directory called gesla-app
on the
current working directory, containing the necessary files for the app
itself, and a sub-directory called gesla_dataset
containing
all the GESLA dataset files (as .parquet
files, as this is
done with the download_gesla()
function as explained
above). Note that the first time this function is called, it will create
these directories and download the full GESLA dataset. However, once
this is done, the next time this function is called on this working
directory, only the app will open in the default browser (no more
downloads will be necessary).
This web-interface allows the user to apply some basic filters, by choosing countries, years and sites. By doing this, the app will filter the data and show information about the selected subset. The first tab contains a preview of the selected subset. The two subsequent tabs provide some plots with the number of observations by year, and summary plots for the sea level data itself. The last tab will show a map with the selected locations, and by clicking on the locations, a pop-up will show information about that specific station.
The user has the option to download the selected subset in CSV or
Parquet file formats. We highly recommend to export it as a Parquet
file, as the file size will be much smaller, and it can be simply loaded
into R with the read_gesla()
function, creating an
ArrowObject
object.