+ All documents
Home > Documents > Waveform Signal Entropy and Compression Study of Whole ...

Waveform Signal Entropy and Compression Study of Whole ...

Date post: 30-Nov-2023
Category:
Upload: khangminh22
View: 0 times
Download: 0 times
Share this document with a friend
10
Waveform Signal Entropy and Compression Study of Whole-Building Energy Datasets Thomas Kriechbaumer Chair for Application and Middleware Systems Technische Universität München Germany [email protected] Hans-Arno Jacobsen Chair for Application and Middleware Systems Technische Universität München Germany [email protected] ABSTRACT Electrical energy consumption has been an ongoing research area since the coming of smart homes and Internet of Things devices. Consumption characteristics and usages profiles are directly influ- enced by building occupants and their interaction with electrical appliances. Extracted information from these data can be used to conserve energy and increase user comfort levels. Data analysis to- gether with machine learning models can be utilized to extract valu- able information for the benefit of occupants themselves, power plants, and grid operators. Public energy datasets provide a scien- tific foundation to develop and benchmark these algorithms and techniques. With datasets exceeding tens of terabytes, we present a novel study of five whole-building energy datasets with high sam- pling rates, their signal entropy, and how a well-calibrated measure- ment can have a significant effect on the overall storage require- ments. We show that some datasets do not fully utilize the avail- able measurement precision, therefore leaving potential accuracy and space savings untapped. We benchmark a comprehensive list of 365 file formats, transparent data transformations, and lossless compression algorithms. The primary goal is to reduce the overall dataset size while maintaining an easy-to-use file format and access API. We show that with careful selection of file format and encod- ing scheme, we can reduce the size of some datasets by up to 73%. 1 INTRODUCTION Home and building automation promise many benefits for the oc- cupants and power utilities. From increased user comfort levels to demand response and lower electricity costs, Smart Homes offer a variety of assistance and informational gains. Internet of Things, a combination of sensors and actuators, can be intelligently con- trolled based on sensor data or external triggers. Power monitoring and smart metering are a key step to fulfill these promises. The in- flux of renewable energies and the increased momentum of changes in the power grid and its operations are a main driving factor for further research in this area. Non-intrusive load monitoring (NILM) can be one solution to identify and disaggregate power consumers (appliances) from a single-point measurement in the building. Utilizing a centralized data acquisition system saves costs for hardware and installation in the electrical circuits under observation. The NILM community heavily relies on long-term measurement data, in the form of public datasets, to craft new algorithms, train models, and evaluate their accuracy on per-appliance energy consumption or appliance iden- tification. In recent years these datasets grew significantly in size and sampling characteristics (temporal and amplitude resolution). Collecting, distributing, and managing large-scale data storage fa- cilities is an ongoing research topic [11, 46] and strongly depends on the environment and systems architecture. High sampling rates are particularly interesting for NILM to ex- tract waveform information from voltage and current signals [24]. Early datasets targeted at load disaggregation and appliance identi- fication started with under 2 GiB [26], whereas recently published datasets reach nearly 100 TiB of raw data [28]. Working with such quantities requires specialized storage and processing techniques which can be costly and maintenance-heavy. Optimizing infrastruc- ture costs for storage is part of ongoing research [29, 37]. The data quality requirements typically define a fixed sampling rate and bit-resolution for a static environment. Removing or aug- menting measurements might impede further research, therefore no filtering or preprocessing steps are performed before releasing the data. Data compression techniques can be classified as lossy or lossless [8]. Lossy algorithms allow for some margin of error when encoding the data and typically give a metric for the remaining accuracy or lost precision. For comparison, most audio, image, and video compression algorithms remove information not detectible by a human ear or eye. This allows for a data rate reduction in areas of the signal a user can’t detect or has a reduced resolution due to a typical human physiology. Depending on the targeted use case, certain aspects of the input signal are considered unimportant and might be not reconstructable. Encoding only the amplitude and frequency of the signal can lead to vast space savings, assuming phase alignment, harmonics, or other signal characteristics are not required for future analysis. On the contrary, lossless encoding schemes guarantee a 1:1 representation of all measurement data with a reversible data transformation. If the intended use case or audience for a given dataset is not known or is very diverse in their requirements, only lossless compression can be applied to keep all data accessible for future use. Recent works pointed out an imbalance in the amount of research on steady-state versus waveform-based compression of electricity signals [10]. Further consideration must be given to communication band- width (transmission to a remote endpoint) and in-memory process- ing (SIMD computation). The efficient use of network channels can be a key requirement for real-time monitoring of streaming data. In the case of one-time transfers (or burst transmissions), chunk- ing is used to split large datasets into more manageable (smaller) files. However, choosing a maximum file size depends on the avail- able memory and CPU (instruction set and cache size). Distributing large datasets as a single file creates an unnecessary burden for re- searchers and required infrastructure. arXiv:1810.10887v1 [cs.OH] 25 Oct 2018
Transcript

Waveform Signal Entropy and Compression Study ofWhole-Building Energy Datasets

Thomas KriechbaumerChair for Application and Middleware Systems

Technische Universität MünchenGermany

[email protected]

Hans-Arno JacobsenChair for Application and Middleware Systems

Technische Universität MünchenGermany

[email protected]

ABSTRACTElectrical energy consumption has been an ongoing research areasince the coming of smart homes and Internet of Things devices.Consumption characteristics and usages profiles are directly influ-enced by building occupants and their interaction with electricalappliances. Extracted information from these data can be used toconserve energy and increase user comfort levels. Data analysis to-gether with machine learning models can be utilized to extract valu-able information for the benefit of occupants themselves, powerplants, and grid operators. Public energy datasets provide a scien-tific foundation to develop and benchmark these algorithms andtechniques. With datasets exceeding tens of terabytes, we present anovel study of five whole-building energy datasets with high sam-pling rates, their signal entropy, and how a well-calibrated measure-ment can have a significant effect on the overall storage require-ments. We show that some datasets do not fully utilize the avail-able measurement precision, therefore leaving potential accuracyand space savings untapped. We benchmark a comprehensive listof 365 file formats, transparent data transformations, and losslesscompression algorithms. The primary goal is to reduce the overalldataset size while maintaining an easy-to-use file format and accessAPI. We show that with careful selection of file format and encod-ing scheme, we can reduce the size of some datasets by up to 73%.

1 INTRODUCTIONHome and building automation promise many benefits for the oc-cupants and power utilities. From increased user comfort levels todemand response and lower electricity costs, Smart Homes offer avariety of assistance and informational gains. Internet of Things,a combination of sensors and actuators, can be intelligently con-trolled based on sensor data or external triggers. Power monitoringand smart metering are a key step to fulfill these promises. The in-flux of renewable energies and the increased momentum of changesin the power grid and its operations are a main driving factor forfurther research in this area.

Non-intrusive load monitoring (NILM) can be one solution toidentify and disaggregate power consumers (appliances) from asingle-point measurement in the building. Utilizing a centralizeddata acquisition system saves costs for hardware and installationin the electrical circuits under observation. The NILM communityheavily relies on long-term measurement data, in the form of publicdatasets, to craft new algorithms, train models, and evaluate theiraccuracy on per-appliance energy consumption or appliance iden-tification. In recent years these datasets grew significantly in sizeand sampling characteristics (temporal and amplitude resolution).

Collecting, distributing, and managing large-scale data storage fa-cilities is an ongoing research topic [11, 46] and strongly dependson the environment and systems architecture.

High sampling rates are particularly interesting for NILM to ex-tract waveform information from voltage and current signals [24].Early datasets targeted at load disaggregation and appliance identi-fication started with under 2GiB [26], whereas recently publisheddatasets reach nearly 100 TiB of raw data [28]. Working with suchquantities requires specialized storage and processing techniqueswhich can be costly and maintenance-heavy. Optimizing infrastruc-ture costs for storage is part of ongoing research [29, 37].

The data quality requirements typically define a fixed samplingrate and bit-resolution for a static environment. Removing or aug-menting measurements might impede further research, thereforeno filtering or preprocessing steps are performed before releasingthe data.

Data compression techniques can be classified as lossy or lossless[8]. Lossy algorithms allow for somemargin of error when encodingthe data and typically give a metric for the remaining accuracyor lost precision. For comparison, most audio, image, and videocompression algorithms remove information not detectible by ahuman ear or eye. This allows for a data rate reduction in areasof the signal a user can’t detect or has a reduced resolution due toa typical human physiology. Depending on the targeted use case,certain aspects of the input signal are considered unimportant andmight be not reconstructable. Encoding only the amplitude andfrequency of the signal can lead to vast space savings, assumingphase alignment, harmonics, or other signal characteristics are notrequired for future analysis. On the contrary, lossless encodingschemes guarantee a 1:1 representation of all measurement datawith a reversible data transformation. If the intended use case oraudience for a given dataset is not known or is very diverse intheir requirements, only lossless compression can be applied tokeep all data accessible for future use. Recent works pointed outan imbalance in the amount of research on steady-state versuswaveform-based compression of electricity signals [10].

Further consideration must be given to communication band-width (transmission to a remote endpoint) and in-memory process-ing (SIMD computation). The efficient use of network channels canbe a key requirement for real-time monitoring of streaming data.In the case of one-time transfers (or burst transmissions), chunk-ing is used to split large datasets into more manageable (smaller)files. However, choosing a maximum file size depends on the avail-able memory and CPU (instruction set and cache size). Distributinglarge datasets as a single file creates an unnecessary burden for re-searchers and required infrastructure.

arX

iv:1

810.

1088

7v1

[cs

.OH

] 2

5 O

ct 2

018

A suitable file format must be considered for raw data storage, aswell as easy access to metadata, such as calibration factors, times-tamps, and identifier tags. None of the existing datasets (NILM orrelated datasets with high sampling rates) share a common file for-mat, chunk size, or signal sampling distribution. This heterogeneitymakes it difficult to apply algorithms and evaluation pipelines onmore than one dataset. Therefore, researchers working with mul-tiple datasets have to implement custom importer and converterstages, which can be time-consuming and error-prone.

This work provides an in-depth analysis of public whole-buildingdatasets, and gives a comprehensive evaluation of best-practicestorage techniques and signal conditioning in the context of energydata collection. The key contributions of this work are:

(1) A numerical analysis of signal entropy and measurement cal-ibration of public whole-building energy datasets by evalu-ating all signal channels with respect to their available reso-lution and sample distribution over the entire measurementperiod. The resulting entropy metrics further motivate ourcontributions and the need for a well-calibrated measure-ment system.

(2) An exhaustive benchmark of storage requirements and po-tential space savings with a comprehensive collection of 365file formats, lossless compression techniques, and reversibledata transformations. We re-encode and normalize data fromall datasets to evaluate the effect of compression. We presentthe best-performing combinations and their overall spacesavings. The full ranking can be used to select the optimalfile format and compression for offline storage of large long-term energy datasets.

(3) A full-scale evaluation of increasingly larger data chunks perfile and their final compression ratio. The dependency be-tween input size and achievable compression ratio is evalu-ated up to 3072MiB per file. The results provide an evidence-based guideline for future selection of chunk sizes and pos-sible environmental factors for consideration.

We give an in-depth evaluation of file formats and signal charac-teristics that directly affect storage, encoding, and compression ofsuch data. Each of the analyzed datasets was created with a dedi-cated set of requirements, therefore, a single best option does notexist. However, with this study, we want to help the communityto better understand the fundamental causes of compression per-formance in the field of waveform-based whole-building energydatasets. We provide a definition of measurement calibration andits effects on the storage requirements based on signal entropy.Published datasets are self-contained and final, which allows us toprioritize the compression ratio and achievable space saving overother common compression metrics (CPU load, throughput, or la-tency). We define the achievable space saving and compression ra-tio as the only criterion when dealing with large (offline) datasets.

The rest of this paper is structured as follows: We discuss relatedwork in Section 2. We describe the evaluated datasets in Section 3,which are then used in the experiments in Sections 4, 5, and 6. Fi-nally, we present results in Section 7, before concluding in Section 8.

2 RELATEDWORKNILM and related fields distinguish between low and high samplingrates to capture voltage and current measurements. Low samplingrates (or low-frequency) are typically 1Hz or slower. High sam-pling rates (or high-frequency) are typically above 500Hz (or atleast the Nyquist–Shannon sampling theorem [41]). Recording mul-tiple channels with high sampling rates requires oscilloscopes orspecialized data acquisition systems as presented in [21, 27, 31].

Low-frequency energy data can benefit greatly from compres-sion when applied to smart meter data, as multiple recent workshave shown [13, 39, 44, 45]. Electricity smart meters can be a sourceof high data volume with measurement intervals of 1 s, 60 s, 15min,or higher. Possible transmission and storage savings due to loss-less compression have been evaluated in [45]. While the achievablecompression ratio increased with smaller sampling intervals, thebenefits of compression vanish quickly above 15min intervals. Var-ious encodings (ASCII- and binary-based) have been evaluated forsuch low-frequency measurements, and in most cases, a binary en-coding greatly outperforms an ASCII-based encoding. The need forsmart data compression was discussed in [33], which further moti-vates in-depth research in this area. The main focus of the authorswas smart meter data with low temporal resolution from 10,000meters or more. Various compression techniques were presentedand a fast-streaming differential compression algorithm was eval-uated: removing steady-state power measurements (ti+1 − ti = 0)can save on average 62% of required storage space.

High-frequency energy data offers a significantly larger poten-tial for lossless compression, due to the inherent repeating wave-form signal. Tariq et al. [43] utilized general-purpose compressors,such as LZMA and bzip2, and achieved good compression ratioson some datasets. Applying differential compression and omittingtimestamps can yield size reductions of up to 98% on smart griddata, however, these results are not comparable as there is no gener-alized uniform data source. The presented results use a single datachannel and an ASCII-based data representation as a baseline fortheir comparison, which contains an inherent encoding overhead.The SURF file format [36] was designed to store NILM datasets andprovide an API to create and modify such files. The internal struc-ture is based on wave-audio and augments it with new types ofmetadata chunks. To the best of our knowledge, the SURF file for-mat didn’t gain any traction due to its lack of support in commonscientific computing frameworks. The recently published EMD-DFfile format [35], by the same authors, relies on the same wave-audioencoding, while extending it with more metadata and annotations.Neither SURF nor EMD-DF provides any built-in support for com-pression. The power grid community defined the PQDIF [42] (forpower quality and quantity monitoring) and COMTRADE [5] (fortransient data in power systems) file formats. Both specificationsoutline a structured view of numerical data in the context of en-ergy measurements. Raw measurements are augmented with pre-computed evaluations (statistical metrics), which can cause a sig-nificant overhead in required storage space. While PQDIF supportsa simple LZ compression, COMTRADE does not offer such capabil-ities. To the best of our knowledge, these file formats never gainedtraction outside the power grid operations community.

2

Lossy compression can achieve multiple magnitudes higher com-pression ratios than lossless, with minimal loss of accuracy for cer-tain use cases [13]. Using piecewise polynomial regression, the au-thors achieved good compression ratios on three existing smartgrid scenarios. The compressed parametrical representation wasstored in a relational database system. However, this approach onlyapplies if the use case and expected data transformation is knownbefore applying a lossy data reduction. A 2-dimensional represen-tation for power quality data was proposed in [17] and [38], whichthen could be used to employ compression approaches from imageprocessing and other related fields. While both approaches can becategorized as lossy compression due to their numerical approxi-mation using wavelets or trigonometric functions, they require aspecialized encoder and decoder which is not readily available inscientific computing frameworks.

The NilmDB project [34] provides a generalized user interfaceto access, query, and analyze large time-series datasets in the con-text of power quality diagnostics and NILM. A distributed archi-tecture and a custom storage format were employed to work ef-ficiently with “big data”. The underlying data persistence is orga-nized hierarchically in the filesystem and utilizes tree-based struc-tures to reduce storage overhead. This internal data representationis capable of handling multiple streams and non-uniform data ratesbut lacks support for data compression or more efficient codingschemes. NILMTK [6], an open-source NILM toolkit, provides anevaluation workbench for power disaggregation and uses the HDF5[15] file format with a custom metadata structure. Most availablepublic datasets require a specialized converter to import them intoa NILMTK-usable file format. While the documentation states thata zlib data compression is applied, some converters currently usebzip2 or Blosc [1].

3 EVALUATED DATASETSWhile there is a vast pool of smart meter datasets1, i.e., low sam-pling rates of measurements every 1 s, 15min, or 1 h, a majority ofthe underlying information is already lost (signal waveform). Theraw signals are aggregated into single root-mean-squared voltageand current readings, frequency spectrums, or other metrics accu-mulated over the last measurement interval. This can be alreadyclassified as a type of lossy compression. For some use cases, thisdata source is sufficient to work with, while other fields requirehigh sampling rates to extract more information from the signals.

All following experiments and evaluations were performed onpublicly accessible datasets: The Reference Energy DisaggregationData Set (REDD [26]), Building-Level fUlly-labeled dataset for Elec-tricity Disaggregation (BLUED [3]), UK Domestic Appliance-LevelElectricity dataset (UK-DALE [25]), and the Building-Level Office eN-vironment Dataset (BLOND [28]). We will refer to these datasets bytheir established acronyms: REDD, BLUED, UK-DALE, and BLOND.Based on the energy dataset survey provided by the NILM-Wiki1,these are all datasets of long-term continuous measurements withvoltage and current waveforms from selected buildings or house-holds. The data acquisition systems and data types are comparableto warrant their use in this context. (Table 1).

1http://wiki.nilm.eu/datasets.html

Table 1: Overview of evaluated datasets: long-term continu-ousmeasurements containing rawvoltage and currentwave-forms.

Dataset CurrentChannels

VoltageChannels

SamplingRate Values

REDD 2 1 15 kHz 24-bitBLUED 2 1 12 kHz 16-bitUK-DALE 1 1 16 kHz 24-bitBLOND-50 3 3 50 kHz 16-bitBLOND-250 3 3 250 kHz 16-bit

Measurement systems and their analog-to-digital converters(ADC) always output a unit-less integer number, either between[0, 2bits ) for unipolar ADCs or [−2bits−1, 2bits−1) for bipolar ADCs.During setup and calibration, a common factor is determined toconvert raw values into a voltage or current reading. Some datasetspublish raw values and the corresponding calibration factors, whileothers publish directly Volt- and Ampere-based readings as floatvalues. Datasets only available as floating-point values are con-verted back into their original integer representation without lossof precision by reversing the calibration step from the analog-to-digital converter for each channel:

measurementi = ADCi · calibrationchannel[Volts] = [steps] · [Volt/steps]

[Ampere] = [steps] · [Ampere/step]Each of the mentioned datasets was published in a different (com-

pressed) file format and encoding scheme. To allow for compar-isons between these datasets, we decompressed, normalized, andre-encoded all data before analyzing them (raw binary encoding).

From REDD, we used the entire available High Frequency RawData: house_3 and house_5, each with 3 channels: current_1, cur-rent_2, and voltage. The custom file format encodes a single chan-nel per file. In total, 1.4GiB of raw data from 126 files were used.

From BLUED, we used all available waveform data (1 location,16 sub-datasets) and 3 channels: current_a, current_b, voltage. TheCSV-like text files contain voltage and two current channels and adedicated measurement timestamp. In total, 41.1GiB of raw datafrom 6430 files were used.

From UK-DALE, we selected house_1 from the most recent re-lease (UK-DALE-2017-16kHz, the longest continuous recording). Thecompressed FLAC files contain 2 channels: current and voltage. Intotal, 6259.1GiB of raw data from 19491 files were used.

From BLOND, we selected the aggregated mains data of bothsub-datasets: BLOND-50 and BLOND-250. The HDF5 files with gzipcompression contain 6 channels: current{1-3} and voltage{1-3}. Intotal, 10 246.7GiB of raw data from 61125 files of BLOND-50, and11 899.0GiB of raw data from 35490 files of BLOND-250 were used.

The data acquisition systems (DAQ) of all datasets produce alinear pulse-code modulated (LPCM) stream. The analog signalsare sampled in uniform intervals and converted to digital values(Figure 1). The quantization levels are distributed linearly in a fixedmeasurement range which requires a signal conditioning step in theDAQ system. ADCs typically cannot directly measure mains voltage

3

and require a step-down converter or measurement probe. Mainscurrent signals need to be converted into a proportional voltage.

4 ENTROPY ANALYSISDAQ units provide a way to collect digital values from analog sys-tems. As such, the quality of the data depends strongly on the cor-rect calibration and selection of measurement equipment. Mainselectricity signals are typically not compatible with modern digi-tal systems, requiring an indirect measurement through step-downtransformers or other metrics. Mains voltage can vary by up to±10% during normal operation of the grid [2, 14], making it neces-sary to design the measurement range with a safety margin. Theexpected signal, plus any margin for spikes, should be equally dis-tributed on the available ADC resolution range. Leaving large areasof the available value range unused can be prevented by carefullyselecting input characteristics and signal conditioning (step-downcalibration). A rule of thumb for range calibration is that the ex-pected signal should occupy 80-90%, leaving enough bandwidth forunexpected measurements. Input signals larger than the measure-ment range get recorded as the minimum/maximum value. Grosslyexceeding the rated input signal level could damage the ADC, un-less a dedicated signal conditioning and protection is employed.

0 50 100Time [samples]

32768

16384

0

16384

32768

Raw

ADC

Valu

es [s

teps

]

300

150

0

150

300

Volta

ge [V

]

Figure 1: Linear pulse-code modulation stream of a sinu-soidal waveform sampled with a 16-bit ADC. The waveformcorresponds to a 230V mains voltage signal.

We extracted the probability mass function (PMF) of all evaluateddatasets for the full bit-range (16- or 24-bit). The value histogram is astructure mapping each possible measurement value (integer) to thenumber of times this valuewas recorded. Ideally, the region betweenthe lowest and highest value contains a continuous value rangewithout gaps. However, the quantization level (step size) couldcause a mismatch and results in skipped values. We then normalizethis histogram to obtain the PMF and compute the signal entropyper channel, which gives an estimation of the actual informationcontained in the raw data and provides a lower bound for theachievable compression ratio based on the Kolmogorov complexity.

X ={−2bits−1, ..., 0, ..., 2bits−1 − 1

}hist = histoдram(dataset ,X )

fX =hist∑

x ∈X hist[x]∀x ∈ X where fX (x) = 0 : fX (x) = 1

H (x) = −∑x ∈X

fX (x) · loд2 (fX (x))

Each dataset is split into multiple files, making it necessary tomerge all histograms into a total result at the end of the computingrun. Since all histograms can be combined with a simple summation,the process can be parallelized and computed without any particularorder. Computing and merging all histograms is, therefore, bestaccomplished in a distributed compute cluster with multiple nodesor similar environments.

5 DATA REPRESENTATIONChoosing a suitable file format for potentially large datasets in-volves multiple tradeoffs and decisions, including supported plat-forms, scientific computing frameworks, metadata, error correction,compression, and chunking. The available choices for data represen-tation can range from CSV data (ASCII-parsable) to binary file for-mats and custom encoding schemes. From the energy dataset surveyand the evaluated datasets, it can be noted, that every dataset uses adifferent file format, encoding scheme, and optionally compression.

Publishing and distributing large datasets requires storage sys-tems capable of providing long-term archives of scientific measure-ment data. Lossless compression helps to minimize storage costsand distribution efforts. At the same time, other researchers access-ing the data benefit from smaller files and shorter access times todownload the data.

Electricity signals (current and voltage) contain a repetitive wave-form with some form of distortion depending on the load. In anideal power grid, the voltage would follow a perfect sinusoidalwaveform without any offset or error. This would allow us to accu-rately predict the next voltage measurement. However, constantfluctuations in the supply and demand cause the signals to deviate.The fact that each signal is primarily continuous (without suddenjumps) can be beneficial to compression algorithms.

A delta encoding scheme only stores the numerical difference ofneighboring elements in a time-series measurement vector. Thiscan be useful for slow-changing signals because the difference of asignal might require less bytes to encode than the absolute value:

∀i ∈ {1 . . .n} : di = vi −vi−1d0 = v0

We compare the original data representation (format, compres-sion, encoding) of each dataset, reformat them into various file for-mats, and evaluate their storage saving based on a comprehensivelist of lossless compression algorithms. This involves encoding rawdata in a more suitable representation to compare their compressedsize:CS = compressed_size/oriдinal_size ∗ 100%, and the resultingspace saving: SS = 100% −CS . We define the main goal of reducingthe overall required storage space for each dataset, and deliberatelydo not consider compression or decompression speed. The perfor-mance characteristics (throughput and speed) are well known forindividual compression techniques [4] and are of minor importancein the case of large static datasets which require only a single com-pression step before distribution. Performance metrics are impor-tant when dealing with repeated compression of raw data, which isnot the case for static energy datasets. Repeated decompression ishowever relevant because researchers might want to read and parsethe files over and over again while analyzing them (if in-memoryprocessing is not feasible). As noted in [4], decompression speed

4

and throughput is typically not a performance bottleneck in dataanalytics tasks.

Building a novel data compression scheme for energy data iscounter-productive, since most scientific computing frameworkslack support and the idea suffers from the "not invented here" and"yet another standard" problematic, both common anti-patterns inthe field of engineering when developing new solutions, despiteexisting suitable approaches [13, 26, 36]. Therefore, a key require-ment is that each file format must be supported in common scien-tific computing systems to read (and possibly write) data files.

We selected four format types: raw binary, HDF5 (data modeland file format for storing and managing data), Zarr (chunked, com-pressed, N-dimensional arrays), and audio-based PCM containers.

Raw binary formats provide a baseline for comparison. All sam-ples are encoded as integer values (16-bit or 24-bit) and are com-pressed with a general-purpose compressor: zlib/gzip, LZMA, bzip2,and zstd, all with various parameter values. The input for eachcompressor is either raw-integer or variable-length encoded data(LEB128S [19]), which is serialized either row- or column-basedfrom all channels (interweaving). The LEB128S encoding is addi-tionally evaluated with delta encoding of the input.

The Hierarchical Data Format 5 (HDF5) [15] provides structuredmetadata and data storage, data transformations, and libraries formost scientific computing frameworks. All data is organized innatively-typed arrays (multi-dimensional matrices) with variousfilters for data compression, checksumming, and other reversibletransformations before storing the data to a file. The API transpar-ently reverses these transformations and compression filters whilereading data. HDF5 is popular in the scientific community and usedfor various big-data-type applications [7, 12, 18, 40]. The public reg-istry for HDF5 filters1 currently lists 21 data transformations, mostof them compression-related. Each HDF5 file is evaluated with andwithout the shuffle filter, zlib/gzip, lzf, MAFISC [22] with LZMA,szip [20], Bitshuffle [30] with LZ4, zstd, and the full Blosc [1] com-pression suite, again all with various parameter values.

Zarr [32] organizes all data in a filesystem-like structure, whichcan be archived as a single zip-archive file or as tree-structure in thefilesystem. Each channel is stored as a separate array (data stream)with optional chunk-based compression via zlib/gzip, LZMA, bzip2,or Blosc (with shuffle, Bitshuffle, or no-shuffle filter), again all withvarious parameter values. Each Zarr file is additionally evaluatedwith a delta filter to reduce the value range.

Audio-based formats use LPCM-type data encoding (PCM16 orPCM24) with a fixed precision and sampling rate. All channels areencoded into a single container using lossless compression formats:FLAC [16], ALAC [23], and WavPack [9]. These formats do notprovide tune-able parameters.

Calibration factors, timestamps, and labels can augment the rawdata in a single file while providing a unified API for accessingdata and metadata. Raw binary formats lack this type of integratedsupport and require additional tooling and encoding schemes formetadata. Audio-based formats require a container format to store

1https://support.hdfgroup.org/services/filters.html

metadata, typically designed for the needs of the music and enter-tainment industry. Out of these formats, only HDF5 and Zarr pro-vide support for encoding and storing arbitrary metadata objects(complex types or matrices) together with measurement data.

Most audio-based formats support at most 8 signal channels,while general-purpose formats such as HDF5 and Zarr have no re-strictions on the total number of channels per file. The sampling ratecan also be a limiting factor: FLAC supports at most 655.35 kHz andALAC only 384 kHz. ADC resolution (bit depth) is mostly bound byexisting technological limitations and will not exceed 32-bit in theforeseeable future. While these constraints are within the require-ments for all datasets under evaluation, they need to be consideredfor future dataset collection and the design of measurement systems.

In total, we encoded the evaluated datasets with 365 differentdata representation formats: 54 raw, 264 HDF5-based, 44 Zarr-based,and 3 audio-based and gathered their per-file compression size as abenchmark. The complete list, including all parameters and com-pression options, is available in the online appendix2. The full anal-ysis was performed in a distributed computing environment andconsumed approx. 1, 176, 000 CPU-core-hours (dual Intel Xeon E5-2630v3 machines with 128GiB RAM and 10Gibit Ethernet inter-faces).

6 CHUNK SIZE IMPACTEach dataset is provided in equally-sized files, typically based onmeasurement duration. Working with a single large file can be cum-bersome due to main memory restriction or available local storagespace. Assuming a typical desktop computer, with 8GiB of mainmemory, is used for processing, a single file from a dataset mustbe fully loaded into memory before any computation can be done.Depending on the analysis and algorithms, multiple copies mightbe required for intermediary results and temporary copies. Thismeans the main memory size is an upper bound for the maximumfeasible chunk size.

Some file formats and data types support internal chunkingor streamed data access, in which data can be read into memorysequentially or random-access. In such environments other factorswill limit the usable chunk size, such as file system capabilities,network-attached storage, or other operating system limitations.

The evaluated datasets are distributed with the following chunksizes of raw data: REDD: 11.4MiB or 4min, BLUED: 6.6MiB or1.65min, UK-DALE: 329.2MiB or 60min, BLOND-50: 171.7MiB or5min, BLOND-250: 343.3MiB or 2min. Measurement duration andfile size are not strictly linked, causing a slight variation in file sizesacross the entire measurement period of each dataset. Observedreal-world time does not affect any of the compression algorithmsunder test and is therefore omitted. The sampling rate and channelcount directly affects the data rate (bytes per time unit) and explainsthe non-uniform chunk sizes mentioned for each dataset.

We compare the best-performing data representation formats ofeach dataset from the previous experiment, benchmark them withdifferent chunk sizes, and estimate their effect on the overall com-pression ratio. For this evaluation, we define the compression ra-tio as CR = oriдinal_size/compressed_size . The chunk sizes rangefrom 1, 2, 4, 8, 16, 32, 64, 128MiB, and then continue in steps of

2The online appendix is available through the program chair (double-blind review).

5

128MiB up to 3072MiB. To reduce the required computational ef-fort, we greedily consume data from the first available dataset file,until the predefined chunk limit is fulfilled. The chunk size is deter-mined using the number of samples (across all channels) and theirinteger byte count (2 or 3 bytes); only full samples for all channelsare included in a chunk.

7 RESULTS7.1 Entropy AnalysisEntropy is based on the probability for a given measurement (signalvalue). The histogram of an entire measurement channel showsthe number of times a single measurement value was seen in thedataset (Figure 3). The plots show the raw measurement bandwidthin ADC value on the x-axis and a logarithmic y-axis for the numberof occurrences of each value. The raw ADC values are bipolarand centered on 0: −32768 . . . 32767 for BLUED, BLOND-50, andBLOND-250; −8388608 . . . 8388607 for REDD and UK-DALE.

The voltage histogram shows a distinctive sinusoidal distribution(peaks at minimum and maximum values). The current histogramwould show a similar distribution if the power draw is constant(pure-linear or resistive loads), however, multiple levels of currentvalues can be observed, indicating high activity and fluctuations.REDD and BLUED (Figures 3a and 3b) show a center-biased distribu-tion, indicating a sub-optimal calibration performance and unusedmeasurement bandwidth. UK-DALE, BLOND-50, and BLOND-250(Figures 3c, 3d, 3e) show awide range of highly used values, with thevoltage channels utilizing around 90% of the available bandwidth.

REDD and BLUED use only a small percentage of the availablerange, indicating a low entropy based on the used data type. UK-DALE utilizes a reasonable slice, while BLOND covers almost the en-tire possible range (Table 2). Assuming a well-calibrated data acqui-sition system, the expected percentage should reflect the expectedmeasurement values. Low range usage (REDD, BLUED) leads to lostprecision which would have been freely available with the givenhardware, whereas high usage (UK-DALE, BLOND) means almostall available measurement precision is reflected in the raw data.Some datasets utilize 100% of the available measurement range,while REDD only uses 5%. A high range utilization does not resultin a equally high usage, as the histogram can contain gaps (ADCvalues with 0 occurrences in the datasets).

7.2 Data RepresentationThe evaluation compares the compressed size (CS, final file sizeafter compression and file format encapsulation in percent of un-compressed size) of 365 data representation formats. For brevityreasons, only the 30 best-performing formats are shown in Figure 2.Each of the 365 data representation was tested on all datasets andthe full evaluation is available in the online appendix3. The follow-ing evaluation and benchmark uses the raw data from each datasetas described in Section 3. In total, raw data with 27.8 TiB was re-encoded 365 times.

HDF5 and Zarr are general-purpose file formats for numericaldata with a broad support in scientific computing frameworks. Assuch, they only support 16-bit and 32-bit integer values, which

3The online appendix is available through the program chair (double-blind review).

Table 2: Entropy analysis of whole-building energy datasetswith high sampling rates. The amount of unique measure-ment values for each channel is extracted, which corre-sponds to a usage percentage over the available measure-ment resolution. The lowest and highest observed value isused to give determine the observed range.

Dataset Channel Values Usage Range H(x)

REDD(24-bit)

current_1 87713 1% 4% 14.3current_2 85989 1% 5% 14.9voltage 2925155 17% 18% 21.1

BLUED(16-bit)

current_a 5855 9% 10% 7.8current_b 7684 12% 13% 9.7voltage 11302 17% 18% 13.2

UK-DALE(24-bit)

current 6981612 42% 81% 19.0voltage 15135594 90% 100% 23.2

BLOND-50(16-bit)

current1 51122 78% 100% 12.6current2 49355 75% 100% 11.2current3 48658 74% 100% 11.3voltage1 58396 89% 92% 15.3voltage2 57975 88% 91% 15.4voltage3 59596 91% 95% 15.4

BLOND-250(16-bit)

current1 52721 80% 100% 12.4current2 51802 79% 100% 10.8current3 50989 78% 100% 11.6voltage1 58488 89% 91% 15.3voltage2 57912 88% 92% 15.4voltage3 59742 91% 94% 15.4

causes a 1-byte overhead for REDD and UK-DALE. The baselineused for comparison is a raw concatenated byte string with dataset-native data types (16-bit and 24-bit). This allows us to obtain com-parable evaluation results, while other published benchmarks com-pared ASCII-like encodings against binary representations, skew-ing the results significantly.

Overall, it can be noted that all three audio-based formats per-formed well, given their inherent targeted nature of compress-ing waveforms with high temporal resolution. ALAC and FLACachieved the highest overall CS across all datasets, followed byHDF5+MAFISC and HDF5+zstd, which can overcome the 1-byteoverhead. Although the general-purpose compressors and their in-dividual data representation formats were intended to serve as abaseline for comparison of the more advanced schemes (HDF5, Zarr,and audio-based), one can conclude that even plain bzip2 or LZMAcompression can achieve comparable compression results. A trade-off to consider is the lack of metadata and internal structure, whichmight cause additional data handling overhead as easy-to-use im-port and parsing tools are not available. Variable-length encodingusing LEB128S is a suitable input for the bzip2 and LZMA compres-sors when combined with a column-based storage format. Deltaencoding resulted in comparably good CS in certain combinations.

Some datasets are inherently more compressible than others.This is a result of the entropy analysis and can be observed in

6

ALAC

bzip

2 LE

B128

S-de

lta-ro

w

FLAC

LZM

A-al

one

LEB1

28S-

delta

-row

HDF5

no_

shuf

fle M

AFIS

C-LZ

MA

all-c

olum

n-ch

unk

HDF5

shuf

fle zs

td in

divi

dual

-dse

ts

HDF5

shuf

fle zs

td a

ll-co

lum

n-ch

unk

LZM

A-al

one

raw-

colu

mn

LZM

A-al

one

LEB1

28S-

row

LZM

A-al

one

LEB1

28S-

delta

-col

umn

LZM

A-xz

LEB

128S

-del

ta-c

olum

n

HDF5

no_

shuf

fle M

AFIS

C-LZ

MA

indi

vidu

al-d

sets

bzip

2 LE

B128

S-de

lta-c

olum

n

HDF5

shuf

fle g

zip a

ll-co

lum

n-ch

unk

HDF5

shuf

fle g

zip in

divi

dual

-dse

ts

LZM

A LE

B128

S-de

lta-ro

w DE

LTA

dist

ance

-1

HDF5

shuf

fle zs

td a

ll-ro

w-ch

unk

bzip

2 ra

w-co

lum

n

LZM

A LE

B128

S-de

lta-ro

w DE

LTA

dist

ance

-2

bzip

2 LE

B128

S-ro

w

LZM

A-al

one

raw-

row

HDF5

shuf

fle M

AFIS

C-LZ

MA

indi

vidu

al-d

sets

LZM

A ra

w-co

lum

n DE

LTA

dist

ance

-1

LZM

A ra

w-co

lum

n DE

LTA

dist

ance

-2

HDF5

shuf

fle g

zip a

ll-ro

w-ch

unk

LZM

A ra

w-co

lum

n DE

LTA

dist

ance

-3

LZM

A ra

w-co

lum

n DE

LTA

dist

ance

-4

LZM

A LE

B128

S-ro

w DE

LTA

dist

ance

-2

LZM

A LE

B128

S-ro

w DE

LTA

dist

ance

-4

LZM

A ra

w-ro

w DE

LTA

dist

ance

-2

0

20

40

60

80

100

Com

pres

sed

Size

[%]

REDDBLUEDUK-DALEBLOND-50BLOND-250

Figure 2: Compression performance for the top-30 data representation formats and their transformation filters. Each datarepresentation format was applied on a per-file basis to every dataset.

the data representation evaluation as well. Compressing BLUEDconsistently yields smaller file sizeswithmost compressors than anyother dataset. The benchmark shows that higher entropy correlatesstrongly with higher CS per dataset.

While themajority of tested data representation formats achievesa data reduction, compared to the baseline, some formats are counter-productive and generate a larger output (CS over 100%). This be-havior affects most HDF5- and Zarr-based formats, because of the1-byte overhead (depending on the used compressor).

Choosing the best-performing data representation for each dataset,the following SS can be achieved when applied to all data files ascompared against the raw binary encoding: REDD: 48.3% or 0.7GiB,BLUED: 73.0% or 30.0GiB, UK-DALE: 40.5% or 2534.1GiB, BLOND-50: 51.3% or 5252.3GiB, BLOND-250: 55.4% or 6590.8GiB. It canbe noted that REDD, UK-DALE, and both BLOND datasets performat around 50-60% of CS, while BLUED shows a significantly smallerCS of below 30% CS, due to it’s very low signal entropy (Table 2).Variable-length encoding (LEB128S) and Delta encoding yield thelargest space saving for such types of data (REDD and BLUED).

Two out of the five evaluated datasets (REDD and BLUED) showedthe highest space savingswith a general-purpose compressor (bzip2)and variable-length encoding. ALAC andHDF5+MAFISC performedbest on UK-DALE, BLOND-50, and BLOND-250, given their highersignal entropy and value range utilization.

When comparing the raw space savings against the actuallypublished dataset, which typically is already compressed, we canachieve additional space savings: REDD: 61.2% or 1.1GiB, BLUED:96.4% or 295.5GiB, UK-DALE: -1.3% or−49.1GiB, BLOND-50: 23.3%or 1519.7GiB, BLOND-250: 26.0% or 1867.9GiB. All datasets showspace savings, except for UK-DALE, which shows an insignificantincrease in the overall dataset size. This means the originally pub-lished FLAC files are already compressed to a high extent; this issupported by Figure 2, showing FLAC among the highest rankingformats in this study. While an absolute space saving of 1.1GiB forREDD might be insignificant in most use cases (desktop computingand data center), a more compelling reduction in storage space ofup to 1867.9GiB for BLOND-250 can be substantially beneficial.

7.3 Chunk Size ImpactThe chunk size evaluation (Figure 4) contains the averaged CRper chunk size for all datasets except REDD, as it only contains1438.4MiB of data andwas therefore omitted. A detailed per-datasetevaluation is available in the online appendix3.

The evaluated chunk size range starts with very small chunks,which would not be recommended for large datasets because of theincreased handling and container overhead. As such, chunk sizes

3The online appendix is available through the program chair (double-blind review).

7

(a) REDD (24-bit) (b) BLUED (16-bit)

(c) UK-DALE (24-bit) (d) BLOND-50 (16-bit)

(e) BLOND-250 (16-bit)

Figure 3: Semi-logarithmic histogram of ADC values for each dataset and channel. Current signals show distinct steps, corre-sponding to prolonged usage at certain power levels. For visualization reasons, the scatter plot was smoothed and the full his-togram is available in the online appendix3.

8

starting with 128MiB can be considered as viable storage strategy.The resulting CR ramps up quickly for most formats until it lev-els off between 32MiB to 64MiB. Above this mark, no significantimprovement in CR can be achieved by increasing the chunk size.Some file formats even show a slight linear decrease in CRwith verylarge chunk sizes (above approx. 1.5GiB). ALAC and FLAC com-pressors show a slight improvement (2-3%) in CR with larger chunksizes. In most use cases this size reduction comes at a great costin RAM requirement to process files above 2048MiB. HDF5 has itsown concept of "chunks", used for I/O and the filter pipeline, witha default size of 1MiB. Internal limitations do not allow for HDF5-chunks larger than 2048MiB, however, HDF5, in general, can beused for files larger than this limit. The MAFISC filter with LZMAcompression experiences large fluctuations for neighboring chunksize steps and should, therefore, be tuned separately. Overall, in-creasing the chunk size has a negligible effect on the final compres-sion ratio and only pushes up the RAM requirements for processing.

1 2 4 8 163264128 512 1024 1536 2048 2560 3072Original Chunk Size [MB]

1.6

1.7

1.8

1.9

2.0

2.1

2.2

Aver

aged

Com

pres

sion

Ratio

Ove

r All

Data

sets

[X:1

]

LZMA-alone LEB128S-delta-rowALACLZMA-alone raw-rowFLACBZ2-9 LEB128S-delta-rowBZ2-9 raw-rowHDF5 no_shuffle MAFISC-LZMA all-column-chunk

Figure 4: Chunk size impact of different representations.

7.4 Summary and RecommendationsThe entropy analysis shows a lack of measurement range calibra-tion in some datasets. This results in unutilized precision, thatwould have been available with the given hardware DAQ units. Theused range directly affects the contained entropy, and thereforethe achievable compression ratio. A well-calibrated measurementsystem is a key requirement to achieve the best signal range andresolution.

Choosing a file format for long-term whole-building energydatasets is a crucial component, directly affecting the visibility andaccessibility of the data by other researchers. Using an unsupportedencoding or requiring specialized tools to read the data is cumber-some and error-prone and should be avoided. We recommend usingwell-known file formats, such as HDF5 or FLAC, which are widelyadopted and provide built-in support for metadata, compression,and error-detection. While ALAC and FLAC already provide in-ternal compression, we recommend the MAFISC or zstd filters forHDF5, due to their superior compression ratio. The serializationorientation (row- or column-based) has only a minor effect.

Large datasets should be split into multiple smaller files to fa-cilitate data handling, reduce transfer speeds and loading timesfor short amounts of data. We have found that compression algo-rithms (together with the above-described file formats) yield higherspace savings with chunk sizes above 256MiB to 384MiB. Smallfiles show a modest compression ratio, while larger files requiremore transfer bandwidth and time before the data can be analyzed.

8 CONCLUSIONSWe presented a comprehensive entropy analysis of public whole-building energy datasets with waveform signals. Some datasetsleave a majority of the available ADC range unused, causing lostprecision and accuracy. Awell-calibratedmeasurement systemmax-imizes the achievable precision. Using 365 different data represen-tation formats, we have shown that immense space savings of upto 73% are achievable by choosing a suitable file format and datatransformation. Low entropy datasets show higher achievable com-pression ratios. Audio-based file formats perform considerably well,given the similarities to electricity waveforms. Transparent datatransformations are particularly beneficial, such as MAFISC andSHUFFLE-based approaches. The input size shows a mostly stabledependency to the achievable compressed size, with variations of afew percentage points (limited by RAM). Waveform data shows anearly constant compression ratio, independent of the input chunksize. Splitting large datasets into multiple smaller files is importantfor data handling, but insignificant in terms of space savings.

9

REFERENCES[1] Francesc Alted. 2017. Blosc: A high performance compressor optimized for binary

data. (November 2017). Retrieved January 20, 2018 from http://blosc.org/[2] American National Standards Institute. 2016. ANSI C84.1-2016: Standard for

Electric Power Systems and Equipment—Voltage Ratings (60 Hz). (2016).[3] Kyle Anderson, Adrian Ocneanu, Diego Benitez, Derrick Carlson, Anthony Rowe,

and Mario Berges. 2012. BLUED: A Fully Labeled Public Dataset for Event-BasedNon-Intrusive Load Monitoring Research. In SustKDD ’12. ACM, Beijing, China,1–5.

[4] R. Arnold and T. Bell. 1997. A corpus for the evaluation of lossless compressionalgorithms. In Data Compression Conference, 1997. DCC ’97. Proceedings. 201–210.https://doi.org/10.1109/DCC.1997.582019

[5] IEEE Standards Association. 2018. COMTRADE: Common format for TransientData Exchange for power systems. (January 2018). Retrieved January 20, 2018from https://standards.ieee.org/findstds/standard/C37.111-2013.html

[6] Nipun Batra, Jack Kelly, Oliver Parson, Haimonti Dutta, William Knottenbelt,Alex Rogers, Amarjeet Singh, and Mani Srivastava. 2014. NILMTK: An OpenSource Toolkit for Non-intrusive Load Monitoring. In ACM e-Energy ’14. ACM,New York, NY, USA, 265–276. https://doi.org/10.1145/2602044.2602051

[7] Spyros Blanas, Kesheng Wu, Surendra Byna, Bin Dong, and Arie Shoshani. 2014.Parallel Data Analysis Directly on Scientific File Formats. In Proceedings of the2014 ACM SIGMOD International Conference on Management of Data (SIGMOD’14). ACM, New York, NY, USA, 385–396. https://doi.org/10.1145/2588555.2612185

[8] Abraham Bookstein and James A. Storer. 1992. Data compression. Informa-tion Processing & Management 28, 6 (1992), 675 – 680. https://doi.org/10.1016/0306-4573(92)90060-D Special Issue: Data compression for images and texts.

[9] David Bryant. 2018. WavPack: Hybrid Lossless Audio Compression. (January2018). Retrieved January 20, 2018 from http://www.wavpack.com/

[10] J. C. S. de Souza, T. M. L. Assis, and B. C. Pal. 2017. Data Compression in SmartDistribution Systems via Singular Value Decomposition. IEEE Transactions onSmart Grid 8, 1 (Jan 2017), 275–284. https://doi.org/10.1109/TSG.2015.2456979

[11] E. Deelman and A. Chervenak. 2008. Data Management Challenges of Data-Intensive Scientific Workflows. In 2008 Eighth IEEE International Symposiumon Cluster Computing and the Grid (CCGRID). 687–692. https://doi.org/10.1109/CCGRID.2008.24

[12] Matthew T. Dougherty, Michael J. Folk, Erez Zadok, Herbert J. Bernstein,Frances C. Bernstein, Kevin W. Eliceiri, Werner Benger, and Christoph Best. 2009.Unifying Biological Image Formats with HDF5. Commun. ACM 52, 10 (Oct. 2009),42–47. https://doi.org/10.1145/1562764.1562781

[13] Frank Eichinger, Pavel Efros, Stamatis Karnouskos, and Klemens Böhm. 2015.A Time-series Compression Technique and Its Application to the SmartGrid. The VLDB Journal 24, 2 (April 2015), 193–218. https://doi.org/10.1007/s00778-014-0368-8

[14] European Committee for Electrotechnical Standardization. 1989. CENELECHarmonisation Document HD 472 S1. (1989).

[15] Mike Folk, Gerd Heber, Quincey Koziol, Elena Pourmal, and Dana Robinson. 2011.An Overview of the HDF5 Technology Suite and Its Applications. In Proceedingsof the EDBT/ICDT 2011 Workshop on Array Databases (AD ’11). ACM, New York,NY, USA, 36–47. https://doi.org/10.1145/1966895.1966900

[16] Xiph.Org Foundation. 2018. FLAC: Free Lossless Audio Codec. (January 2018).Retrieved January 20, 2018 from https://xiph.org/flac/

[17] O. N. Gerek and D. G. Ece. 2004. 2-D analysis and compression of power-qualityevent data. IEEE Transactions on Power Delivery 19, 2 (April 2004), 791–798.https://doi.org/10.1109/TPWRD.2003.823197

[18] L. Gosink, J. Shalf, K. Stockinger, Kesheng Wu, and W. Bethel. 2006. HDF5-FastQuery: Accelerating Complex Queries on HDF Datasets using Fast BitmapIndices. In 18th International Conference on Scientific and Statistical DatabaseManagement (SSDBM’06). 149–158. https://doi.org/10.1109/SSDBM.2006.27

[19] Free Standards Group. 2018. DWARF Debugging Information Format Speci-fication Version 3.0. (January 2018). Retrieved January 20, 2018 from http://dwarfstd.org/doc/Dwarf3.pdf

[20] HDF Group. 2017. Szip Compression in HDF Products. (November 2017). Re-trieved January 20, 2018 from https://support.hdfgroup.org/doc_resource/SZIP/

[21] Anwar Ul Haq, Thomas Kriechbaumer, Matthias Kahl, and Hans-Arno Jacobsen.2017. CLEAR – A Circuit Level Electric Appliance Radar for the Electric Cabinet.In 2017 IEEE International Conference on Industrial Technology (ICIT ’17). 1130–1135. https://doi.org/10.1109/ICIT.2017.7915521

[22] Nathanael Hübbe and Julian Kunkel. 2013. Reducing the HPC-datastorage foot-print with MAFISC—Multidimensional Adaptive Filtering Improved Scientificdata Compression. Computer Science - Research and Development 28, 2 (01 May2013), 231–239. https://doi.org/10.1007/s00450-012-0222-4

[23] Apple Inc. 2018. ALAC: Apple Lossless Audio Codec. (January 2018). RetrievedJanuary 20, 2018 from https://macosforge.github.io/alac/

[24] Matthias Kahl, Anwar Ul Haq, Thomas Kriechbaumer, and Hans-Arno Jacobsen.2017. A Comprehensive Feature Study for Appliance Recognition on HighFrequency Energy Data. In Proceedings of the 2017 ACM Eighth InternationalConference on Future Energy Systems (e-Energy ’17). ACM, New York, NY, USA.

https://doi.org/10.1145/3077839.3077845[25] Jack Kelly and William Knottenbelt. 2015. The UK-DALE dataset, domestic

appliance-level electricity demand and whole-house demand from five UK homes.Scientific Data 2, 150007 (2015). https://doi.org/10.1038/sdata.2015.7

[26] J. Zico Kolter and Matthew J. Johnson. [n. d.]. REDD: A Public Data Set forEnergy Disaggregation Research. In SustKDD ’11 (2011), Vol. 25. 59–62.

[27] Thomas Kriechbaumer, Anwar Ul Haq, Matthias Kahl, and Hans-Arno Jacobsen.2017. MEDAL: A Cost-Effective High-Frequency Energy Data Acquisition Systemfor Electrical Appliances. In Proceedings of the 2017 ACM Eighth InternationalConference on Future Energy Systems (e-Energy ’17). ACM, New York, NY, USA.https://doi.org/10.1145/3077839.3077844

[28] Thomas Kriechbaumer and Hans-Arno Jacobsen. 2018. BLOND, a building-level office environment dataset of typical electrical appliances. (March 2018).https://doi.org/10.1038/sdata.2018.48

[29] Guoxin Liu and Haiying Shen. 2017. Minimum-Cost Cloud Storage Service AcrossMultiple Cloud Providers. IEEE/ACM Trans. Netw. 25, 4 (Aug. 2017), 2498–2513.https://doi.org/10.1109/TNET.2017.2693222

[30] K. Masui, M. Amiri, L. Connor, M. Deng, M. Fandino, C. Höfer, M. Halpern, D.Hanna, A.D. Hincks, G. Hinshaw, J.M. Parra, L.B. Newburgh, J.R. Shaw, and K.Vanderlinde. 2015. A compression scheme for radio data in high performancecomputing. Astronomy and Computing 12, Supplement C (2015), 181 – 190.https://doi.org/10.1016/j.ascom.2015.07.002

[31] M. N. Meziane, T. Picon, P. Ravier, G. Lamarque, J. C. Le Bunetel, and Y. Raingeaud.2016. A Measurement System for Creating Datasets of On/Off-Controlled Elec-trical Loads. In 2016 IEEE 16th International Conference on Environment and Elec-trical Engineering (EEEIC). 1–5. https://doi.org/10.1109/EEEIC.2016.7555847

[32] Alistair Miles. 2018. Zarr: A Python package providing an implementation ofchunked, compressed, N-dimensional arrays. (January 2018). Retrieved January20, 2018 from https://zarr.readthedocs.io/en/latest/

[33] Muhammad Nabeel, Fahad Javed, and Naveed Arshad. 2013. Towards SmartData Compression for Future Energy Management System. In Fifth InternationalConference on Applied Energy.

[34] J. Paris, J. S. Donnal, and S. B. Leeb. 2014. NilmDB: The Non-Intrusive LoadMonitor Database. IEEE Transactions on Smart Grid 5, 5 (Sept 2014), 2459–2467.https://doi.org/10.1109/TSG.2014.2321582

[35] Lucas Pereira. 2017. EMD-DF: A Data Model and File Format for Energy Disag-gregation Datasets. In Proceedings of the 4th ACM International Conference onSystems for Energy-Efficient Built Environments (BuildSys ’17). ACM, New York,NY, USA, Article 52, 2 pages. https://doi.org/10.1145/3137133.3141474

[36] Lucas Pereira, Nuno Nunes, and Mario Bergés. 2014. SURF and SURF-PI: AFile Format and API for Non-intrusive Load Monitoring Public Datasets. InProceedings of the 5th International Conference on Future Energy Systems (e-Energy’14). ACM, New York, NY, USA, 225–226. https://doi.org/10.1145/2602044.2602078

[37] Krishna P.N. Puttaswamy, ThyagaNandagopal, andMurali Kodialam. 2012. FrugalStorage for Cloud File Systems. In Proceedings of the 7th ACM European Conferenceon Computer Systems (EuroSys ’12). ACM, New York, NY, USA, 71–84. https://doi.org/10.1145/2168836.2168845

[38] A. Qing, Z. Hongtao, H. Zhikun, and C. Zhiwen. 2011. A Compression Approachof Power Quality Monitoring Data Based on Two-dimension DCT. In 2011 ThirdInternational Conference on Measuring Technology and Mechatronics Automation,Vol. 1. 20–24. https://doi.org/10.1109/ICMTMA.2011.12

[39] Martin Ringwelski, Christian Renner, Andreas Reinhardt, Andreas Weigel, andVolker Turau. 2012. The Hitchhiker’s Guide to choosing the Compression Algo-rithm for your Smart Meter Data. (September 2012), 935–940. https://doi.org/10.1109/EnergyCon.2012.6348285

[40] S. Sehrish, J. Kowalkowski, M. Paterno, and C. Green. 2017. Python and HPCfor High Energy Physics Data Analyses. In Proceedings of the 7th Workshop onPython for High-Performance and Scientific Computing (PyHPC’17). ACM, NewYork, NY, USA, Article 8, 8 pages. https://doi.org/10.1145/3149869.3149877

[41] C. E. Shannon. 1949. Communication in the Presence of Noise. Proceedings ofthe IRE 37, 1 (Jan 1949), 10–21. https://doi.org/10.1109/JRPROC.1949.232969

[42] IEEE Power & Energy Society. 2018. IEEE 1159 - PQDIF: Power Quality andQuantity Data Interchange Format. (January 2018). Retrieved January 20, 2018from http://grouper.ieee.org/groups/1159/3/docs.html

[43] Z. B. Tariq, N. Arshad, and M. Nabeel. 2015. Enhanced LZMA and BZIP2 forimproved energy data compression. In 2015 International Conference on SmartCities and Green ICT Systems (SMARTGREENS). 1–8.

[44] Andreas Unterweger and Dominik Engel. 2015. Resumable load data compressionin smart grids. IEEE Transactions on Smart Grid 6, 2 (2015), 919–929. https://doi.org/10.1109/TSG.2014.2364686

[45] Andreas Unterweger, Dominik Engel, and Martin Ringwelski. 2015. The Effect ofData Granularity on Load Data Compression. Springer International Publishing,Cham, 69–80. https://doi.org/10.1007/978-3-319-25876-8_7

[46] D. Yuan, Y. Yang, X. Liu, and J. Chen. 2010. A cost-effective strategy for in-termediate data storage in scientific cloud workflow systems. In 2010 IEEE In-ternational Symposium on Parallel Distributed Processing (IPDPS). 1–12. https://doi.org/10.1109/IPDPS.2010.5470453

10


Recommended