Skip to content

Commit

Permalink
feat: add deserialize helper functions
Browse files Browse the repository at this point in the history
  • Loading branch information
lucatrv committed Feb 12, 2024
1 parent 2620510 commit 108f991
Show file tree
Hide file tree
Showing 2 changed files with 234 additions and 27 deletions.
63 changes: 37 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ An Excel/OpenDocument Spreadsheets file reader/deserializer, in pure Rust.
## Description

**calamine** is a pure Rust library to read and deserialize any spreadsheet file:

- excel like (`xls`, `xlsx`, `xlsm`, `xlsb`, `xla`, `xlam`)
- opendocument spreadsheets (`ods`)

Expand Down Expand Up @@ -44,47 +45,44 @@ fn example() -> Result<(), Error> {
}
```

Note if you want to deserialize a column that may have invalid types (i.e. a float where some values may be strings), you can use Serde's `deserialize_with` field attribute:
Calamine provides helper functions to deal with invalid type values. For instance if you
want to deserialize a column which should contain floats but may also contain invalid values
(i.e. strings), you can use the [`deserialize_as_f64_or_none`] helper function with Serde's
[`deserialize_with`](https://serde.rs/field-attrs.html) field attribute:

```rust
use calamine::{deserialize_as_f64_or_none, open_workbook, RangeDeserializerBuilder, Reader, Xlsx};
use serde::Deserialize;
use calamine::{RangeDeserializerBuilder, Reader, Xlsx};


#[derive(Deserialize)]
struct ExcelRow {
struct Record {
metric: String,
#[serde(deserialize_with = "de_opt_f64")]
#[serde(deserialize_with = "deserialize_as_f64_or_none")]
value: Option<f64>,
}


// Convert value cell to Some(f64) if float or int, else None
fn de_opt_f64<'de, D>(deserializer: D) -> Result<Option<f64>, D::Error>
where
D: serde::Deserializer<'de>,
{
let data_type = calamine::DataType::deserialize(deserializer)?;
if let Some(float) = data_type.as_f64() {
Ok(Some(float))
} else {
Ok(None)
}
}

fn main() -> Result<(), Box<dyn std::error::Error>> {
fn main() -> Result<(), Box<dyn std::error::Error>> {
let path = format!("{}/tests/excel.xlsx", env!("CARGO_MANIFEST_DIR"));
let mut excel: Xlsx<_> = open_workbook(path)?;

let range = excel
.worksheet_range("Sheet1")
.ok_or(calamine::Error::Msg("Cannot find Sheet1"))??;
.worksheet_range("Sheet1")
.map_err(|_| calamine::Error::Msg("Cannot find Sheet1"))?;

let iter_records =
RangeDeserializerBuilder::with_headers(&["metric", "value"]).from_range(&range)?;

for result in iter_records {
let record: Record = result?;
println!("metric={:?}, value={:?}", record.metric, record.value);
}

let iter_result =
RangeDeserializerBuilder::with_headers(&COLUMNS).from_range::<_, ExcelRow>(&range)?;
}
Ok(())
}
```

The [`deserialize_as_f64_or_none`] function will discard all invalid values, if you want to
return them as `String` you can use the [`deserialize_as_f64_or_string`] function instead.

### Reader: Simple

Expand All @@ -102,6 +100,7 @@ if let Some(Ok(r)) = excel.worksheet_range("Sheet1") {
### Reader: More complex

Let's assume

- the file type (xls, xlsx ...) cannot be known at static time
- we need to get all data from the workbook
- we need to parse the vba
Expand Down Expand Up @@ -160,7 +159,7 @@ for s in sheets {

## Features

- `dates`: Add date related fn to `DataType`.
- `dates`: Add date related fn to `DataType`.
- `picture`: Extract picture data.

### Others
Expand All @@ -170,6 +169,7 @@ Browse the [examples](https://github.com/tafia/calamine/tree/master/examples) di
## Performance

As `calamine` is readonly, the comparisons will only involve reading an excel `xlsx` file and then iterating over the rows. Along with `calamine`, three other libraries were chosen, from three different languages:

- [`excelize`](https://github.com/qax-os/excelize) written in `go`
- [`ClosedXML`](https://github.com/ClosedXML/ClosedXML) written in `C#`
- [`openpyxl`](https://foss.heptapod.net/openpyxl/openpyxl) written in `python`
Expand All @@ -179,6 +179,7 @@ The benchmarks were done using this [dataset](https://raw.githubusercontent.com/
The programs are all structured to follow the same constructs:

`calamine`:

```rust
use calamine::{open_workbook, Reader, Xlsx};

Expand All @@ -199,6 +200,7 @@ fn main() {
```

`excelize`:

```go
package main

Expand Down Expand Up @@ -237,6 +239,7 @@ func main() {
```

`ClosedXML`:

```csharp
using ClosedXML.Excel;

Expand All @@ -261,6 +264,7 @@ internal class Program
```

`openpyxl`:

```python
from openpyxl import load_workbook

Expand Down Expand Up @@ -306,6 +310,7 @@ v2.8.0 excelize.exe
The spreadsheet has a range of 1,000,001 rows and 41 columns, for a total of 41,000,041 cells in the range. Of those, 28,056,975 cells had values.

Going off of that number:

- `calamine` => 1,122,279 cells per second
- `excelize` => 633,998 cells per second
- `ClosedXML` => 157,320 cells per second
Expand All @@ -314,9 +319,11 @@ Going off of that number:
### Plots

#### Disk Read

![bytes_from_disk](https://github.com/RoloEdits/calamine/assets/12489689/fcca1147-d73f-4d1c-b273-e7e4c183ab29)

As stated, the filesize on disk is `186MB`:

- `calamine` => `186MB`
- `ClosedXML` => `208MB`.
- `openpyxl` => `192MB`.
Expand All @@ -328,11 +335,13 @@ When asking one of the maintainers of `excelize`, I got this [response](https://
> \- xuri
#### Disk Write

![bytes_to_disk](https://github.com/RoloEdits/calamine/assets/12489689/befa9893-7658-41a7-8cbd-b0ce5a7d9341)

As seen in the previous section, `excelize` is writting to disk to save memory. The others don't employ that kind of mechanism.

#### Memory

![mem_usage](https://github.com/RoloEdits/calamine/assets/12489689/c83fdf6b-1442-4e22-8eca-84cbc1db4a26)

![virt_mem_usage](https://github.com/RoloEdits/calamine/assets/12489689/840a96ed-33d7-44f7-8276-80bb7a02557f)
Expand All @@ -342,6 +351,7 @@ As seen in the previous section, `excelize` is writting to disk to save memory.
The stepping and falling for `calamine` is from the grows of `Vec`s and the freeing of memory right after, with the memory usage dropping down again. The sudden jump at the end is when the sheet is being read into memory. The others, being garbage collected, have a more linear climb all the way through.

#### CPU

![cpu_usage](https://github.com/RoloEdits/calamine/assets/12489689/c3aa55a8-b008-48ee-ba04-c08bd91c1f6f)

Very noisy chart, but `excelize`'s spikes must be from the GC?
Expand All @@ -351,6 +361,7 @@ Very noisy chart, but `excelize`'s spikes must be from the GC?
Many (most) part of the specifications are not implemented, the focus has been put on reading cell **values** and **vba** code.

The main unsupported items are:

- no support for writing excel files, this is a read-only library
- no support for reading extra contents, such as formatting, excel parameter, encrypted components etc ...
- no support for reading VB for opendocuments
Expand Down
198 changes: 197 additions & 1 deletion src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ mod de;
mod errors;
pub mod vba;

use serde::de::DeserializeOwned;
use serde::de::{Deserialize, DeserializeOwned, Deserializer};
use std::borrow::Cow;
use std::cmp::{max, min};
use std::fmt;
Expand Down Expand Up @@ -890,3 +890,199 @@ impl<T> Table<T> {
&self.data
}
}

/// A helper function to deserialize cell values as `i64`,
/// useful when cells may also contain invalid values (i.e. strings).
/// It applies the [`as_i64`] method to the cell value, and returns
/// `Ok(Some(value_as_i64))` if successful or `Ok(None)` if unsuccessful,
/// therefore never failing. This function is intended to be used with Serde's
/// [`deserialize_with`](https://serde.rs/field-attrs.html) field attribute.
pub fn deserialize_as_i64_or_none<'de, D>(deserializer: D) -> Result<Option<i64>, D::Error>
where
D: Deserializer<'de>,
{
let data = Data::deserialize(deserializer)?;
Ok(data.as_i64())
}

/// A helper function to deserialize cell values as `i64`,
/// useful when cells may also contain invalid values (i.e. strings).
/// It applies the [`as_i64`] method to the cell value, and returns
/// `Ok(Ok(value_as_i64))` if successful or `Ok(Err(value_to_string))` if unsuccessful,
/// therefore never failing. This function is intended to be used with Serde's
/// [`deserialize_with`](https://serde.rs/field-attrs.html) field attribute.
pub fn deserialize_as_i64_or_string<'de, D>(
deserializer: D,
) -> Result<Result<i64, String>, D::Error>
where
D: Deserializer<'de>,
{
let data = Data::deserialize(deserializer)?;
Ok(data.as_i64().ok_or_else(|| data.to_string()))
}

/// A helper function to deserialize cell values as `f64`,
/// useful when cells may also contain invalid values (i.e. strings).
/// It applies the [`as_f64`] method to the cell value, and returns
/// `Ok(Some(value_as_f64))` if successful or `Ok(None)` if unsuccessful,
/// therefore never failing. This function is intended to be used with Serde's
/// [`deserialize_with`](https://serde.rs/field-attrs.html) field attribute.
pub fn deserialize_as_f64_or_none<'de, D>(deserializer: D) -> Result<Option<f64>, D::Error>
where
D: Deserializer<'de>,
{
let data = Data::deserialize(deserializer)?;
Ok(data.as_f64())
}

/// A helper function to deserialize cell values as `f64`,
/// useful when cells may also contain invalid values (i.e. strings).
/// It applies the [`as_f64`] method to the cell value, and returns
/// `Ok(Ok(value_as_f64))` if successful or `Ok(Err(value_to_string))` if unsuccessful,
/// therefore never failing. This function is intended to be used with Serde's
/// [`deserialize_with`](https://serde.rs/field-attrs.html) field attribute.
pub fn deserialize_as_f64_or_string<'de, D>(
deserializer: D,
) -> Result<Result<f64, String>, D::Error>
where
D: Deserializer<'de>,
{
let data = Data::deserialize(deserializer)?;
Ok(data.as_f64().ok_or_else(|| data.to_string()))
}

/// A helper function to deserialize cell values as `chrono::NaiveDate`,
/// useful when cells may also contain invalid values (i.e. strings).
/// It applies the [`as_date`] method to the cell value, and returns
/// `Ok(Some(value_as_date))` if successful or `Ok(None)` if unsuccessful,
/// therefore never failing. This function is intended to be used with Serde's
/// [`deserialize_with`](https://serde.rs/field-attrs.html) field attribute.
#[cfg(feature = "dates")]
pub fn deserialize_as_date_or_none<'de, D>(
deserializer: D,
) -> Result<Option<chrono::NaiveDate>, D::Error>
where
D: Deserializer<'de>,
{
let data = Data::deserialize(deserializer)?;
Ok(data.as_date())
}

/// A helper function to deserialize cell values as `chrono::NaiveDate`,
/// useful when cells may also contain invalid values (i.e. strings).
/// It applies the [`as_date`] method to the cell value, and returns
/// `Ok(Ok(value_as_date))` if successful or `Ok(Err(value_to_string))` if unsuccessful,
/// therefore never failing. This function is intended to be used with Serde's
/// [`deserialize_with`](https://serde.rs/field-attrs.html) field attribute.
#[cfg(feature = "dates")]
pub fn deserialize_as_date_or_string<'de, D>(
deserializer: D,
) -> Result<Result<chrono::NaiveDate, String>, D::Error>
where
D: Deserializer<'de>,
{
let data = Data::deserialize(deserializer)?;
Ok(data.as_date().ok_or_else(|| data.to_string()))
}

/// A helper function to deserialize cell values as `chrono::NaiveTime`,
/// useful when cells may also contain invalid values (i.e. strings).
/// It applies the [`as_time`] method to the cell value, and returns
/// `Ok(Some(value_as_time))` if successful or `Ok(None)` if unsuccessful,
/// therefore never failing. This function is intended to be used with Serde's
/// [`deserialize_with`](https://serde.rs/field-attrs.html) field attribute.
#[cfg(feature = "dates")]
pub fn deserialize_as_time_or_none<'de, D>(
deserializer: D,
) -> Result<Option<chrono::NaiveTime>, D::Error>
where
D: Deserializer<'de>,
{
let data = Data::deserialize(deserializer)?;
Ok(data.as_time())
}

/// A helper function to deserialize cell values as `chrono::NaiveTime`,
/// useful when cells may also contain invalid values (i.e. strings).
/// It applies the [`as_time`] method to the cell value, and returns
/// `Ok(Ok(value_as_time))` if successful or `Ok(Err(value_to_string))` if unsuccessful,
/// therefore never failing. This function is intended to be used with Serde's
/// [`deserialize_with`](https://serde.rs/field-attrs.html) field attribute.
#[cfg(feature = "dates")]
pub fn deserialize_as_time_or_string<'de, D>(
deserializer: D,
) -> Result<Result<chrono::NaiveTime, String>, D::Error>
where
D: Deserializer<'de>,
{
let data = Data::deserialize(deserializer)?;
Ok(data.as_time().ok_or_else(|| data.to_string()))
}

/// A helper function to deserialize cell values as `chrono::Duration`,
/// useful when cells may also contain invalid values (i.e. strings).
/// It applies the [`as_duration`] method to the cell value, and returns
/// `Ok(Some(value_as_duration))` if successful or `Ok(None)` if unsuccessful,
/// therefore never failing. This function is intended to be used with Serde's
/// [`deserialize_with`](https://serde.rs/field-attrs.html) field attribute.
#[cfg(feature = "dates")]
pub fn deserialize_as_duration_or_none<'de, D>(
deserializer: D,
) -> Result<Option<chrono::Duration>, D::Error>
where
D: Deserializer<'de>,
{
let data = Data::deserialize(deserializer)?;
Ok(data.as_duration())
}

/// A helper function to deserialize cell values as `chrono::Duration`,
/// useful when cells may also contain invalid values (i.e. strings).
/// It applies the [`as_duration`] method to the cell value, and returns
/// `Ok(Ok(value_as_duration))` if successful or `Ok(Err(value_to_string))` if unsuccessful,
/// therefore never failing. This function is intended to be used with Serde's
/// [`deserialize_with`](https://serde.rs/field-attrs.html) field attribute.
#[cfg(feature = "dates")]
pub fn deserialize_as_duration_or_string<'de, D>(
deserializer: D,
) -> Result<Result<chrono::Duration, String>, D::Error>
where
D: Deserializer<'de>,
{
let data = Data::deserialize(deserializer)?;
Ok(data.as_duration().ok_or_else(|| data.to_string()))
}

/// A helper function to deserialize cell values as `chrono::NaiveDateTime`,
/// useful when cells may also contain invalid values (i.e. strings).
/// It applies the [`as_datetime`] method to the cell value, and returns
/// `Ok(Some(value_as_datetime))` if successful or `Ok(None)` if unsuccessful,
/// therefore never failing. This function is intended to be used with Serde's
/// [`deserialize_with`](https://serde.rs/field-attrs.html) field attribute.
#[cfg(feature = "dates")]
pub fn deserialize_as_datetime_or_none<'de, D>(
deserializer: D,
) -> Result<Option<chrono::NaiveDateTime>, D::Error>
where
D: Deserializer<'de>,
{
let data = Data::deserialize(deserializer)?;
Ok(data.as_datetime())
}

/// A helper function to deserialize cell values as `chrono::NaiveDateTime`,
/// useful when cells may also contain invalid values (i.e. strings).
/// It applies the [`as_datetime`] method to the cell value, and returns
/// `Ok(Ok(value_as_datetime))` if successful or `Ok(Err(value_to_string))` if unsuccessful,
/// therefore never failing. This function is intended to be used with Serde's
/// [`deserialize_with`](https://serde.rs/field-attrs.html) field attribute.
#[cfg(feature = "dates")]
pub fn deserialize_as_datetime_or_string<'de, D>(
deserializer: D,
) -> Result<Result<chrono::NaiveDateTime, String>, D::Error>
where
D: Deserializer<'de>,
{
let data = Data::deserialize(deserializer)?;
Ok(data.as_datetime().ok_or_else(|| data.to_string()))
}

0 comments on commit 108f991

Please sign in to comment.