Recently, when I was implementing new features for OpenDAL, I felt more and more that the existing error handling logic was overstretched, so I spent a lot of time redesigning a whole new set of error handling logic. Today, I'd like to share with you my error handling practices in OpenDAL, and hope to provide you with some ideas for building your own error handling system in Rust.

Context

No practice can be separated from specific scenarios.

Before we start describing the practice, let's understand the OpenDAL project and which errors he needs to handle. OpenDAL is a Rust library designed to enable free, painless, and efficient access to data, and is used roughly as follows.

// Init Operator
let op = Operator::from_env(Scheme::Fs)?;
// Create object handler.
let o = op.object("test_file");
// Read data from object;
let bs = o.read().await?;

As a unified storage abstraction, OpenDAL needs to return a unified Error structure that allows the user to write logic without targeting different services. Using the example of a file not existing, we need to allow the user to write code such as

if let Err(e) = op.object("test").metadata().await {
    if e.kind() == ErrorKind::ObjectNotFound {
      // logic if file not exist.
    } else {
      // logic if we meet other errors.
    }
}

In other words, OpenDAL should allow the user to know what error has occurred, so that the user can deal with it separately according to the specific type of Error; not only that, OpenDAL interfaces with a large number of storage services, it is not enough just to return the Error Code, OpenDAL also needs to provide enough context, so that the user can quickly locate what has happened according to the Error.

To summarize the requirements of OpenDAL, it needs an error handling framework that:

  • Know what error has occurred
  • Be able to decide how to deal with it
  • Assist in locating the cause of the error

Current Status

Prior to this redesign, OpenDAL used std::io::Error to return errors. The advantage of this design is that it maximizes the user's usage of std::fs and reduces the learning cost.

In order to be able to return the context of the error, OpenDAL carries ObjectError and BackendError in io::Error.

pub struct BackendError {
    context: HashMap<String, String>,
    source: anyhow::Error,
}

#[derive(Error, Debug)]
#[error("object error: (op: {op}, path: {path}, source: {source})")]
pub struct ObjectError {
    op: Operation,
    path: String,
    source: anyhow::Error,
}

But in practical use, such a design is very painful.

Raw error leakage

As we summarized earlier, one of the major requirements of OpenDAL is to provide a context for errors, since a simple UnexpectedEof does not help the user in debugging problems. But using std::io::Error can lead to very easy raw error leaks in our code, e.g.

impl Object {
  pub async fn range_read(&self, range: impl RangeBounds<u64>) -> Result<Vec<u8>> {
      ...

      io::copy(s, &mut bs).await?;

      Ok(bs.into_inner())
  }
}

The io error in io::copy here is leaked out without any context.

Difficulty in building error contexts

std::io::Error is an external type, and there is very little we can do about it. In order to carry context, we have to pass all the context when we build Error, which leaves us with a lot of repetitive logic to write when we implement it. there are plenty of such helper functions in OpenDAL.

pub fn new_other_object_error(
    op: Operation,
    path: &str,
    source: impl Into<anyhow::Error>,
) -> io::Error {
    io::Error::new(io::ErrorKind::Other, ObjectError::new(op, path, source))
}

pub fn new_response_consume_error(op: Operation, path: &str, err: Error) -> Error {
    Error::new(
        err.kind(),
        ObjectError::new(op, path, anyhow!("consuming response: {err:?}")),
    )
}

Inconsistent design goals

std::io::Error is designed to return the underlying IO error, and its design goals are inconsistent with the needs of OpenDAL. In addition, std::io::Error is part of the standard, and any improvements related to it are difficult to push. To this day, the io_error_more feature that OpenDAL has been waiting for is not yet stable. This means that OpenDAL cannot return ErrorKind:: IsADirectory and ErrorKind::NotADirectory, thus preventing the user from handling the error.

Design

After considering all these factors, I decided to redesign the Error type of OpenDAL.

ErrorKind

ErrorKind basically refers to io::ErrorKind, but only selects the parts of OpenDAL that need to be used.

#[non_exhaustive]
pub enum ErrorKind {
    Unexpected,
    Unsupported,

    BackendConfigInvalid,

    ObjectNotFound,
    ObjectPermissionDenied,
    ObjectIsADirectory,
    ObjectNotADirectory,
}

ErrorStatus

In order to be able to better support users in implementing logic such as error retries (a common requirement for OpenDAL users), I introduced the concept of ErrorStatus in Error.

enum ErrorStatus {
    Permanent,
    Temporary,
    Persistent,
}
  • Permanent indicates that the error is permanent and that the user should not retry the error without external changes
  • Temporary indicates that the error is temporary and the user can try to retry the request to resolve it
  • Persistent indicates that the error was once temporary but continues after a retry and discourages the user from trying the request again

Error Operation

The most helpful thing in error location is to know what kind of operation the error occurred in, for which OpenDAL adds Error Operation: this is a &'static str that OpenDAL implementers can append with with_operation().

pub fn with_operation(mut self, operation: &'static str) -> Self {
    if !self.operation.is_empty() {
        self.context.push(("called", self.operation.to_string()));
    }

    self.operation = operation;
    self
}

In particular, if this Error has been set to operation in the past, we append a new called context to the context to mark which operations the error was called in.

Error Context

OpenDAL adds Error Context to the error to carry contextual information.

pub struct Error {
    ...
    context: Vec<(&'static str, String)>,
}

impl Error {
    pub fn with_context(mut self, key: &'static str, value: impl Into<String>) -> Self {
        self.context.push((key, value.into()));
        self
    }
}

OpenDAL does this automatically for all services by implementing an Error Context Wrapper.

pub struct ErrorContextWrapper<T: Accessor + 'static> {
    meta: AccessorMetadata,
    inner: T,
}

#[async_trait]
impl<T: Accessor + 'static> Accessor for ErrorContextWrapper<T> {
    async fn read(&self, path: &str, args: OpRead) -> Result<ObjectReader> {
        let br = args.range();
        self.inner.read(path, args).await.map_err(|err| {
            err.with_operation(Operation::Read.into_static())
                .with_context("service", self.meta.scheme())
                .with_context("path", path)
                .with_context("range", br.to_string())
        })
    }
}

Thanks to this design, OpenDAL removes a huge amount of context-related code from the Services implementation.

Error Source

Obviously, OpenDAL needs to be able to expose the underlying errors as well. Here OpenDAL does not use thiserror to automatically implement #[from] for all errors, because it makes no sense to the user to have an error that has no way to be handled. Even if OpenDAL processed and returned IoError, XmlError, JsonError in a granular fashion, there would be no way for users to know what they should do based on that error. OpenDAL's choice is to uniformly use ErrorKind to return the error type and ErrorStatus to return the error status, other than that. The user can only return the error directly to a higher level.

OpenDAL has chosen to use anyhow::Error to carry the error source.

pub struct Error {
    ...
    source: Option<anyhow::Error>,
}

impl Error {
    pub fn set_source(mut self, src: impl Into<anyhow::Error>) -> Self {
        debug_assert!(self.source.is_none(), "the source error has been set");

        self.source = Some(src.into());
        self
    }
}

Developers can set the error source with set_source and prevent the source from being set more than once with debug_assert.

Here's an easy point to overlook: we don't need to generate backtraces for all errors. many business-expected errors, such as ObjectNotFound, can be made without carrying additional error sources and backtrace information at all.

Usage

Combining all the above features, we get an Error that:

pub struct Error {
    kind: ErrorKind,
    message: String,

    status: ErrorStatus,
    operation: &'static str,
    context: Vec<(&'static str, String)>,
    source: Option<anyhow::Error>,
}

There are some principles of use in OpenDAL.

  • All functions should return Result<T, opendal::Error>
  • Errors from external libraries are wrapped with set_source(err) as opendal::Error and returned
  • Carefully implement From<OtherError> for opendal::Error to prevent raw error leakage
  • The same error is handled only once, and subsequent operations only append context, not repeat wrap

OpenDAL learns from anyhow and implements Display and Debug respectively for Error.

where Display displays compact error messages, not source, and no backtrace.

ObjectNotFound (permanent) at stat, context { service: s3, path: x/x/y } => status code: 404, headers: {"x-amz-request-id": "TTD9EWB9NZ4ZF1DP", "x-amz-id-2": "ch/MHMf/zwPLxWgtBBY7fw9i9K+FGxDRzx3sxrbQKbtl21SONzTpNvs1IrFt2OjhAexcEB3Oo+c=", "c***tent-type": "applicati***/xml", "date": "M***, 21 Nov 2022 15:31:06 GMT", "server": "Amaz***S3"}, body: ""

And Debug displays the full error message.

Unexpected (temporary) at write => send async request

    Context:
        called: http_util::Client::send_async
        service: s3
        path: c40634f8-4a4b-479e-b98f-1ee0d7f1041b

    Source: error sending request for url (https://s3.us-west-1.amazonaws.com/***/***84551d50-811f-4614-87cb-ede43447dfbf/c40634f8-4a4b-479e-b98f-1ee0d7f1041b): user body write aborted: early end, expected 2621254 more bytes

    Caused by:
        0: user body write aborted: early end, expected 2621254 more bytes
        1: early end, expected 2621254 more bytes

Conclusion

In this article, we share the error practice of OpenDAL. The basic idea is to distinguish expected errors from unintended errors from the user's perspective. For expected errors, explicit error types are given to help users write clear error handling logic; for unintended errors, the same error type, such as Unexpected, is used to encapsulate them, and the error status is used to help users decide whether to retry or not. Don't abuse mechanisms like thiserror and From<OtherError> for Error, and don't blindly return a lot of error codes to the user that have no room to operate.

In addition, a well-designed error context mechanism ensures that each error is handled only once, avoiding that a single error is repeatedly wrapped in several layers. On the one hand, there is an additional performance impact: for OpenDAL, the error branch may be a few cases, but it is entirely possible for the user's logic to rely heavily on the errors returned by OpenDAL; on the other hand, it is detrimental to the user's ability to read and debug errors: repetitive encapsulation makes it impossible for the developer to find the focus at first glance, and the context is lost in layer after layer of structures.

In short, be sure to design the interface from a user experience perspective~