The community website of Databend is built using Docusaurus, hosted on Vercel, and automated i18n using Crowdin.

This SaaS service stack has the following benefits.

  • Only the source files need to be maintained in the source code, no additional storage of translated files is required
  • Translations are always consistent with source files, no need to deal with out-of-date translations, etc.
  • Separation of focus, so the documentation and translation students can work in parallel
  • No additional learning cost, translation students can quickly get started after configuration, no need to learn command line tools
  • Leverage the development of machine learning to reduce workload

Today's edition of this weekly newsletter focuses on Databend's documentation of i18n practices based on these services.

The main contributors to the launch of the i18n feature on the Databend community website were @PsiACE and @Xuanwo, with @Xuanwo completing the preliminary research and initial configuration, and @PsiACE taking over as a follow-up, resolving all residual issues, cleaning up outdated content, and driving the feature to final implementation.

Background

Databend is an open source Cloud Data Warehouse based on Rust that supports a wide range of SQL syntaxes, so it has a huge amount of documentation. According to Crowdin's backend statistics, there are nearly 12W words after de-duplication. Not only that, but Databend iterates very quickly and the documentation is constantly being updated. In addition, the Databend community was just growing, with only 100+ contributors, and lacked the resources to organize a dedicated translation team or hire a third-party translation company. As a result, the Databend community needed a low-cost, less invasive, easy-to-use i18n solution.

The official Docusaurus documentation provides two solutions, Git and Crowdin, which are essentially different implementations of the same idea: Docusaurus sets a set of rules for i18n, places all i18n-related files under i18n/<locale> in the project directory, and builds the files according to the locale set. Crowdin uploads the source files to the translation platform, translates them on the translation platform, and downloads the results to the specified directory. This solution was quickly ruled out due to the complexity of maintaining the Git solution, and the next step was to consider how to combine Github / Crowdin / Vercel to make the translation experience as less invasive as possible for developers.

Possible solutions include

According to our actual experience, Crowdin's own Github integration does not work particularly well, the minimum setting can only be set to 10 minutes synchronization granularity and not to mention, synchronization often feels delayed, some strange problems, it does not feel particularly reliable, and it takes a lot of time in debugging. Github Action is relatively better, at least you can see a clear error log if anything goes wrong. If download_translations is enabled, it will automatically launch a PR with translated content, and the maintainer can preview it and merge it directly or configure an automated process to merge it automatically after the build passes. The disadvantage is that this approach requires adding translated content directly to the repository, which can quickly grow in size if the amount of translated text is large.

Processes

In the end we chose to use Github Action to upload the source files and Vercel to download the translation files before the build. The exact process is as follows.

  • Configure Github Action to upload the files to be translated each time the main is updated
  • Translate on the Crowdin platform
  • Vercel build to download translated files at build page and perform i18n build

Configuration

Next, we describe how to implement this process.

Docusaurus

First we need to properly configure Docusaurus with the appropriate locale and how it will be displayed on the page. Add the following configuration to docusaurus.config.js.

const config = {
  i18n: {
    defaultLocale: 'en-US',
    locales: ['en-US', 'zh-CN'],
    localeConfigs: {
        'en-US': {
            label: 'English',
        },
        'zh-CN': {
            label: '简体中文',
        },
    },
  },
  themeConfig: ({
    navbar: {
      items: [
        {
          type: 'localeDropdown',
          position: 'right',
          dropdownItemsAfter: [
            {
              to: 'https://databend.crowdin.com/databend',
              label: 'Help Us Translate',
            },
          ],
        },
      ]
    }
  })
}
  • i18n.defaultLocale determines the locale used for the default display of the page
  • i18n.localeConfigs determines which locales are supported by the current page
  • i18n.localeConfigs configure different locale respectively, here we mainly use locale to show out the Label
  • themeConfig.navbar.items add type to localeDropdown to add a language switch button in the navigation bar
    • By adding a dropdownItemsAfter you can add some additional items to the existing locale

In addition, the connection to which the edit button points can be modified by modifying the editUrl.

editUrl: ({locale, devPath}) => {
  if (locale !== config.i18n.defaultLocale) {
    return `https://databend.crowdin.com/databend/${locale}`;
  }
  return `https://github.com/datafuselabs/databend/edit/main/docs/dev/${devPath}`;
},

Once Docusaurus is configured, we can use the commands provided by docusaurus to generate the content to be translated related to Docusaurus themes and plugins:

docusaurus write-translations

Crowdin

Crowdin work depends on the crowdin.yml configuration provided by the project, which is currently used by Databend as follows.

See crowdin.yml for the latest configuration

project_id: "2"
api_token_env: CROWDIN_PERSONAL_TOKEN
preserve_hierarchy: true
base_url: "https://databend.crowdin.com"
base_path: "../"
export_only_approved: true

# See docs from https://docusaurus.io/docs/i18n/crowdin
files:
  # JSON translation files
  - source: /website/i18n/en-US/**/*
    translation: /website/i18n/%locale%/**/%original_file_name%
  # Blog Markdown files
  - source: /website/blog/**/*
    translation: /website/i18n/%locale%/docusaurus-plugin-content-blog/**/%original_file_name%
  # Docs Markdown files
  - source: /docs/doc/**/*
    translation: /website/i18n/%locale%/docusaurus-plugin-content-docs/current/**/%original_file_name%
  - source: /docs/dev/**/*
    translation: /website/i18n/%locale%/docusaurus-plugin-content-docs-dev/current/**/%original_file_name%
  • project_id is the numeric ID of the current project (you can see it in the admin backend by clicking on Project Settings)
  • api_token_env is used to specify the name of the environment variable used by API TOKEN, next set CROWDIN_PERSONAL_TOKEN environment variable to the correct Token (Token can be created in the user settings)
  • preserve_hierarchy is the directory structure that the user keeps, recommended to turn on
  • base_url is the address of the organization, set according to the actual situation
  • base_path is the address of the project file relative to the configuration file, set as appropriate (databend puts the configuration file in the .github/ directory, so it uses . /)
  • export_only_approved only exports approved translations, recommended to avoid translations going live without review
  • files maintains the mapping between original files and translation files, which needs to be carefully considered and adjusted according to the actual situation of the project

With crowdin.yml configured correctly we can already try to upload and download files to start debugging. crowdin-cli is developed in Java, but provides distribution in multiple languages. For ease of use later, we will use the npm version directly with yarn add @crowdin/cli.

export CROWDIN_PERSONAL_TOKEN=<your_token>
# Upload sources
./node_modules/.bin/crowdin upload sources --config=.github/crowdin.yml
# Download files
./node_modules/.bin/crowdin download --config=.github/crowdin.yml

New scripts can be added to package.json to simplify the follow-up process.

{
  "scripts": {
    "download-translations": "crowdin download --config=../.github/crowdin.yml --export-only-approved --all",
  },
}

If all works fine, we can continue with the configuration.

Upload files to be translated

Once this test process is running, we can add Github Action to automate the translation file upload process:

See i18n.yml for the latest configuration

Add file .github/workflows/i18n.yml

name: Crowdin Action

on:
  push:
    branches: [main]

jobs:
  synchronize-with-crowdin:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout
        uses: actions/checkout@v3

      - name: crowdin action
        uses: crowdin/github-action@1.5.1
        with:
          config: ".github/crowdin.yml"
          upload_sources: true
          # Delete obsolete files and folders from Crowdin project
          upload_sources_args: "--delete-obsolete"
        env:
          CROWDIN_PERSONAL_TOKEN: ${{ secrets.CROWDIN_PERSONAL_TOKEN }}
  • Only upload files to be translated when the main branch is updated
  • set upload_sources to true to upload the files to be translated, but nothing else needs to be enabled
  • Add the --delete-obsolete parameter to upload_sources_args to clean up the files that have become invalid.

Action After successful debugging, you can proceed to the last step ->

Download Translated Files

Vercel supports overloading the commands we use to build the project, we just need to overload Build to

yarn run download-translations && yarn run build

This command will first download the translated file from Crowdin and execute the build.

Remember to also configure CROWDIN_PERSONAL_TOKEN in the environment variable.

Follow-up Maintenance

After successful configuration there is no need for complicated maintenance, just execute docusaurus write-translations to update the plugin's translations after a Docusaurus upgrade, and most of the usual translation-related operations can be performed within the Crowdin platform.

Tricks

Turn on machine translation

Crowdin, as a translation platform, supports and has access to most translation engines and third-party translation companies, making it easy to customize your own translation process. the process used by Databend is as follows.

!

  • First we use Crowdin's own machine translation engine to pre-translate all content
  • Then we translate the untranslated content using the translation history recorded by Crowdin
  • Then we use human translation
  • Finally, the translated content is reviewed and approved before going live

Crowdin supports a large number of machine translation engines, such as DeepL, Google Translte, etc. You can choose according to your financial resources and actual needs.

In-Context for Web

According to the strong amenity of Crowdin CEO, Crowdin supports to turn on the real-time translation function on the current page.

The effect is as follows.

!

It enables to directly select the text content on the current page for translation, and the translation result is also synchronized to the page in real time. This function is rather conflicting with our existing process, so it is not turned on.

Conclusion

This article describes how to implement a set of low-cost, weakly invasive, easy-to-use, open-source, i18n collaboration processes based on Docusaurus, Vercel, Crowdin and Github. friendly i18n collaboration process, I hope it will be useful for other open source communities to build documentation ~

References