Developing file sync library

Originally posted in our newsletter.

In this post I’m diving into the sync library I developed to keep recipes in sync across devices. This one’s a bit technical, but if you’re curious about system design or want to peek into the process, you might find it interesting—and maybe even useful for your next interview! 😉

The Problem

Initially, I launched a mobile app that used iCloud to sync recipes between desktop and mobile devices. It was a quick, get-things-done solution: to enable users to manage recipes in comfort of desktop, then use them on their mobile when cooking or shopping. However, iCloud comes with significant limitations—primarily that it only works between Mac and iPhone and offloads files from the phone to optimize storage, which is far from ideal for real-world use.

So, I needed to move away from this relationship and find a better solution.

Finding the Right Library

Having worked in enterprise environments for a while (and have my brain permanently damaged), I started by listing my requirements. The library needed to:

  • Work across Windows, Linux, macOS, Android, and iOS
  • Be written in a cross-platform language like C or language with C-bindings support
  • Support bi-directional sync
  • Be lightweight, efficient, and secure
  • Handle text file synchronization with near real-time updates
  • Have a solid conflict resolution mechanism
  • Bonus: Work with symlinks

With that list in hand, I started searching for an existing library. Here’s what I found and why I didn’t use each one:

LibraryDescriptionWhy Not?
librsyncCalculates and applies network deltas.Doesn’t include transport, complex dependencies.
duetBi-directional sync similar to Unison.SSH transport only.
LuminSFast alternative to rsync for local files.Only for local backups.
bitaHTTP-based sync tool focused on low bandwidth.Needs an archive shim to work.
fast_rsyncFaster rsync in pure Rust.No transport layer.
rusticBackup tool with encryption and deduplication.Requires extra setup to be useful.
zboxEncrypted in-app file system.Not accessible when the app isn’t running.
FilerRust service for cloud document syncing.S3 backend only.
NymDriveEncrypted file sync over the Nym network.Requires Nym network.
rjrssyncFast, rsync-like tool.SSH transport only.

Building My Own Library

Lead by my foolhardiness and courage, Given the limitations of existing libraries, I decided to create a new library, something quick and dirty of course…Here’s what I aimed for:

  • Low development effort. Keep it simple.
  • Single-user edits. No collaboration is expected, so we can avoid complexities from it.
  • Online editing. Assume users are mostly changing recipes when they have internet connection.
  • Latest version priority. Since recipes are small, we can see the whole thing and hence favor the most recent edits.
  • Server-centric. Not peer-to-peer, thanks.
  • Automatic conflict resolution. No manual steps required.
  • Tolerance for rare edit loss. Acceptable trade-off for simplicity.

Conflict Resolution

Resolving conflicts in file synchronization is tricky, especially with multiple clients.

I forsee these operations that can happen and can cause conflicts:

  • Add/remove/edit line in a file
  • Create/remove file
  • Move file between dirs
  • Rename a file
  • Create/remove a dir
  • Client goes offline/online

Here’s a breakdown of potential strategies (thanks ChatGPT!):

StrategyApproachProsCons
Last Write Wins (LWW)Latest timestamp wins.Simple to implement.Risk of losing important changes.
Version VectorsCompare version numbers.More accurate than LWW.Complex to implement.
Operational Transformation (OT)Transform concurrent operations.Effective for collaborative editing.Overkill for simple sync.
CRDTsMerge data structures without conflicts.Guarantees consistency.High memory overhead, especially in dynamic languages.
Three-Way MergeUse a common ancestor for merging.Reduces data loss.Requires file history tracking.
Application-Specific RulesTailored conflict rules.Highly customized.Requires deep understanding of the application.
User-Assisted MergingAutomatic merges with user fallback.Balances automation with control.Manual intervention can be time-consuming.
Hybrid ApproachesCombine methods based on context.Flexible.Increased complexity.
Semantic SyncUnderstand file data semantics.More accurate.Requires detailed parsing.
Machine LearningPredict best merge methods.Improves over time.High computational requirements.

Conflict Free Replicated Data Types

In my research, I initially decided to explore Conflict-Free Replicated Data Types (CRDTs). I had come across them in various blog posts, and they seemed both intriguing and sophisticated. If math—albeit a lot of math 🙄—can provide a solution that completely eliminates conflicts, why not give it a shot?

The concept behind CRDTs is to use specially designed data primitives and operations that can be safely merged to produce a consistent final state. It sounds ideal in theory. A good example of these data primitives can be found here: https://github.com/aphyr/meangirls.

However, after experimenting with CRDTs, I realized that these primitives couldn’t accurately represent the correct state of my recipes until all merges had fully propagated (or even in that case). CRDTs based on idea of order, and while in recipes files there an order of words but it based on our natural language and not really measurable for computing merges.

I’ll leave it to the industry leaders to discuss this challenge in more detail:

“Exactly, the “C” in CRDT is a little misleading. They are “conflict free” in as much as they are able to merge and resolve in the basic sense. That does not mean that the resulting state is valid or meets your internal schema. … Point is, it’s still essential with CRDTs to have a schema and a validation/resolution process. That or you use completely custom CRDTs that encodes into their resolution process the validation of the schema.” Source: https://news.ycombinator.com/item?id=33865672

“Figma isn’t using true CRDTs. CRDTs are designed for decentralized systems where there is no single central authority to decide what the final state should be. There is some unavoidable performance and memory overhead with doing this. Since Figma is centralized (our server is the central authority), we can simplify our system by removing this extra overhead and benefit from a faster and leaner implementation.” Source: https://www.figma.com/blog/how-figmas-multiplayer-technology-works/

“Apparently, everyone recognizes that CRDTs have a lot of potential, but concluded that the memory overhead of using them must be too expensive for real-world applications.” Source: https://blog.kevinjahns.de/are-crdts-suitable-for-shared-editing/

After I dropped the idea of using CRDTs, I settled on a Server Wins strategy because it’s simple, minimizes surprises for users, and is resistant to issues from outdated clients.

Choosing the Algorithm

I began by researching how others approach similar tasks. The first thing that came to mind was to examine how rsync functions. Rsync relies on the librsync library, which is responsible for efficiently comparing files and generating diffs. The rsync binary utilizes this library and adds transport capabilities via SSH.

Next, I explored a blog post on Dropbox (https://dropbox.tech/infrastructure/streaming-file-synchronization) that detailed changes they made to their sync algorithm. Although the post didn’t provide complete details, it offered enough insight for me to understand the general process.

With this knowledge, I decided to implement my own Dropbox clone. If their approach works for them, it should work for me too. My implementation will only send the differences between files and include polling, allowing clients to subscribe and receive updates as needed. Now, let’s delve into how this algorithm operates.

The system operates with both client-side and server-side components. Similar to rsync, the process involves splitting each file into chunks, with each chunk assigned an associated checksum. The server stores these checksums and maintains a journal that records all changes. When the client needs updates, it requests all changes since a specific journal reference ID.

Dropbox uses fixed-size chunks (they mentioned a size of 4 MB). However, in my case, such large chunks would be inefficient because Cooklang primarily deals with text files, which are usually only a few kilobytes in size. Therefore, splitting them into such large chunks doesn’t make sense. Instead, I decided to split text files at new line symbols, resulting in much smaller chunks.

Below is an example schema used by the journal to track file version history.

Upload flow

Let’s start by considering the upload flow. First, we need to recognize that we have some local changes. We generate a list of hashes for the specific file and compare it with the one stored in local database previously coming from the server. If there’s a mismatch, we submit the file to the server. However, the server might not have all the required hashes, so it could return a list of the missing ones. We then upload the requested hashes to the block storage (chunk server). Afterward, we attempt to commit again, and if successful, we receive a new journal ID. Finally, we store this journal ID along with the list of hashes in our client database.

Download flow

Now, let’s examine how downloading works for clients who don’t yet have the latest version of the recipe. A client will request the metaserver (journal server) to list all journal IDs since the last one stored locally. For example, if the last known journal ID was 15, the client will ask for ones more than 15. The metaserver then returns a list of files with their new hashes and corresponding journal IDs. The client proceeds to download the required chunks from the chunk server and stores them locally. Once all chunks are downloaded, the file can be reconstructed and replaced in the local file system.

Conclusion

The library is open-source and available at https://github.com/cooklang/cooklang-sync. It’s still in draft form but is already serving Android alpha users. There might be some issues as it evolves, but I believe other projects could benefit from this simple sync mechanism.

What’s next? Features like rsync-style remote comparison, read-only sync, batch downloads, and security enhancements are on the horizon. For now, it syncs only selected file extensions.

Happy cooking!

-Alexey