Search by Algolia
Vector vs Keyword Search: Why You Should Care
ai

Vector vs Keyword Search: Why You Should Care

Search has been around for a while, to the point that it is now considered a standard requirement in many ...

Nicolas Fiorini

Senior Machine Learning Engineer

What is a B2B marketplace?
e-commerce

What is a B2B marketplace?

It’s no secret that B2B (business-to-business) transactions have largely migrated online. According to Gartner, by 2025, 80 ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

3 strategies for B2B ecommerce growth: key takeaways from B2B Online - Chicago
e-commerce

3 strategies for B2B ecommerce growth: key takeaways from B2B Online - Chicago

Twice a year, B2B Online brings together industry leaders to discuss the trends affecting the B2B ecommerce industry. At the ...

Elena Moravec

Director of Product Marketing & Strategy

Deconstructing smart digital merchandising
e-commerce

Deconstructing smart digital merchandising

This is Part 2 of a series that dives into the transformational journey made by digital merchandising to drive positive ...

Benoit Reulier
Reshma Iyer

Benoit Reulier &

Reshma Iyer

The death of traditional shopping: How AI-powered conversational commerce changes everything
ai

The death of traditional shopping: How AI-powered conversational commerce changes everything

Get ready for the ride: online shopping is about to be completely upended by AI. Over the past few years ...

Aayush Iyer

Director, User Experience & UI Platform

What is B2C ecommerce? Models, examples, and definitions
e-commerce

What is B2C ecommerce? Models, examples, and definitions

Remember life before online shopping? When you had to actually leave the house for a brick-and-mortar store to ...

Catherine Dee

Search and Discovery writer

What are marketplace platforms and software? Why are they important?
e-commerce

What are marketplace platforms and software? Why are they important?

If you imagine pushing a virtual shopping cart down the aisles of an online store, or browsing items in an ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

What is an online marketplace?
e-commerce

What is an online marketplace?

Remember the world before the convenience of online commerce? Before the pandemic, before the proliferation of ecommerce sites, when the ...

Catherine Dee

Search and Discovery writer

10 ways AI is transforming ecommerce
e-commerce

10 ways AI is transforming ecommerce

Artificial intelligence (AI) is no longer just the stuff of scary futuristic movies; it’s recently burst into the headlines ...

Catherine Dee

Search and Discovery writer

AI as a Service (AIaaS) in the era of "buy not build"
ai

AI as a Service (AIaaS) in the era of "buy not build"

Imagine you are the CTO of a company that has just undergone a massive decade long digital transformation. You’ve ...

Sean Mullaney

CTO @Algolia

By the numbers: the ROI of keyword and AI site search for digital commerce
product

By the numbers: the ROI of keyword and AI site search for digital commerce

Did you know that the tiny search bar at the top of many ecommerce sites can offer an outsized return ...

Jon Silvers

Director, Digital Marketing

Using pre-trained AI algorithms to solve the cold start problem
ai

Using pre-trained AI algorithms to solve the cold start problem

Artificial intelligence (AI) has quickly moved from hot topic to everyday life. Now, ecommerce businesses are beginning to clearly see ...

Etienne Martin

VP of Product

Introducing Algolia NeuralSearch
product

Introducing Algolia NeuralSearch

We couldn’t be more excited to announce the availability of our breakthrough product, Algolia NeuralSearch. The world has stepped ...

Bernadette Nixon

Chief Executive Officer and Board Member at Algolia

AI is eating ecommerce
ai

AI is eating ecommerce

The ecommerce industry has experienced steady and reliable growth over the last 20 years (albeit interrupted briefly by a global ...

Sean Mullaney

CTO @Algolia

Semantic textual similarity: a game changer for search results and recommendations
product

Semantic textual similarity: a game changer for search results and recommendations

As an ecommerce professional, you know the importance of providing a five-star search experience on your site or in ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

What is hashing and how does it improve website and app search?
ai

What is hashing and how does it improve website and app search?

Hashing.   Yep, you read that right.   Not hashtags. Not golden, crisp-on-the-outside, melty-on-the-inside hash browns ...

Catherine Dee

Search and Discovery writer

Conference Recap: ECIR23 Take-aways
engineering

Conference Recap: ECIR23 Take-aways

We’re just back from ECIR23, the leading European conference around Information Retrieval systems, which ran its 45th edition in ...

Paul-Louis Nech

Senior ML Engineer

What is a neural network and how many types are there?
ai

What is a neural network and how many types are there?

Your grandfather wears those comfy slipper-y shoes all day, every day, and they’re starting to get holes in ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

Looking for something?

facebookfacebooklinkedinlinkedintwittertwittermailmail

Remember when “content” looked like this?

Original form of punched card content graphic
Generously offered under the Creative Commons Attribution 2.0 Generic license by Pete Birkinshaw on Wikimedia.

Ah, the good old days.

Nowadays, “content” often looks like this:

Modern content WAV format encoded

How far technology has come.

Don’t get me wrong, both of these formats (punched card and WAV, respectively) have their merits and are actually quite good at their respective tasks. But the fact remains that for literally decades now, we’ve been searching for a single format to contain “content” of all types. We haven’t quite failed — many attempts have been made with varying amounts of success — but we haven’t exactly hit a home run, either.

Most attempts at this have (appropriately) narrowed the scope to just rich text. After all, it’s reasonable to drop support for clearly distinct types of content like video if we could create something superior that only worked for rich text.

So in this article, I’m going to delve into three of the best-known attempts at cross-platform, fully-featured rich text formats of the past — XML, Markdown, and JSON — and three modern takes on the problem — Notion, ExtraMark, and Sanity’s Structured Content.

XML

When XML was created, it was forced to answer the question: “What stuff are we storing?”. Comparable formats like HTML had a definitive answer to that question (in its case, website layout). XML’s answer was pretty much “whatever you want”, hence the eXtensible in the acronym.

That can be a good thing, at times. For example, there’s a spec called NITF used by the news industry to format feeds that other tools can consolidate into cross-journal article collections like Google News. That’s only possible because XML let the creators of NITF pick whatever tags they wanted to include.

That extensibility can be a bad thing, at times, too. That lack of specificity in the format means learning XML doesn’t actually teach you a whole lot about using it. You still need to worry about the details of the specification because that’s what will tell you how to actually access the data stored in the XML.

That extensibility also makes XML files unnecessarily long! Say I want to create a heading for my content. In XML, I need to arbitrarily define the <heading> element, or whatever I’d like to call it, and make sure that the existence of that element is clear to whoever will read this data later on. Often that’s done with a Document Type Definition. Then, we’ve got to somehow convey the meaning of this information, because besides the rather self-explanatory element title, the element doesn’t really tell us much. What exactly is a heading? Two different people reading this data may interpret <heading> differently. All of this takes up a lot of space — it’s so verbose that later efforts ended up swinging to the opposite extreme.

Markdown

The creators of Markdown saw the mess that was defining something as simple as a heading in XML, thought it was ridiculous, and came up with this:

#

I’d call that the opposite extreme.

Wrapped up in that tiny little hashtag (pound sign? just hash? sharp? octothorpe?) is the entire concept of the highest-order heading. It posits that the topic of the content that the heading is for is irrelevant, so it doesn’t need to have different markers for <ArticleTitle>, or <ProductName>, or <CompanyName>. A single mark for the highest-order heading does just fine.

That leads to the opposite problem that XML had! Where XML shined in its flexibility, Markdown doesn’t. In fact, when you want to store a product that can be sold through your website, it’d seem rather unnecessary to have a field in your database known just as heading. No, you’d likely want that to be more descriptive, especially because your product’s name will sometimes show up as a heading, or sometimes as a list item, or sometimes in plain text on the page. So in the end, while Markdown makes presentation on the Web easier by catering to actual website layout design, it doesn’t actually label the information it stores usefully, which is the point of a rich text format in the first place. It’s convenient in some use cases, but at the cost of being pointless in almost every other use case.

But even using Markdown where it was intended has an issue or two. For example, you’d expect that if it’s going to hone in on representing elements in web design, it’d go all in. Yet Markdown doesn’t have an analog for a significant chunk of the possible ways of representing rich text in HTML. Add to that the conundrum that there technically isn’t one single standard for Markdown, which features it does implement is a bit up in the air. The positive is that this can be easily fixed! Let’s put a pin in this for now before coming back to the most recent attempts to fix it.

JSON

In 2001, a developer decided to serialize a JavaScript object so that it could be sent as a payload in a protocol that could only send strings, like HTTP. Because it was familiar (starting out as a subset of JavaScript), but also cross-environment (drivers are available from pretty much every language), JSON quickly became one of the leading data transmission formats. Because it could store and transport nearly anything though, it wasn’t long before folks started trying to use it for rich text.

But as seems to be the recurring theme of this article so far: there was a fatal flaw. We had swung back to the infinite extensibility of XML, so we’re back to where learning the format doesn’t teach us anything about using it. Properties in JSON are a little clearer than in XML (this is my highly subjective opinion, but given their relative popularities, I’d say a lot of people agree with me here), so even the “content” of the block some object is describing is a property too. That might sound a smidgeon confusing spelled out, but here’s an example:

XML


<product-description
	sold-out="true"
>
	This is the product description.
</product-description>

JSON


{
	"product-description": {
		"sold-out": true,
		"content": "This is the product description."
	}
}

For comparison, the XML example uses two different types of storing data (split out for demonstration, not because this is best practice). You could either store simple data as an attribute on the parent element itself, or as child nodes of that parent element. When to use each of these methods can get a little confusing, especially if you don’t know the eventual development direction of the program that will consume this XML. On the other hand, the JSON example treats all properties equally. They can be strings, booleans, numbers, null, arrays, or sub-objects, all of which are datatypes most developers are familiar with. We don’t need to treat them all like strings for the sake of the format (see sold-out in the XML example).

I personally am a fan of JSON, not because of some objective measure, but because I’m a JavaScript developer and it feels nice to use. Interestingly though, JSON seems to be one of the most common ways forward in storing rich text. While formats are still needed for offline documents (think DOCX, an XML spec, for Microsoft Word), the vast majority of documents created today are for sharing publicly on the World Wide Web, which we typically access with JavaScript-capable browsers. So XML does have a place, but there’s a growing argument to be made that since rich text is primarily for the browser, JSON is about as native of a cross-environment (read: not HTML) format to store it in as we’re going to find.

Notion

We’ve talked a fair bit now about attempts of the past — let’s talk a bit about the present and the future. I originally got the notion (see what I did there?) to write an article like this after examining my own workflow and how it’s progressed from what technical authors used years ago. And I found that the biggest improvement was Notion, the note-taking app that I use to write these articles. I dug into Notion’s API to figure out how they store the content of the rich text I’m writing right now. Take a look at what I get when I query for the heading of this section:

{
	"object": "list",
	"results": [
		{
			"object": "block",
			"id": "00000000-this-long-uuid-000000000000",
			"created_time": "2022-05-24T02:55:00.000Z",
			"last_edited_time": "2022-05-24T02:55:00.000Z",
			"created_by": {
				"object": "user",
				"id": "0another-very-long-uuid-000000000000"
			},
			"last_edited_by": {
				"object": "user",
				"id": "0another-very-long-uuid-000000000000"
			},
			"has_children": false,
			"archived": false,
			"type": "heading_2",
			"heading_2": {
				"rich_text": [
					{
						"type": "text",
						"text": {
							"content": "Notion",
							"link": null
						},
						"annotations": {
							"bold": false,
							"italic": false,
							"strikethrough": false,
							"underline": false,
							"code": false,
							"color": "default"
						},
						"plain_text": "Notion",
						"href": null
					}
				],
				"color": "default"
			}
		}
	],
	"next_cursor": null,
	"has_more": false
}

Okay, so definitely JSON.

The question is, how did they use the underlying technology to match the use case? Well, I noticed a couple of things:

  1. Every object of data is identified with a UUID. Things like the actual content of the block don’t count — I’m talking about discrete objects like blocks themselves, users, pages, etc. This alone takes away one of the biggest disadvantages of building complex data structures in JSON by allowing you to reference other objects without repeating their content. It’s second only to how many query languages like GraphQL do it. How convenient it is then that NotionQL exists.
  2. Maybe this is a small thing, but I liked that they didn’t just have a string at the annotations property that lists some set of annotation options, like bold-underline-italic. JSON inherits JavaScript’s simple, easy-to-understand booleans, so Notion has gone with the option of creating an annotations object, with each annotation option being its own boolean. That means they don’t have to worry about the order that the annotations are given in, and they don’t have to worry about future changes (like adding a superscript annotation option, for example) breaking everything.
  3. One benefit of XML is that it requires tag names! It’s easy to gloss over that, but those tag names help define what each element actually is. Notion, here, has made sure that their JSON objects aren’t unlabelled. They actually have a consistent property on each discrete object (see #1 of this list) called object, which tells you what the object you’re reading actually is. Line 5 tells you “you’re reading information about a block of content”, and line 10 says “you’re now looking at an object that represents a user”. In XML, those would be <block> and <user> tags, so it’d be super clear — but Notion’s consistent application of this simple scheme gives JSON the same advantage.

With these advantages, Notion has created a system that other tools might do well to adopt and even expand upon until it becomes a complete specification! I’m struggling to find any “illogical” parts here, per se, even if I personally could see the benefits of taking a different approach. For example, Notion expanded on JSON — an apt decision, given the block-based nature of the program — but they still take Markdown as input and output, so they’re still somewhat limited by what Markdown can support. I personally use a lot of toggle lists and side-by-side elements, neither of which are supported in Markdown but are in Notion. Regardless, Notion has set up an excellent system, and I’m really excited to see it evolve.

ExtraMark

Let’s take out the pin we put in Markdown earlier — and to recap, I was just complaining that Markdown specializes in representing rich text on the web, but doesn’t have enough marks to represent all the types of rich text on the web, limiting its use for its intended purpose. But the situation drastically improves when we just add some of those missing elements. The usefulness of Markdown off the web doesn’t change, but some new flavors of Markdown can make it perfect for mapping directly to HTML rich text.

Take for example, ExtraMark. It’s a superset of CommonMark, one of the most popular and generally agreed-upon flavors of Markdown. But ExtraMark took it a step further and started adding in other highly useful features. Here’s the list from their GitHub README:

  • Automatic typographic replacements
  • Tables
  • Anchors for headings (up to heading level 3)
  • Definition lists
  • Superscript
  • Subscript
  • Abbreviations
  • Footnotes
  • Critic Markup

Amazing. I can’t count how many times — despite personally despising working with tables — I’ve googled how to make one with Markdown and found myself stuck. Now, it’s possible! Definition lists are another feature that probably should be used more often (it’s semantically quite valuable, it’s just a bit obscure), and now that can be used in Markdown too!

I should add a little footnote to that last statement, though. We can technically use footnotes and subscript in Markdown, but whatever will be parsing or displaying that Markdown needs to support ExtraMark, and I’ve actually never heard of that implemented. This repo has 4 GitHub stars — it’s not a common tool. And that’s a shame! ExtraMark is the most logical, but still powerful proposed spec for Markdown that I’ve seen yet! Because it’s completely compatible with CommonMark, if you need a Markdown parser for your next project, I recommend choosing this one! Everything you’ve written so far will still work, but now you’ll have all of these features at your fingertips.

Structured Content

Let’s jump back to JSON now — personally, I think there’s only one format that has outdone Notion. I’d like to introduce you to Sanity’s Structured Content.

While it isn’t necessary to use Sanity to use Structured Content, it’s linked closely enough with their data storage platform to be an optional benefit instead of a mandatory burden. If you need somewhere to store that data that you’re transporting around in Structured Content, you can put it in Sanity knowing they’ll handle the formatting and everything for you.

Structured Content also includes a lot more than just rich text! It has built-in tools for (as the name suggests) structuring content — the models that your content follows live right alongside everything else! You can store logic and custom algorithms to modify your data when needed (side note, as wild as this sounds, this part is actually pretty common in rich text — think JavaScript dynamically modifying HTML, or those silly programmable animations in PowerPoint), you can easily loop in external services (also not unheard of — there are even workarounds for this in Markdown), and images are automatically customized when they’re displayed (this was always possible, but used to require an external service — Sanity bakes it into your rich text).

The best part (in my opinion) is that you can strictly use the rich text specification without all of the Sanity-specific features (where much of that logic actually runs). By itself, the spec is called Portable Text, and it’s incredibly well-supported by drivers and parsers developed by the Sanity team. They’ve put quite a bit of effort into not locking you into their platform. As of May 2022, they have publicly-available libraries that take Portable Text as input and spit out pure HTML, Markdown, Vue components, React components, Svelte components, and Hyperscript (I haven’t seen this until now, but it looks amazing). If you’re working in other programming languages, they’ve got you covered too — not that I ever want to work in PHP again, but if I did, at least I’d be comforted knowing I could use Portable Text.

Looking back

Well, this was quite an in-depth article. I’m already at nearly three thousand words, but I think we spent it well; diving into the history of rich text formats is a useful exercise, given that it helps us understand any unsolved problems or shortcomings so that we can work to improve it.

Realistically, many of us may not be involved in creating one of the rich text specs of the future, but in all likelihood, we’ll have to choose one to use, and the cursory understanding we reviewed here will help inform those decisions. Maybe plain-old XML, Markdown, and JSON are fading away, but what format will you use to pass around rich text in your next project? If you want my advice — and you read this far, so why not — reach for ExtraMark if you’re going the Markdown route or Portable Text for everything else.

Thanks for reading and I’m looking forward to seeing to what you create.

About the author
Jaden Baptista

Technical Writer

Recommended Articles

Powered byAlgolia Algolia Recommend

Good API Documentation Is Not About Choosing the Right Tool
engineering

Maxime Locqueville

DX Engineering Manager

Post-Exit Year in Review
algolia

Ciprian Borodescu

AI Product Manager | On a mission to help people succeed through the use of AI

Indexing Markdown content with Algolia
engineering

Michael King
Soma Osvay

Michael King &

Soma Osvay