Salesforce, Python, SQL, & other ways to put your data where you need it

Need event music? 🎸

Live and recorded jazz, pop, and meditative music for your virtual conference / Zoom wedding / yoga class / private party with quality sound and a smooth technical experience

HELP?? Data modeling Algolia

12 Feb 2021 🔖 integration web development jamstack
💬 EN

Now that I have Sanity webhooks sending newly-published data to Algolia, I have to configure it, which means doing the hard part: naming things.

As I’ve said before:

Algolia doesn’t care if one “object” you pass it has a personName and another has a carMakeAndModel and yet another has a brand. It’s just more work for you to do to figure out that you probably could’ve uploaded all of these pieces of data as an item’s title if you’re just doing, say, search against different page types on a web site.

Netlify has skin in the game working with Algolia to develop a great plugin, so let’s see what Algolia’s Netlify plugin does. Looks like the plugin uses the following 13 keys per Algolia object:

  • objectID: The full URL of the web page being indexed, starting at http but also possibly including a # anchor specification for an in-page section.
  • url: The full URL of the web page being indexed, in their example without its #0 as seen in objectID.
  • urlDepth: The number of slashes at the end of the domain.
    • http://example.com/ is 1; http://example.com/about is also 1; but http://example.com/about/ is 2.
  • title: Whatever you set in og:title or head > title
    • example: Crawler | Web Crawler | Ecommerce Crawler
  • description: Whatever you set in meta[name=description] or meta[property="og:description"].
  • image: Whatever image URL you set in meta[property="og:image"]
  • lang: Whatever you set in html[attr=lang], like en
  • content: A plaintext copy of all of the meaningful textual content of the web page, with HTML tags and such stripped.
  • category: Whatever you set in meta[property="article:section"] or meta[property="product:category"]
  • keywords: Whatever you set in meta[name="keywords"] or meta[property="article:tag"]
  • authors: Whatever you set in meta[property="article:author"] or your Article JSON-LD
  • datePublished: Whatever you set in meta[property="article:published_time"] or your Article JSON-LD
  • dateModified: Whatever you set in meta[property="article:modified_time"] or your Article JSON-LD

Plus 3 more if you set it to crawl “hierarchichal subsections” of articles:

  • hierarchy: A JSON object like:
    • { lvl0: 'My H1 heading', lvl1: 'The first H2 heading', lvl2: 'The first H3 heading', ... }
  • hierarchicalCategories: A JSON object like:
    • { lvl0: 'My H1 heading', lvl1: 'My H1 heading > the first H2 heading', lvl3: 'My H1 heading > the first H2 heading > the first H3 heading', ... }
  • contentLength: A number indicating how many characters are in the content of the subsection.
    • Apparently some people like to surface matches w/ longer content higher in results.

For a future iteration of my own blog, this seems good enough. Almost all of my best content fits the Schema.org Article model. I’ll probably just build on Netlify & let this plugin do the Algolia indexing for me, rather than try to figure out how to get my CMS or my SSG to push data to Algolia.


However, for the music web site I’m working on, there’s a much richer structure of “data with URLs,” and much heavier dependence upon componentized, sales-focused landing pages than on thought-piece articles.

I’ll have:

  1. Traditional blog articles (and pages w/ 1 main inflexible content area, which is functionally equivalent – although will there be a way to add fancy things like forms that need to be excluded from the searchable content in either of these? Maybe.) Luckily, no “authors” to worry about – just 1 for this whole site.
  2. Sectioned “page builder” pages (Home, Live Music, Studio Production, Lessons, Bio & EPK, Hire, Contact). Do things from the sections need to pop?
  3. 1 URL per upcoming event, just to make Google happy about JSON-LD Event data. Probably want to expose individual event URLs in the future to search, but probably want to ignore individual event URLs in the past, as there are enough of them to cause clutter (maybe still let them match search & just surface the “past events” page).
  4. Exclude the “upcoming events” page from search if letting individual upcoming event pages match search. (Alternatively, treat upcoming events a lot like past events, and just surface the “upcoming events” listing.)
  5. Consider surfacing the “past events” page (which is just a massive UL of every past gig, ever) as a search result w/ event objects in the past that match a search
  6. If a miniature events list is included in a sectioned “page builder” page, don’t let its contents be part of the searchable contents of that page. Event search is already covered.
  7. Exclude any pages whose sole function is to list traditional blog articles.
  8. However, DO surface the main “blog listing” page at the top of the results for certain “obvious” searches like blog, news, etc. (Give it an abbreviated “content” for search-result-display?) Maybe implement by just making its title searchable or something, but not any of the visible part of what’s displayed on the page.
  9. If a miniature blog articles list is included in a sectioned “page builder” page that is the HOME page, don’t let its contents be part of the searchable contents of that page. Blog search is already covered.
  10. (On the other hand, do I want to steer attention to a sales-ey landing page if a blog extract has been surfaced within it? Maybe?)
  11. Reviews/Testimonials & their writers (don’t forget “band” & “individual musician” tags if tagged as part of the review) (like a FAQ, these should just be searchable as part of the pages they’re displayed on, but how should they look in search results?)
  12. FAQ’s embedded in blog articles & landing pages? Do those need to pop out in any sort of special way for a Google Q&A-like experience? Or would H-level sectioning as done in the Netlify plugin suffice?
  13. Non-decorative Photos & their alt-text (some of which might be more meaningful than others) (don’t forget “band” & “individual musician” tags) (is this even a thing, outside of a “photo gallery”? Will there still be a “photo gallery,” outside of the EPK? Probably not.)
  14. Videos & their metadata (don’t forget “band” & “individual musician” tags) – some of which will have been sprinkled into landing pages & blog articles decoratively – others of which might be part of a “video gallery”
  15. Music & its metadata (don’t forget “band” & “individual musician” tags). Possibly the most structured data of the whole site, besides events. How can I get the job done right but avoid overengineering this? Anyway, don’t display the main full-music-listing when matching an album; just display the album page. And likewise, skip things like the home page? But if an album is included in, say, a sales-ey landing page, probably do want to surface that landing page. But maybe not on a match for every last hidden piece of metadata about the album, the way you would for the album itself. Tough stuff.

Also, will I want any filterability on the search results, like narrowing results down to things about a particular band, or narrowing results down to a certain content type?


After talking to Vince Parulan, it seems I should keep things simple.

Host my site (or at least a shado copy of it) on Netlify.

Let the Algolia Crawler that normally costs money but is included for free w/ the Netlify-Algolia plugin do the heavy lifting.

Let it strip HTML from content, even if it does annoying things like match / or /blog/ for a word found in the title of my most recent blog post.

It’s way better than building a crawler/indexer myself.

Then, maybe, for anything that can’t be done through the plugin with the careful use of meta tags & page heads, look into some way of surgically editing Algolia indexes to add extra key-value pairs (be careful to download the existing Algolia index record before doing the update, so as not to overwrite the ones Netlify put into place – I checked – a simple “update” Algolia API call destroys the keys not in the “update” body).

--- ---