Amid cries for more and better documentation, we are lost in finding information across the myriad of disparate resources. Information is outdated, information is inaccessible to search engines, we don’t know how to find information, we forget to look for it, and so forth.
I would like to ask for thoughts and input on putting together a nix-focused search engine. (Hey maybe we can finally use that fancy semantic data in the xml documentation! [citation needed])
A stretch feature I would like is to mark outdated information, and cross-reference it to new correct information.
(Do we have any librarians?)
As usual, I don’t have time in the next weeks, but maybe over time we can pull something together…baby steps as always?
Initial ideas for data sources that would need to be ingested:
nix, nixos, nixpkgs, nixops manuals
nix pills
the old and new wiki, and …all the other variations
I think I saw some manual pages that weren’t part of the main manual pages
somehow index nixcon talks, at least transcripts
irc logs
old and new and idk mailing lists
discourse
all the codebases; nix, nixpkgs, etc
all the github issues and stuff
third party blog posts
all commit messages
as a bonus, maybe the arch and gentoo wikis and such?
Hark, hark! Now seeking writers of search engines, all manners of scrapers, and soforth.
We’d probably want to explicitly mark information from these sort of sources as “not NixOS-specific”, to warn people of that whatever information is specified there is strictly advisory and may not always apply to NixOS.
I would be more than happy to polish those scripts up a bit more for a contribution to the repo.
I’ve continued tweaking them here and there over the months. I’ve gotten them to open source files, open a shell, or install into a user’s environment. I also added home-manager integration.
Admittingly in their current state they’re aimed at personal use and I expected folks to tweak them to their needs; but with a bit of work they could be repackaged for anybody to pick up and use out of the box.
It happened twice to me that a PR was kept back because it broke something else and somebody just pushed the fix in a separate commit to the master because he didn’t know that a) a PR already exists b) the fix breaks other things.
Because of this I was thinking about some kind of web service which collects informations like github issues, states, docs, etc. and shows informations about package states, open issues for the package, etc. Maybe a service that crawls issues, manuals and xml docs could also connect and prettyprint such informations useful to contributors.
I would love to create such a crawler service but I’m mainly a system/network engineer and my coding knowledge is very limited…
But then it grokked me: why should this be specific for Nix? Put any topic instead of Nix… and it turns out that Google is best approach for searching any topic, including Nix.
Just make everything indexed by Google. Or do you want to build own mini-google for Nix topic?
The “catalog”-style knowledge isn’t easy to maintain, you can make it usable (Nix docs, nixos.wiki), but never ideal.
A question back to you: which queries should this tool answer best?
But then it grokked me: why should this be specific for Nix? Put any topic instead of Nix… and it turns out that Google is best approach for searching any topic, including Nix.
Well, for Nix we can reasonably estimate the reliability of sources; Google is usually worse than topic-specific ordering of sources.
This example is kind of ok because there are few results, but for searches with a large mass of results…there are a lot of results - mostly spammed full of IRC. Searx does allow toggling engines.
Basically, this needs a lot of work.
Writing Searx backends is very simple for JSON ones, though there are some general (reasonable) limitations afaict(!) such as not being able to return HTML in a result, and I don’t see a way to aggregate multiple queries - but maybe that shouldn’t be Searx’s problem.
What about pushing the index data into something like Algolia? Although I’m not sure if their free-tier community pricing would cover the volume and demand. You’re limited to indexing 10k items and 50k searches a month judging from their pricing matrix.