UTF-8 support for Nix

figsoda · May 29, 2023, 12:05am

piegames · May 29, 2023, 11:53am

For an UTF-8 library, I’d expect it to be very precise about definitions and use Unicode related terminology. What is a “character”? What is a string’s “length”? Do you operate on code points or grapheme clusters (–which ones, there are multiple definitions–), or something else? How does it deal with various joiners and other control code points?

figsoda · May 29, 2023, 3:21pm

By characters I meant code points, and it treats joiners and other control code points as standalone code points.

Sorry about the confusion, I’ve updated the docs accordingly, hopefully this is less ambiguous now, and please tell me if there are other things that could be improved :)

iFreilicht · June 5, 2023, 10:19pm

Very cool! What was the usecase that prompted you to implement this?

I see you implemented it in Rust, so you might want to consider using the unicode-segmentation crate to implement turning a string into a list of graphemes. Might be of limited utility though, I only see this sort of stuff used for visual reasons like limiting line-lengths properly.

figsoda · June 6, 2023, 1:51pm

I wanted to make a parser generator that supports utf8, so we can implement things like yaml and KDL without needing a built-in, with proper utf8 support

I’m not sure using the crate directly will help, since I don’t think it is possible to convert nix strings into integers, but I will use that as an inspiration. If I do end up implementing graphemes I will probably use the same data