🧠✂️ SemanticSlicer — A smart text chunker for LLM-ready documents.
Smart, recursive text slicing for LLM-ready documents.
SemanticSlicer is a lightweight C# application that recursively splits text into meaningful chunks—preserving semantic boundaries (sentences, headings, HTML tags) and ideal for embedding generation (OpenAI, Azure OpenAI, LangChain, etc.). You can run it on MacOs, Linux, or Windows and it can run from the command line, as a daemon, as service or as a REST API. You can also directly use the library by referencing the Nuget package in your code.
GitHub: https://github.com/drittich/SemanticSlicer
This library accepts text and will break it into smaller chunks, typically useful for when creating LLM AI embeddings.
The package name is drittich.SemanticSlicer. You can install this from Nuget via the command line:
dotnet add package drittich.SemanticSlicer
or from the Package Manager Console:
NuGet\Install-Package drittich.SemanticSlicer
Prebuilt binaries are published under GitHub Releases of this repository: https://github.com/drittich/SemanticSlicer/releases
Choose the asset that matches your platform:
After downloading:
Daemon mode (keeps slicer in memory):
Notes:
Build the command-line tool:
dotnet publish SemanticSlicer.Cli/SemanticSlicer.Cli.csproj -c Release -o ./cli
Slice a file and output JSON chunk data:
dotnet ./cli/SemanticSlicer.Cli.dll --overlap 30 MyDocument.txt
You can also pipe text in (omit the overlap flag to use the default 0%):
cat MyDocument.txt | dotnet ./cli/SemanticSlicer.Cli.dll --overlap 20
Use the --overlap flag (0-100) to carry forward that percentage of the previous chunk’s tokens, respecting your configured max chunk size.
Keep a slicer in memory and read lines from stdin (or a named pipe):
dotnet ./cli/SemanticSlicer.Cli.dll daemon --overlap 25
Optionally listen on a named pipe:
dotnet ./cli/SemanticSlicer.Cli.dll daemon --pipe slicerpipe --overlap 25
The repository includes a small Web API (SemanticSlicer.Service) that can be
installed as a background service so the slicer stays in memory.
First publish the service:
dotnet publish SemanticSlicer.Service/SemanticSlicer.Service.csproj -c Release -o ./publish
./publish folder to /opt/semanticslicer (or a location of your choice)./etc/systemd/system/semanticslicer.service with:[Unit]
Description=Semantic Slicer Service
After=network.target
[Service]
Type=simple
WorkingDirectory=/opt/semanticslicer
ExecStart=/usr/bin/dotnet /opt/semanticslicer/SemanticSlicer.Service.dll
Restart=always
[Install]
WantedBy=multi-user.target
sudo systemctl enable semanticslicer
sudo systemctl start semanticslicer
C:\SemanticSlicer:dotnet publish SemanticSlicer.Service/SemanticSlicer.Service.csproj -c Release -o C:\SemanticSlicer
sc create SemanticSlicer binPath= "\"%ProgramFiles%\dotnet\dotnet.exe\" \"C:\\SemanticSlicer\\SemanticSlicer.Service.dll\""
sc start SemanticSlicer
Once running you can POST text to the service:
curl -X POST http://localhost:5000/slice -H "Content-Type: application/json" \
-d '{"content":"Hello world","overlapPercentage":30}'
overlapPercentage is optional (defaults to 0) and clamped between 0 and 100. Header tokens also count toward the overlap budget.
Simple text document:
// The default options uses text separators, a max chunk size of 1,000, and
// cl100k_base encoding to count tokens.
var slicer = new Slicer();
var text = File.ReadAllText("MyDocument.txt");
var documentChunks = slicer.GetDocumentChunks(text);
Markdown document:
// Let's use Markdown separators and reduce the chunk size
var options = new SlicerOptions { MaxChunkTokenCount = 600, Separators = Separators.Markdown };
var slicer = new Slicer(options);
var text = File.ReadAllText("MyDocument.md");
var documentChunks = slicer.GetDocumentChunks(text);
Overlapping chunks:
// Reuse the last 30% of the previous chunk (by tokens), while still respecting the max size
var options = new SlicerOptions { MaxChunkTokenCount = 800, OverlapPercentage = 30 };
var slicer = new Slicer(options);
var documentChunks = slicer.GetDocumentChunks(text);
HTML document:
var options = new SlicerOptions { Separators = Separators.Html };
var slicer = new Slicer(options);
var text = File.ReadAllText("MyDocument.html");
var documentChunks = slicer.GetDocumentChunks(text);
Removing HTML tags:
For any content you can choose to remove HTML tags from the chunks to minimize the number of tokens. The inner text is preserved, and if there is a <Title> tag the title will be pre-pended to the result:
// Let's remove the HTML tags as they just consume a lot of tokens without adding much value
var options = new SlicerOptions { Separators = Separators.Html, StripHtml = true };
var slicer = new Slicer(options);
var text = File.ReadAllText("MyDocument.html");
var documentChunks = slicer.GetDocumentChunks(text);
Custom separators:
You can pass in your own list if of separators if you wish, e.g., if you wish to add support for other documents.
Chunks will be returned in the order they were found in the document, and contain an Index property you can use to put them back in order if necessary. Each chunk also includes StartOffset and EndOffset character positions relative to the normalized input text so you can align slices back to the source if needed.
You can pass any additional metadata you wish in as a dictionary, and it will be returned with each document chunk, so it’s easy to persist. You might use the metadata to store the document id, title or last modified date.
var slicer = new Slicer();
var text = File.ReadAllText("MyDocument.txt");
var metadata = new Dictionary<string, object?>();
metadata["Id"] = 123;
metadata["FileName"] = "MyDocument.txt";
var documentChunks = slicer.GetDocumentChunks(text, metadata);
// All chunks returned will have a Metadata property with the data you passed in.
If you wish you can pass a header to be included at the top of each chunk. Example use cases are to include the document title or tags as part of the chunk content to help maintain context.
var slicer = new Slicer();
var fileName = "MyDocument.txt";
var text = File.ReadAllText(fileName);
var header = $"FileName: {fileName}";
var documentChunks = slicer.GetDocumentChunks(text, null, header);
Note: Headers count against MaxChunkTokenCount and reduce the available overlap when OverlapPercentage is non-zero.
This project is licensed under the MIT License - see the LICENSE file for details.
If you have any questions or feedback, please open an issue on this repository.