Uni++

A convenient iterator-based C++17 header-only library that allows convertions between different Unicode Transformational Formats (UTFs).

Currently, it is possible to seamlessly convert between the following encodings:

UTF-8
UTF-16
UTF-32

Prerequisites

A C++17 confirming compiler (tested under MSVC and MinGW)
CMake (optional)

How to integrate

You can use Uni++ simply by copying the unipp/include directory contents into your third parties folder and specifying the appropriate include directory for the compiler.

However, Uni++ also integrates well with CMake by providing an interface target. Here is the basic setup script:

add_subdirectory("unipp")

add_executable(your_target)

target_link_libraries(
    your_target
    # Usually, PUBLIC is more suitable for header-only libraries.
    PUBLIC/PRIVATE/INTERFACE
    unipp::unipp # Provides include dirs and a precompiled header file.
)

Using CPM is also an option:

CPMAddPackage("gh:leoovs/unipp#dev")

How to use

The library provides a simple interface for working with the UTF-encoded character sequences. It allocates very little and allows to specify custom allocators. Exceptions are not used. The library does not introduce any specific types to store code units, it only relies on the standard types like char or char16_t and standard-defined semantics for those types, so you don't have to type 'utf8' or 'utf16' everywhere (my personal preference), the only exception is the char built-in type, see the Remarks section below.

Here's a basic use-case of Uni++:

#include <iostream>

#include <unipp/convert.hpp>

int main()
{
	std::u16string in = u"Hello, world! Привет, мир! 哈囉世界! 👋🌎";
	std::string out; // `char`s assumed to be utilized as UTF-8 code units.

	// Based on the iterator `value_type`s the function peeks an appropriate
	// encoding algorithm.
	unipp::convert(in.begin(), in.end(), std::back_inserter(out));

	std::cout << out << " (" << out.length() << ") bytes\n";
}

More use cases are available in the example folder, including the way to support Win32 wide characters.

Remarks

The char type assumed to be used as a code unit for UTF-8. This is due to the lack of C++20's char8_t which I don't want to introduce myself as a separate library type due to the Occam's Razor. In practice, this should not cause any problems, since the initialization of char[] is allowed from the UTF-8 string literals. Note that u8 string literals may behave strange in MSVC so check out the cmake/CharSet.cmake module that can enforce usage of UTF-8 as a source character set.

Sources of information on UTF-8;16;32

cppreference.com articles on built-in types
Wikipedia articles on UTF-8 and UTF-16 encodings

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
cmake		cmake
example		example
testbed		testbed
unipp		unipp
.editorconfig		.editorconfig
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Uni++

Prerequisites

How to integrate

How to use

Remarks

Sources of information on UTF-8;16;32

About

Uh oh!

Releases

Packages

Languages

License

leoovs/unipp

Folders and files

Latest commit

History

Repository files navigation

Uni++

Prerequisites

How to integrate

How to use

Remarks

Sources of information on UTF-8;16;32

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages