Strings in C and C++

String encoding
Strings in Windows
Literals and string data types
C strings
std::string
- The -Wwrite-strings warning
- String comparison
String / number conversions
Splitting a string by a delimiter
Character classification: tolower, toupper, isalnum
std::string_view
Case-insensitive find
Small-string optimization (SSO)

String encoding

ASCII

ASCII uses 7 bits to represent a character, giving 2^7 = 128 distinct codes. The 8th bit was historically used as a parity bit. Most ASCII codes are printable (abc, ABC, 123, ?&!); the rest are control characters (carriage return, line feed, tab). ASCII was designed for English only.

ANSI

ANSI extends ASCII to 8 bits (256 codes) and uses code pages to assign different meanings to the upper 128 codes for different languages — Japanese, Chinese, Western European, etc. The application reading the file has to know which code page is in use to interpret the bytes correctly.

There are many 8-bit "extended ASCII" variants. ISO 8859-1 (Latin-1) is one common one — original ASCII (0–127) plus Western European letters in 160–255, the default for most early browsers. Windows uses a closely related encoding called Windows-1252, often loosely labeled "ANSI" — it differs from ISO 8859-1 by using 128–159 for printable characters instead of controls.

Refs: excelanytime.com

Extended ASCII (8-bit ASCII)

Once people started using the 8th bit (originally parity) for additional characters, the table doubled to 256 entries (2^8 = 256). This was enough for languages based on the Latin alphabet (e.g. French é) but not for Greek, Cyrillic, or CJK scripts.

Unicode

Unicode assigns a unique code point (an abstract integer) to every character in essentially every script — currently over 140k assigned. A code point is not a byte sequence; you still need an encoding to map code points to bytes. See the unicode table for the full list.

A character encoding maps each character (or code point) to a sequence of bits or bytes — Morse code, ASCII, UTF-8, UTF-16, etc. are all encodings. See Wikipedia: character encoding.

Encodings: UTF-8 vs UTF-16 vs UTF-32

Encoding	Code unit	Per code point	Notes
UTF-8	8 bits	1–4 bytes	ASCII-compatible, dominant on the web
UTF-16	16 bits	2 or 4 bytes (surrogates)	Used by Windows APIs, Java, JavaScript
UTF-32	32 bits	always 4 bytes	Fixed-width, simple but space-hungry

UTF-8 and UTF-16 are variable-length; UTF-32 is fixed-length. Refs: Comparison of Unicode encodings.

Strings in Windows

8-bit AnsiStrings

char — 8-bit character, the C/C++ data type
CHAR — Windows alias for char
LPSTR — null-terminated string of CHAR (Long Pointer to STRing)
LPCSTR — null-terminated const string of CHAR

char is supposed to hold a one-byte character; wchar_t holds a "wide character," but its size is implementation-defined: 2 bytes on Windows, 4 bytes on Linux. Neither char nor wchar_t is directly tied to Unicode by the language — Windows just uses UTF-16 in its 16-bit APIs.

16-bit UnicodeStrings

wchar_t — 16-bit character on Windows
WCHAR — alias for wchar_t
LPWSTR — null-terminated string of WCHAR
LPCWSTR — null-terminated const string of WCHAR

TCHAR (depends on UNICODE)

TCHAR — WCHAR if UNICODE is defined, else CHAR
LPTSTR — null-terminated string of TCHAR
LPCTSTR — null-terminated const string of TCHAR

Item	8-bit	16-bit	Varies
character	CHAR	WCHAR	TCHAR
string	LPSTR	LPWSTR	LPTSTR
string (const)	LPCSTR	LPCWSTR	LPCTSTR

LPCTSTR = Long Pointer to a Const TCHAR STRing. The "long" prefix is a 16-bit Windows leftover; today it's just a pointer.

LPSTR    = char*
LPCSTR   = const char*
LPWSTR   = wchar_t*
LPCWSTR  = const wchar_t*
LPTSTR   = char* or wchar_t*           // depends on _UNICODE
LPCTSTR  = const char* or const wchar_t*

The LPCTSTR typedef from WinNT.h:

#ifdef UNICODE
typedef LPCWSTR LPCTSTR;
#else
typedef LPCSTR  LPCTSTR;
#endif

std::string is std::basic_string<char>; std::wstring is std::basic_string<wchar_t>. There are also fixed-width Unicode variants:

Type	Underlying
`std::string`	`std::basic_string<char>`
`std::wstring`	`std::basic_string<wchar_t>`
`std::u8string` (C++20)	`std::basic_string<char8_t>`
`std::u16string` (C++11)	`std::basic_string<char16_t>`
`std::u32string` (C++11)	`std::basic_string<char32_t>`

Refs: SO: T and L meaning in C++.

Checklist for Windows string programming

_T() / TEXT() are macros that prepend L to a string literal if _UNICODE / UNICODE is defined. They date from when you had to support Windows 9x (no Unicode) and Windows NT (Unicode) from one codebase. All modern Windows is Unicode-based, so new code rarely needs them.
- UNICODE controls Win32 headers: defining it makes GetWindowText resolve to GetWindowTextW and TEXT(...) produce L"...".
- _UNICODE controls the C runtime headers: defining it makes _tcslen resolve to wcslen and _TEXT(...) produce L"...".
Use _TCHAR / _T() with C-runtime functions; use TCHAR / TEXT() with the Win32 API. CString (MFC) is TCHAR-based, so use TEXT().
Prefer LPTSTR and LPCTSTR over char* and const char* in Windows code.
- LPCSTR — pointer to a const char string
- LPCTSTR — pointer to a const TCHAR string (wide or narrow depending on UNICODE)
- LPTSTR — pointer to a non-const TCHAR string
For C++ strings, prefer std::wstring over std::string when interoperating with the Win32 W APIs.

Literals and string data types

A string literal is the source-code form of a string value: in x = "foo", "foo" is the literal. In C the type is char[N]; in C++ it's const char[N] (the array decays to const char*).

auto c0 = 'A';   // char
auto c1 = u8'A'; // char (C++17), char8_t (C++20+)
auto c2 = L'A';  // wchar_t
auto c3 = u'A';  // char16_t
auto c4 = U'A';  // char32_t

// Multicharacter literal — int, value is implementation-defined.
auto m0 = 'abcd';

// String literals
auto s0 = "hello";   // const char[6]      → decays to const char*
auto s1 = u8"hello"; // const char[6] (C++17), const char8_t[6] (C++20+); UTF-8
auto s2 = L"hello";  // const wchar_t[6]
auto s3 = u"hello";  // const char16_t[6]; UTF-16
auto s4 = U"hello";  // const char32_t[6]; UTF-32

const char* multiline = R"(line1
line2
line3)";

C strings

A C string is a one-dimensional array of characters terminated by '\0'. The trailing null is what lets printf, std::cout, strlen, etc. find the end.

Stack-allocated, fixed length 10 with explicit '\0':

char name[10] = { 'b','e','h','n','a','m','\0' };
// Compiler zero-fills the rest:
// char name[10] = {'b','e','h','n','a','m','\0','\0','\0','\0'};

Stack-allocated, size deduced:

char name[] = { 'b','e','h','n','a','m','\0' };
// equivalent to:
// char name[7] = {'b','e','h','n','a','m','\0'};

Same thing using a string literal — the '\0' is implicit:

char name[] = "behnam";
// equivalent to:
// char name[7] = "behnam";

The char*-to-literal form puts the data in read-only memory (typically the .rodata section, not the code/text section). This form is discouraged in modern code:

char* name = "behnam"; // deprecated in C++; warns or errors with -Wwrite-strings

Better:

const char* name = "behnam";

This compiles but segfaults at runtime — name points into read-only memory:

name[0] = 'C'; // crash

This is fine — you're changing what name points at, not the literal:

name = "Margarethe";

Printing works because the literal is null-terminated:

std::cout << "name: " << name << '\n';

Without a terminator, printing reads past the end until it hits a stray zero — undefined behavior:

char name[6] = { 'b','e','h','n','a','m' }; // no '\0' — DON'T print this

Aside: `const` placement on pointers

Reading right-to-left and ignoring the type makes this easy:

const int* p = &x;   // *p is const  → pointed-to int can't be modified
int const* p = &x;   // same thing
int* const p = &x;   // p is const   → pointer can't be reseated
const int* const p;  // both

char* vs char array

char  a1[] = "Behnam";
char* p1   = "Behnam"; // const char* in modern C++

Aspect	`a1` (array)	`p1` (pointer)
Storage	stack	points into `.rodata`
`a1++` / `p1++`	invalid	valid
`sizeof`	7 (6 chars + `'\0'`)	size of a pointer (8 on 64-bit)
`&a1` vs `a1`	same address	different (address of pointer vs value)
Assignment	`a1[1]='n'` OK	`p1[1]='n'` segfaults

Refs: cs50, r/C_Programming, SO: array name as pointer.

Size of a string

sizeof(s) on a std::string gives the size of the string object (typically 24–32 bytes), not the number of characters. Don't use it for string length.
strlen(p) works on null-terminated C strings — counts bytes up to (but not including) '\0'.
s.size() (or s.length()) gives the character count of a std::string.

char my_str[100] = "my string";
std::cout << "length: " << strlen(my_str)
          << ", contents: " << my_str << '\n';

std::string s = "my string";
std::cout << "length: " << s.size() << '\n';

strdup

strdup (from C, declared in <cstring>) allocates new memory with malloc and copies a null-terminated string into it. You must free the result.

#include <cstring>
#include <cstdlib>

const char* original = "Hello, World!";
char* copy = std::strdup(original);
// ... use copy ...
std::free(copy);

In C++, prefer std::string — copies and frees automatically:

std::string original = "Hello, World!";
std::string copy = original;

strstr — substring search

strstr(haystack, needle) returns a pointer to the first occurrence of needle in haystack, or nullptr if not found.

#include <cstring>
#include <cstdio>

char haystack[] = "Hello, World!";
const char* needle = "World";

if (char* hit = std::strstr(haystack, needle)) {
    std::printf("found at offset %ld: %s\n", hit - haystack, hit);
} else {
    std::printf("not found\n");
}

The C++ equivalent is std::string::find, which returns std::string::npos on miss.

std::string

std::string allocates its buffer on the heap (with the small-string optimization handling short strings on the stack — see SSO) and copies the literal into it at construction.

std::string s1 = "initializer syntax";
std::string s2("converting constructor");
std::string s3 = std::string("explicit constructor");
std::string s4{"uniform initialization"};

The -Wwrite-strings warning

char* p = "John"; // warning: ISO C++ forbids converting a string constant to 'char*'

String literals are char[] in C but const char[] in C++. Assigning to a non-const pointer is allowed for backward-compatibility but warns under -Wwrite-strings. The literal still lives in read-only memory, so:

p[0] = 'C'; // compiles, segfaults at runtime

The fix is to make the pointer const:

const char* p = "John";
p[0] = 'C'; // now a compile error — caught at compile time, as it should be

A C-style cast silences the warning but doesn't fix the bug — the memory is still read-only:

char* p = (char*)"John"; // compiles, still segfaults on p[0] = 'C'

The right fix in C++ is to use std::string:

std::string p = "John"; // mutable, owns its memory
p[0] = 'C';             // fine

String comparison

std::string::compare returns an int:

0 if equal
< 0 if *this is lexicographically less
> 0 if *this is greater

if (str1.compare(str2) == 0) { /* equal */ }

For plain equality, just use == — it's clearer and equivalent:

if (str1 == str2) { /* equal */ }

compare is useful when you need the three-way result (sort, binary search, custom orderings). For just-greater/just-less checks, <, <=, >, >= operators work too.

Refs: SO: == vs compare.

String / number conversions

String → number

std::string s = "10.3";

int    i = std::stoi(s);  // 10
long   l = std::stol(s);  // 10
float  f = std::stof(s);  // 10.3f
double d = std::stod(s);  // 10.3

Note: std::stoi, std::stof, std::stod take a std::string (or std::wstring), not a const char*. For C strings, use the C-style std::atoi, std::atof (return int and double respectively), or pass s.c_str() only if you wrap it back into a std::string.

std::stoi throws std::invalid_argument if no conversion is possible and std::out_of_range if the value doesn't fit. C-style atoi returns 0 silently on failure.

Number → string

std::string a = std::to_string(42);     // "42"
std::string b = std::to_string(10.3);   // "10.300000" (C locale, fixed precision)

For controlled formatting use std::format (C++20) or a std::ostringstream:

#include <format>   // C++20
auto s = std::format("{:.2f}", 10.3);   // "10.30"

std::ostringstream oss;
oss << std::fixed << std::setprecision(2) << 10.3;
std::string s = oss.str(); // "10.30"

char → string

char c = 'A';

std::string s1(1, c);     // construct: 1 copy of c
std::string s2;
s2.push_back(c);          // append
std::string s3 = std::string{c};

std::stringstream ss;
ss << c;
std::string s4 = ss.str();

string → vector<char>

std::string s = "hello";
std::vector<char> v(s.begin(), s.end());

Splitting a string by a delimiter

Manual loop using find / substr:

std::string s = "scott>=tiger>=mushroom";
const std::string delim = ">=";

std::vector<std::string> tokens;
size_t pos = 0;
while ((pos = s.find(delim)) != std::string::npos) {
    tokens.push_back(s.substr(0, pos));
    s.erase(0, pos + delim.size());
}
tokens.push_back(s); // final piece after the last delimiter

C++20 alternative using ranges:

#include <ranges>
#include <string_view>

std::string s = "scott>=tiger>=mushroom";
for (auto part : std::views::split(s, std::string_view{">="})) {
    std::string_view sv{part.begin(), part.end()};
    // use sv
}

Character classification: tolower, toupper, isalnum

<cctype> functions take an int whose value must be representable as unsigned char (or EOF). Passing a plain char directly is undefined behavior for non-ASCII bytes on platforms where char is signed — always cast to unsigned char first.

unsigned char c = 'A';
char lower = static_cast<char>(std::tolower(c));
std::string s(1, lower);

To lowercase an entire string, transform each character:

#include <algorithm>
#include <cctype>

std::string in = "Hello, World!";
std::string out;
out.reserve(in.size());
std::transform(in.begin(), in.end(), std::back_inserter(out),
               [](unsigned char ch){ return std::tolower(ch); });
// out == "hello, world!"

std::isalnum(c) returns non-zero if c is a letter or digit (A-Z, a-z, 0-9 in the C locale):

unsigned char c = 'A';
if (std::isalnum(c)) { /* letter or digit */ }

Other useful predicates: isalpha, isdigit, isspace, ispunct, isupper, islower.

std::string_view

std::string_view (C++17, <string_view>) is a non-owning reference to a contiguous character sequence: a pointer + a length, no allocation, no null-termination guarantee. Use it for read-only string parameters to accept any of const char*, std::string, std::string_view, or substrings without copying.

#include <string_view>

void log(std::string_view msg) {       // accepts std::string, const char*, literal, etc.
    std::cout << msg << '\n';
}

std::string_view sv = "hello world";
auto first = sv.substr(0, 5);          // "hello" — no allocation, just a new view

Lifetime caveat: a string_view doesn't own its data. Outliving the underlying buffer (e.g. taking a view into a temporary std::string) is a use-after-free.

std::string_view bad = std::string{"oops"}; // dangles after the temporary dies

Case-insensitive find

#include <algorithm>
#include <cctype>

auto it = std::search(
    sentence.begin(), sentence.end(),
    word.begin(),     word.end(),
    [](unsigned char a, unsigned char b){ return std::toupper(a) == std::toupper(b); });

bool found = (it != sentence.end());

Small-string optimization (SSO)

Most std::string implementations store short strings (typically up to 15–22 characters, depending on the standard library) inline inside the string object rather than allocating a heap buffer. Once the content grows past that threshold, the implementation switches to a heap allocation. This is the small-string optimization (SSO) and it dramatically reduces allocations for the common case of short strings.

You can observe the switch by overriding operator new — see the worked example in track_memory_allocations_overriding_new_operator.md.

source code

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strings in C and C++

String encoding

ASCII

ANSI

Extended ASCII (8-bit ASCII)

Unicode

Encodings: UTF-8 vs UTF-16 vs UTF-32

Strings in Windows

8-bit AnsiStrings

16-bit UnicodeStrings

TCHAR (depends on UNICODE)

Checklist for Windows string programming

Literals and string data types

C strings

Aside: `const` placement on pointers

char* vs char array

Size of a string

strdup

strstr — substring search

std::string

The -Wwrite-strings warning

String comparison

String / number conversions

String → number

Number → string

char → string

string → vector<char>

Splitting a string by a delimiter

Character classification: tolower, toupper, isalnum

std::string_view

Case-insensitive find

Small-string optimization (SSO)

FilesExpand file tree

string.md

Latest commit

History

string.md

File metadata and controls

Strings in C and C++

String encoding

ASCII

ANSI

Extended ASCII (8-bit ASCII)

Unicode

Encodings: UTF-8 vs UTF-16 vs UTF-32

Strings in Windows

8-bit AnsiStrings

16-bit UnicodeStrings

TCHAR (depends on UNICODE)

Checklist for Windows string programming

Literals and string data types

C strings

Aside: const placement on pointers

char* vs char array

Size of a string

strdup

strstr — substring search

std::string

The -Wwrite-strings warning

String comparison

String / number conversions

String → number

Number → string

char → string

string → vector<char>

Splitting a string by a delimiter

Character classification: tolower, toupper, isalnum

std::string_view

Case-insensitive find

Small-string optimization (SSO)

Aside: `const` placement on pointers