Basics Master - Part 1

Integer Types

In this chapter, we will be learning about the integer types in C from the first principles. There are 5 primary integer types. char, short, int, long, long long. And their sizes are implementation defined.

#c #type-system #integer-types

Integer types from first principles

First of all, we need to forget the type names for a second. At the hardware level, memory is just an array of bits, grouped into addressable units. A C “integer type” is a contract: how many bits an object occupies, how those bits map to a numeric value, and what operations are legal on it. Everything else (the names char/int/long) is just a label C attaches to specific contracts on a given platform.

The byte and CHAR_BIT

The smallest addressable unit in C is the byte, and sizeof is defined in units of bytes — sizeof(char) == 1 always, by definition. But a “byte” in C is not guaranteed to be 8 bits. CHAR_BIT (from <limits.h>) tells you how many bits are in a byte on this implementation. On essentially every machine you’ll touch, CHAR_BIT == 8. On some DSPs (old TI C6x, certain embedded targets), CHAR_BIT is 16 or 32 — there, char, short, and sometimes int might all be the same size in bytes and bits, because the hardware can’t address anything smaller than 16 or 32 bits.

This is the first principle: C’s type sizes are derived from what the underlying hardware can naturally address and compute on, not from a fixed spec. The standard just sets floors.

A type is (size, alignment, representation)

Every integer type has three properties:

  1. Size — how many bytes the object occupies (sizeof).
  2. Alignment — what addresses the object is allowed to start at (_Alignof). A 4-byte int typically needs to start at an address divisible by 4, because the CPU’s load/store instructions for 32-bit words are fastest (or only work) on aligned addresses.
  3. Representation — how the bit pattern maps to a value. This is where signed vs unsigned, and two’s complement vs other schemes, comes in.

sizeof(int) == 4 doesn’t mean “an int is a box holding the number 4 things” — it means the object occupies 4 bytes = 32 bits (assuming CHAR_BIT==8), giving 2^32 distinct bit patterns. The representation rule decides which of those 2^32 patterns map to which integers.

Unsigned representation — pure binary

For unsigned types, the mapping is the one you already know: a bit pattern b_{n-1} b_{n-2} ... b_1 b_0 represents

value = sum(b_i * 2^i)  for i = 0 to n-1

For an unsigned int with n=32 bits, range is 0 to 2^32 - 1. This is modular arithmetic — unsigned overflow is defined behavior, wrapping mod 2^n. UINT_MAX + 1 == 0 is guaranteed by the standard, every time, on every platform. This is the only integer arithmetic in C that’s fully defined under overflow.

Signed representation — two’s complement (mandatory as of C23)

For n bits, two’s complement defines:

value = -b_{n-1} * 2^(n-1) + sum(b_i * 2^i)  for i = 0 to n-2

The top bit has negative weight. This single rule is why two’s complement won: addition, subtraction, and multiplication work identically for signed and unsigned operands at the bit level — the ALU doesn’t need separate signed/unsigned add circuits. Only operations that interpret the result (comparison, division, right-shift, overflow detection) need to know signedness.

Concretely, for an 8-bit signed value:

0000 0001 =  1
0111 1111 =  127  (INT_MAX equivalent)
1000 0000 = -128  (INT_MIN equivalent — note: no positive counterpart!)
1111 1111 = -1
1111 1110 = -2

Two consequences fall directly out of this asymmetric range:

  • INT_MIN has no corresponding positive value (-INT_MIN overflows).
  • -1 is all bits set, for any width — this is why ~0 (bitwise NOT of zero) equals -1 for signed types, and why (unsigned)-1 == UINT_MAX (the bit pattern is reinterpreted, not converted numerically).

Before C23, implementations could also use sign-magnitude (top bit = sign flag, rest = magnitude) or ones’ complement (negation = flip all bits). Both have a redundant representation for zero (+0 and -0 as distinct bit patterns) — a wart two’s complement avoids since it has exactly one all-zero pattern. C23 killed these off entirely; you’ll basically only encounter them in legacy-portability war stories now.

Signed overflow is UB — and why

INT_MAX + 1 is undefined behavior, not “wraps to INT_MIN,” even though two’s complement would wrap that way mechanically. Why leave it undefined if the representation is now fixed?

Because UB here isn’t about representation — it’s about what the compiler is allowed to assume during optimization. If signed overflow were defined as wrapping, the compiler couldn’t transform x + 1 > x into true (always-true), because for x == INT_MAX it’d be false under wrapping. Compilers exploit “signed overflow never happens” to simplify loop bounds, eliminate redundant checks, and vectorize. This is a huge practical landmine: code that “looks correct” can be silently miscompiled under -O2 if it overflows signed ints, even on a two’s-complement machine where the wraparound would “make sense” mechanically. (-fwrapv in GCC/Clang forces defined wrapping semantics if you need it — common in crypto/hashing code.)

How a type gets bound to bits: the ABI

The C standard says “int has at least 16 bits, long has at least as many bits as int.” It does not say how many bits your int actually has — that’s decided by the ABI (Application Binary Interface) for your target: a contract between the compiler, OS, and linker about type sizes, alignment, calling conventions, and struct layout.

ILP32: int=32, long=32,  pointer=32   (32-bit x86, ARM32, most embedded)
LP64:  int=32, long=64,  pointer=64   (Linux/macOS/BSD x86-64, ARM64)
LLP64: int=32, long=32,  pointer=64   (Windows x86-64)

The names encode it: “LP64” = long and Pointer are 64-bit. The compiler reads a target triple (e.g. x86_64-linux-gnu) and picks the matching ABI’s type-size table — that table is what sizeof returns. This is why cross-compiling can change sizeof(long) for the exact same source file with the exact same compiler — different --target, different ABI, different table.

Memory layout — endianness

A type’s size tells you how many bytes; endianness tells you the order those bytes sit in memory for multi-byte types. For a 32-bit int with value 0x12345678:

Little-endian (x86, ARM in default mode):
  address:  N    N+1   N+2   N+3
  byte:    0x78  0x56  0x34  0x12   <- least significant byte first

Big-endian (network byte order, old PowerPC, some ARM configs):
  address:  N    N+1   N+2   N+3
  byte:    0x12  0x34  0x56  0x78   <- most significant byte first

This is invisible in normal C arithmetic — +, -, comparisons all operate on the value, and the compiler/CPU handle byte order internally. It becomes visible the moment you memcpy an int into a buffer, cast a pointer, use a union to reinterpret bytes, or read raw bytes off a network socket or file — exactly the territory you’re in with the BMP-to-ASCII project. htonl/ntohl and friends exist precisely to normalize this for protocols.

Padding bits and trap representations (mostly historical now)

Pre-C23, a type’s width (value-determining bits + sign bit) could be less than its full storage size — the extra bits were “padding,” with unspecified values, and certain padding patterns could be “trap representations” that produce UB just by being read, independent of any operation. This was a relic of exotic hardware (one’s-complement and sign-magnitude machines with -0 traps). C23, having mandated two’s complement with no padding bits for the standard integer types, eliminates this for signed/unsigned char/short/int/long/long long — every bit pattern in an object of these types is now a valid value. (_Bool still technically has this nuance — more on that in 1.11.)

Putting it together — what sizeof(int) is

sizeof(int) is the answer to: “given this target’s ABI, how many CHAR_BIT-sized bytes does the compiler’s int-representation occupy, where that representation is two’s-complement binary with no padding, aligned per the ABI’s alignment table, big enough to satisfy the standard’s minimum range guarantee (≥16 bits) but otherwise chosen by the implementation to match the CPU’s natural word size for efficient arithmetic.” Every other fact in 1.1–1.6 is a consequence of that one sentence.

The Primary Integer types — char, short, int, long, long long

The hierarchy and what “at least as wide” actually means

C defines a rank ordering (C23 §6.3.1.1):

bool < char < short < int < long < long long

The following is the most important information that we might need to remember or note down. I follow the latest standard : c23 and therefore here are its guarantees. That is I feel like the C standard is tending towards C++ territory, slowly, but without overload.

The standard guarantees:

sizeof(char)  <= sizeof(short)
sizeof(short) <= sizeof(int)
sizeof(int)   <= sizeof(long)
sizeof(long)  <= sizeof(long long)

Note “<=”, not “<”. Nothing stops a conforming implementation from making short == int == 32 bits. The only hard floors come from minimum range guarantees in <limits.h> (C23 §5.2.4.2.1), not from sizeof relationships:

type min bits min range (signed)
char 8 −127 to 127
short 16 −32767 to 32767
int 16 −32767 to 32767
long 32 −2147483647 to 2147483647
long long 64 −9223372036854775807 to …

These are minimums. int being only 16-bit-guaranteed is a historical artifact (16-bit machines) — in practice every modern target gives int 32 bits.

Real-world sizes (the data model matters)

On a given ABI, sizes are fixed but the ABI choice varies by platform:

As mentioned earlier, the ABI is the Application Binary Interface, which is a set of rules that allows two compiled binary program modules to communicate and work together on a specific system. While an API (Application Programming Interface) defines how source code interacts, an ABI defines how the compiled machine code interacts. That’s it. The main difference between them both.

LP64 (Linux/macOS x86-64, ARM64):
  char=1  short=2  int=4  long=8   long long=8

LLP64 (Windows x86-64):
  char=1  short=2  int=4  long=4   long long=8

ILP32 (32-bit x86, ARM32):
  char=1  short=2  int=4  long=4   long long=8

This is the single biggest portability trap: long is 8 bytes on Linux, 4 on Windows. Code that does long x = some_pointer_diff works on Linux and silently truncates on Windows. This is why intptr_t/size_t/fixed-width types exist (sections 1.4–1.6).

Why int is special

int is meant to be the “natural” register width of the target — the type arithmetic defaults to. This shows up in integer promotion: any operand of rank less than int (char, short, and _Bool, signed or unsigned) gets promoted to int (or unsigned int if it doesn’t fit in int) before any arithmetic or comparison. So:

char a = 100, b = 100;
int r = a + b;   // a, b promoted to int BEFORE the add
                 // result 200 — no char overflow, despite both operands being char

This is why you almost never see arithmetic overflow on char/short operands directly — the promotion happens first, and only the final assignment back to a narrow type can truncate.

Signedness defaults

short, int, long, long long are signed by default (signed short == short, etc. — the signed keyword is redundant but legal). char is the outlier — see 1.10, its signedness is implementation-defined and it’s actually a third distinct type from both signed char and unsigned char.

C23 change worth flagging for your notes

C23 mandates two’s complement as the only representation for signed integers (previously sign-magnitude and ones’ complement were allowed, even if no real implementation used them). This also formally defines INT_MIN == -INT_MAX - 1 and makes (unsigned)-1 behavior on conversion fully nailed down. Practically changes nothing on x86/ARM (always was two’s complement) but removes a category of “technically UB” reasoning about bit patterns.

Literal type and suffixes (preview, full table in 1.17)

A bare integer literal like 100000000000 doesn’t automatically become long long — its type is the first type in a signedness/size-appropriate list that can hold the value (decimal: int, long, long long; hex/octal also consider unsigned variants earlier). This is a classic bug source:

long x = 1 << 40;     // UB/wrong on LP64 too — 1 is int, shift overflows int
long x = 1L << 40;    // correct — 1L is long