Program database

Last updated
Program database
Filename extension
.pdb
Internet media type application/octet-stream
Developed by Microsoft
Type of format Debug

Program database (PDB) is a file format (developed by Microsoft) for storing debugging information about a program (or, commonly, program modules such as a DLL or EXE). PDB files commonly have a .pdb extension. A PDB file is typically created from source files during compilation. It stores a list of all symbols in a module with their addresses and possibly the name of the file and the line on which the symbol was declared. This symbol information is not stored in the module itself, because it takes up a lot of space.[ citation needed ]

Contents

Applications

When a program is debugged, the debugger loads debugging information from the PDB file and uses it to locate symbols or relate current execution state of a program source code. Microsoft Visual Studio uses PDB files as its primary file format for debugging information.

Another use of PDB files is in services that collect crash data from users and relate it to the specific parts of the source code that cause (or are involved in) the crash.

Microsoft compilers will, under appropriate options, store information in a single PDB about types found in the compiled sources. Debug information specific to each source is stored in the compiled object file, and contains references to types in the PDB. Each compilation will add to the PDB any types that are not already found there, so that references in already compiled object files remain valid.

The Microsoft linker, under appropriate options, builds a complete new PDB which combines the debug information found in its input modules, the types referenced by those modules, and other information generated by the linker. If the link is performed incrementally, an existing PDB is modified by adding or replacing only the information pertaining to added or replaced modules, and adding any new types not already in the PDB.

PDB files are usually removed from the programs' distribution package. They are used by developers during debugging to save time and gain insight.

Extracting information

The PDB format is documented here, information can be extracted from a PDB file using the DIA (Debug Interface Access) interfaces, available on Microsoft Windows. There are also third-party tools that can also extract information from PDB such as radare2 and pdbparse

Multiple stream format

The PDB is a single file which is logically composed of several sub-files, called streams. It is designed to optimize the process of making changes to the PDB, as performed by compiles and incremental links. Streams can be removed, added, or replaced without rewriting any other streams, and the changes to the metadata which describes the streams is minimized as well.

The PDB is organized in fixed-size pages, typically 1K, 2K, or 4K, numbered consecutively starting at 0.

Note: It is presumed that all numeric information (e.g., stream and page numbers) is stored in little-endian form, the native form for Intel x86 based processors. The pdbparse Python code makes this assumption.

Stream

Each stream in the PDB occupies several pages, which aren't necessarily consecutively numbered. The stream has a number and a length. The stream content is the concatenation of its pages, truncated to the stream's length.

Metadata format

The function of the PDB metadata is to identify all of the component streams, giving the length, and sequence of pages for each stream. Streams are numbered consecutively starting with 0. There is also a root stream, unnumbered, which contains some of the metadata.

The PDB begins with a header, consisting of:

  • Signature, used to identify and validate the specific format. The length of the signature varies with the specific format.
  • The remainder of the header varies with the format identified by the signature.

The header may be longer than a single page.

Microsoft tools use two PDB formats:

Version 2

Signature is "Microsoft C/C++ program database 2.00\r\n\032JG\0\0"(44 bytes).

Remainder of the header consists of:

  • Page size, 4 bytes.
  • Start page, 2 bytes.
  • Number of file pages, 2 bytes.
  • Root stream size, 4 bytes.
  • reserved, 4 bytes.
  • Root stream page number list, 2 bytes per page, enough to cover the above Root stream size.

Version 7

Signature is "Microsoft C/C++ MSF 7.00\r\n\x1ADS\0\0\0"(32 bytes).

Remainder of the header consists of:

  • Page size, 4 bytes.
  • Allocation table pointer, 4 bytes. The meaning of this is unknown. There appears to be an allocation table, an array of 65,536 bits (8,192 bytes), located at the end of the PDB, and a 1-bit means a page that is not being used.
  • Number of file pages, 4 bytes.
  • Root stream size, 4 bytes.
  • reserved, 4 bytes.
  • Page number of the Root stream page number list. It does not indicate the location of the Root stream itself, only of the page containing the structure which points to its pages. At that page, the Root stream page number list indicates the pages where the Root stream is stored. It contains 4 bytes per page, enough to cover the above Root stream size.

Root stream

The root stream describes all of the PDB streams starting with stream 0. Its contents vary with the PDB format version.

Version 2

The root stream consists of:

  • Number of streams, 2 bytes.
  • Reserved, 2 bytes.
  • For each stream:
    • Stream size, 4 bytes.
    • Reserved, 4 bytes.
  • For each stream:
    • Stream page number list, 2 bytes per page, enough to cover above stream size.
Version 7

The root stream consists of:

  • Number of streams, 4 bytes.
  • For each stream:
    • Stream size, 4 bytes.
  • For each stream:
    • Stream page number list, 4 bytes per page, enough to cover above stream size.

Stream contents

Microsoft tools store different sorts of information in different numbered streams. Some stream numbers have a fixed information type associated with them, and other streams are identified in the aforementioned fixed type streams.

Stream 1 is used to verify that the PDB is the same file referred to in an executable or object file stream.

  • Version, 4 bytes.
  • Time date stamp, 4 bytes.
  • Age, 4 bytes. This is the number of times this PDB has been modified since its creation.
  • GUID, 16 bytes.
  • Total length of following names, 4 bytes. Followed by null-terminated character strings.

Stream 2 and stream 4 hold types information. Actual type records define types used in the program. The structure of these records can be found in the file cvinfo.h provided by Microsoft. There are two flavors of records, each with its own set of index numbers: type IDs and types; only types are stored in stream 2 and only type IDs are stored in stream 4. The indices are used to refer to these records from within symbol records and other type records.

  • A header:
    • Version, 4 bytes.
    • Header size, 4 bytes.
    • Minimum and maximum (last + 1) index for type records (4 bytes each).
    • Size of following data, 4 bytes, to the end of the stream.
  • Hash information:
    • Stream number, 2 bytes with 2 bytes padding.
    • Hash key, 4 bytes.
    • Buckets, 4 bytes.
    • HashVals, TiOff, and HashAdj, each composed of an offset and length, each 4 bytes.
  • Type records, variable length, count = (maximum - minimum) from above header.

Stream 3 is a directory for other streams. Note, it is not present in Version 2, nor in a PDB produced by a compiler. The stream starts with a header which is padded to be 64 bytes in total

PDB Stream 3 Header (struct NewDBIHdr)
OffsetSizeNameDescription
04SignatureHeader identifier, == 0xFFFFFFFF
44HeaderVersionVersion of the Header
84Age
122snGSSyms
142usVerAll
union{struct{USHORTusVerPdbDllMin:8;// minor version andUSHORTusVerPdbDllMaj:7;// major version andUSHORTfNewVerFmt:1;// flag telling us we have rbld stored elsewhere (high bit of original major version)}vernew;// that built this pdb last.struct{USHORTusVerPdbDllRbld:4;USHORTusVerPdbDllMin:7;USHORTusVerPdbDllMaj:5;}verold;USHORTusVerAll;};
162snPSSyms
182usVerPdbDllBuildbuild version of the pdb dll that built this pdb last
202snSymRecs
222VerPdbDllRBldrbld version of the pdb dll that built this pdb last
244cbGpModisize of rgmodi substream
284cbSCsize of Section Contribution substream
324cbSecMapsize of section map
364cbFileInfosize of file info stream
404cbTSMapsize of the Type Server Map substream
444iMFCMFC Index
484cbDbgHdrsize of optional DbgHdr info appended to the end of the stream
524cbECInfonumber of bytes in EC substream, or 0 if no EC enabled Mods
562flags
struct_flags{USHORTfIncLink:1;// true if linked incrmentally (really just if ilink thunks are present)USHORTfStripped:1;// true if PDB::CopyTo stripped the private data outUSHORTfCTypes:1;// true if this PDB is using CTypes.USHORTunused:13;// reserved, must be 0.}flags;
582wMachineMachine identifier, same as used in COFF object format, e.g., hex 8664 for Intel x86 64-bit
604RESERVEDfuture expansion, pad to 64 bytes
  • Module information, variable length. Total size in above header. There is one of these for each object module used by the linker
    • Opened, 4 bytes.
    • Symbol info.
      • Section number, 2 bytes + 2 bytes padding.
      • Offset and size, 4 bytes each.
      • Flags, 4 bytes.
      • Module number, 2 bytes + 2 bytes padding.
      • CRCs for section data and relocations data, 4 bytes each.
    • Flags, 2 bytes.
    • Stream number, 2 bytes.
    • Symbols size, 4 bytes.
    • Old and new line number info sizes, 4 bytes each.
    • Number of source files, 2 bytes + 2 bytes padding.
    • Offsets, 4 bytes.
    • niSource and niCompiler, 4 bytes each.
    • Module name, null terminated byte string.
    • Object name, null terminated byte string.
    • Padding to multiple of 4 bytes.
  • Section contributions, section headers, file info, ts map, and EC info. Their sizes are found in the above header.
  • Debug header,
    • Stream numbers for Old Frame Pointer Omission, Exceptions, Fixups, Object Maps to and from Source, Section Headers, Token Ring IDs, Xdata, Pdata, New Frame Pointer Omission, and Section Header Origin. 2 bytes each.

See also

Related Research Articles

In computer programming, Base64 is a group of binary-to-text encoding schemes that represent binary data in sequences of 24 bits that can be represented by four 6-bit Base64 digits.

The BMP file format or bitmap, is a raster graphics image file format used to store bitmap digital images, independently of the display device, especially on Microsoft Windows and OS/2 operating systems.

The Common Object File Format (COFF) is a format for executable, object code, and shared library computer files used on Unix systems. It was introduced in Unix System V, replaced the previously used a.out format, and formed the basis for extended specifications such as XCOFF and ECOFF, before being largely replaced by ELF, introduced with SVR4. COFF and its variants continue to be used on some Unix-like systems, on Microsoft Windows, in UEFI environments and in some embedded development systems.

In computer programming, a magic number is any of the following:

The Lempel–Ziv–Markov chain algorithm (LZMA) is an algorithm used to perform lossless data compression. It has been under development since either 1996 or 1998 by Igor Pavlov and was first used in the 7z format of the 7-Zip archiver. This algorithm uses a dictionary compression scheme somewhat similar to the LZ77 algorithm published by Abraham Lempel and Jacob Ziv in 1977 and features a high compression ratio and a variable compression-dictionary size, while still maintaining decompression speed similar to other commonly used compression algorithms.

The archiver, also known simply as ar, is a Unix utility that maintains groups of files as a single archive file. Today, ar is generally used only to create and update static library files that the link editor or linker uses and for generating .deb packages for the Debian family; it can be used to create archives for any purpose, but has been largely replaced by tar for purposes other than static libraries. An implementation of ar is included as one of the GNU Binutils.

In computer science, a symbol table is a data structure used by a language translator such as a compiler or interpreter, where each identifier, constant, procedure and function in a program's source code is associated with information relating to its declaration or appearance in the source. In other words, the entries of a symbol table store the information related to the entry's corresponding symbol.

Mach-O, short for Mach object file format, is a file format for executables, object code, shared libraries, dynamically loaded code, and core dumps. It was developed to replace the a.out format.

On the Amiga, the Old File System was the filesystem for AmigaOS before the Amiga Fast File System. Even though it used 512-byte blocks, it reserved the first small portion of each block for metadata, leaving an actual data block capacity of 488 bytes per block. It wasn't very suitable for anything except floppy disks, and it was soon replaced.

Relocation is the process of assigning load addresses for position-dependent code and data of a program and adjusting the code and data to reflect the assigned addresses. Prior to the advent of multiprocess systems, and still in many embedded systems, the addresses for objects were absolute starting at a known location, often zero. Since multiprocessing systems dynamically link and switch between programs it became necessary to be able to relocate objects using position-independent code. A linker usually performs relocation in conjunction with symbol resolution, the process of searching files and libraries to replace symbolic references or names of libraries with actual usable addresses in memory before running a program.

The ICO file format is an image file format for computer icons in Microsoft Windows. ICO files contain one or more small images at multiple sizes and color depths, such that they may be scaled appropriately. In Windows, all executables that display an icon to the user, on the desktop, in the Start Menu, or in file Explorer, must carry the icon in ICO format.

<span class="mw-page-title-main">SREC (file format)</span> File format developed by Motorola

Motorola S-record is a file format, created by Motorola in the mid-1970s, that conveys binary information as hex values in ASCII text form. This file format may also be known as SRECORD, SREC, S19, S28, S37. It is commonly used for programming flash memory in microcontrollers, EPROMs, EEPROMs, and other types of programmable logic devices. In a typical application, a compiler or assembler converts a program's source code to machine code and outputs it into a HEX file. The HEX file is then imported by a programmer to "burn" the machine code into non-volatile memory, or is transferred to the target system for loading and execution.

Compound File Binary Format (CFBF), also called Compound File, Compound Document format, or Composite Document File V2 (CDF), is a compound document file format for storing numerous files and streams within a single file on a disk. CFBF is developed by Microsoft and is an implementation of Microsoft COM Structured Storage.

A debug symbol is a special kind of symbol that attaches additional information to the symbol table of an object file, such as a shared library or an executable. This information allows a symbolic debugger to gain access to information from the source code of the binary, such as the names of identifiers, including variables and routines.

In computing, the System Object Model (SOM) is a proprietary executable file format developed by Hewlett-Packard for its HP-UX and MPE/ix operating systems. In particular, SOM is the native format used for 32-bit application executables, object code, and shared libraries running under the PA-RISC family of processors.

LEB128 or Little Endian Base 128 is a variable-length code compression used to store arbitrarily large integers in a small number of bytes. LEB128 is used in the DWARF debug file format and the WebAssembly binary encoding for all integer literals.

ZPAQ is an open source command line archiver for Windows and Linux. It uses a journaling or append-only format which can be rolled back to an earlier state to retrieve older versions of files and directories. It supports fast incremental update by adding only files whose last-modified date has changed since the previous update. It compresses using deduplication and several algorithms depending on the data type and the selected compression level. To preserve forward and backward compatibility between versions as the compression algorithm is improved, it stores the decompression algorithm in the archive. The ZPAQ source code includes a public domain API, libzpaq, which provides compression and decompression services to C++ applications. The format is believed to be unencumbered by patents.

The Perl virtual machine is a stack-based process virtual machine implemented as an opcodes interpreter which runs previously compiled programs written in the Perl language. The opcodes interpreter is a part of the Perl interpreter, which also contains a compiler in one executable file, commonly /usr/bin/perl on various Unix-like systems or perl.exe on Microsoft Windows systems.

The OS/360 Object File Format is the standard object module file format for the IBM DOS/360, OS/360 and VM/370, Univac VS/9, and Fujitsu BS2000 mainframe operating systems. In the 1990s, the format was given an extension with the XSD-type record for the MVS Operating System to support longer module names in the C Programming Language. This format is still in use by the z/VSE operating system. In contrast, it has been superseded by the GOFF file format on the MVS Operating System and on the z/VM Operating System. Since the MVS and z/VM loaders will still handle this older format, some compilers have chosen to continue to produce this format instead of the newer GOFF format.

The GOFF specification was developed for IBM's MVS operating system to supersede the IBM OS/360 Object File Format to compensate for weaknesses in the older format.