Glossary

Several publicly available websites were consulted to create the final definitions you see here. Principal among them is Wikipedia.org.

Byte Order Mark (BOM)

The byte order mark (BOM) is a Unicode character, FEFF_hex BYTE ORDER MARK (BOM), whose appearance as a magic number at the start of a text stream can signal several things to a program reading the text:

The byte order, or endianness, of the text stream
The fact that the text stream's encoding is Unicode, to a high level of confidence
Which Unicode encoding the text stream is encoded as

The UTF-8 representation of the BOM is this sequence of characters: ï » ¿

Hindu–Arabic Numeral System

The Hindu–Arabic numeral system or Indo-Arabic numeral system (also called the Arabic numeral system or Hindu numeral system) is a positional decimal numeral system, and is the most common system for the symbolic representation of numbers in the world.

The system is based upon ten (originally nine) glyphs. The symbols (glyphs) used to represent the system are in principle independent of the system itself. The glyphs in actual use are descended from Brahmi numerals and have split into various typographical variants since the Middle Ages.

Hindu-Arabic numerals are the ten digits: 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9. Their Unicode code points are 30_hex to 39_hex. The term often implies a decimal number written using these digits.

Key-Value Pair

A name–value pair, key–value pair, field–value pair or attribute–value pair is a fundamental data representation in computing systems and applications. Designers often desire an open-ended data structure that allows for future extension without modifying existing code or data. In such situations, all or part of the data model may be expressed as a collection of 2-tuples in the form with each element being an attribute–value pair. Depending on the particular application and the implementation chosen by programmers, attribute names may or may not be unique.

Magic Number

Magic numbers are common in programs across many operating systems. Magic numbers implement strongly typed data and are a form of in-band signaling to the controlling program that reads the data type(s) at program run-time. Many files have such constants that identify the contained data. Detecting such constants in files is a simple and effective way of distinguishing between many file formats and can yield further run-time information.

Media Type

A media type (formerly known as MIME type) is a two-part identifier for file formats and format contents transmitted on the Internet. A media type consists of a type and a subtype, which is further structured into a tree. A media type can optionally define a suffix and parameters: Type "/" [tree "."] subtype ["+" suffix] *[";" parameter].

Common examples:

application/javascript
application/json
text/html; charset=UTF-8
text/plain
text/xml

n-Tuple

An n-tuple, sometimes simply called a "tuple" when the number n is known implicitly, is another word for a list, i.e., an ordered set of n elements.

Newline

Newline (frequently called line ending, end of line (EOL), line feed, or line break) is a control character or sequence of control characters in a character encoding specification (e.g. ASCII or EBCDIC) that is used to signify the end of a line of text and the start of a new one. Text editors set this special character when pressing the Enter key. When displaying (or printing) a text file, this control character causes the text editor to show the following characters in a new line.

The Unicode standard defines several characters that conforming applications should recognize as line terminators:

LF: Line Feed, 0A_hex
VT: Vertical Tab, 0B_hex
FF: Form Feed, 0C_hex
CR: Carriage Return, 0D_hex
CR+LF: CR (0D_hex) followed by LF (0A_hex)
NEL: Next Line, 85_hex
LS: Line Separator, 2028_hex
PS: Paragraph Separator, 2029_hex

NULL

Null (or NULL) is a special marker used in Structured Query Language to indicate that a data value does not exist in the database. Introduced by the creator of the relational database model, E. F. Codd, SQL Null serves to fulfil the requirement that all true relational database management systems (RDBMS) support a representation of "missing information and inapplicable information". Codd also introduced the use of the lowercase Greek omega ω (03C9_hex) symbol to represent Null in database theory. In SQL, NULL is a reserved word used to identify this marker.

A null should not be confused with a value of 0. A null value indicates a lack of a value — a lack of a value is not the same thing as a value of zero in the same way that a lack of an answer is not the same thing as an answer of "no". For example, consider the question "How many books does Adam own?" The answer may be "zero" (we know that he owns none) or "null" (we do not know how many he owns). In a database table, the column reporting this answer would start out with no value (marked by Null), and it would not be updated with the value "zero" until we have ascertained that Adam owns no books.

PII (Personally Identifying Information)

Personal data, also known as personal information, personally identifying information (PII), or sensitive personal information (SPI), is any information relating to identifying a person. The abbreviation PII is widely accepted in the United States, but the phrase it abbreviates has four common variants based on personal / personally, and identifiable / identifying. Not all are equivalent, and for legal purposes the effective definitions vary depending on the jurisdiction and the purposes for which the term is being used. Under European and other data protection regimes, which centre primarily around the General Data Protection Regulation, the term "personal data" is significantly broader, and determines the scope of the regulatory regime.

National Institute of Standards and Technology Special Publication 800-122[5] defines personally identifying information as "any information about an individual maintained by an agency, including (1) any information that can be used to distinguish or trace an individual's identity, such as name, social security number, date and place of birth, mother's maiden name, or biometric records; and (2) any other information that is linked or linkable to an individual, such as medical, educational, financial, and employment information." So, for example, a user's IP address is not classed as PII on its own, but is classified as a linked PII. However, in the European Union, the IP address of an Internet subscriber may be classed as personal data.

Unicode

Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard is maintained by the Unicode Consortium, and as of May 2019 the most recent version, Unicode 12.1, contains a repertoire of 137,994 characters (consisting of 137,766 graphic characters, 163 format characters and 65 control characters) covering 150 modern and historic scripts, as well as multiple symbol sets and emoji.

The Unicode Standard consists of a set of code charts for visual reference, an encoding method and set of standard character encodings, a set of reference data files, and a number of related items, such as character properties, rules for normalization, decomposition, collation, rendering, and bidirectional display order (for the correct display of text containing both right-to-left scripts, such as Arabic and Hebrew, and left-to-right scripts).

URL (Uniform Resource Locator)

A URL is a compact string representation for a resource available via the Internet. URLs are used to "locate" resources, by providing an abstract identification of the resource location.

For example the URL for this page is https://textfileschema.omegatower.net/glossary.

UTF-8

UTF-8 is a variable width character encoding capable of encoding all 1,112,064 valid code points in Unicode using one to four 8-bit bytes. UTF-8 uses one byte for the first 128 code points, and up to 4 bytes for other characters. The encoding is defined by the Unicode standard, and was originally designed by Ken Thompson and Rob Pike. The name is derived from Unicode (or Universal Coded Character Set) Transformation Format – 8-bit. UTF-8 is the dominant encoding on the World Wide Web (used in over 94% of websites as of November 2019). The first 128 Unicode code points represent the ASCII characters, which means that any ASCII text is also a UTF-8 text.

A UTF-8 encoding table can be found here: UTF-8 encoding table and Unicode characters