File Format Specifications and Patents
– File formats often have published specifications describing encoding methods
– Specifications enable testing of program functionality
– Some developers view their specifications as trade secrets
– Reverse engineering or acquiring specification documents are common methods to utilize file formats
– File formats with publicly available specifications are more widely supported
– Patent law is often used to protect file formats
– Some formats use patented algorithms
– GIF file format required patented compression algorithm prior to 2004
– GIF patent expired in the US in mid-2003 and worldwide in mid-2004
– Patent expiration led to the development of alternative formats like PNG
Identifying File Type
– Different operating systems use different approaches to determine file format
– Multiple approaches are often needed to read foreign file formats
– Filename extension is a popular method used by many operating systems
– Filename extension is the portion of the filename after the final period
– Limited number of three-letter extensions can cause confusion
Internal Metadata and File Headers
– Another way to identify file format is using information stored inside the file itself
– Information can be specifically meant for identification or binary strings in specific locations
– Internal metadata provides reliable identification of file format
– This method is used in addition to other approaches by modern operating systems
– Internal metadata helps in reading and working with foreign file formats
– File headers contain metadata about the file and its contents.
– They are usually stored at the start of the file, but can be present in other areas too.
– Text-based file headers are human-readable and can be examined easily.
– Binary formats usually have binary headers.
– File headers may be used by an operating system to quickly gather information about a file.
– File headers can store information about image format, size, resolution, and color space.
– They can also contain authoring information such as the creator, date, and camera settings.
– Metadata in file headers is used by software during the loading process and afterwards.
– Text-based file headers are larger in size but can be easily examined using simple software.
– Binary headers may require complex interpretation and can result in corrupt metadata.
Magic Numbers and Shebang Lines
– Magic numbers are identifiers stored inside the file itself for file type recognition.
– Any unique feature of a file format can be used as a magic number.
– Magic numbers offer better guarantees for identifying the file format correctly.
– They can determine more precise information about the file.
– Magic numbers can be inefficient for displaying large lists of files.
– Shebang lines in script files are a special case of magic numbers.
– Shebang lines identify a specific command interpreter and its options.
– The magic number in shebang lines is human-readable text.
– Shebang lines are used to execute scripts with the correct interpreter.
– They are commonly used in Unix-based systems.
External Metadata and Symlinks
– External metadata stores information about the file format in the file system.
– This approach keeps metadata separate from the main data and the file name.
– External metadata is less portable compared to filename extensions or magic numbers.
– Zip files or archive files solve the problem of handling metadata.
– Zip files collect multiple files together with metadata in a compressed and encrypted format.
– Symlinks are symbolic links that point to another file or directory.
– They provide a way to create shortcuts or references to other files or directories.
– Symlinks can be used to create aliases or alternative names for files or directories.
– They are commonly used in Unix-like operating systems.
– Symlinks can be created using the ln command.
This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these template messages)
|
A file format is a standard way that information is encoded for storage in a computer file. It specifies how bits are used to encode information in a digital storage medium. File formats may be either proprietary or free.
Some file formats are designed for very particular types of data: PNG files, for example, store bitmapped images using lossless data compression. Other file formats, however, are designed for storage of several different types of data: the Ogg format can act as a container for different types of multimedia including any combination of audio and video, with or without text (such as subtitles), and metadata. A text file can contain any stream of characters, including possible control characters, and is encoded in one of various character encoding schemes. Some file formats, such as HTML, scalable vector graphics, and the source code of computer software are text files with defined syntaxes that allow them to be used for specific purposes.