Glossary Term
File format
File Format Specifications and Patents
- File formats often have published specifications describing encoding methods
- Specifications enable testing of program functionality
- Some developers view their specifications as trade secrets
- Reverse engineering or acquiring specification documents are common methods to utilize file formats
- File formats with publicly available specifications are more widely supported
- Patent law is often used to protect file formats
- Some formats use patented algorithms
- GIF file format required patented compression algorithm prior to 2004
- GIF patent expired in the US in mid-2003 and worldwide in mid-2004
- Patent expiration led to the development of alternative formats like PNG
Identifying File Type
- Different operating systems use different approaches to determine file format
- Multiple approaches are often needed to read foreign file formats
- Filename extension is a popular method used by many operating systems
- Filename extension is the portion of the filename after the final period
- Limited number of three-letter extensions can cause confusion
Internal Metadata and File Headers
- Another way to identify file format is using information stored inside the file itself
- Information can be specifically meant for identification or binary strings in specific locations
- Internal metadata provides reliable identification of file format
- This method is used in addition to other approaches by modern operating systems
- Internal metadata helps in reading and working with foreign file formats
- File headers contain metadata about the file and its contents.
- They are usually stored at the start of the file, but can be present in other areas too.
- Text-based file headers are human-readable and can be examined easily.
- Binary formats usually have binary headers.
- File headers may be used by an operating system to quickly gather information about a file.
- File headers can store information about image format, size, resolution, and color space.
- They can also contain authoring information such as the creator, date, and camera settings.
- Metadata in file headers is used by software during the loading process and afterwards.
- Text-based file headers are larger in size but can be easily examined using simple software.
- Binary headers may require complex interpretation and can result in corrupt metadata.
Magic Numbers and Shebang Lines
- Magic numbers are identifiers stored inside the file itself for file type recognition.
- Any unique feature of a file format can be used as a magic number.
- Magic numbers offer better guarantees for identifying the file format correctly.
- They can determine more precise information about the file.
- Magic numbers can be inefficient for displaying large lists of files.
- Shebang lines in script files are a special case of magic numbers.
- Shebang lines identify a specific command interpreter and its options.
- The magic number in shebang lines is human-readable text.
- Shebang lines are used to execute scripts with the correct interpreter.
- They are commonly used in Unix-based systems.
External Metadata and Symlinks
- External metadata stores information about the file format in the file system.
- This approach keeps metadata separate from the main data and the file name.
- External metadata is less portable compared to filename extensions or magic numbers.
- Zip files or archive files solve the problem of handling metadata.
- Zip files collect multiple files together with metadata in a compressed and encrypted format.
- Symlinks are symbolic links that point to another file or directory.
- They provide a way to create shortcuts or references to other files or directories.
- Symlinks can be used to create aliases or alternative names for files or directories.
- They are commonly used in Unix-like operating systems.
- Symlinks can be created using the ln command.