- Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides high performance compression and encoding schemes to handle complex data in bulk and is supported in many programming language and analytics tools.
- Overview - Apache Parquet
- Docs - Apache Parquet
- apache/parquet-format - contains the specification for Apache Parquet and Apache Thrift definitions to read and write Parquet metadata.
- The Parquet Format and Performance Optimization Opportunities Boudewijn Braams (Databricks)
- physical storage: row-wise, columnar, hybrid (logical: rows and columns, rows containing column chunks)
- OLTP (transactional; e.g. customer databases - row-wise) vs OLAP (analytical; e.g. machine learning applications - columnar)
- hybrid physical storage used by Parquet and ORC - get benefits of locality properties of row-wise model and I/O properties of columnar model
- Parquet files: Not necessarily a single file on disk - logical file is defined by root directory with either files or subdirectories defining files in leaf directories
- Parquet data organisation: Hybrid data partiioning - row-groups (128MB); column chunks and pages
- Encoding schemes: Run-length encoding dictionaries e.g. countries (only finite number of countries; bound to repeat) - timestamp
- Page compression: schemes including snappy, gzip, lzo; only makes sense if: saved I/O should outweigh the cost of decompression
- optimisation opportunities:
- use (row footer?) metadata to quick-check predicates e.g. WHERE x > 5 check min and max in row-wise metadata
- bake predicate data into directory structure
- avoid having huge files
- From Doug Cutting (creator of Hadoop)
- Apache Arrow
- FAQ - Apache Arrow
- Feather File Format — Apache Arrow v20.0.0
- Datasets 🤝 Arrow - Hugging Face
datasets
- Implementation Status
- Introduction to Apache Arrow - InfluxData
- seems like it’s columnar?
- A deep dive into the Arrow Columnar format with pyarrow and nanoarrow 👈 still to watch
- Apache Hadoop - The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
- Docs - Apache Hadoop
- Modules:
- Hadoop Common: The common utilities that support the other Hadoop modules.
- Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
- Hadoop YARN: A framework for job scheduling and cluster resource management.
- Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
- Libraries - Hugging Face Datasets - the libraries that Hugging Face
datasets
supports interfaces with - Argilla - Hugging Face Datasets Libraries
- webdataset/webdataset- A high-performance Python-based I/O system for large (and small) deep learning problems, with strong support for PyTorch.
- GraphQL
Concepts
Notes
What are the primary best practices one should follow when designing and defining relational databases and relational database structure? In particular, is there an equivalent concept to that of unit tests for software engineering in the context of databases?
This section is llm-generated
When designing and defining relational databases and relational database structures, there are several best practices to follow. These include:
- Proper planning: Before creating a database, design a proper Entity-Relationship (ER) model or diagram to represent the structure and relationships between tables. This will help you create a solid foundation for your database.
- Normalization: Normalization is the process of organizing data in a database to minimize redundancy and improve data integrity. There are several normal forms (1NF, 2NF, 3NF, BCNF, etc.), each with specific rules to follow. Aim to reach at least the third normal form (3NF) to ensure a well-designed database.
- Use appropriate data types: Choose the correct data types for each column to ensure data integrity, optimize storage, and improve query performance.
- Primary and foreign keys: Use primary keys to uniquely identify each row in a table and foreign keys to establish relationships between tables. This helps maintain referential integrity.
- Indexing: Create indexes on frequently searched or filtered columns to improve query performance. However, use them judiciously, as too many indexes can slow down data modification operations.
- Naming conventions: Follow consistent naming conventions for tables, columns, and indexes to make it easier to understand the database structure and maintain it.
- Use constraints: Apply constraints like NOT NULL, UNIQUE, CHECK, and DEFAULT to maintain data integrity and prevent invalid data from being entered into the database.
- Database views: Use views to simplify complex queries and provide an additional layer of abstraction for users.
- Stored procedures and triggers: Use stored procedures for complex business logic and triggers to maintain data integrity by automatically executing actions based on specific events.
- Backup and recovery: Implement a backup and recovery strategy to ensure data safety and minimize downtime in case of data loss or corruption.
Regarding the concept of unit tests for databases, there isn’t a direct equivalent. However, you can perform testing on databases through various methods, such as:
- Data validation tests: Check if the data in the database conforms to the rules and constraints defined in the schema.
- Functional tests: Test stored procedures, functions, and triggers to ensure they behave as expected.
- Performance tests: Test the database’s performance under different load conditions and optimize it accordingly.
- Integration tests: Test the interaction between the database and the application to ensure proper data flow and handling.
- Database migration tests: Test the migration process from one version of the database schema to another to ensure data integrity and compatibility with the application.
In conclusion, following best practices when designing and defining relational databases will help you create efficient, maintainable, and well-organized databases. Testing your database is also essential to ensure data integrity, functionality, and performance.