简单说,DuckDB 是一款在类似于 SQLite 的嵌入到应用内使用的 OLAP 数据库,没有一般的 OLAP 数据库那么“重”,也能够为各种非大数据但是数据量也蛮大的 Dashboard 型的数据场景提供助力。 ![[Quack, World!20240930155201.png]] ### 方便 马上开始使用,比安装各种 \*sql 数据库方便多了 ```shell brew install duckdb duckdb ``` ### SQL in Python - DuckDB can also directly query Pandas DataFrames, Polars DataFrames and Arrow tables. - 我一直用不顺 Pandas(虽然也有 sql 型的查询方式),极大阻碍了我的技能发展,这一次,可能真的不同。 ```python import duckdb # directly query a Pandas DataFrame import pandas as pd pandas_df = pd.DataFrame({"a": [42]}) duckdb.sql("SELECT * FROM pandas_df") # directly query a Polars DataFrame import polars as pl polars_df = pl.DataFrame({"a": [42]}) duckdb.sql("SELECT * FROM polars_df") # directly query a pyarrow table import pyarrow as pa arrow_table = pa.Table.from_pydict({"a": [42]}) duckdb.sql("SELECT * FROM arrow_table") ``` ### SQL 新特性 - USING 设置关联条件。 ```sql 两个表里的 regionkey 是同一个字段 -- return the regions for the nations SELECT n.*, r.* FROM l_nations n JOIN l_regions r USING (regionkey); ``` - Semi and Anti Joins - Semi joins return rows from the left table that have at least one match in the right table. - Anti joins return rows from the left table that have no matches in the right table. - 少写一个 where 啊! - Positional Joins 按记录的顺序拼接列 - Always full outer joins ```sql -- treat two data frames as a single table SELECT df1.*, df2.* FROM df1 POSITIONAL JOIN df2; ``` - GROUP BY ALL 和 ORDER BY ALL - 少做一次 Copy 啊!代码也短一些。 ```sql -- Group by city and street_name to remove any duplicate values SELECT city, street_name FROM addresses GROUP BY ALL; -- GROUP BY city, street_name -- compute the average income per city per street_name -- Since income is wrapped in an aggregate function, do not include it in the GROUP BY SELECT city, street_name, avg(income) FROM addresses GROUP BY ALL; -- GROUP BY city, street_name ``` - 按拼音顺序排列。 ![[Quack, World!20240930155308.png]] - 抽样跑数 - 之前我都是自己 `rand()` 一列,均匀分布地抽取一些数据来处理的。 - 啊,这个“不努力”分布到底张啥样来的? ```sql -- select a sample of 1% of the addresses table using default (system) sampling SELECT * FROM addresses USING SAMPLE 1%; -- select a sample of 1% of the addresses table using bernoulli sampling SELECT * FROM addresses USING SAMPLE 1% (bernoulli); ``` - 展开 list 或者一个 struct,直接 Unnest `SELECT unnest({'a': 42, 'b': 84});` - Qualify 过滤 window function 的结果 - 少套一个 where ```sql -- Filter based on a WINDOW function defined in the SELECT clause SELECT ... row_number() OVER (...) AS function_rank FROM ... QUALIFY function_rank < 3; ``` - filter 写指标更加优雅了。 - 好像也差不多? ```sql -- "Pivot" the data out by year (move each year out to a separate column) SELECT count(i) FILTER (year = 2022) AS "2022", count(CASE WHEN year = 2023 THEN i END) AS "2023" FROM stacked_data; ``` - 一口气取一堆列,懒到家了 ```sql -- 不取 col1 SELECT * EXCLUDE (col1) FROM tbl; -- 只改 col1 SELECT * REPLACE (col1 / 1000 AS col1) FROM tbl; -- 一起来 aggregate SELECT min(COLUMNS(*)), count(COLUMNS(*)) FROM numbers; -- 懒得写字段名,那就 regex 一下 SELECT COLUMNS('(id|numbers?)') FROM numbers; ```