PostgreSQL and Big Data

Alexey Vasiliev, Railsware

PostgreSQL and Big Data

Alexey Vasiliev, Railsware

PostgreSQL and Big Data

Alexey Vasiliev

Chef Book PostgreSQL book

Big Data

What is Big Data?

What is Big Data?

That which doesn't fit in an Excel spreadsheet.

Stéphane Hamel

Any data too big for copy to DVDs and fit into 1 Lada.

Kazakhstan dictionary

When person monitors database, it is called small data and when database monitors person, it is called big data.

Tarun Kumar

What is Big Data?

Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it.

WhatIsBigDataBruce

PostgreSQL

What is PostgreSQL?

PostgreSQL logo PostgreSQL ("Postgres") - is an object-relational database management system (ORDBMS) with an emphasis on extensibility and standards-compliance. Based on the SQL language and supports many of the features of the standard SQL:2011. PostgreSQL evolved from the Ingres project at the University of California, Berkeley. In 1982 the leader of the Ingres team, Michael Stonebraker, left Berkeley to make a proprietary version of Ingres. He returned to Berkeley in 1985 and started a post-Ingres project

PostgreSQL strengths

PostgreSQL

PostgreSQL limits

Limit Value
Maximum Database Size Unlimited
Maximum Table Size 32 TB
Maximum Row Size 1.6 TB
Maximum Field Size 1 GB
Maximum Rows per Table Unlimited
Maximum Columns per Table 250-1600 depending on column types
Maximum Indexes per Table Unlimited

Indexes

What Is Indexes

Index use cases

B-Tree

B-Tree

R-Tree

R-Tree

Hash Index

HashIndexes

Bitmap Index

BitmapIndexes

Generalized Search Tree (GiST) index

GISTIndexes

Generalized Inverted (GIN) index

GINIndexes

Block Range (BRIN) Indexes (9.5+)

BRINIndexes

Block Range (BRIN) Indexes

# \dt+ orders
                  List of relations
 Schema |  Name  | Type  | Owner | Size  | Description
--------+--------+-------+-------+-------+-------------
 public | orders | table | leo   | 13 GB |
(1 row)

Block Range (BRIN) Indexes

# EXPLAIN ANALYSE SELECT count(*) FROM orders
    WHERE order_date BETWEEN '2012-01-04 09:00:00'
    and '2014-01-04 14:30:00';
...
Planning time: 0.140 ms
Execution time: 30172.482 ms
(6 rows)

Block Range (BRIN) Indexes

# CREATE INDEX idx_order_date_brin ON orders
 USING BRIN (order_date);
# \di+ idx_order_date_brin
                              List of relations
 Schema |        Name         | Type  | Owner | Table  |  Size  | Description
--------+---------------------+-------+-------+--------+--------+-------------
 public | idx_order_date_brin | index | leo   | orders | 504 kB |
(1 row)

Block Range (BRIN) Indexes

# EXPLAIN ANALYSE SELECT count(*) FROM orders
       WHERE order_date BETWEEN '2012-01-04 09:00:00'
       and '2014-01-04 14:30:00';
...
->  Bitmap Index Scan on idx_order_date_brin
      Index Cond: ((order_date >= '2012-01-04 09:00:00+00'::timestamp with time zone) AND (order_date <= '2014-01-04 14:30:00+00'::timestamp with time zone))
Planning time: 0.108 ms
Execution time: 6347.701 ms
(9 rows)

Block Range (BRIN) Indexes

# CREATE INDEX idx_order_date_brin_32
 ON orders
 USING BRIN (order_date) WITH (pages_per_range = 32);

# CREATE INDEX idx_order_date_brin_512
 ON orders
 USING BRIN (order_date) WITH (pages_per_range = 512);

Block Range (BRIN) Indexes

# \di+ idx_order_date_brin*
                                List of relations
 Schema |          Name           | Table  |  Size
--------+-------------------------+--------+---------
 public | idx_order_date_brin     | orders | 504 kB
 public | idx_order_date_brin_32  | orders | 1872 kB
 public | idx_order_date_brin_512 | orders | 152 kB
(3 rows)

JSONB

JSONB

The only difference between json and jsonb is their storage:

There are 3 major consequences of this:

JSONB

# SELECT '{"name": "Joe", "age": 30}'::jsonb
       || '{"town": "London"}'::jsonb;
                    ?column?
 ----------------------------------------------
  {"age": 30, "name": "Joe", "town": "London"}

JSONB Operators

Operator Description
-> Get an element by key as a JSON object
->> Get an element by key as a text object
#> Get an element by path as a JSON object
#>> Get an element by path as a text object
<@, @> Evaluate whether a JSON object contains a key/value pair
? Evaluate whether a JSON object contains a key or a value
?| Evaluate whether a JSON object contains ANY of keys or values
?& Evaluate whether a JSON object contains ALL of keys or values
|| Insert or Update an element to a JSON object
- Delete an element by key from a JSON object
#- Delete an element by path from a JSON object

Analytics

WITH

# WITH prepared_data AS ( ... )
SELECT data, count(data),
       min(data), max(data)
FROM prepared_data
GROUP BY data;

WITH RECURSIVE

# WITH RECURSIVE t(n) AS (VALUES (1)
UNION ALL SELECT n+1 FROM t WHERE n < 100)
SELECT sum(n) FROM t;

 sum
------
 5050
(1 row)

Window functions

# SELECT depname, empno, salary, avg(salary) OVER (PARTITION BY depname) FROM empsalary;
 depname  | empno | salary |          avg
-----------+-------+--------+-----------------------
 develop   |    11 |   5200 | 5020.0000000000000000
 develop   |     7 |   4200 | 5020.0000000000000000
 develop   |     9 |   4500 | 5020.0000000000000000
 develop   |     8 |   6000 | 5020.0000000000000000
 develop   |    10 |   5200 | 5020.0000000000000000
 personnel |     5 |   3500 | 3700.0000000000000000
 personnel |     2 |   3900 | 3700.0000000000000000
 sales     |     3 |   4800 | 4866.6666666666666667
 sales     |     1 |   5000 | 4866.6666666666666667
 sales     |     4 |   4800 | 4866.6666666666666667
(10 rows)

Window functions

# SELECT salary, sum(salary) OVER (ORDER BY salary) FROM empsalary;
 salary |  sum
--------+-------
   3500 |  3500
   3900 |  7400
   4200 | 11600
   4500 | 16100
   4800 | 25700
   4800 | 25700
   5000 | 30700
   5200 | 41100
   5200 | 41100
   6000 | 47100
(10 rows)

FILTER replaces CASE WHEN

# SELECT
  COUNT(*) AS unfiltered,
  SUM( CASE WHEN i < 5 THEN 1 ELSE 0 END ) AS filtered
FROM generate_series(1,10) AS s(i);

# SELECT
  COUNT(*) AS unfiltered,
  COUNT(*) FILTER (WHERE i < 5) AS filtered
FROM generate_series(1,10) AS s(i);

Ordered-set (WITHIN GROUP) aggregates (9.4+)

Mode() - most common value

# SELECT mode() WITHIN GROUP (ORDER BY some_value)
      AS modal_value FROM tbl;
 mode
------
    9
(1 row)

Rank() - rank of a value in a subset

# SELECT v, RANK() OVER(ORDER BY v) FROM t;
| V | RANK |
|---|------|
| a |    1 |
| a |    1 |
| b |    3 |
| c |    4 |
| c |    4 |
| d |    6 |

Dense_rank() - rank() with no gaps

# SELECT v, DENSE_RANK() OVER(ORDER BY v) FROM t;
| V | DENSE_RANK |
|---|------------|
| a |          1 |
| a |          1 |
| b |          2 |
| c |          3 |
| c |          3 |
| d |          4 |

GROUPING SETS (9.5+)

# SELECT * FROM items_sold;
 brand | size | sales
-------+------+-------
 Foo   | L    |  10
 Foo   | M    |  20
 Bar   | M    |  15
 Bar   | L    |  5
(4 rows)

GROUPING SETS (9.5+)

# SELECT brand, size, sum(sales) FROM items_sold
      GROUP BY GROUPING SETS ((brand), (size), ());
 brand | size | sum
-------+------+-----
 Foo   |      |  30
 Bar   |      |  20
       | L    |  15
       | M    |  35
       |      |  50

CUBE (9.5+)

# SELECT brand, size, sum(sales) FROM items_sold GROUP BY CUBE (brand, size);
 brand | size | sum
-------+------+-----
 Bar   | L    |   5
 Bar   | M    |  15
 Bar   |      |  20
 Foo   | L    |  10
 Foo   | M    |  20
 Foo   |      |  30
       |      |  50
       | L    |  15
       | M    |  35

CUBE (9.5+)

# CUBE(c1, c2, c3)
    (с1, null, null)
    (null, c2, null)
    (null, null, c3)
    (c1, c2, null)
    (c1, null, c3)
    (null, c2, c3)
    (c1, c2, c3)
    (null, null, null)

ROLLUP (9.5+)

# SELECT brand, size, sum(sales) FROM items_sold
      GROUP BY ROLLUP (brand, size);
   brand | size | sum
  -------+------+-----
   Bar   | L    |   5
   Bar   | M    |  15
   Bar   |      |  20
   Foo   | L    |  10
   Foo   | M    |  20
   Foo   |      |  30
         |      |  50

ROLLUP (9.5+)

# ROLLUP(c1, c2, c3, c4)
    (c1, c2, c3, c4)
    (c1, c2, c3, null)
    (c1, c2, null, null)
    (c1, null, null, null)
    (null, null, null, null)

TABLESAMPLE

# EXPLAIN ANALYZE SELECT * FROM test TABLESAMPLE SYSTEM ( 0.001 );
                  QUERY PLAN
------------------------------------------------------------------
 Sample Scan (system) on test  (cost=0.00..0.10 rows=10 width=25)
   (actual time=0.029..0.042 rows=136 loops=1)
 Planning time: 0.125 ms
 Execution time: 0.061 ms
(3 rows)

TABLESAMPLE

# EXPLAIN ANALYZE SELECT * FROM test TABLESAMPLE BERNOULLI ( 0.001 );
                  QUERY PLAN
------------------------------------------------------------------
Sample Scan (bernoulli) on test  (cost=0.00..7353.10 rows=10 width=25)
    (actual time=2.779..27.288 rows=15 loops=1)
 Planning time: 0.112 ms
 Execution time: 27.312 ms
(3 rows)

TABLESAMPLE - Sampling methods

Tablesample

TABLESAMPLE REPEATABLE

The optional REPEATABLE clause specifies a seed number or expression to use for generating random numbers within the sampling method. The seed value can be any non-null floating-point value. Two queries that specify the same seed and argument values will select the same sample of the table, if the table has not been changed meanwhile

# SELECT * FROM t TABLESAMPLE SYSTEM (0.001) REPEATABLE (200);

Materialized Views

# REFRESH MATERIALIZED VIEW CONCURRENTLY my_view
MaterializedViews

Sharding

sharding

Yahoo claims 2-petabyte database is world's biggest, busiest

the system is described as an over 2 petabyte repository of user click stream and context data with an update rate for 24 billion events per day

James Hamilton

Petascale SQL DB at Yahoo!

Sharding solutions

Postgres-XC

Postgres-XC

Postgres-XC

Postgres-XC

Pg_shard

CREATE EXTENSION pg_shard;
CREATE TABLE customer_reviews
(
    customer_id TEXT NOT NULL,
    review_date DATE,
    review_rating INTEGER
);
SELECT master_create_distributed_table('customer_reviews', 'customer_id');
SELECT master_create_worker_shards('customer_reviews', 16, 2);
RDS pg_shard

Pg_shard Limitations

Cstore_fdw - columnar store for analytics with PostgreSQL

Cstore_fdw
Cstore_fdw

Foreign table inheritance (9.5+)

CREATE TABLE events (...);
CREATE TABLE events_current () INHERITS(main);
CREATE FOREIGN TABLE events_201501 (
CHECK(20150101 <= event_date AND event_date <= 20150131)
)
INHERITS(main) SERVER cstore_server OPTIONS(compression 'pglz');

Additional Options - "NoSQL" database

NoSQL

Foreign Data Wrappers

FDWs

Foreign Data Wrappers

<Thank You!> Questions?

Contact information

QuestionsSlide