Search Algorithms and Data Structures

Содержание

Слайд 2

About Me

About Me

Слайд 3

RDBMS is … https://www.geeksforgeeks.org/database-file-indexing-b-tree-introduction/

RDBMS is …

https://www.geeksforgeeks.org/database-file-indexing-b-tree-introduction/

Слайд 4

Composite Index CREATE INDEX class_pos_index ON users (class, position);

Composite Index

CREATE INDEX class_pos_index ON users (class, position);

Слайд 5

eComm Facets

eComm Facets

Слайд 6

Inverted Index

Inverted Index

Слайд 7

Lucene, Solr and Elasticsearch http://www.supercoloring.com/coloring-pages/romulus-and-remus-with-the-she-wolf.

Lucene, Solr and Elasticsearch

http://www.supercoloring.com/coloring-pages/romulus-and-remus-with-the-she-wolf.

Слайд 8

The First Indices Общественное достояние, https://commons.wikimedia.org/w/index.php?curid=433142 https://rbscp.lib.rochester.edu/489

The First Indices

Общественное достояние, https://commons.wikimedia.org/w/index.php?curid=433142

https://rbscp.lib.rochester.edu/489

Слайд 9

Term Dictionary rubens rub* [rome TO rustic] *uber *man* By Claudio

Term Dictionary

rubens
rub*
[rome TO rustic]
*uber
*man*

By Claudio Rocchini - Own work, CC BY

2.5, https://commons.wikimedia.org/w/index.php?curid=2118795
Слайд 10

Offsets to Postings List File 10: 8, 9, 10, 14, 18,

Offsets to Postings List File

10: 8, 9, 10, 14, 18, 23,

24, 26, 31, 35; 8: 8, 11, 14, 18, 21, 23, 25, 27; 8: 4, 5, 6, 9, 13, 14, 18, 22; 8: 3, 4, 7, 9, 12, 13, 17, 20; 7: 5, 9, 12, 14, 19, 23, 28; 9: 0, 2, 5, 6, 11, 13, 17, 22, 27
Слайд 11

Postings Codecs delta 8, 9, 10, 14, 18, 23 => 1,1,1,4,4,5

Postings Codecs

delta 8, 9, 10, 14, 18, 23 => 1,1,1,4,4,5
vint 5

=> 000001012 129 => 100000012 000000012
PFOR 0012 0012 0012 1002 1002 1012
Слайд 12

Query Execution

Query Execution

Слайд 13

Stored Field Retrieval {"name": "first doc", "count": 43} {"name": "another doc",

Stored Field Retrieval

{"name": "first doc", "count": 43} {"name": "another doc", "count":5}

{"name": "the last doc", "count":3435} {"name": "the book", "count":-1334}
Слайд 14

Indexing

Indexing

Слайд 15

Index Document curl -X PUT "localhost:9200/twitter/tweet/1" -H 'Content-Type: application/json' -d' {

Index Document

curl -X PUT "localhost:9200/twitter/tweet/1" -H 'Content-Type: application/json' -d'
{
"user" :

"kimchy",
"post_date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch"
}
'
Слайд 16

Mapping curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'{ "mappings": {

Mapping

curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'{
"mappings": {
"properties":

{
"title": { "type": "text" },
"name": { "type": "text" },
"age": { "type": "integer" },
"created": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis" } } }}'
Слайд 17

Analysis curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d' { "settings":

Analysis

curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'
{ "settings": { "analysis":

{
"analyzer": {
"my_custom_analyzer": { "type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase"
]
} } } }}
'
Слайд 18

In-memory Buffer { "user" : "kimchy", "message" : "trying out Elasticsearch" } user message

In-memory Buffer

{
"user" : "kimchy",
"message" : "trying out
Elasticsearch"
}

user

message

Слайд 19

Segments 1,2,2,3 1 2,2,1

Segments

1,2,2,3

1

2,2,1

Слайд 20

_refresh Elasticsearch: The De€nitive Guide Copyright © 2015 Elasticsearch. All rights reserved.

_refresh

Elasticsearch: The De€nitive Guide
Copyright © 2015 Elasticsearch. All rights reserved.

Слайд 21

_refresh vs _flush Elasticsearch: The De€nitive Guide Copyright © 2015 Elasticsearch. All rights reserved.

_refresh vs _flush

Elasticsearch: The De€nitive Guide
Copyright © 2015 Elasticsearch. All

rights reserved.
Слайд 22

Indexing Performance _bulk Threads ETL is hard

Indexing Performance

_bulk
Threads
ETL is hard

Слайд 23

Searching

Searching

Слайд 24

Result Page { "timed_out": false, "took": 62, "_shards":{ "total" : 10,

Result Page

{
"timed_out": false,
"took": 62,
"_shards":{
"total" : 10,
"successful"

: 1,
"skipped" : 0,
"failed" : 0
},
"hits":{
"total" : {
"value": 23539,
"relation": "eq"
},
"max_score": 1.3862944,
"hits" : [
{
"_index" : "twitter",
"_type" : "_doc",
"_id" : "0",
"_score": 1.3862944,
"_source" : {
"user" : "kimchy",
"date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch",
"likes": 0
}
}, {
"_index" : "twitter",
"_type" : "_doc",
"_id" : "242",
"_score": 0.72944,
"_source" : {
"user" : "foo bar",
"date" : "2009-11-15T14:12:12",
"message" : "searching Elasticsearch",
"likes": 0
}
}
] }}
Слайд 25

Result Page Cropping 10: 8, 9, 10, 14, 18, 23, 24,

Result Page Cropping

10: 8, 9, 10, 14, 18, 23, 24, 26,

31, 35; 8: 8, 11, 14, 18, 21, 23, 25, 27; 8: 4, 5, 6, 9, 13, 14, 18, 22; 8: 3, 4, 7, 9, 12, 13, 17, 20;


O(n log (p))

http://www.cse.hut.fi/en/research/SVG/TRAKLA2/tutorials/heap_tutorial/taulukkona.html

Слайд 26

Слайд 27

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html

Слайд 28

Real Life Usage

Real Life Usage

Слайд 29

Elastic Logstash Kibana (ELK) Stack https://www.elastic.co/guide/en/logstash/current/deploying-and-scaling.html

Elastic Logstash Kibana (ELK) Stack

https://www.elastic.co/guide/en/logstash/current/deploying-and-scaling.html

Слайд 30

Scaling https://www.digitalocean.com/community/tutorials/understanding-database-sharding

Scaling

https://www.digitalocean.com/community/tutorials/understanding-database-sharding

Слайд 31

Summary Why search? Inverted Index Indexing Searching Scaling ELK

Summary

Why search?
Inverted Index
Indexing
Searching
Scaling
ELK

Слайд 32

200 threads https://blog.newrelic.com/engineering/expected-distributions-website-response-times/

200 threads

https://blog.newrelic.com/engineering/expected-distributions-website-response-times/

Слайд 33

2,2,1 Index Hierarchy 2,2,1 2,2,1 Segment - inverted index Lucene Index

2,2,1

Index Hierarchy

2,2,1

2,2,1

Segment - inverted index

Lucene Index - chain


Shard

Elasticsearch Index

Index pattern:

access-log-*

access-log-2019-08-06
access-log-2019-08-07
access-log-2019-08-08
….

Слайд 34

An Index is … an derived data structure

An Index is …

an derived data structure

Слайд 35

Слайд 36

Term Frequency red red bar red red red bar red

Term Frequency

red

red bar

red red red bar red