Analysts usually need to access only a few columns at a time. The idea behind column-oriented storage is to not store all values in a row together, but store all the values from each column together. If each column is stored in a separate file, a query only needs to read and parse those columns that are used in that query.
Columns can be compressed with bitmap encoding. If n = number of rows is large, there will be many 0s and the bitmaps are sparse. We can use run-length encoding to reduce even more space.
In a column store, it doesn’t necessarily matter in which order the rows are stored. However, we can choose to impose an order, like we did with SSTables previously, and use that as an indexing method.
asm.js is a subset of JS which is compiled into bytecode. The interpreter with asm support can now do ahead-of-time compilation. Although JIT compilers are good, asm.js’s model provides a model closer to C/C++ by eliminating dynamic types guards, boxed values, and garbage collection.
Web Assembly is a binary format for JS. It’s note bytecode, it’s a binary encoding of the AST that the parser calculates. It allows the JS engine to skip the parsing step and it’s much more compact than the JS original source. Web Assembly is a misnomer. It’s neither web nor assembly, but a bytecode that can be targeted from languages like C++, and Rust. This means you can write some Rust code, compile it into WebAssembly, and run that code in a WebAssembly virtual machine. This is powerful because you won’t have to deal with garbage collected scripted languages anymore, and essentially use Rust/C++ as your scripting language.
Scripting languages means languages used as to glue high-performance components in your system.
The point of running WebAssembly outside of the browser is the same as running Node on the server. People want to use the languages their comfortable in (Rust/C++) everywhere.
Swapping out processes memory to disk was more common in order machines with small memory sizes, or with processes that needs lots of memory space (larger than physical memory).
Actual physical memory behaves like a set of windows on a much larger “virtual” memory space, most of which at any given time is actually stored on disk in a special zone called swap space.
P-code (pseudocode) is a hybrid language that uses both compilation and interpretation. P-code languages are like compiled languages in that the source is translated to a compact binary form which is what you actually execute, but that form is not machine code.
Typically, operating systems use two levels of tables for files: a per-process table and a system-wide table.
Each application program must include its own code to interpret an input file to the appropriate structure. However, all operating systems must support at least one structure — that of an executable file — so that the system is able to load and run programs.
The directory can be viewed as a symbol table that translates file names into file control blocks.
The sequence of directories searched when a file is named is called the search path.
A file is an abstract data type defined and implemented by the operating system. It’s a sequence of logical records.
The root partition selected by the boot loader which contains the operating-system kernel and sometimes other system files, is mountaineer at boot time.
The idea of paging is to divide both the virtual and the physical memory space into small, fixed-size blocks. The blocks of the virtual memory space are called pages and the blocks of the physical address space are called frames.
Paging allows to start a third instance of a program without performing any defragmentation (which is slow).
Paging has internal fragmentation, which is still better than external fragmentation when using segmentation.
Segmentation uses an individual segment selector register for each active memory region, which is not possible for paging since there are way more pages than registers.
To save space in paging, we use multilevel page tables. The x84-64 architecture uses a 4-level page table and a page size of 4KiB.
The physical address of the current active level 4 page table in x84-64 machines is stored in a register called the CR3 register. Each page table entry then points the physical frame of the next level table. The entry of the level 1 table then points to the mapped frame. Note that all addresses in the page tables are physical instead of virtual, because otherwise the CPU would need to translate those addresses too (causing a never-ending recursion).
A 4-level page table makes the translation of virtual addresses expensive, because each translation requires 4 memory accesses. To improve performance, the x84-64 architecture caches the last few translations in the translation lookaside buffer (TLB).
Rust ownership is based on resource acquisition is initialization (RAII)
alloc crate is disabled in #[no_std] creates because alloc requires a heap allocator, which is an object that provides the allocate and delicate functions. Since std is disabled, no global memory allocator is provided by default. If you want to have dynamic memory in a #[no_std] app, you must provide your own global allocator. Add #[global_allocator] to a static variable that implements the GlobalAlloc trait.
async/await is an implementation of cooperative multitasking. Threads is an an implementation for preemptive scheduling.
Futures translate .await operations by changing the internal state of the Future.
Three possible solutions for dangling pointers with self-referential structs in Futures are. 1. Update the pointer on move 2. Store an offset instead of self-references 3. Forbid moving the struct The last option made the most sense for Rust since 1 and 2 incur runtime costs. Thus, pinning was introduced.
Preemptive multitasking is more expensive because each task be interrupted at arbitrary points in time, even between assembly instructions. This is unlike cooperative multitasking that can only be paused at predefined states of the state machine with Poll:Pending. So, with presumptive multitasking, the OS must ensure that the complete CPU state including all registers and stack content are exactly the same when the task was paused.
event_loop = inversion_of_control(polling)
distributed systems is a high level von Neumann machine.
if your node app has an uncaught exception, don’t try to recover, let it crash. It’s not safe and your application might end in a bad state. Memory leak, sockets hanging, etc. It’s better to start a new process from scratch.
Prefer native process monitoring like systemd or docker/k8 native restart instead of pm2/forever with Node.
Ideally, a node logger transport should consume logs in a separate process to the application, Using transports in the same process causes unnecessary load and slows down Node's single threaded event loop.
In my toy kernel, the boot loader creates the mapping from virtual to physical memory with some offset for space for registers, the bootloader itself, and other data.
Creating a new mapping from virtual page to physical frame depends on virtual page we want to map. In the easiest case, the level 1 page table already exists and we just write a single entry. In the most difficult case, the page is in a memory region for that no level 3 exists yet so that we need to create new level 3, level 2, and level 1 page tables first.
Dereferencing a null pointer (pointer with address 0) causes a page fault because the virtual page is not mapped to any physical frame.
JSON is a subset of JS.
The benefits of having a well-designed architecture are 1. flexibility 2. fast tests 3. Shared language
async excels in waiting. This could be a network call, a CPU-bound task, db query, or RPC call. Hence async *await*.
an async runtime in the Rust ecosystem provides a 1. Task scheduler and 2. An I/O runtime
Rust doesn’t ship with a built-in async runtime because of it’s commitments to zero-cost abstractions.
Rust futures differ from JS promises in that they’re lazy
In Rust, X.await followed by Y.await is actually executing our program synchronously. To execute both requests concurrently, we need to spawn another task. (tokio::spawn(X), Y.await)
In Rust, you have to be explicit about tasks because there is no implicit event loop/runtime. Thus, for concurrent flow, you need to do some_rust_rt::spawn(async_fn()), and spawn an event loop unit of execution, aka “task” in rust land. In node, the runtime is built into the language and so node already knows what “task” is. There is no need to spawn it.
Tokio is a multithreaded async runtime. There are other async runtimes in Rust that are single threaded, like node.
Async executors in Rust essentially provide the spawn method. The implementation of spawn can be thought of as pushing the Future to the executor’s internal event loop, perhaps a VecDeque.
We need async runtimes to execute our futures with poll() repeatedly because Future’s in Rust are lazy
Waker’s are created by executors
In Rust, await doesn’t necessarily yield the current task. It only yields if the Future returns a Poll::Pending. If it returns Poll::Ready, the state of the Future can be advanced. Futures in Rust are poll-based.
We can think of the Rust compiler transforming an async function into a state machine that implements the Future trait. The state of the machine is minimal at each pause point, because the compiler knows exactly when the Future could potentially yield. This is because we’re using cooperative multitasking. It loops, and matches against the state of our Future. In each arm, it tries to advance the state of the Future. To prevent pointer invalidation, Rust uses the Pin type to ensure that futures cannot be moved in memory anymore after they have been polled for the first time.
The set of Futures that share the same runtime-provided (tokio et al) Waker is called a “task”. A task can be composed of many futures, but it only polls the top-level Future.
synchronous programming is not a good solution for the C10k problem
a task in async Rust is by default associated with one future, but this future may be a composite future that drives many contained futures. For example, multiple joined futures with join all() or two futures executed in series using and_then() combinator from the futures crate.
os system calls support passing in flags (options) to not block, namely NON_BLOCK. Calling read(fd) on a socket/file which has nothing to read will return a WOULD_BLOCK error.
you can think of Futures as one layer, and the integration with system I/O as another.
epoll/kpoll/IOCP et al is a blocking call, and returns the file descriptors that are available to run the operations you specified when calling epoll(Array((fd, op)) -> Array. You might be interested in when fd 4 is ready for reading, and when 7 is ready for writing. These polling calls are system calls. These
Tokio has a thread that is responsible for looping through all of the I/O it’s waiting for. This thread passes Collection(FD, Operation), Task> to epoll, and for every source that’s ready to be acted on, it wakes the task up.
You can think of tokio::spawn(fut) as pushing fut to Tokio’s internal list of futures to execute. Before it added future to the worker queue of the same thread you’re currently on. In a multithreaded executor, it places it on a random thread instead.
Below a certain level of sugar in Rust, there are no for loops. It’s all just iterators.
Use associated types over generics when there is only one implementation of a trait for a struct.
Rust’s std::mpsc is unidirectional (senders can only send, receivers can only receive).
A Mutex is a boolean Semaphore.
If a thread panicked while holding a lock, then the next thread that tries to obtain a lock on a Mutex will receive a PoisonError(Guard).
Channels are typically implemented with ring buffers (VecDeque).
async await is when your I/O bound rather than CPU bound. It’s so you don’t have to have a thousand threads running.
Channels are implemented with Mutex’s and Condvars
Rc(T) (reference counted) implies T is clonable. This makes sense because you’re sharing many copies amongst threads.
std::sync::mpsc has two types of Sender. Sender, an async one, and a SyncSender. Sender is not asynchronous like async await. On the contrary, it simply means the sender and receiver are not in sync. There is no back pressure. SyncSender ensures the sender and receiver operate in lockstep, and provides back pressure. Better names would be bounded and unbounded channels.
The common flavours of channels are 1. Synchronous aka bounded (send() blocks, limited capacity) 2. Asynchronous aka unbounded (send() cannot block. Unbounded) 3. Rendezvouz (synchronous with capacity = 0. Used for thread synchronization) Thread synchronization is often achieved with a barrier, but you can do it with a channel as well. 4. Oneshot channels. (Any capacity. In practice, only one call to send())
a socket is user land is simply a pointer (file descriptor) to a file in kernel land.
MAC addresses are burned-in (hardware) addresses on NICs
network nodes with multiple network interfaces (routers, switches) must have a unique MAC address for each NIC in the network. However, two NICs connected to two different networks can be the same.
the most significant bits of an IP address are the network prefix, which identifies a whole network or subnet. The least significant bits form the host identifier.
A mask lets you define the separation between the host portion of an IP address and the host portion of an address.
When a host wants to send something to another host, it masks the destination IP address to see if it is on the same network as itself. If it is, it just sends the traffic directly to the host on the same network. If not, the host sends the traffic to the configured gateway. It does this using layer 2 (MAC addresses). If the the destination is on the same network, private IPs are used to communicate. If it’s on another network, it’ll use public IPs.
There were too many IP addresses with class addressing Class C has default subnet mask of 255.255.255.0, with 24 bits for the network and the last 8 bits are for the host. With 8 bits for the host you could have 2^8-1 = 254 addresses that are part of the same network which is way too much.
Subnetting solves the above issue by basically moving the boundary between the host and network part of the address. By decreasing the number of addresses on each network, we can increase the number of networks.
There’s still a problem with subnets though. We have way too few addresses for every device to get its own, which is why we have NAT. NAT works by letting multiple private addresses share a single public IP address.
IP protocol is basically you naming the remote machine
The internet protocol implements two basic functions: addressing and fragmentation.
The internet protocol uses four key mechanisms in providing it’s service: type of service, time ti live, options, and header checksum.
Internet Protocol: https://tools.ietf.org/html/rfc791
Fragmentation of an internet datagram is necessary when it originates in a local net that allows a large packet size which then must traverse a local net that limits packets to a smaller size.
An internet datagram can be marked as “don’t fragment”
All hosts must be prepared to accept datagrams of 576 bytes (576 * 8 = 4608 bytes = 4KB)
Since some header fields change (TTL), the header checksum on an IP datagram is recomputed and verified at each point that the datagram is processed
Routers examine layer 3 packets encapsulated inside layer 2 frames for network information, and then direct them out interfaces according to their destination. In contrast, a switch looks only at the layer 2 MAC address to determine its destination.
When a computer resolves the IP after a DNS lookup, if the IP is not on the same network as itself, it’ll send the packets to its default gateway. The default gateway is usually a router which contains a routing table. If the IP address isn’t in the routing table (most likely) it’ll send the packets to the default route which is most likely your ISP.
TCP protocol provides basic data transfer, reliability, flow control, multiplexing, connections. And precedence/security
Just like von Neumann machines, networks are fractal too. Your home devices form a LAN which sit behind a common router/gateway/hub. Many LANs form a WAN, and so on.
The IP datagrams are unwrapped from local packets and examined to determine through which network the internet datagram should travel to next. The IP datagram is then wrapped in a new local packet which is sent to the next gateway or destination.
Adresss Resolution Protocol (ARP) is a procedure for mapping a dynamic IP address to a physical machine address, like a MAC address, in a LAN.
money is incidentally an asset
money is a lubricant, used to facilitate transactions
you want that lubricant to be invisible and frictionless
barter -> metallic coins -> paper fiat money -> cheque/mobile phone -> cashless society
what keeps currency valuable 1. Trust in government restraint from printing 2. Being able to pay taxes
hyperinflation is almost always because of political and social collapse
Paul Krugman says gold and $100 are backstopped with value. You can use gold for jewelry and teeth and $100 can be turned into $20. If everyone woke up tomorrow and said the $1 bill was useless, the government would be like, you owe us $X in fiat money. If everyone woke up tomorrow and said bitcoin was useless, then it would be useless.
hyperflation is inflation of minimum 50% per month
placing trust in a single party is efficient but risky
As Nobel-laureate Robert Shiller observes: "Gold is a bubble, but it's always been a bubble. It has some industrial uses, but basically it's like a fad that's lasted thousands of years."
We can think of money as a bubble that never pops (or that hasn’t popped yet) and the value of fiat currency, gold, or Bitcoin as relying on collective belief.
The idea of a fiat currency like the US Dollar being untethered to gold is itself a recent phenomenon that seemed unthinkable half a century ago.
when the Bitcoin supply approaches its asymptote and miners must be compensated primarily with transaction fees rather than block rewards.
ethereum is a distributed computer.
decentralization is important because it eliminates single points of failure.
a socket is 2 bytes, hence up to 65,536.
cat /etc/services prints all the services that are listening on ports.
ports under 1024 are often considered special, and usually require special OS privileges to use.
Big endian order is also called network byte order because that’s the order network types like.
Little endian is also called host byte order because computers (Intels) like to store addresses in little endian order.
Your firewall is doing NAT.
A computer can have more than one network interface (WiFi card, ethernet card). Each interface will have a separate address.
/etc/hosts contains the mapping from local addresses to urls like “localhost”
If a private key controlling unspent bitcoins is compromised or stolen, the value can only be protected if it is immediately spent to a different output which is secure.
Routers are just weak computers with big network cards.
The most private and secure way to use bitcoin is to send a brand new address to each person who pays you. After the received coins have been spent the address should never be used again. Also when sending money to people always ask them for a brand new bitcoin address.
Despite the name, paper wallets are not actually wallets. They only store the private keys and addresses, and cannot tell users if they have actually received bitcoins and in what quantity.
Handing 2.5 percent to banks to move bits around the Internet is the worst possible choice.
Crypto networks prevent the bait-and-switch that centralized identities do, like Govopoly.
first era of the internet: 1980s-2000s, decentralized protocols built by community. Second era of internet, mid 2000s to 2018, centralized applications “fat apps” like FANG built on these open protocols and captured all the value. We are now in the third era of the internet with the fifth protocol. Bitcoin. But, the design of crypto will make it so going forwards, centralized systems are no more.
a stablecoin claims to be an asset that prices itself, rather than an asset that is priced by supply and demand.
monetary policy is generally quicker to implement as interest rates can be set every month.
Keynsian economics is macroeconomics such as monetary + fiscal policy.
Taxes transfer wealth.
According to the proponents of the chartalist theory of money creation, taxes are not needed for government revenue, as long as the government in question is able to issue fiat money. Really, taxes is about wealth transfer.
Coins and banknotes are usually defined as legal tender in many countries, but personal cheques, credit cards, and similar non-cash methods of payment are usually not.
The right, in many jurisdictions, of a trader to refuse to do business with any person means that a would-be purchaser may not force a purchase merely by presenting legal tender, as legal tender only must be accepted for debts already incurred.
In political systems based on the principle of separation of powers, authority is distributed among several branches (legislative, executive, judicial) LEJ
lobbying is seen as one of the causes of a democratic deficit.
leverage buyout works. You use the thing you buy as collateral to get a loan to buy it.
Bitcoin consists of: 1. A decentralized peer-to-peer network (the bitcoin protocol) 2. A public transaction ledger (the blockchain) 3. A set of rules for independent transaction validation and currency issuance (consensus rules) 4. A mechanism for reaching global decentralized consensus on the valid blockchain (Proof-of-Work algorithm)
Market makers are typically large investment firms or financial institutions that create liquidity in the market.
Loyalty is macroeconomy.
Crypto adds internet of value. We now have internet of information AND internet of value.
Trust is native to the Bitcoin protocol. It’s called the Trust Protocol.
Bitcoin is the internet of money. Currency is only the first application.
Bitcoin is not anonymous, it prevents total totalitarian surveillance. Transactions can be surveyed but they cannot be attached to identities. There’s privacy, not secrecy.
There are as many private keys in Bitcoin as there are atoms in the universe.
Quoting Paul Graham “It seemed to me that the web would be a big deal. I'd seen what graphical user interfaces had done for the popularity of microcomputers. It seemed like the web would do the same for the internet.”
Bitcoin is a value-based system that keeps track of value. This is the opposite of an account-based system.
A distributed system is a multiple networked cooperating computer.
A distributed system is a von Neumann machine with a network in the middle of processes.
Why do we need distributed systems? 1. Connect physical separated machines -> allows sharing 2. Increase capacity through parallelism 3. Tolerate faults 4. Achieve security via isolation
Distributed systems historical context 1. Distributed systems started when local area networks started (1980s) Main examples were DNS and email in 1980s. 2. Rise of datacenters with big websites (1990s) Main examples were web search, and shopping. 3. Cloud computing (2000s) 4. Current state of dist. sys: active development.
Why are distributed systems hard? Two main things drive complexity. 1. Many concurrent parts. 2. Must deal with partial failure. 3. Tricky to realize the performance benefits.
Why are systems hard? Because they are complex.
How do we mitigate complexity? With design principles such as modularity and abstraction. This minimizes the number of interconnected components.
How do we enforce modularity? One way is to use client/sever model.
Client/server model solves security (no overwriting memory), crashing.
In a system, you have to make tradeoffs between scalability, fault-tolerenace, reliability, security, and performance.
Operating systems enforce modularity in a *single* machine.
Different browsers have different procedures for getting a root CA accepted. Chrome, for example, takes the trust store of the operating system. Firefox, on the other hand, maintains all its CAs themselves.
With SSL, if an attacker intercepts a clients requests, nothing bad will happen. The client will use the real public key of the server to encrypt the message, which is useless to the attacker. If the client is somehow tricked into trusting a certificate and public key whose private key is controlled by an attacker, trouble begins.
At the root of every chain of trust lies an implicitly trusted CA.
The two fundamental ways in which distributed computing differs from single-server/machine computing are. 1. No shared memory. 2. No shared clock
Tokens will break down the barrier between professional investors and token buyers in the same way that the internet brought down the barrier between professional journalists and tweeters and bloggers.
When your web service needs to support millions of users, your first reaction should be to scale the number of servers you use. They still need to see the same data so all the web servers hit the same database. Eventually though, as the web servers scale, the database becomes the bottleneck. It’s a fair amount of work (hard) to scale storage services.
Consistency can be strong or weak. Strong consistency is very expensive to implement. The reader or writer has to perform a lot of communication. Weak systems are more attractive for that reason.
Map reduce was made at Google because they didn’t want engineers sitting around waiting for jobs to finish. They wanted the engineers to write the guts of the application, and be able to run it on 1000s of computers without worrying. Easy for non-specialists to write and run giant distributed computations.
An entire map reduce is called a job and is made of map and reduce tasks.
A reduce worker will go and talk to all the map tasks and ask them to send all the instances of key over the network in which the master told it was responsible for.
Input and output in map reduce was first stored in NFS. This was expensive because a lot of data was being sent over the network. Google wanted to minimize network communication. So they ran GFS and map tasks on the same machines. The master server will figure out what GFS server holds what input, and then send correct map tasks to the same server.
Transformation from row-based data to column-based data is called a shuffle. To shuffle, you need to use the network, as you have to send all the computations for a specific key to one reduce task.
You might think we can apply the same trick with the input. Store the output of a reduce task on the same node as the GFS server. But GFS keeps two or three copies in different servers for fault tolerance, so network communication has to happen.
The MapReduce library is designed to help process very large amounts of data using hundreds of thousands of machines, so the library must tolerate machine failures gracefully.
To prevent worker failure, the master in Map Reduce pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed.
When a map task is executed first by worker A and then alter executed by worker B (because A failed), all workers executing reduce tasks are notified of the re-execution. So any reduce task that has no already read the data from worker A will read the data from worker B.
Since there’s only one master in map reduce, failure is unlikely. The paper’s implementation aborts the map reduce computation if the master fails.
In practice, we tend to choose M so that each individual task is roughly 16 MB to 64 MB of input data.
Google often performs MapReduce computations with M = 200, 000 and R = 5, 000, using 2,000 worker machines
Map Reduce was first made to rewrite and speed up Google’s indexing service.
A stack per thread is around 1KB - 8KB.
Some assembly instructions are atomic, some aren’t.
Why would you make a system distributed? 1. It’s inherently distributed 2. For better reliability 3. For better performance 4. To solve bigger problems
The main cons of distributed systems is communication can fail, processes may crash, and all of this may happen nondeterministically.
Even though some RPC frameworks make calls look like function, your’e still using the network under the hood. This means that the network could fail, or the machine at the other end of the wire could fail. Your process (client) won’t know, so you can’t program exactly the same as you would with a local function.
The two general problem presents the problem of no common knowledge. No matter the amount of finite messages sent back and forth, we have never have absolute certainty. You might be able to increase probability, but certainty is impossible.
Byzantine generals problem assumes messages are reliable, but generals might be traitors.
The solution of the Byzantine generals problem is for the honest generals to agree on a plan beforehand.
Theorem: with a max of x generals behaving maliciously, we need 3x + 1 generals in total to tolerate x malicious generals. This means that we can have up to 1/3 of generals being malicious.
Byzantine general problem basically describes a system where people don’t trust each other.
Two general problem is when people are honest, but messages may get lost.
In real systems, both nodes and networks may be faulty. Real systems face both the two general and byzantine general problem.
Network behaviour could have reliable (perfect) links, fair-loss links (messages may be lost, but if you keep retrying, a message eventually goes through), or arbitrary (malicious adversary may interfere with messages).
We can turn fair-loss links into reliable links with retry and deduplication.
We can turn arbitrary links into fair-loss links with TLS.
Node behaviour: crash-stop (after crash, dead forever), crash-recovery (resume after crash, in-memory will be cleared), or byzantine (deviates from the algorithm).
Systems timing models can either be synchronous, partially synchronous, or asynchronous. Most real-life systems are partially synchronous. Usually the system is well behaved except for a few times.
So the system models are: network: reliable, fair-loss, arbitrary nodes: crash-stop, crash-recovery, Byzantine timing: synchronous, partially synchronous, asynchronous These are the assumptions you need to know for any distributed algorithm. If your assumptions are wrong, your algorithm will mess up.
Failure detectors are usually implemented with messages and timeouts. We label nodes as crashed if there’s no reply within some timeout. The problem however is that you cannot tell the difference between a crashed node, a temporarily unresponsive node, lost messaged, or delayed message.
Failure detectors are only perfect in a synchronous, crash-stop system with reliable links. In other systems that are more-reflective of real life, failure detectors are eventually perfect.
Distributed systems often need to measure time.
There are two types of clock. Physical clock and logical clock.
In digital electronics, a clock produces a signal which produces pulses of 1s and 0s. In distributed systems, the meaning of a clock is something you can ask a timestamp for.
We have two definitions of time. Astronomy, and quantum-mechanisms. They don’t exactly match up. The compromise is UTC. We then correct UTC to match up with earth’s astronomy. This correction is leap seconds. Every year on June 30 and December 31 one of three things happens. The clock immediately jumps forward to 00:00:00, skipping one second. It moves to 00:00:00 after one second as usual. The Clock moves to 23:59:60 after one second, and then moves to 00:00:00 after one further second. This is announced several months beforehand. Astronomers decide based on how to keep UTC in sync with astronomy.
Two most common representations of time is Unix time UTC, (not counting leap seconds), and ISO 8601 (year, month, day, hour, minute, second, and timezone offset relative to UTC)
Software ignores leap seconds. But, OS and distsys often need timings with sub-second accuracy. The pragmatic solution is to “smear” and spread out a leap second over the course of a day.
Due to clock drift, clock error gradually increases. We end up with clock skew (a difference between two clocks at a point in time).
In asynchronous or partially asynchronous networks, it’s impossible to reduce clock skew to 0.
We do clock synchronization with Network Time Protocol (NTP). There’s a server has some accurate clock source (atomic clock, or GPS receiver). Clients can query the server’s current time. Most operating systems have NTP built in.
Hierarchy of clock servers arranged into strata. Stratum 0: atomic clock or GPS receiver. Stratum 1: synched directly with stratum 0 device. Straum 2: servers that sync with stratum 1, etc.
It’s possible to reduce clock skew to a few milliseconds in good network connections. But, accuracy could be much much worse.
When a NTP client queries an NTP server, we can estimate the server time with round-trip delay calculations.
When measuring speed of your program programmatically with standard system time calls, make sure to use atomic clock to protect against NTP client clock stepping. Don’t use the time-of-day-clock, use the monotonic clock.
In a distsys, when we use timestamps to denote the correct ordering of a message, clock skew could be greater than network delay, and it would appear that message with t1 actually happens after message with t2, still.
We say event a happens before event b iff A and b occurred at the same single threaded node and a occurred before b in node’s local execution order Event a is the sending of some message m, and event b is the receipt of that same message m. There exists some event c such that a -> c and c-> b.
The happens-before relation is a partial order: is it possible that neither a->b nor b->a. In that case, a and b are concurrent (written a || b)
Logical clock count number of seconds elapsed. Logical clocks count number of events occurred.
There are two types of logical clocks. Lamport and vector clocks.
In Lamport-based clock system, each node maintains a counter t, incremented on every local event e. Let L(e) be the value of t after that increment.
If a-> b, then L(a) < L(b). But, L(a) < L(b) does not imply a->b.
We can define total order on Lamport clocks, and hence, causal order too. (a->b) ==> (a < b) (total order symbol)
Given Lamport timestamps L(a) and L(b) such that L(a) < L(b), we can’t tell whether a->b or a||b.
If we want to detect which events are concurrent, we need vector clocks.
The vector timestamp of an event e represents a set of events, e, and it’s causal dependencies. For example, [2, 2, 0] represents the first two events from node A, the first two events from node B, and no events from node C.
Broadcast order can be FIFO, causal, total order, or FIFO-total order.
In order of strength, best-effort -> reliable -> FIFO -> causal -> FIFO-total order \ / \ -> total order
In broadcast algorithms, first transform best-effort broadcast into reliable by adding retransmit functionality. Once you have a reliable broadcast, you can enforce delivery order.
Eager reliable broadcast is the idea that the first time a node receives a particular message, it re-broadcasts to each other node. Reliable, but expensive. Up to O(n^2) messages for n nodes.
Gossip protocols is the idea that when a node message for the first time, forward it to 3 other nodes, chosen randomly. The message eventually reaches all nodes (with high probability).
Replication is easy if the data doesn’t change: just copy it. Replication is hard when data changes.
One form of replication on single machines is RAID. RAID has a single controller but in a distributed system each node acts independently.
If we represent data as a set, then operations performed on that set are idempotent. A second request to add a like to a set is ok.
There are different retry semantics: at-most-once, at-least-once, exactly-one. At-most-once sends request, doesn’t retry, update may not happen. At-least-once retries request until acknowledged, may repeat update. Exactly-once retries and has idempotence or deduplication.
By adding timestamps to client messages, replicas can now periodically communicate among themselves to check for any inconsistencies. They reconcile state, also known as anti-entropy.
Two common approaches to state reconciliation is last writer wins (LWW) with Lamport clock, and multi-value register with vector clock (because there’s an idea of concurrent events).
Read-after-write consistency can be solved with a quorum. There are write and read quorums.
The client of the databases can repair the state by using timestamps to figure out what the state of the data should be amongst the distributed database.
Total order broadcast is when every node delivers the same messages in the same order. This is exactly what we want for replication.
State machine replication (SMR) algorithm. A FIFO-total order broadcast every update to all replicas. Replica delivers update message: apply it to won state.
A replica in SMR is a state machine.
Total order broadcast can be implemented by sending all messages via a single leader. What happens if a leader crashes/becomes unavailable. We can use
Manual failover: human operator chooses a new leader, and reconfigures each node to use new leader. This is ok for planned maintenance but is not good for unplanned. Can we automatically choose a new leader?
Consensus and Toal order broadcast are formally equivalent.
A common consensus algorithms is paxos, multi-paxes, and raft.
FLP result says its impossible to implement consensus algorithm in an asynchronous crash-stop system model. The algorithm is not guaranteed to terminate.
Raft is a non byzantine consensus algorithm.
Multi-paxos, raft, etc use a leader to sequence messages. We want to avoid having two leaders at the same time (split-brain). If your system has a split brain, data will be corrupted or lost. We also have to ensure there is <= 1 leader per term. Leader election could fail, but there can’t be more than one. A node can only vote once per term. A quorum is required to elect a leader in a term.
Multi-paxes and raft can guarantee one leader per term. But they cannot prevent having multiple leaders from different terms concurrently running.
serial primary key is a shortcut Postgres gives you to avoid typing out exact SQL of creating a sequence and altering the column value to be the nextval(‘db_id_seq’) which is a pain.
You can query sequences with select nextval(‘seq_name’)
Postgres does not bury their auto-incrementing “stuff” within the engine itself, it puts the functionality right out there so you can tweak it as you like.
Make turns one thing into another. This could be combining header and code files into C objects and binaries, or it could be individual SQL files appended together and then executed.
You create things called targets. Thee build happens with a recipe. If target X needs to be built before target Y, then you can specify a prerequisite for target Y (which is target X). All that combines to make a rule.
In Go, untyped constants take the type needed by their context.
Schemas in Postgres are like namespaces. Usually they’re used for permissions. This team has access to this schema, this team has access to another schema. Schemas are used to essentially manage group permissions.
In DDD, whenever possible, you should try to create an Anticorruption layer between your downstream model and an upstream integration model, so that you can produce model concepts on your side of hate nitration that specifically fit your business needs and that keep you completely isolated from foreign concepts. Aka, create an adapter layer.
Using messaging is one of the most robust forms of integration because you remove much of the temporal coupling associated with blocking forms of network calls like RPC and REST. Since you already anticipate the latency of messagings exchange, you tend to build more robust systems because you never expect immediate results.
Links can be full-duplex (at same time), half-duplex, or simplex.
Wireless links send message as broadcasts. All nearby hosts get messages as well.
Each instance of a protocol talks virtually to its peer using the protocol.
IEEE is responsible for communications protocols (ethernet, wifi). IETF is responsible for internet protocols (HTTP, DNS). W3C is responsible for web protocols (HTML5, CSS)
Packet switching is better than circuit switching for computers because traffic is bursty.
Narrow waist of IP and the hourglass figure allows a wide range of technologies in the link layer and application layer. It’s flexible on both sides.
On a flash drive sequential reads isn’t that much faster than random reads, unlike magnetic disks.
Block level storage in databases tries to predict future behaviour with caching, pre-fetching, and buffering.
Disk space management provides an API to read and write pages (blocks) to a device.
Higher levels (past the buffer manager) of DBMS only operate in memory.
DB file structures include unordered heap files, clustered heap files, sorted files, and indexes files (B+ trees, linear hashing)
Tables are encoded in files which contain multiple pages which contain multiple records.
Record id = (pageId, “location in page”)
Unpacked representation for fixed length records is a better choice. You pay a small price with the bitmap in the page header and get stability when deleting a record (no need to rearrange record locations, and hence ids).
Records don’t need to store schema, schema is stored in system catalog.
There is no serialization done in DBMS between disk and memory. Records (in pages) are stored with the same byte representation.
Physical layer of the network is basically the border of EE and CS. Signals, modulation, limits. It’s the actual data underlying data path, the line that connects the two stacks of networks.
The link layer concerns itself with how to transfer messages over one or more connected links. We are virtually communicating from one link instance to another.
Some of the link layer is implemented in the operating system. Some of it is implemented in the NIC.
Physical layer is implemented with cables and wires.
ML has a static and dynamic environment. The static environment holds type definitions, and the dynamic environment holds variable bindings. Static environment is doing the same thing as a dynamic environment except it deals with types. (It finds the type of a variable by using pre-existing definitions)
When you think of expressions in ML, (and other similar languages), there’s three components. Syntax, type checking, and evaluation.
Rust is mainly influenced by SML/OCaml (algebraic data types, pattern matching, type inference) and C++ (references, RAII, smart pointers, move semantics). Hence, the low level performance with a high level feel.
In ML, variable shadowing is not assignment (mutation). It’s redefining the dynamic environment.
In ML, you cannot forward reference because the variable won’t be defined in the dynamic environment at run time.
To learn a programming language, you need to learn five things. Syntax, semantics, patterns, libraries, and tools.
The interpreter for a computer language is just another program.
If you don’t understand interpreters, you can still write programs, you can even even be competent programmer. But you can’t be a master.
The whole distinction between program and programming language is a misleading idea. An interpreter itself is just a program. But that program is written in some language, whose interpreter is itself just a program written in some language whose interpreter is itself…Future programmers will see themselves not as writing programs in particular, but as creating new languages for each new application.
In networking, the physical layer talks about signals. The link layer talks about bits.
Physical and link layers are often implemented together.
Byte stuffing is used in physical layer to prevent desynchronization in byte counting. You stuff a flag at the beginning and ending of frames.
All layers of the network stack contribute to the overall reliability of communication over the internet. Physical layer is about bit scrubbing. Higher layers are about recovery and correctness.
Timeouts are easy to do over LAN, but hard to do over the internet. TCP manages reliability over the entire internet.
Multiplexing can be done either with time division multiplexing (TDM) or frequency division multiplexing (FDM)
Ethernet is multiple access with 1-persistent CSMA (Carrier send multiple access, listen before you talk)/CD (two people talk at once, they stop) with BEB (exponential backoff)
Modern ethernet is based on switches, not multiple access (like classic ethernet). But we still call modern ethernet, ethernet.
Switches have buffers because of the inherent structure of a switch. It switches which frames get sent, so we need temporary storage for other frames.
Switches figure out the addresses of hosts with backwards learning. To fill the table, it looks at the source undress of input frames. To forward, it sends to the port, or broadcasts to all ports.
A switch doesn’t know if a port is connected with a computer or another switch. Switches are just computers with small memories.
Loops in the topology of switches is a problem. The solution is to find a spanning tree for the topology.
The link layer doesn’t scale well with a global network. The LAN routing table will blow up because it holds data for all destinations. Broadcast blows up because the layer network protocol sends messages with unknown port destinations to all ports.
IP is responsible for two things, forwarding, and routing.
In DBMS, we want a declarative access API where we look things up by value.
An index is a data structure that enables fast lookup and modification of data entries by search key.
ISAM stores data pages and index pages in one file, with the index pages forming a tree which eventually leads to data pages in the leaves. It has fast sequential scan, high fan-out, but does not support insertion nicely. Inserting records into pages that are full result in overflow pages. Overflows pages are the Achilles heel of ISAM.
B+ Trees have the same interior node structure as ISAM. The difference is that it’s always balanced. B+ Tree is a variant of the B-tree where all data is stores in leaves only. This is useful for range search.
B+ Trees are better than ISAM because they adjust to growth gracefully.
Selection when using db indexes can be equality-based or range-based. B+ trees provide both. Linear hash indexed only provide equality.
Data in indexes can be pointers or actual data. And, the actual data stored in the data file can be clustered or unclustered with respect to the index. Also, with references, multiple records for one search key can be separate, or compacted into a list.
Data storage can be alt 1 (tuple), alt 2 (record ids), or alt 3(list of record ids).
Clustered indexes are better for range searches, and provide locality benefits. The cons are being more expensive to maintain. Heap files are usually 2/3 packed to leave room for future insertions.
Instead of thinking NoSQL, think of “no transactions instead”.
An index lookup requires three steps. Tree traversal, following the leaf node chain, and fetching the table data.
Two ingredients for a slow index lookup is following a long leaf node chain, and and potentially accessing many hits –often hundreds.
A poorly written where clause is the first ingredient of a slow query.
Ingredients of a slow query are *not* present with an index unique scan, because there’s a unique entry in the index with the id.
If your index is a concatenated index, than a query that uses only one of the key columns does not use the index. It searches the entire db.
A concatenated index is sorted by key one then key two. That means a two-column index does not support searching on the second column alone.
But, a two-key index does support searching with the first index column. If you only need to search by 1/2 keys in the index, just re-arrange the order. If you need to do this for both keys, then you’re out of luck.
An index with with three keys can be used when searching for the first key, first two keys, and all three keys.
To define an optimal index you must understand more than just how indexes work — you must also know how the application queries the data: the access path. This is why defining optimal indexes is very difficult for external consultants.
An index whose definition contains functions or expressions is a so-called function-based index (FBI).
Sometimes ORM tools use FBI (UPPER and LOWER) without the developer’s knowledge. Hibernate, for example, injects an implicit LOWER for case-insensitive searches.
SQL Server and MySQL do not support function-based indexes as described. The workaround is computed or generated columns.
Functions that cannot be indexed in FBIs in Postgres and Oracle are ones that use current time, random number generators, or functions that depend on environment variables. They are not deterministic. You can declare the functions with the DETERMINISTIC (Oracle) or IMMUTABLE keyword (Postgres) but the functions will not behave as intended because the index will not update keys on a regular basis.
The biggest performance risk of an index range scan is the leaf node traversal. The golden rule of indexing is to keep the scanned index range as small as possible. Essentially, you don’t want to degenerate your search from a logarithmic to linear.
With partial indexes you can specify which rows should be indexed. The syntax is surprisingly simple: a where clause.
Partial indexes are usually used in queueing systems when queries only need to retrieve rows that have been “unprocessed”
Column order of indexes only matter when your query contains range.
Queries that contain dates, numeric strings, combining columns, smart logic, and math are usually anti-patterns that prevent proper index usage.
Routers usually have two buffers, an input one and output one. Conceptually, you can think of this as one FIFOqueue.
Networks can differ in service model (datagrams, virtual circuits), addressing, quality of service (priorities, no priorities), packet sizes, securityThe last .
The slashes at the end of the IP addresses is the length of the prefix in bits.
IANA is the body that initially owned all IPs. Then, it delegates to regional bodies (RIRS). RIRs delegate to companies in their region, and companies assign to their customers/computers with DHCP.
0.0.0.0/0 is the default route that catches all IP addresses. An entry in the routing table of your host will have 0.0.0.0/0 mapped to the router. So all IP addresses that don’t have your local network prefix will be sent to your router.
IP’s TTL header prevents loops.
DHCP and ARP are the glue that makes IP work.
DHCP leases IP addresses to nodes. It also provides network prefix, address of local router, DNS server, and time server.
Client sets up IP with servers in DHCP with DORA (Discover ->, offer <-, request ->, ack <-). After that, client will simply have to renew an existing lease, with an abbreviated sequence. (Request, and ack).
Packet size solutions are fragmentation and discovery (figuring out MTU of entire path). Discovery is the method in use today. Discovery is implemented with ICMP.
Fragmentation is undesirable. More work for routers and outs, magnifies loss rate, and security vulnerabilities.
ICMP is a companion protocol to IP. They are implemented together, and ICMP sits on top of IP. ICMP provides error reporting and testing.
If TTL hits 0 than the router sends ICMP error back to host.
Native IPv6 islands can be connected via IPv4 with a tunnel. (IPv4 packet wraps IPv6 in the IPv4 network). We put a network layer on top of a network layer. IPv6 on top (or in) IPv4.
NAT boxes are motivated by IP address scarcity. They were a short-term measure before the IPv6 solution.
NAT boxes breaks app that need their IP addresses to be publicly visible.
GFS needs to be big/fast, global, sharing, and automatic recovery.
GFS is a single data centre.
In GFS, each chunk has three replicas per chunk by default. One primary and two secondaries.
GFS assumes three things. Component failures are the norm rather than the exception (system is built from many inexpensive commodity components). Second, files are huge by traditional standards (multi-GB). Third, most files are mutated by appending new data rater than overwriting existing data. Random writes within a file are practically non-existent. Fourth, co-designing the applications and the file system API benefits the overall system by increasing our flexibility.
In google and GFS-usage, high sustained bandwidth is more important than low latency.
Hive and presto are implemented over map reduce. It’s SQL at scale.
No one usually writes vanilla map reduce jobs anymore. Everyone uses frontends for map reduce.
The goal of The Little Schemer is to teach the reader to think recursively.
To communicate the above concept of recursion, you have three choices. English, formal mathematics, or a programming language. English is too awkwardly verbose, math is too cryptic. Programming languages, the marriage of technology and mathematics are perfect for the job.
Even though join order has no impact on the final result of a query, it still affects performance because the optimizer will evaluate all possible join permutations and select the best one. The more tables to join, the more execution plan variants to evaluate.
A cluster in computing can refer to a compute cluster or a data cluster.
The second power of indexing is clustering data.
Compilers take a stream of symbols, figure out their structure according to some domain-specific predefined rules, and transform them into another symbol stream.
Compilers are divided into parsing, type checking, and code generation.
You can’t unlock Mutexes in Rust explicitly which is actually very nice because it makes your critical sections very minimal (between two curly braces).
Channels, in essence, try to take the good parts of multithreading without the hard parts (shared memory, and deadlocks).
Semaphores are like a bucket of balls. They are used for synchronization, not communication. If you need to communicate you should pair the semaphore with a buffer or mutex.
Channels are the same as semaphores, except the bucket is full of structs, not simply metaphorical balls. Channels are like semaphores with mutexes.
What’s expensive about channels is that each thread will have it’s own copy of a message. It’s allowed to be in multiple places because the message, in essence, is ephemeral.
But, in practice, we share *some* memory (the heap), and only make shallow copies into channels.
In Go, if you pass a pointer in a channel, you can get data races. Channels don’t always protect you against data races.
Rust protects you though, with ownership. You can pass around a Box, and the receiver will get ownership of the pointer.
We need flow control because the transport layer doesn’t control when segments are read from the sliding window buffer. The app decides when. So if the window is full, the receiver needs to let the sender know to slow down, or else bandwidth and segments will be wasted.
The receiver advertises its lower side (empty space, no segments) of the sliding window as the flow control window (WIN).
Selective acks (SACK) gives hints to the sender about receiver buffer state (ACK up to 100, and 200-299).
In an effective congestion control solution, both the network layer and transport layer must work together. The network layer can witness congestion, but the transport layer causes congestion.
Before DNS, a directory file was used, called HOSTS.TXT, which was regularly retrieved from a central machine at the NIC (Network Information Center).
A DNS zone is comprised of DNS records that give information for its domain names. A (IPv4 address of a host), AAAA (IPv6 address of a host), NS (Nameserver of domain or delegated subdomain)
DNS is a hierarchical distributed service.
A DNS cache on a gateway or host doesn’t only cache specific IP’s. It also caches the address of name servers, to avoid the time it takes to lookup from the hierarchical tree starting from the root name server.
HTTP pipelining is sending multiple requests at the same time, instead of one at a time.
SPDY has multiplexed (parallel) HTTP requests on one TCP connection. Client priorities for parallel requests. Compressed HTTP headers.
We place CDN across the internet by clever use of DNS, a CDN DNS server.
Standard time slice of a process in the OS’s process queue is like 10ms.
Go’s channels are known for being. Slow because they essentially implement Mutex(VecDeque()) but using a “fast user space mutex” (futex)
If you’re a large company, you need to implement your own databases to support the large traffic you get. Off the shelf databases won’t scale.
Large-scale systems are prone to Murphy’s law. Anything that can go wrong will go wrong. You have to be prepared for the worst.
DNS can return multiple IP addresses for a given hostname Client s will pick the first one, moving down the list if IPs are unreachable. The downsides of DNS load balancing is that it’s not very intelligent.
Using the dig +noall +answer command on big websites will only display one IP address. This is because big companies like google has many many data servers across the world.