isdanni

ThinkPad X1 Extreme Gen 3: Dual Boot Pop!_OS 20.04 LTS(Nvidia version) with Windows 10

2020-10-25T00:00:00+00:00

I have been heavily using Ubuntu for the past few years, both for work and personal projects, and have become pretty comfortable with it. So when it comes to purchasing my next laptop, I had a hard time choosing the “perfect” Linux distro and proper specs for its installation.

There are too many reviews about the “Best Linux distros” each year(or even month) for “beginners”, “mainstream”, and “professionals” specifically and it’s just impossible stick to one when they are constatly updating and making releases with tweaks here and there. Initially I installed Ubuntu 18.04 LTS on my ThinkPad T series because it is just simply much easier to do all the college homework, which was a nice relief for a Linux newbie when the majority of your class were using MBP or even more “advanced” Linux options like Arch(debatable here :)). I went with the dual booting because the gaming/design has better support on Windows platform and it’s nice to keep one’s option open.

So after some research I landed on the Pop!_OS by System76. It is based on Ubuntu, but is NOT just a re-skinned Ubuntu, so it’s quick and easy for me to set up and use all the tools I’m familiar with(They actually wrote an article stating the problem(seems it’s asked too frequently), you can check it out here: Pop!_OS and Ubuntu: What’s the difference?). Another huge reason is that they specifically provide a Nvidia ISO, so it’s perfect for switching graphics and work with the Nvidia driver on the ThinkPad X1 Extreme.

NOTE: This article is only my personal experience and should not be used as the official guidline for Pop!_OS installation.

For more support, please check the official System76 website.

My ThinkPad X1 Extreme Gen 3 configuration:

Processor: Intel® Core™ i7-10850H CPU @ 2.70GHz × 12 
Graphics: NVIDIA Corporation / GeForce GTX 1650 Ti with Max-Q Design/PCIe/SSE2
Screen: 4K
Memory: 32GB
Disk: 1TB

1. Create bootable USB stick

Download the Pop!_OS iso image(Nvidia version) from the System76 website. It’s a disk image with the OS and installer.

If you are not sure which Pop!_OS version to download, check the Display adapter in Device Manager. If there is “Nvidia” in the graphics, go with the Nvidia version.

To write the image to the flash drive, we need to use Etcher(for Windows, and there are multiple choices other than this).

It should take a few mintues to complete. After the write is finished, close the Etcher tab.

2. Disable secure boot

Restart the computer, when the red Lenovo sign shows up in the screen, press F12 to enter the boot menu(The key varies based on your laptop, please make sure you press the correct one). In the Setup/Security menu, disable Secure Boot.

3. Make partitions for Pop!_OS in the disk space

Open the Disk Management in Windows, right click the NTFS partition(As you can see, mine still has around 900GB free space. And Disk 1 is the USB stick I wrote the Pop!_OS iso to) to Shrink Volume. Personally, I split the space in half: 500GB for Windows 10 and 500GB for Pop!_OS. But whatever you do, make sure back up the data before the operation and be very careful when messing around the disk space.

After the operation, the disk 0 looks like this:

4. Boot from live USB

Now we can start the installation. Restart the computer again, when the red sign shows up, press F12 again, and choose the USB stick. There should be a short period of ugly black screen with white output looking like this:

Then the Pop!_OS screen will show up, which means we have entered the live environment.

4.1. GParted

After the keyboard, timezone, language and other easy configs, we can choose to partition the space manually:

4.1.1. Swap

As shown at the bottom in the GParted menu, it is OPTIONAL to select a swap space. Normally, it is recommended to have double of the RAM size. But for mordern computers, especially those with large RAMS(up to 128 GB), this is not applicable anymore. Since my RAM is 32GB, I went with 6GB of swap. The file system is linux-swap.

Here is an interesting article disussing different options: How Much Swap Should You Use in Linux?

4.1.2. Boot

I went with 1GB of /boot just to be safe. Usually (at least) 512MB should be enough. The file system is fat32.

4.1.3. Root

Then we can directly allocate the rest of space for the /root. The file system is ext4.

Make sure every partition is correct, then select Erase and Install.

4.2. Install and RESTART!

Then it should take 2 or 3 mins to finish installation. After it’s finished, a prompt window will show up and ask you to restart the device.

Switching between Windows and Pop!_OS

If everything is done correctly, the USB stick can be removed and each restart will boot into Pop!_OS directly. If you wish to use the Windows 10, again, press F12 during restart when the red Lenovo sign shows up and choose the Win10 distro.

Resources

Grokking DDIA: Dig depper than buzzwords[2020 UPDATE]

2020-06-06T00:00:00+00:00

Understand data & Build reliable, scalable and maintable applications

Designing Data Intensive Applications(DDIA) has been praised by many people in the industry as a great book to bridge knowledge gap between theories of data systems and practical engineering. Mastering tradeoffs of each technology and apply them to solve real world problems take huge effort and constant learning; So I’m writing this post for my reading notes and some takeaways ;-)

Part 1. Foundations of Data Systems

Reliable, Scalable, and Maintainable Applications.

compute-intensive: limited by CPU power;
data-intensive: volume, complexity and quantity of the data(sometimes plus constantly advancing speed);

Data-Intensive Applications usually consists following functions:

storage: databases;
caches: remember results of expensive operations to save cost;
search indexes: allow searching using keywords or other filtering methods;
stream processing: Send a message to another process, to be handled asynchronously;
batch processing: Periodically crunch a large amount of accumulated data;

Why we talk about data systems generally?

The boundaries between the categories are becoming blurred. For example, Redis(datastore that can be used as message equeues) and Apache Kafka(message queues with database-like durability gurantees)
applications are requiring different functionalities a single tool cannot meet. So people usually just break tasks down into small pieces, solving each one using single tool, and then glue them back using application code.

For example, it’s usually application code’s job to sync all search indexes and caching with the primary db. Once all development finished, combining several tools together specially for one specific task becomes not just application developing, but also data system design

Reliability

System can continue to work correctly even in the face of adversity.
Fault-tolerant / resilient: able to predict faults and can cope with them(certain types of course).

Many critical bugs are actually due to poor error handling. So to increase the rate of faults, we can trigger them deliberately(e.g. randomly killing individual processes without warning) to ensure that fault-tolerance machinary is continually exercised and tested. And we generally prefer tolerating faults over preventing faults.

Netflix: Chaos Monkey, a resiliency tool that help applications tolerate random instance failures by randomly terminates vm instances and containers that run inside of your production env, exposing engineers to failures more frequently to incentivize them to build resilient services.

Faults	Solutions
hardware faults	Hard disks crash, faulty RAM, blackout power grid, wrong network cable unplugged 1. add rebundancy components to the individual hardware in order to reduce failure rate. Disks may be set up in a RAID configuration, servers may have dual power supplies and hot-swappable CPUs, and datacenters may have batteries and diesel generators for backup power. 2. system patches so downtime is for one node one time
software erros	systematic error within the system, harder to predict, cause more erros since nodes system are correlated acrosee nodes no quick solutions. Requires careful thinkings, measuring, monitor, analyzing
human errors	human tend to be unreliable even though intentions are good, leading cause of internet error outages 1. Design well: abstractions, APIs, admin interface…. 2. Decouple most-mistakes place: providing sandbox envs for exploring and experiment safely using real data. 3. Test all levels: unit testing -> whole-system integration tests -> automated testing 4. Quick recovery: fast roll back configuration changes, roll out new code gradually, provide tools to compute data. 5. Set up Monitoring: set up performance metrics and error rates(telemetry) 6. Implement training & management

Scalability

Scalability is the term we use to describe a system’s ability to cope with increased load – System can deal with the growth in data volume, traffic volume or complexity.

Describe Current Load

There are some load parameters and the best choice of the parameters depend on the system architecture: it may be requests per second to a web server, the ratio of reads to writes in a database, the number of simultaneously active users in a chat room, the hit rate on a cache, or something else.

One example from Twitter:
the bottleneck of the scalability is not about the tweet volume but the fan-out(A term borrowed from electronic engineering, where it describes the number of logic gate inputs that are attached to another gate’s output. In transaction processing systems, we use it to describe the number of requests to other services that we need to make in order to serve one incoming request.)

As shown, it’s better to do more work at write time and less at read time. But this also means posting a tweet requires extra work due to write to caches. And such distribution of followers per user is a key load parameter for discussing scalability.
Now the Twitter is moving to a more hybrid approach by combining both: Most users’ tweets continue to be fanned out to home timelines at the time when they are posted, but a small number of users with a very large number of followers (i.e.,celebrities)are excepted from this fan-out. Tweets from any celebrities that a user may follow are fetched separately and merged with that user’s home timeline when it is read, like in approach 1. This hybrid approach is able to deliver consistently good performance.

1. Describe Performance

Throughput: the number of records we can process per second, or the total time it takes to run a job on a dataset of a certain size. Usually for batch processing system.
Latency & Response Time: Response time is the time between a client sending a request and receiving a response (includes network delays and queueing delays). Latency is the duration that a request is waiting to be handled—during which it is latent, awaiting service. Usually for online systems.

Since response time varies each time the user makes the request, we need to think of it as distribution of values. Usually it’s better to use percentiles. For example we sort the list of response time from fastest to slowest, the median would be a good metric to know how long users typically have to wait.

Same goes for checking the outliers: we just use a much higer percentiles(p95, p99 and p999 usually, meaeining how many thresholds are faster than particular thresholds), and high percentiles of response times, or tail latencies, are important because they directly affect the user experience of the service. Usually clients experiening bad response times are exactly those made most data storage/purchases, aka. most valuable customers. However, ensuring optimized response times(e.g. 99.999 percentiles) also means generating less revenue and thus does not yield enough benefit for the service providers’ sake.

Service Level Objects(SLOs) and Service Level Agreements(SLAs) ofen use percentiles to define the expected performance and availability of a service.
For example, customers can demand a refund if SLAs are not met.

Queueing delays ofen account for a large part of the response time at a high percentiles. As a server can only process a small number of things in parallel(For exmaple, limited by CPU cores). head-of-line blcoking refers exactly to a situation when a few slow requests hol up the processing of subsequent requests even though these subsequents are fast to process. So it’s important to measure response times on the client side.

Tail Latency Amplification: Even if only a small percentage of backend calls are slow, the chance of getting a slow call increases if an end-user request requires multiple back‐end calls, and so a higher proportion of end-user requests end up being slow.

2. Approaches for Loading

Scale up / vertical Scaling: moving to more powerful machines.

Scale out / Horizontal Scaling: distributing the load across multiple smaller machines. Also known as shared-nothing architecture.

In real life, systems are elastic, meaning they can automatically add computing resource if detects a load increase(useful if loads are highly unpredictable); wheras others are scaled manually.

So far, there’s no magic scaling sauce – a generic, one-size-fits-all scalable architecture. Systems at this scale are usually designed specifically and problems contain the volume of reads, the volume of writes, the volume of data to store, the complexity of data, the response time requirements, the access patterns, or usually some mixture of all plus some other issues.

Maintainability

Many different people(engineering & operations) can both maintain current behaviour and adapt system to new use cases, and they should be work productively.

FANTASTIC Lagacy Code: every legacy in unpleasant in its own way.

The majority of the cost of software is not initial development but maintenance - fixing bugs, keepign systems operational, investigating failures and more. It’s hard to give general recommendations to deal with all legacy systems but we can follow these principles to avoid troubles as much as we can:

principles	details
Operabiliy	Good data systems should be able to do following to make operation teams’ life easier: Providing visibility into the runtime behavior and internals of the system, with good monitoring Providing good support for automation and integration with standard tools Avoiding dependency on individual machines (allowing machines to be taken down for maintenance while the system as a whole continues running uninterrupted) Providing good documentation and an easy-to-understand operational model(“If I do X, Y will happen”) Providing good default behavior, but also giving administrators the freedom to override defaults when needed Self-healing where appropriate, but also giving administrators manual control over the system state when needed Exhibiting predictable behavior, minimizing surprises
Simplicity	able to manage complexity. Abstraction: one of the best way to remove accidental complexity(e.g. SQL is an abstraction that hides complex on-disk and in-memory data structures, concurrent requests from other clients and inconsistencies after crashes, high level language hides machine code, CPU registers, and syscalls) explosion of the state space tight coupling of modules tangled dependencies inconsistent nameing and terminology hacks for performance more…
Evolvability	/Extensibility/Modifiability/Plasticity:Agile working patterns

Map of Dbs

This section covers a range of data models for data storage and querying.

Data Models & Query Language

Relational VS Document

Best-known data model today is probably SQL, ased on the relational model proposed by Edgar Codd in 1970. data is organized into relations (called tables in SQL), where each relation is an unordered collection of tuples (rows in SQL).

Model types	details
Relational Model	roots of RDBMS lies in business data process on mainfarme pcs in 60 - 70s: typically transaction processing (entering sales or banking transactions, airline reservations, stock-keeping in warehouses) and batch processing (customer invoicing, payroll, reporting).
Document Model	NoSQL in 2010s occurs in need of 1. greater scalability than relational databases including very high write throughput 2. A widespread preference for free and open source software over commercial database products 3. Specialized query operations that are not well supported by the relational model Frustration with the restrictiveness of relational schemas, and a desire for a more dynamic and expressive data model

Model types

details

Relational Model

roots of RDBMS lies in business data process
on mainfarme pcs in 60 - 70s: typically transaction processing (entering sales or banking transactions, airline reservations, stock-keeping in warehouses) and batch processing (customer invoicing, payroll, reporting).

Document Model

NoSQL in 2010s occurs in need of
1. greater scalability than relational databases including very high write throughput
2. A widespread preference for free and open source software over commercial database products
3. Specialized query operations that are not well supported by the relational model
Frustration with the restrictiveness of relational schemas, and a desire for a more dynamic and expressive data model

Query for Data

Graph-Like Data Models

Storage & Retrieval

Part 2. Distributed Data

Part 3. Derived Data

Blah Blah Blah

起源：
和某著名电商网站的大佬说自己经常用Hadoop，结果被随意的几个问题血虐
和某著名短视频app的大佬说自己一个人写过一个全栈网站，结果被随意的几个问题血虐
和某地区记者说自己跑得快，~~结果..~~不好意思跑题了…
结论:
吹水需谨慎, 渣渣还是要做渣渣该做的事

Martin Kleppmann写的 Design Data Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems 是本被很多大佬推荐过的神书，之前一直没时间~~明明是懒得~~读，最近临近final反而神奇地爆发了阅读欲望~~明明是不想赶due~~。

去年上过一门Data Intensive Computing，粗略地学了学数据处理框架和流程然后去Kaggle假装了下Data Science入门, 但是对总体工程数据的应用完全没有深入的认知。晃眼一看大三马上就要结束了， Year3给我最大的教训就是（也是书里作者说过的）：“You should know more than just a few Buzzwords”。就好比使用Linux，用Linux desktop开浏览器刷油管是“在使用Linux”~~（好像又是我）~~，用Linux真正地造轮子也是“在使用Linux”。归根结底，不是说你用过什么，而是真正了解多少。这也是我开始阅读这本书的动力。把读书笔记发在博客，就当做个记录和大家交流~~（然後几年后回来看还会多少）~~

“Just keep Learning”, 与大家共勉：）

Streaming Systems: Data Processing, Watermarks & Advanced Windowing

2020-06-06T00:00:00+00:00

This post is my reading notes of Part 1, The Beam Model(Chapter 1-4) from the book, which covers the high-level batch, streaming data processing model called Apache Beam;

Streaming 101

1. What is streaming?

Streaming System

A type of data processing engine designed with infinite datasets in mind.

Shape of a dataset

	Cardinality	Consitution
definition	its size, with the most salient aspect of cardinality being whether the given dataset is infinite or finite;	physical manifestation, which defines the way one can interact with the given dataset;
types	- Bounded data: a dataset that is finite in size; - Unbounded data: a dataset that is infinite in size(at least theoretically);	Two primary constitutions of importance is: - Table: A holistic view of a dataset at a specific point in time. SQL systems have traditionally dealt in tables; - Stream: An element-by-element view of the evolution of the dataset over time. The MapReduce lineage of data processing systems have traditionally dealt in streams.

Why stream processing is important?

business requires more timely insights & streaming achieves lower latency;
easier to manage massive, unbounded dataset that are increasingly common nowadays;
more consistent, predictable comsumption of resources since the incoming data arrival is spread out evenly;

2. Background

Lambda Architecture

Lambda Architecture: a data processing architecture that has stream system to produce low-latency, inaccurate(either bcoz of approximation algorithm or the system itself does not provide correctness)/speculative results and batch system to provide eventual correct results;

Some links:

How to beat the CAP theorem

Questioning the Lambda Architecture

The reason that the Lambda Architecture is successful is it could actually provide some good results even though the correctness is a bit of letdown; However, it is a lot of work to maintain two independent versions of pipeline and merge the results in the end;

Some people argue against the necessity of dual-mode execution because of the issue of repeatability of using a replayable system(like Kafka) so they propose the Kappa Architecture, which runs a single pipeline using a well designed & built system(like Apache Flink);

Lambda Architecture	Kappa Architecture

Lambda vs Kappa Architecture

Usually if the real-time algorithm and batch algorithm have different outputs, meaning batch & real-time layers cannot be merged, then must use Lambda Architecture;

TBC

Batch vs Streaming Efficiency

Batch: high-latency, higher-efficiency;
Streaming: low-latency, lower-efficiency;

But for streaming system to achieve the same performance of batch systems, we only need to focus on 2 things:

correctness: because strong consistency is required for exactly-once processing, which is required for correctness, which is requirerd to meet batch system’s level of performance. (ref: Why local state is a fundamental primitive in stream processing)
tools for reasoning about time: essential for dealing with unbounded, unordered data of varying time skew;

Event Time vs Processing Time

—	Event Time	Processing Time
Definition	the time at which events actually occured	the time at which events are observed in the system

Some variables that can affect the skew between event time and processing time;

shared recource limitations like network congestion, network partitions, shared CPU, etc;
software causes like distributed system logic, contention, etc;
features of the data like distribution, variance in throughput, variance in disorder;

Because the overall mapping between event time and processing time is not static(the lag/skew can vary arbitraily over time), we cannot analyze data soely by the observed time;

To cope with such unfortunate design for unbounded data of many systems, we implement the windowing of the incoming data, meaning chopping up a dataset into finite pieces along temporal boundaries;

Data Processing Patterns

Bounded Data

pretty straightforward, run the dataset through some data processing engine to get a strcutured dataset with greater inherent value;

Unbounded Data
Fixed windows

Most common way, repeatedly run a batch engine to process input data which is windowed into fixed-size windows(separate data source, sometimes called tumbling windows);

Sessions

Reference

Ubuntu 18.04 LTS Dual Boot with Win10(BIOS Legacy & MBR) [2020 UPDATE]

2020-06-04T00:00:00+00:00

[UPDATE June 2020] Spilt water on my computer last month and while I was trying to fix it I completely messed up the netwok interfaces and sources.list to the extend that I have to reinstall the Linux distro; Thought I’d update this post I wrote almost over 2 years ago. Hope this help you:)

1. What you need

A USB stick/flash drive. Official guide from Ubuntu website says at least 4GB, personally, I used a 30GB stick. (Too big I know but just to be safe)
MS Windows XP or later that is working on the PC.
Rufus/ Ultral OS/ Universal USB Installer etc: A tool that can write Ubuntu ISO(Download Here) to your USB to install later. Choose this carefully. Some common installation issues are caused by the tools you choose.
Enough unallocated space on disk.

2. Make your bootable USB stick

Before we start, I wanna emphasize one thing: Always check your disk file format.

There are 2 ways of partitioning drive: MBR(Master Boot Record) and GPT(GUID Partition Table)(To check your format, go to disk management and right click your disk 0 to see properties, mine is MBR). So what’s the difference between MBR and GPT? Well, MBR is old and GPT is new. But as “But the new is not always better than the old.“ I quoted here, they all have their own pros and cons.

GPT disk can support larger than 2TB while MBR cannot. They can both be dynamic and basic. Also, GPT can supports up to 128 partitions while MBR can only support four primary ones.

Usually, we associate MBR + BIOS and GPT + UEFI together. If a Windows pc uses UEFI, it will only support GPT.

Write downloaded Ubuntu ISO to your USB stick. This USB will be formatted so remember to back up data. Remember to choose Partition scheme to MBR and File system is FAT32(Default).

PLease also check Ubuntu official guide

Restart PC. Know your shortcuts to enter the boot menu. For Thinkpad it’s F12. After the Leveno red sign shows on screen quickly press it before the sign disappears. There will show a line in white color: Entering Boot Menu.

Here are some shortcuts for other PCs. Since I haven’t tried them all myself, I strongly suggest you verify before actually start:

Leveno PC: F12 or F1
Dell laptop: F12
HASEE laptop: F2
Sony laptop: DEL or F2 or F9
Samsung laptop: F10
IBM Pc: F12

Then you will see something like this:(Disclaimer: I took the image from this website, I do not own the copy right of this file)

disable Secure Boot after entering boot menu.
Turn boot priority into Legacy First since the disk format is MBR.
Choose USB stick for boot queue.

UEFI/Legacy Boot                 [Both]

UEFI/Legacy Boot Priority        [Legacy First]

CSM Support                      [Yes]

Enter esc + y to save and exit.

4. Now install Ubuntu!

If things go well, after select USB in the boot menu, you should be directed to a menu with the list of “Try Ubuntu”, “Install Ubuntu”, …

Select Install/Try(It doesn’t matter unless you really wanna play with it first for a bit). Remember to select something else in the installation choice step.

Note: There are many partition scheme online, choose the one that’s suitable! You can check partition guide here in official Ubuntu Wiki. And here’s mine just for your reference;

Also, you have always boot into live USB and adjust the system partition if you would like, be very careful tho ;-)

swap: size of RAM, or twice the size;
/: minimum is 8 GB, but it is recommended to have at least 15 GB; # system will be blocked if root is full
/boot: 250 MB ~ 1 GB; # sometimes required, but do not use the same one for several Linux distros;
/home: as large as possible, especially when you installed your Dropbox here, and have a lot of files; # if you don’t want a seperate home, just merge it with root;

Here’s my updated partition as in 2020; I allocated most of the space to /home cuz I will have my Dropbox and most of my side projects here. Also, I need to ensure the data is safe incase of a drive failure/upgrade. though the general consensus nowadays is to just use the /root which includes /home;

# 2020 June: This is my outdated partition, please check image above!
# on sda:
/dev/sda3  /     ext4  primary beginning  30GB
/dev/sda4  swap        logical beginning  5GB
/dev/sda5  /boot ext4  logical beginning  1GB
/dev/sda6  /home ext4  logical beginning  200GB

# make sure this is as large as possible

5. Use EasyBCD for boot loader

After installation, restart and enter Windows. Download EasyBCD(This is for BIOS) and add an entry:

Add Entry: Linux/BSD
Type: GRUB(Legacy)
Name: define yourself
Bootloader: /boot partition # if you have a seperate /boot else just the ubuntu partition 
Edit Menu: # Now you should have two entries, one Windows one Linux.

Then restart, now ou can choose OS as you wish!

6. SOME issues

a. Failed to load ldlinux.32

This happened the first time I tried to install Linux. I was a complete novice and knew nothing about deep-level OS. In general, this error can be caused by a lot of things: a broken USB port, corrupted ISO image, driver incompatibility…

For me it is because of the writing software, I switched from Ultral SO to Win32 Disk Manager and it all worked out. (But now it’s deprecated so I strongly suggest not to follow)

b. Underscore flashing on black screen after booting into newly installed Ubuntu

Something like this:

Grub issues.

This happened so many times I could recite all commands I tried in my sleep. Basically to repair it you can boot into live USB after installation and add repository of boot repair and run boot-repair; If it still does not work, try installing grub in your \boot partition.

sudo add-apt-repository ppa:yannubuntu/boot-repair
sudo apt-get update
sudo apt-get install -y boot-repair && boot-repair

Note: sometimes you may get a notification before all the boot-repair starts, asking “if this drive** is the fixed drive?””, remeber to choose “No” if you are installing Ubuntu on your PC;

Check links here and here.

c. GRUB rescue mode

How to rescue a Non-booting GRUB 2 on Linux?

GRUB rescue mode.

This is also related to broken grub. It could happen after you reboot into Ubuntu partition. To fix it, simply run command below and check the /root partition.

grub rescue > ls
(hd0) (hd0,msdos5) (hd0,msdos3) (hd0,msdos2) (hd0,msdos1) (hd1) (hd1,msdos1)
grub rescue > ls (hd0,msdos1) # try to recognize which partition is this
grub rescue > ls (hd0,msdos2) # let's assume this is the linux partition
grub rescue > set root=(hd0,msdos2)
grub rescue > set prefix=(hd0,msdos2)/boot/grub # or wherever grub is installed
grub rescue > insmod normal # if this produced an error, reset root and prefix to something else ..
grub rescue > normal

Some Useful Links

“Welcome to the producer side!”

Reservoir Sampling and Randomized Algorithms

2020-05-24T00:00:00+00:00

How the randomized algorithms work and its implementation in streaming systems

Randomized Algorithm

Randomized algorithm applies a certain level of randomness as part of the logic. It usually uses uniform random(Each element from a N dataset has 1/N probability being chosen) to define the behaviour of auxiliary input in the hope of achieving good performance in average case;

The randomized algorithms are random in following aspect:

The operation of actual problem is random;
The computing complexity of the problem is a random variable;
The algorithm output is random;(might be right/wrong);

a. choose one element randomly

For the ith element, we must have (1/i) probability of choosing and (1-1/i) not choosing;

// One element
// Proof of uniform random
// for ith item, the probability of being chosen
1/i * (1 - 1/(i+1)) * (1 - 1/(i+2)) * ... * (1 - 1/n)
= 1/i * i/(i+1) * (i+1)/(i+2) * ... * (n-1) / n
= 1/n 

b. choose k elements randomly

For the ith element, the probability of being chosen k/i, the probability of not being chosen (1-k/i);

// K elements
// Proof of uniform random
// for ith item, the probability of being chosen
k/i * (1 - k/(i+1) * 1/k) * (1 - k/(i+2) * 1/k) * ... * (1 - k/n * 1/k)
= k/i * (1 - 1/(i+1)) * (1 - 1/(i+2)) * ... * (1 - 1/n)
= k/i * i/(i+1) * (i+1)/(i+2) * ... * (n-1) / n
= k/n 

1. Reservoir sampling

Reservoir sampling is a family of randomized algorithms for choose a simple [random sample] [without replacement of k items] from a population of [unknown size n] in a [single pass] over the items.

NOTE:

size n here usually cannot fit into main memory;
n is unknown, revealed over the time; otherwise it would be too easy;
time complexity required is O(N);
probability of each item being chosen must be k/n;

Simple Algorithm

The commonly used algotihm that is simple but slow, known as: Algorithm R;

/**
 * Algorithm R works by maintaining a reservoir size of k, 
 * 1. which initially contains the first k items of the input;
 * then iterates over the remaining items until the input is exhausted.
 * 2. when reaches the ith item
 *   a. if i >= k, random choose d in [0, i]
 *      if d is within [0, k -1], use ith item to replace dth item in reservoir; 
 * 3. repeat second step;
 */

int[] reservoir = new int[k];

for (int i = 0; i < reservoir.length; i++)
{
    reservoir[i] = dataStream[i];
}

for (int i = k; i < dataStream.length; i++)
{
    // random integer in [0, i];
    int d = rand.nextInt(i + 1);
    // if integer is within [0, m-1]，then replace reservoir
    if (d < k)
    {
        reservoir[d] = dataStream[i];
    }
}

CONS: asymptotic running time is thus O(n), which causes the algorithm to be unnecessarily slow if the input population is large.

Distributed/Parallel Reservoir Sampling

In idistributed systems, main memory & IO ops would be the bottleneck; So for the data of a super large scale, we could improve the overall performance using parallel algorithm:

Assume we have m machines, divide the stream into m data stream, every single machine process one stream of m samples, then note down as N1, N2, ..., Nk, ... NK(assume m < Nk>) => N1 + N2 + N3 + … + NK = N;
Choose a random number d from [1, N]: a. if d < N1, then replace (1/m) from the first machine, …; repeat m times;

=> m / N

Implementation

Because the reservoir sampling has O(N) time complexity and O(M) space complexity, it is usually adopted in streaming systems where statistical sampling is required. For example, random output n lines from a large-scale dataset;

For algorithm lovers, you could also find some common problems like： linked list random node, pick random index;

Limitations

Reservoir sampling makes the assumption that the desired sample fits into main memory, often implying that k is a constant independent of n.

ref: wiki: reservoir sampling#limitations

In applications where we would like to select a large subset of the input list (say a third, i.e. k=n/3), other methods need to be adopted. Distributed implementations for this problem have been proposed.

2. Geometric Distribution

Time Complexity O(K + Klog(N/K))

3. Fisher–Yates shuffle

The Fisher–Yates shuffle is used to generate a random permutation of a finite sequence, meaning to shuffle the sequence;

So choose K items randomly from the sequence just equals to shuffling the deck of size k;

-- To shuffle an array a of n elements (indices 0..n-1):
for i from n−1 downto 1 do
     j ← random integer such that 0 ≤ j ≤ i
     exchange a[j] and a[i]

Reference

Clean Code: A Handbook of Agile Software Craftsmanship

2020-05-19T00:00:00+00:00

Finally had the time to ~~almost~~ finish this book ;-)

Naming

Use descriptive and unambiguous names;
Avoid misunderstanding; (e.g. Use accountList for a list of accounts unless it is the real list data type, otherwise accounts or AccountGroup would be better);
Use meaningful distinction; (e.g. Usually do not use a or the for variable prefix since it is hard to distinguish what it actually means)
Use names that can be pronounced;
Use searchable names => easier to adjust during debugging & cr stage;
Be consistent;
Avoid encodings:
- do not append prefix/postfix like strings or str, the compiler can distinguish them itself;
Replace magic numbers with named constants;

Function

small;
Do ONE thing;
Use descriptive name;
Arguments:
- have fewer arguments => functions & arguments are on different abstract levels;
- avoid passing Boolean as input;
- If a function has arguments but no output, it should be an event, otherwise must have return;
No side effects;
Use exception instead of error;
Goal: Eliminate duplicate functions;

Comments

NOTE: some comments might be outdated, or just simply wrong;
Do not comment on ill-formatted function, re-construct functions instead;
Dos:
- Alert
- Use // TODO if necessary;
- ALways try to explain the code;
Don’ts:
- Don’t be redundant.
- Don’t add obvious noise.
- Don’t use closing brace comments.
- Don’t comment out code;

Source code structure

Shorter file is easier to understand;
Use identation, even when the function only has one-line statement/empty;
Declare variables near their usage;
Vertically distance:
- Do not place similar concepts in different folder unless for a very good reason;
- the control variable in the loop should always be decalred within the statement;
- Keep functions with similar usage close;
- The function that is calling should always be placed on top of the called function;
Source code should be clear, and well-structured; Its name shows it’s in correct module; At the beginning of the file, it should display the high-level concept & algorithm, then details in following sections;
Follow the team rule;
Boundaries:
- Hide third-party APIs;
  - once third-party package changes, easier to change our own codebase;
  - consistant code style, easier to read;
- Write tests for their-party APIs:
  - Learning test: Faster to understadn its usage;
  - Efficient way to know if the API function changes;

Object & Data Structures

Avoid hybrid structures => half object half data;
Only do one thing;
Do not put public accesser & function with change operations together;
If we want to frequently add functions instead of new objects, we should use Procedure oriented programming method => only adds in one place;
If we want to frequently add data objects, use OOP => does not change other data code;
Hide internal implementation/structures;
Prefer non-static methods;
Better to implement many functions than passing many arguments into one function to select a behaviour;

Error handeling

Use exceptions instead of code;
Use unchecked exceptions => does not require try/catch or throw to compile => simplify the codes;
Add exception statement => Why this failed? Because the default exception only provides stack trace; If the system has log system, note it;
Do not return NULL => If returned, we need to constantly check the NULL value, thus prone to NullPointerException;
Do not pass NULL;
Define Special Case Pattern;

Tests

Keep the testing code clean like the production code; But different standard => Usually production code aims for performance but the testing code does not;
aim for higher test coverage;
Readability is important;
TDD: First create test data, then test the ata, then verify the result;
FIRST rule: F(fast), I(Independent), R(Repeatable), S(Self-sufficient), T(Timely);

Code smells

When to change the code? When the code has bad smeels;

Here’s a list:

Comment

Unwanted information => e.g. change history;
Commented out code; => Just delete it;
Comment that is too obvious => comment should have information the code does not offer;
Outdated comment;

Environment

How many steps to finish the structure => should be seperated into single oprations;
How many steps to complete the tests;

Functions

Dead function => never called;
Over-complicated;
Too many arguments;
Return arguments;
Identification parameters;

Parameters

Does not follow standard naming principles;
Use all kinds of prefix/postfix;
Does not explain what it is used for;
…

Testing

Not enough coverage;
No coverage tools;
Neglect small tests;
Too slow;

General

Rogidity: difficult to change. A small change causes a cascade of subsequent changes;
Fragility: breaks in many places due to a single change;
Immobility: cannot reuse parts of the code in other projects because of involved risks and high effort;
Needless Complexity;
Needless Repetition;
Opacity: hard to understand the code;

Design patterns in systems with limited memory

2020-03-09T00:00:00+00:00

Reading Small Memory Software: Patterns for systems with limited memory

2020 so far has been a train wreck. Without any classes on campus, I did manage to spend some time focusing on learning design pattern in computer software and systems in general: a goal I set a year ago but have never had time to finish.

Small Memory Software is a classic for those wishing to learn more on system design patterns & memory efficiency. Its first version was published around 2000s. However, though the memory capcity engineers are familiar with has changed greatly, there are still many principles applicable and useful for any software development that relies on the efficient use of memory and other resources. I first discovered this book through online forum while some senior engineers recommened as “at least it’s worthwhile to read a few chapters to know if you have seen the same memory constraints in the past”. I was quite skeptical in the beginning since the book seems nothing special and pretty old for materials in the tech industry. So I read the introduction before making any purchases. It was a fun experience, espcially for those who have had similar issues: It was like a series of “yeah same” moments linked together and made coherent. Few hours into reading, I bought the physical book.

Why still read this book?

So first, why we still should read this book? Computer memory used to be expensive, but now the company, even the individuals can easily afford a model with good memory. But as the computing power of mobile devices advances, it is not uncommon to rely more on our cellphones with a huge number of applications. The need for developers of such applications to support large request amount increases beyong imagination. So, yes, small memory softwares are back.

In this post we will be fousing on key components like RAM, ROM and secondary storage. Of course there are other constraints like network, processing power, graphics that can slow down the entire system in real life but there have been more patterns in detecting paramsters mentioned before.

Small Archietecture

	Embedded Systems	Mobile devices	PC	server farms
typical applications	Device control, protocol conversion, etc	Diary, Address book, Phone, Email	Word processing spreadsheet, small database,accounting.	E-commerce, large database applications, accounting, stock control.
UI	NA	GUI; libraries in ROM	GUI, with several possible libraries as DLLs on disk	Implemented by clients, browsers or terminals
Network	None, Serial Connection, or industrial LAN	TCP/IP over a wireless connection	10MBps LAN	100 MBps LAN
IO	As needed – often the main purpose of device.	Serial connections	Serial & parallel ports, modem, etc.	Any, accessed via LAN

Allocations

pdf: allocations

Fragmentation

For dynamic memory allocation, there are two types: internal fragmentation and external fragmentation(And Data Fragmentation, as some would mention). The cause of such cases usueally happens when the user processes are loaded and removed from RAM, stored in blocks, making the main memory not enough for loading new process even though there are available memory spaces: small size memory blocks.

The memory would eventually run out, no matter which memory allocation scheme we choose, so what we could do for better memory management is to choose memory plan accordingly.

Fixed size Client Memories: makes user to take responsibility to take memory problems but harder to provide full features of the app, sometimes may result in lower user engagement rate.
Signal an error: It is quite easy to inform the client side of the error but it is more important to handle the error correctly, think about the parital failure pattern. This approach usually gives us more options to handle memory problems.
Reduce quality to reduce quantity: reduce quality can maintain the system throughput. One popular example is reduceing the quality of the image we store, or reducing sampling frequency.
Delete old objects: This is known as common practice. For example, if you load your pics for far too long in Instgram, most likely you would refresh or reopen the app, which means Fresh Work Before Stable: terminating old connections that have lower chance to be answered, delete old ones for new objects to arrive.
Defer new requests/ IGNORE.

I will be updating this post regularly till I finish the whole book. If you have any questions, feel free to discuss in the comment section below.

Writing elegant Golang

2019-12-20T00:00:00+00:00

Full disclosure, I did’t start using Golang actively till recent months, even though I have always claimed to know it and put it in the language section on my resume(naively & shamelessly). But there is definietly a huge difference between knowing some common syntaxes and understanding the language in engineering level completely.

Last week, while I was having a conversation with one of my friends who started using Golang because of his PhD thesis, I realized we both shared same learning experience(Though definitely not the most efficient learning curve):

started a huge list of online tutorials;
proceeded to spend money on books;
gave up the first two & just started development;
found bugs couldn’t understand, solved with online forumn, and went back to learning materials;

One thing we both 100% agreed on: practice, practice, practice. More precisely, practicing by building something yourself. I have always believed the best way to learn programming language is a quick project that adopts most common features and has a progressive learning curve.

personal notes on writing concise & elegant Golang

gofmt, goimports, golangci-lint, etc.
Standard Go Project Layout:
- do NOT contain /src: especially for those Java developers who is used to its design pattern;
- /internal modules cannot be used by external parties;
do NOT use init for initialization, like in rpc, DB, or Redis, because init will be anonymously executed, initializing the resource connection. Every time we declare an init() function, Go will load and tun it prior to anything else in that package;

At the init stage. Best make some simple conditional/judging statements, like using flag: True/False to determine the status of the parameters;

// main.go
package main

import (
    "fmt"
)

var name string

func init() {
    name = "anonymous"
}

func main() {
    fmt.Printf("My name is %s", name)
}

Instead, it is better to use Client + NewClient for initializing connection.

Testing. Use frameworks like (GoMock)[https://blog.codecentric.de/en/2017/08/gomock-tutorial/], httpMonky, monkey and (GoMock)[https://github.com/golang/mock] for testing;
Optimization:
- Instead of fmt.Sprintf, use strconv;
- Use sync.Poolto re-use previously allocated objects and re-duce the work of the garbage collector;
- avoid using structures containing pointers as key for large maps
  - For example, if you have a structure: map[string]int, the garbage collection has to check every string since it contains pointers;

Reference

https://stephen.sh/posts/quick-go-performance-improvements

Consistent Hashing: tradeoffs & how-to in Redis

2019-08-16T00:00:00+00:00

What is Hashing?

Merriam-Webster: noun: “chopped meat mixed with potatoes and browned”; verb: “to chop (as meat and potatoes) into small pieces.”

So basically, hashing is the mapping between data object.(general terms) The input and output values do not need to be the same type.

hash collision: more than one input being maped to the same hash result. (the infamous Hash Collision Attack)

Simple hash in Redis?

To ensure the high availability and improve readability, we can simply do the replication in Redis so that it can form the Master-Master or Master-Slave; Building clusters to split read/write in data operations. Similar to the database, when the data is too large, we create new db/tables.

High Availability in Redis [Source: Redis Labs][i1]

One example, if the key is the image name and the value is the file path,

to search for a certain image on Redis server takes traversal time(for each server)
If we use the plain hash, hash(file-name.png) % num(server), we can directly go to the one we need but there are certain problems -> when the server name changes, every cache location changes too.

hash(1.png)%2  -> hash(1.png)%3 = ?

Consistent Hashing

Consistent hashing still uses module method. But instead of moding the server number, it mods the 2^32, which takes the entire hashing space as a clock-wise circle, starting from the 0 node.

First, hash each server(using IP/server name/…)
For each file,use sane hash function to hash the key to get the hash value on the circle, walking clock-wise, the first server encoutered should be its designated server.

Consistent Hashing in Redis [Source: Redis Labs][i2]

Fault Tolerance

If one node down or more node added, we simply need to update a small portion of file mapping while the majority stay unborthered.

Weighted Hosts

This happends when we have more (or less) load to one server as to the rest. Possible reasons are un-evenly distributed nodes(/the few nodes).

For this situation, we can adopt virtual nodes that still map back to the original node, like “Node A#1”, “Node A#2”, “Node A#3”. IN practice, it’s common to set the number of virtual nodes larger than 32 so that the even distribution with few nodes is guaranteed.

Reference

Intellij IDEA for Spark w/ Scala examples

2019-07-30T00:00:00+00:00

“And why I don’t use Eclipse for Spark”

Why I don’t use Eclipse for Spark?

I tried eclipse, Atom, Sublime and even Emacs before settling on IntelliJ. The reason I finally went back tp Intellij is the same like most of other Scala developers – the more stable IDE and more features.

Since the Scala IDE team also showed interest to move to VS Code back in 2017 and started a few new projects on GitHub, there’s really no use to stick to Eclipse when it’s already not the top choice from Scala’s own team.

And as I quote from this Quora user here:

Having tried Eclipse on and off, and sticking IntelliJ for a while, its a tradeoff between being less useful but more responsive/performant (Eclipse) vs less responsive/performance but more useful (IntelliJ).

The auto-completion & refactoring features in IntelliJ work really well when we are using Java but it becomes more like an issue when it comes to Scala. The type system in Scala is very complicated so sometimes it caues more trouble than easying the burden.(e.g. the incorrect highlighting, libraries importing…)

However, compared to the less rich features the Eclipse provides, I’m more than happy to stick to IntelliJ than going back to Eclipse, especially when it already gave me so much painful memory while doing some Java web projects & I have the liscence for IntelliJ Ultimate version ; )

Set up DEV environment in IntelliJ for Scala

In 2019

Config

I directly followed this guide from JetBrains. but its worthwhile to check this post from itversity in 2018. It has more thourough guides.

Running

Local running: just go to “Run” - > “Run Configurations”
Running in Spark clusters: pack the program as Jar as use shell. Select “File” –> “Project Structure” –> “Artifact”，then select “+” –> “Jar” –> “From Modules with dependencies”, and choose main function and select the jar location in the pop up. Finally, just choose “Build” –> “Build Artifact” and compile to jar.

./bin/spark-shell --master <master-url>

If we use local mode in Spark commands and run it on 4 CPU cores, the command will simply become ./bin/spark-shell --master local[4].

And for convinence, it’s better to config system parameter:

vi /etc/profile
 
# add following to the end of the file
export PATH=$PATH:/usr/local/spark-[version]-bin-hadoop[version]/bin
 
# activate the change
source /etc/profile

Scala Code examples

Word Count

4 paramters: Spark master location, program name, Spark installation log and Jar location.

import org.apache.spark._
import SparkContext._

val sc = new SparkContext(
    args(0), "WordCount", 
    System.getenv("SPARK_HOME"),
    Seq(System.getenv("SPARK_TEST_JAR"))
)

// read in file
val textFile = sc.textFile(args(1))

// directly create a Hadoop RDD Object
var hadoopRdd = new HadoopRDD(
    sc, 
    conf,
    classOf[SequenceFileInputFormat[Text,Text,classOf[Text], classOf[Text]]
    )

// first get the word from input & put the same word in one bucket, then count the frequences.

val result = hadoopRdd.flatMap{
    case (key,value) = > value.toString().split("\\s+");
    }.map(
        word = > (word, 1)).reduceByKey (_ +_)
result.saveAsSequenceFile(args(2))

Top K

Top K task has many answers, either in algorithm way or Big data. Here in Spark, we simply follow the avove program and find the top K words.

A lot of tech blogs tend to use the top method from SparkAPI to calculate, but we can also use the algorithm way, which is heap sort to get the answer.

Here’s the common way:

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.SparkContext._
 
object TopK {
  def main(args: Array[String]) {
    if (args.length != 2) {
      System.out.println("Usage: <src> <num>")
      System.exit(1)
    }
 
    val conf = new SparkConf().setAppName("TopK")
    val sc = new SparkContext(conf)
 
    val lines = sc.textFile(args(0))
    val ones = lines.flatMap(_.split(" ")).map(word => (word, 1))
    val count = ones.reduceByKey((a, b) => a + b)
    val convert = count.map {
      case (key, value) => (value, key)
    }.sortByKey(true, 1)
    convert.top(args(1).toInt).foreach(a => System.out.println("(" + a._2 + "," + a._1 + ")"))
}

Here’s the Heap method, taken from StackOverflow.

def pickTopN[A, B](n: Int, iterable: Iterable[A], f: A => B)(implicit ord: Ordering[B]): Seq[A] = {
  val seq = iterable.toSeq
  val q = collection.mutable.PriorityQueue[A](seq.take(n):_*)(ord.on(f).reverse) // initialize with first n

  // invariant: keep the top k scanned so far
  seq.drop(n).foreach(v => {
    q += v
    q.dequeue()
  })

  q.dequeueAll.reverse
}

isdanni

ThinkPad X1 Extreme Gen 3: Dual Boot Pop!_OS 20.04 LTS(Nvidia version) with Windows 10

1. Create bootable USB stick

2. Disable secure boot

3. Make partitions for Pop!_OS in the disk space

4. Boot from live USB

4.1. GParted

4.1.1. Swap

4.1.2. Boot

4.1.3. Root

4.2. Install and RESTART!

Switching between Windows and Pop!_OS

Resources

Grokking DDIA: Dig depper than buzzwords[2020 UPDATE]

Part 1. Foundations of Data Systems

Reliable, Scalable, and Maintainable Applications.

Reliability

Scalability

Maintainability

Data Models & Query Language

Relational VS Document

Query for Data

Graph-Like Data Models

Storage & Retrieval

Part 2. Distributed Data

Part 3. Derived Data

Blah Blah Blah

Streaming Systems: Data Processing, Watermarks & Advanced Windowing

Streaming 101

1. What is streaming?

Streaming System

Shape of a dataset

Why stream processing is important?

2. Background

Lambda Architecture

Lambda vs Kappa Architecture

Batch vs Streaming Efficiency

Event Time vs Processing Time

Data Processing Patterns

Reference

Ubuntu 18.04 LTS Dual Boot with Win10(BIOS Legacy & MBR) [2020 UPDATE]

1. What you need

2. Make your bootable USB stick

3. Get into Boot Menu

4. Now install Ubuntu!

5. Use EasyBCD for boot loader

6. SOME issues

a. Failed to load ldlinux.32

b. Underscore flashing on black screen after booting into newly installed Ubuntu

c. GRUB rescue mode

Some Useful Links

Reservoir Sampling and Randomized Algorithms

Randomized Algorithm

a. choose one element randomly

b. choose k elements randomly

1. Reservoir sampling

Simple Algorithm

Distributed/Parallel Reservoir Sampling

Implementation

Limitations

2. Geometric Distribution

3. Fisher–Yates shuffle

Reference

Clean Code: A Handbook of Agile Software Craftsmanship

Naming

Function

Comments

Source code structure

Object & Data Structures

Error handeling

Tests

Code smells

Comment

Environment

Functions

Parameters

Testing

General

Design patterns in systems with limited memory

Why still read this book?