Jekyll2021-05-30T13:38:26+00:00https://isdanni.com/feed.xmlisdanni...127.0.0.1DanniThinkPad X1 Extreme Gen 3: Dual Boot Pop!_OS 20.04 LTS(Nvidia version) with Windows 102020-10-25T00:00:00+00:002020-10-25T00:00:00+00:00https://isdanni.com/thinkpad-x1-extreme-pop-os<p>I have been heavily using Ubuntu for the past few years, both for work and personal projects, and have become pretty comfortable with it. So when it comes to purchasing my next laptop, I had a hard time choosing the “perfect” Linux distro and proper specs for its installation.</p>
<p>There are too many reviews about the “Best Linux distros” each year(or even month) for “beginners”, “mainstream”, and “professionals” specifically and it’s just impossible stick to one when they are constatly updating and making releases with tweaks here and there. Initially I installed Ubuntu 18.04 LTS on my ThinkPad T series because it is just simply much easier to do all the college homework, which was a nice relief for a Linux newbie when the majority of your class were using MBP or even more “advanced” Linux options like Arch(debatable here :)). I went with the dual booting because the gaming/design has better support on Windows platform and it’s nice to keep one’s option open.</p>
<p>So after some research I landed on the <a href="https://system76.com/pop">Pop!_OS by System76</a>. It is based on Ubuntu, but is NOT just a re-skinned Ubuntu, so it’s quick and easy for me to set up and use all the tools I’m familiar with(They actually wrote an article stating the problem(seems it’s asked too frequently), you can check it out here: <a href="https://support.system76.com/articles/difference-between-pop-ubuntu/">Pop!_OS and Ubuntu: What’s the difference?</a>). Another huge reason is that they specifically provide a Nvidia ISO, so it’s perfect for switching graphics and work with the Nvidia driver on the ThinkPad X1 Extreme.</p>
<p><strong>NOTE</strong>: This article is only my personal experience and should not be used as the official guidline for Pop!_OS installation.</p>
<blockquote>
<p>For more support, please check the <a href="https://support.system76.com/">official System76 website</a>.</p>
</blockquote>
<p>My ThinkPad X1 Extreme Gen 3 configuration:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Processor: Intel® Core™ i7-10850H CPU @ 2.70GHz × 12
Graphics: NVIDIA Corporation / GeForce GTX 1650 Ti with Max-Q Design/PCIe/SSE2
Screen: 4K
Memory: 32GB
Disk: 1TB
</code></pre></div></div>
<h1 id="1-create-bootable-usb-stick">1. Create bootable USB stick</h1>
<p>Download the Pop!_OS iso image(Nvidia version) from the System76 website. It’s a disk image with the OS and installer.</p>
<p><img src="/assets/images/post/popos/popos_download.jpg" alt="popos_download" /></p>
<p>If you are not sure which Pop!_OS version to download, check the Display adapter in Device Manager. If there is “Nvidia” in the graphics, go with the Nvidia version.</p>
<p><img src="/assets/images/post/popos/device_manager.jpg" alt="device_manager" /></p>
<p>To write the image to the flash drive, we need to use <a href="https://www.balena.io/etcher/">Etcher</a>(for Windows, and there are multiple choices other than this).</p>
<p><img src="/assets/images/post/popos/etcher1.jpg" alt="etcher1" /></p>
<p><img src="/assets/images/post/popos/etcher2.jpg" alt="etcher2" /></p>
<p>It should take a few mintues to complete. After the write is finished, close the Etcher tab.</p>
<p><img src="/assets/images/post/popos/etcher4.jpg" alt="etcher4" /></p>
<h1 id="2-disable-secure-boot">2. Disable secure boot</h1>
<p>Restart the computer, when the red Lenovo sign shows up in the screen, press <strong>F12</strong> to enter the boot menu(The key <strong>varies</strong> based on your laptop, please make sure you press the correct one). In the <strong>Setup/Security</strong> menu, disable <strong>Secure Boot</strong>.</p>
<p><img src="/assets/images/post/popos/secure_boot1.jpg" alt="secure_boot1" /></p>
<p><img src="/assets/images/post/popos/secure_boot2.jpg" alt="secure_boot2" /></p>
<h1 id="3-make-partitions-for-pop_os-in-the-disk-space">3. Make partitions for Pop!_OS in the disk space</h1>
<p>Open the <strong>Disk Management</strong> in Windows, right click the NTFS partition(As you can see, mine still has around 900GB free space. And Disk 1 is the USB stick I wrote the Pop!_OS iso to) to <strong>Shrink Volume</strong>. Personally, I split the space in half: 500GB for Windows 10 and 500GB for Pop!_OS. But whatever you do, make sure back up the data before the operation and be very careful when messing around the disk space.</p>
<p><img src="/assets/images/post/popos/disk_management.jpg" alt="disk_management" /></p>
<p>After the operation, the disk 0 looks like this:</p>
<p><img src="/assets/images/post/popos/unallocated.jpg" alt="unallocated" /></p>
<h1 id="4-boot-from-live-usb">4. Boot from live USB</h1>
<p>Now we can start the installation. Restart the computer again, when the red sign shows up, press <strong>F12</strong> again, and choose the USB stick. There should be a short period of ugly black screen with white output looking like this:</p>
<p><img src="/assets/images/post/popos/black.jpg" alt="black" /></p>
<p>Then the Pop!_OS screen will show up, which means we have entered the live environment.</p>
<p><img src="/assets/images/post/popos/popos.jpg" alt="popos.jpg" /></p>
<h2 id="41-gparted">4.1. GParted</h2>
<p>After the keyboard, timezone, language and other easy configs, we can choose to partition the space manually:</p>
<p><img src="/assets/images/post/popos/space.png" alt="space.png" /></p>
<h3 id="411-swap">4.1.1. Swap</h3>
<p>As shown at the bottom in the GParted menu, it is <strong>OPTIONAL</strong> to select a <strong>swap</strong> space. Normally, it is recommended to have double of the RAM size. But for mordern computers, especially those with large RAMS(up to 128 GB), this is not applicable anymore. Since my RAM is 32GB, I went with <strong>6GB</strong> of swap. The file system is <strong>linux-swap</strong>.</p>
<blockquote>
<p>Here is an interesting article disussing different options: <a href="https://itsfoss.com/swap-size/">How Much Swap Should You Use in Linux?</a></p>
</blockquote>
<p><img src="/assets/images/post/popos/swap_space.png" alt="swap.png" /></p>
<p><img src="/assets/images/post/popos/swap.png" alt="swap.png" /></p>
<h3 id="412-boot">4.1.2. Boot</h3>
<p>I went with <strong>1GB</strong> of <strong>/boot</strong> just to be safe. Usually (at least) 512MB should be enough. The file system is <strong>fat32</strong>.</p>
<p><img src="/assets/images/post/popos/boot.jpg" alt="boot" /></p>
<p><img src="/assets/images/post/popos/boot_space.png" alt="boot space" /></p>
<h3 id="413-root">4.1.3. Root</h3>
<p>Then we can directly allocate the rest of space for the <strong>/root</strong>. The file system is <strong>ext4</strong>.</p>
<p><img src="/assets/images/post/popos/root.png" alt="root" /></p>
<p><img src="/assets/images/post/popos/root_space.png" alt="root_space" /></p>
<p>Make sure every partition is correct, then select <strong>Erase and Install</strong>.</p>
<p><img src="/assets/images/post/popos/erase_install.png" alt="erase_install.png" /></p>
<h2 id="42-install-and-restart">4.2. Install and RESTART!</h2>
<p>Then it should take 2 or 3 mins to finish installation. After it’s finished, a prompt window will show up and ask you to restart the device.</p>
<p><img src="/assets/images/post/popos/installing.png" alt="installing" /></p>
<p><img src="/assets/images/post/popos/restart.png" alt="restart" /></p>
<h2 id="switching-between-windows-and-pop_os">Switching between Windows and Pop!_OS</h2>
<p>If everything is done correctly, the USB stick can be removed and each restart will boot into Pop!_OS directly. If you wish to use the Windows 10, again, press <code class="language-plaintext highlighter-rouge">F12</code> during restart when the red Lenovo sign shows up and choose the Win10 distro.</p>
<h1 id="resources">Resources</h1>
<ul>
<li><a href="https://support.system76.com/articles/live-disk/">Create and Use Bootable Media from Other OS’s</a></li>
<li><a href="http://www.glowseed.com/mindmash/?p=643">DUAL BOOTING POP! OS ON A THINKPAD X1 EXTREME</a></li>
<li><a href="https://deepak.puthraya.com/2019/10/10/popos-thinkpad-x1-extreme">Setting up Pop!_OS</a></li>
<li><a href="https://www.ultrabookreview.com/33225-decided-to-switch-to-linux-day-1/">So I’ve decided to switch to Linux: Day 1</a></li>
<li><a href="https://techhut.tv/dual-boot-windows-10-pop-os/">How to Dual Boot Windows 10 and Pop!_OS</a></li>
<li><a href="https://www.youtube.com/watch?v=XGa-HHYPF2s">Dualboot Pop! OS 20.04 linux and Windows 10(2020)</a></li>
<li><a href="https://www.youtube.com/watch?v=CozK7sJ8UMs">Pop!_OS 19.10 - Setting up a Dual Boot with Windows 10</a></li>
<li><a href="https://www.youtube.com/watch?v=EXZ7_DVxztQ&t=139s">How to Dual Boot Pop!_OS 20.04 LTS and Windows 10</a></li>
</ul>DanniI have been heavily using Ubuntu for the past few years, both for work and personal projects, and have become pretty comfortable with it. So when it comes to purchasing my next laptop, I had a hard time choosing the “perfect” Linux distro and proper specs for its installation.Grokking DDIA: Dig depper than buzzwords[2020 UPDATE]2020-06-06T00:00:00+00:002020-06-06T00:00:00+00:00https://isdanni.com/ddia<blockquote>
<p>Understand data & Build reliable, scalable and maintable applications</p>
</blockquote>
<p><strong><em>Designing Data Intensive Applications(DDIA)</em></strong> has been praised by many people in the industry as a great book to bridge knowledge gap between theories of data systems and practical engineering. Mastering tradeoffs of each technology and apply them to solve real world problems take huge effort and constant learning; So I’m writing this post for my reading notes and some takeaways ;-)</p>
<h1 id="part-1-foundations-of-data-systems">Part 1. Foundations of Data Systems</h1>
<h2 id="reliable-scalable-and-maintainable-applications">Reliable, Scalable, and Maintainable Applications.</h2>
<ul>
<li><strong>compute-intensive</strong>: limited by CPU power;</li>
<li><strong>data-intensive</strong>: volume, complexity and quantity of the data(sometimes plus constantly advancing speed);</li>
</ul>
<p><strong>Data-Intensive Applications</strong> usually consists following functions:</p>
<ul>
<li><strong>storage</strong>: databases;</li>
<li><strong>caches</strong>: remember results of expensive operations to save cost;</li>
<li><strong>search indexes</strong>: allow searching using keywords or other filtering methods;</li>
<li><strong>stream processing</strong>: Send a message to another process, to be handled asynchronously;</li>
<li><strong>batch processing</strong>: Periodically crunch a large amount of accumulated data;</li>
</ul>
<p><strong>Why we talk about data systems generally?</strong></p>
<ol>
<li>The boundaries between the categories are becoming blurred. For example, Redis(datastore that can be used as message equeues) and Apache Kafka(message queues with database-like durability gurantees)</li>
<li>applications are requiring different functionalities a single tool cannot meet. So people usually just break tasks down into small pieces, solving each one using single tool, and then glue them back using application code.
<blockquote>
<p>For example, it’s usually application code’s job to sync all search indexes and caching with the primary db. Once all development finished, combining several tools together specially for one specific task becomes not just application developing, but also data system design<br /><img src="/assets/images/post/dintensive/appcode.png" alt="application code sync all with main db" /></p>
</blockquote>
</li>
</ol>
<h4 id="reliability">Reliability</h4>
<blockquote>
<p>System can continue to work <em>correctly</em> even in the face of adversity.<br /><strong>Fault-tolerant</strong> / <strong>resilient</strong>: able to predict faults and can cope with them(certain types of course).</p>
</blockquote>
<p>Many critical bugs are actually due to <strong>poor error handling</strong>. So to increase the rate of faults, we can <u>trigger them deliberately</u>(e.g. randomly killing individual processes without warning) to ensure that fault-tolerance machinary is continually exercised and tested. And we generally prefer tolerating faults over preventing faults.</p>
<blockquote>
<p><strong>Netflix</strong>: <em>Chaos Monkey</em>, a resiliency tool that help applications tolerate random instance failures by randomly terminates vm instances and containers that run inside of your production env, exposing engineers to failures more frequently to incentivize them to build resilient services.</p>
</blockquote>
<table>
<thead>
<tr>
<th>Faults</th>
<th>Solutions</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>hardware faults</strong></td>
<td><em>Hard disks crash, faulty RAM, blackout power grid, wrong network cable unplugged</em><br /><br /> 1. <strong>add rebundancy components</strong> to the individual hardware in order to reduce failure rate. <em>Disks may be set up in a RAID configuration, servers may have dual power supplies and hot-swappable CPUs, and datacenters may have batteries and diesel generators for backup power.</em> <br />2. system patches so downtime is for one node one time</td>
</tr>
<tr>
<td><strong>software erros</strong></td>
<td><em>systematic error within the system, harder to predict, cause more erros since nodes system are correlated acrosee nodes</em> <br /><br /> no quick solutions. Requires careful thinkings, measuring, monitor, analyzing</td>
</tr>
<tr>
<td><strong>human errors</strong></td>
<td><em>human tend to be unreliable even though intentions are good, leading cause of internet error outages</em><br /><br />1. <strong>Design well</strong>: abstractions, APIs, admin interface…. <br />2. <strong>Decouple most-mistakes place</strong>: providing sandbox envs for exploring and experiment safely using real data.<br />3. <strong>Test all levels</strong>: unit testing -> whole-system integration tests -> automated testing<br />4. <strong>Quick recovery</strong>: fast roll back configuration changes, roll out new code gradually, provide tools to compute data.<br />5. <strong>Set up Monitoring</strong>: set up performance metrics and error rates(<em>telemetry</em>)<br />6. <strong>Implement training & management</strong></td>
</tr>
</tbody>
</table>
<h4 id="scalability">Scalability</h4>
<p>Scalability is the term we use to describe a system’s ability to cope with <strong>increased load</strong> – System can deal with the growth in data volume, traffic volume or complexity.</p>
<p><strong>Describe Current Load</strong></p>
<p>There are some <strong>load parameters</strong> and the best choice of the parameters depend on the system architecture: it may be requests per second to a web server, the ratio of reads to writes in a database, the number of simultaneously active users in a chat room, the hit rate on a cache, or something else.</p>
<blockquote>
<p>One example from <strong>Twitter</strong>:<br />the <strong>bottleneck of the scalability is not about the tweet volume but the fan-out</strong>(A term borrowed from electronic engineering, where it describes the number of logic gate inputs that are attached to another gate’s output. In transaction processing systems, we use it to describe <u>the number of requests to other services that we need to make in order to serve one incoming request.</u>) <br /><img src="/assets/images/post/dintensive/t1.png" alt="twitter case" /><br />As shown, it’s <strong>better to do more work at write time and less at read time</strong>. But this also means posting a tweet requires extra work due to write to caches. And such distribution of followers per user is a key load parameter for discussing scalability.<br />Now the Twitter is moving to a <strong>more hybrid approach by combining both</strong>: Most users’ tweets continue to be fanned out to home timelines at the time when they are posted, but a small number of users with a very large number of followers (i.e.,celebrities)are excepted from this fan-out. Tweets from any celebrities that a user may follow are fetched separately and merged with that user’s home timeline when it is read, like in approach 1. This hybrid approach is able to deliver consistently good performance.</p>
</blockquote>
<p><strong>1. Describe Performance</strong></p>
<ul>
<li>
<p><strong>Throughput</strong>: the number of records we can process per second, or the total time it takes to run a job on a dataset of a certain size. Usually for batch processing system.</p>
</li>
<li>
<p><strong>Latency & Response Time</strong>: Response time is the time between a client sending a request and receiving a response <em>(includes network delays and queueing delays)</em>. Latency is the duration that a request is waiting to be handled—during which it is latent, awaiting service. Usually for online systems.</p>
</li>
</ul>
<p>Since response time varies each time the user makes the request, we need to think of it as <strong>distribution</strong> of values. Usually it’s better to use <strong>percentiles</strong>. For example we sort the list of response time from fastest to slowest, the median would be a good metric to know how long users typically have to wait.</p>
<p>Same goes for checking the <strong>outliers</strong>: we just use a much higer percentiles(<em>p95, p99 and p999</em> usually, meaeining how many thresholds are faster than particular thresholds), and <strong>high percentiles of response times</strong>, or <strong>tail latencies</strong>, are important because they directly affect the <em>user experience</em> of the service. Usually clients experiening bad response times are exactly those made most data storage/purchases, aka. most valuable customers. However, ensuring optimized response times(e.g. 99.999 percentiles) also means generating <strong>less revenue</strong> and thus does not yield enough benefit for the service providers’ sake.</p>
<blockquote>
<p><u>Service Level Objects(SLOs)</u> and <u>Service Level Agreements(SLAs)</u> ofen use <strong>percentiles</strong> to define the expected performance and availability of a service.<br />For example, customers can demand a refund if SLAs are not met.</p>
</blockquote>
<p><strong>Queueing delays</strong> ofen account for a large part of the response time at a high percentiles. As a server can only process a small number of things in parallel(For exmaple, limited by CPU cores). <strong>head-of-line blcoking</strong> refers exactly to a situation when a few slow requests hol up the processing of subsequent requests even though these subsequents are fast to process. So it’s important to measure response times on the <strong>client side</strong>.</p>
<blockquote>
<p><strong>Tail Latency Amplification</strong>: Even if only a small percentage of backend calls are slow, the chance of getting a slow call increases if an end-user request requires multiple back‐end calls, and so a higher proportion of end-user requests end up being slow.<br /><img src="/assets/images/post/dintensive/slow.png" alt="slow requests" /></p>
</blockquote>
<p><strong>2. Approaches for Loading</strong></p>
<p><strong>Scale up / vertical Scaling</strong>: moving to more powerful machines.</p>
<p><strong>Scale out / Horizontal Scaling</strong>: distributing the load
across multiple smaller machines. Also known as <u>shared-nothing</u> architecture.</p>
<p>In real life, systems are <em>elastic</em>, meaning they can automatically add computing resource if detects a load increase(useful if loads are highly unpredictable); wheras others are <em>scaled manually</em>.</p>
<blockquote>
<p>So far, there’s no <strong>magic scaling sauce</strong> – a generic, one-size-fits-all scalable architecture. Systems at this scale are usually designed specifically and problems contain the volume of reads, the volume of writes, the volume of data to store, the complexity of data, the response time requirements, the access patterns, or usually some mixture of all plus some other issues.</p>
</blockquote>
<h4 id="maintainability">Maintainability</h4>
<blockquote>
<p>Many different people(engineering & operations) can both maintain current behaviour and adapt system to new use cases, and they should be work productively.<br /><img src="/assets/images/post/dintensive/legacy.jpeg" alt="legacy" /><br /><strong>FANTASTIC Lagacy Code</strong>: every legacy in unpleasant in its own way.</p>
</blockquote>
<p>The majority of the cost of software is not initial development but <strong>maintenance</strong> - fixing bugs, keepign systems operational, investigating failures and more. It’s hard to give general recommendations to deal with all legacy systems but we can follow these principles to avoid troubles as much as we can:</p>
<table>
<thead>
<tr>
<th>principles</th>
<th>details</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Operabiliy</strong></td>
<td>Good data systems should be able to do following to make operation teams’ life easier:<br />Providing visibility into the runtime behavior and internals of the system, with good monitoring<br />Providing good support for automation and integration with standard tools<br />Avoiding dependency on individual machines (allowing machines to be taken down for maintenance while the system as a whole continues running uninterrupted)<br />Providing good documentation and an easy-to-understand operational model(“If I do X, Y will happen”)<br />Providing good default behavior, but also giving administrators the freedom to override defaults when needed<br />Self-healing where appropriate, but also giving administrators manual control over the system state when needed<br />Exhibiting predictable behavior, minimizing surprises</td>
</tr>
<tr>
<td><strong>Simplicity</strong></td>
<td>able to manage complexity. <br />Abstraction: one of the best way to remove accidental complexity(<em>e.g. SQL is an abstraction that hides complex on-disk and in-memory data structures, concurrent requests from other clients and inconsistencies after crashes</em>, high level language hides machine code, CPU registers, and syscalls)<br />explosion of the state space<br />tight coupling of modules<br />tangled dependencies<br />inconsistent nameing and terminology<br />hacks for performance<br />more…</td>
</tr>
<tr>
<td><strong>Evolvability</strong></td>
<td>/Extensibility/Modifiability/Plasticity:<u>Agile</u> working patterns</td>
</tr>
</tbody>
</table>
<blockquote>
<p><img src="/assets/images/post/dintensive/db.png" alt="db" /><small class="img-hint">Map of Dbs</small></p>
</blockquote>
<p>This section covers a range of data models for data storage and querying.</p>
<h2 id="data-models--query-language">Data Models & Query Language</h2>
<h4 id="relational-vs-document">Relational VS Document</h4>
<blockquote>
<p>Best-known data model today is probably SQL, ased on the relational model proposed by Edgar Codd in 1970. data is organized into relations (called tables in SQL), where each relation is an unordered collection of tuples (rows in SQL).</p>
</blockquote>
<table>
<thead>
<tr>
<th>Model types</th>
<th>details</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Relational Model</strong></td>
<td>roots of RDBMS lies in <em>business data process</em><br /> on mainfarme pcs in 60 - 70s: typically transaction processing (entering sales or banking transactions, airline reservations, stock-keeping in warehouses) and batch processing (customer invoicing, payroll, reporting).</td>
</tr>
<tr>
<td><strong>Document Model</strong></td>
<td><strong>NoSQL</strong> in 2010s occurs in need of<br />1. <u>greater scalability</u> than relational databases including very high write throughput<br />2. A widespread preference for <u>free and open source software</u> over commercial database products<br />3. Specialized query operations that are not well supported by the relational model<br />Frustration with the restrictiveness of relational schemas, and a desire for a more dynamic and expressive data model</td>
</tr>
</tbody>
</table>
<h4 id="query-for-data">Query for Data</h4>
<h4 id="graph-like-data-models">Graph-Like Data Models</h4>
<h2 id="storage--retrieval">Storage & Retrieval</h2>
<h2 id="part-2-distributed-data">Part 2. Distributed Data</h2>
<h2 id="part-3-derived-data">Part 3. Derived Data</h2>
<h2 id="blah-blah-blah">Blah Blah Blah</h2>
<blockquote>
<p><strong>起源</strong>: <br />和某著名电商网站的大佬说自己经常用Hadoop,结果被随意的几个问题血虐<br />和某著名短视频app的大佬说自己一个人写过一个全栈网站,结果被随意的几个问题血虐<br />和某地区记者说自己跑得快,<s>结果..</s>不好意思跑题了…<br /><strong>结论</strong>: <br />吹水需谨慎, 渣渣还是要做渣渣该做的事<br /></p>
</blockquote>
<p>Martin Kleppmann写的 <strong><em>Design Data Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems</em></strong> 是本被很多大佬推荐过的神书,之前一直没时间<s>明明是懒得</s>读, 最近临近final反而神奇地爆发了阅读欲望<s>明明是不想赶due</s>。</p>
<p>去年上过一门Data Intensive Computing, 粗略地学了学数据处理框架和流程然后去Kaggle假装了下Data Science入门, 但是对总体工程数据的应用完全没有深入的认知。晃眼一看大三马上就要结束了, Year3给我最大的教训就是(也是书里作者说过的):<em>“You should know more than just a few Buzzwords”</em>。 就好比使用Linux, 用Linux desktop开浏览器刷油管是“在使用Linux”<s>(好像又是我)</s>,用Linux真正地造轮子也是“在使用Linux”。归根结底,不是说你用过什么,而是真正了解多少。这也是我开始阅读这本书的动力。把读书笔记发在博客,就当做个记录和大家交流<s>(然後几年后回来看还会多少)</s></p>
<p>“Just keep Learning”, 与大家共勉:)</p>DanniUnderstand data & Build reliable, scalable and maintable applicationsStreaming Systems: Data Processing, Watermarks & Advanced Windowing2020-06-06T00:00:00+00:002020-06-06T00:00:00+00:00https://isdanni.com/streaming-system<p>This post is my reading notes of <strong>Part 1, The Beam Model(Chapter 1-4)</strong> from the book, which covers the high-level batch, streaming data processing model called <a href="https://beam.apache.org/">Apache Beam</a>;</p>
<h1 id="streaming-101">Streaming 101</h1>
<h2 id="1-what-is-streaming">1. What is streaming?</h2>
<h3 id="streaming-system">Streaming System</h3>
<p>A type of <u>data processing engine</u> designed with <u>infinite</u> datasets in mind.</p>
<h3 id="shape-of-a-dataset">Shape of a dataset</h3>
<table>
<thead>
<tr>
<th> </th>
<th><strong>Cardinality</strong></th>
<th><strong>Consitution</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td>definition</td>
<td>its size, with the most salient aspect of cardinality being whether the given dataset is infinite or finite;</td>
<td>physical manifestation, which defines the way one can interact with the given dataset;</td>
</tr>
<tr>
<td>types</td>
<td>- <strong>Bounded data</strong>: a dataset that is finite in size;<br />- <strong>Unbounded data</strong>: a dataset that is infinite in size(at least theoretically);</td>
<td>Two primary constitutions of importance is:<br /> - <strong>Table</strong>: A holistic view of a dataset at a specific point in time. SQL systems have traditionally dealt in tables;<br /> - <strong>Stream</strong>: An element-by-element view of the evolution of the dataset over time. The MapReduce lineage of data processing systems have traditionally dealt in streams.</td>
</tr>
</tbody>
</table>
<h3 id="why-stream-processing-is-important">Why stream processing is important?</h3>
<ul>
<li>business requires more <em><u>timely insights</u></em> & streaming achieves lower <em><u>latency</u></em>;</li>
<li>easier to manage massive, <em><u>unbounded</u></em> dataset that are increasingly common nowadays;</li>
<li>more <em><u>consistent, predictable comsumption of resources</u></em> since the incoming data arrival is spread out evenly;</li>
</ul>
<h2 id="2-background">2. Background</h2>
<h3 id="lambda-architecture">Lambda Architecture</h3>
<blockquote>
<p><strong><a href="https://en.wikipedia.org/wiki/Lambda_architecture">Lambda Architecture</a></strong>: a data processing architecture that has <u>stream system</u> to produce low-latency, inaccurate(either bcoz of approximation algorithm or the system itself does not provide correctness)/speculative results and <u>batch system</u> to provide eventual correct results;</p>
</blockquote>
<blockquote>
<p>Some links:</p>
<ol>
<li><a href="http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html">How to beat the CAP theorem</a></li>
<li><a href="https://www.oreilly.com/radar/questioning-the-lambda-architecture/">Questioning the Lambda Architecture</a></li>
</ol>
</blockquote>
<p>The reason that the Lambda Architecture is successful is it could actually provide some good results even though the correctness is a bit of letdown; However, it is a lot of work to <u>maintain two independent versions of pipeline and merge the results in the end</u>;</p>
<p><a href="https://www.oreilly.com/radar/questioning-the-lambda-architecture/">Some people</a> argue against the <u>necessity of dual-mode execution</u> because of the issue of repeatability of using a replayable system(like <a href="https://kafka.apache.org/10/documentation/streams/core-concepts.html#streams_topology">Kafka</a>) so they propose the <a href="https://hazelcast.com/glossary/kappa-architecture/">Kappa Architecture</a>, which runs a single pipeline using a well designed & built system(like <a href="https://flink.apache.org/">Apache Flink</a>);</p>
<table>
<thead>
<tr>
<th style="text-align: center">Lambda Architecture</th>
<th style="text-align: center">Kappa Architecture</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center"><img src="/assets/images/post/ss/lambda.jpg" alt="lambda" /></td>
<td style="text-align: center"><img src="/assets/images/post/ss/kappa.jpg" alt="kappa" /></td>
</tr>
</tbody>
</table>
<h3 id="lambda-vs-kappa-architecture">Lambda vs Kappa Architecture</h3>
<p>Usually if the real-time algorithm and batch algorithm have different outputs, meaning batch & real-time layers cannot be merged, then must use Lambda Architecture;</p>
<blockquote>
<p>TBC</p>
</blockquote>
<h3 id="batch-vs-streaming-efficiency">Batch vs Streaming Efficiency</h3>
<ul>
<li><strong>Batch</strong>: high-latency, higher-efficiency;</li>
<li><strong>Streaming</strong>: low-latency, lower-efficiency;</li>
</ul>
<p>But for streaming system to achieve the same performance of batch systems, we only need to focus on 2 things:</p>
<ol>
<li><strong>correctness</strong>: because <u>strong consistency</u> is required for <u>exactly-once processing</u>, which is required for <u>correctness</u>, which is requirerd to meet batch system’s level of performance. (ref: <a href="https://www.oreilly.com/content/why-local-state-is-a-fundamental-primitive-in-stream-processing/">Why local state is a fundamental primitive in stream processing</a>)</li>
<li><strong>tools for reasoning about time</strong>: essential for dealing with unbounded, unordered data of varying time skew;</li>
</ol>
<h3 id="event-time-vs-processing-time">Event Time vs Processing Time</h3>
<table>
<thead>
<tr>
<th style="text-align: center">—</th>
<th>Event Time</th>
<th style="text-align: center">Processing Time</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center">Definition</td>
<td>the time at which events actually occured</td>
<td style="text-align: center">the time at which events are observed in the system</td>
</tr>
</tbody>
</table>
<p>Some variables that can affect the skew between event time and processing time;</p>
<ul>
<li>shared recource limitations like network congestion, network partitions, shared CPU, etc;</li>
<li>software causes like distributed system logic, contention, etc;</li>
<li>features of the data like distribution, variance in throughput, variance in disorder;</li>
</ul>
<p><img src="/assets/images/post/ss/event-process.png" alt="event-process" width="350" /></p>
<p>Because the overall mapping between event time and processing time is not static(the lag/skew can vary arbitraily over time), we cannot analyze data soely by the observed time;</p>
<p>To cope with such unfortunate design for unbounded data of many systems, we implement the windowing of the incoming data, meaning chopping up a dataset into finite pieces along temporal boundaries;</p>
<h3 id="data-processing-patterns">Data Processing Patterns</h3>
<ul>
<li>Bounded Data</li>
</ul>
<p>pretty straightforward, run the dataset through some data processing engine to get a strcutured dataset with greater inherent value;</p>
<ul>
<li>
<p>Unbounded Data</p>
</li>
<li>
<p>Fixed windows</p>
</li>
</ul>
<p>Most common way, repeatedly run a batch engine to process input data which is windowed into fixed-size windows(separate data source, sometimes called tumbling windows);</p>
<ul>
<li>Sessions</li>
</ul>
<h1 id="reference">Reference</h1>
<ul>
<li><a href="https://www.ericsson.com/en/blog/2015/11/data-processing-architectures--lambda-and-kappa">Picture for Lambda Architecture & Kappa Architecture</a></li>
<li><a href="https://arxiv.org/pdf/1506.08603.pdf">Lightweight Asynchronous Snapshots for Distributed Dataflows</a></li>
<li><a href="https://www.oreilly.com/radar/questioning-the-lambda-architecture/">Questioning the Lambda Architecture</a></li>
</ul>DanniThis post is my reading notes of Part 1, The Beam Model(Chapter 1-4) from the book, which covers the high-level batch, streaming data processing model called Apache Beam;Ubuntu 18.04 LTS Dual Boot with Win10(BIOS Legacy & MBR) [2020 UPDATE]2020-06-04T00:00:00+00:002020-06-04T00:00:00+00:00https://isdanni.com/ubuntu-18-04<blockquote>
<p><strong>[UPDATE June 2020]</strong> Spilt water on my computer last month and while I was trying to fix it I completely messed up the netwok interfaces and sources.list to the extend that I have to reinstall the Linux distro; Thought I’d update this post I wrote almost over 2 years ago. Hope this help you:)</p>
</blockquote>
<h1 id="1-what-you-need">1. What you need</h1>
<ol>
<li>A USB stick/flash drive. Official guide from Ubuntu website says at least 4GB, personally, I used a 30GB stick. (Too big I know but just to be safe)</li>
<li>MS Windows XP or later that is working on the PC.</li>
<li>Rufus/ Ultral OS/ Universal USB Installer etc: A tool that can write Ubuntu ISO(Download Here) to your USB to install later. <strong>Choose this carefully.</strong> Some common installation issues are caused by the tools you choose.</li>
<li>Enough unallocated space on disk.</li>
</ol>
<h1 id="2-make-your-bootable-usb-stick">2. Make your bootable USB stick</h1>
<p>Before we start, I wanna emphasize one thing: Always check your disk file format.</p>
<p>There are 2 ways of partitioning drive: <code class="language-plaintext highlighter-rouge">MBR(Master Boot Record)</code> and <code class="language-plaintext highlighter-rouge">GPT(GUID Partition Table)</code>(To check your format, go to disk management and right click your disk 0 to see properties, mine is MBR). So what’s the difference between MBR and GPT? Well, MBR is old and GPT is new. But as “But the new is not always better than the old.“ I quoted <a href="https://www.disk-partition.com/gpt-mbr/mbr-vs-gpt-1004.html">here</a>, they all have their own pros and cons.</p>
<p>GPT disk can support larger than 2TB while MBR cannot. They can both be dynamic and basic. Also, GPT can supports up to 128 partitions while MBR can only support four primary ones.</p>
<p>Usually, we associate <code class="language-plaintext highlighter-rouge">MBR + BIOS</code> and <code class="language-plaintext highlighter-rouge">GPT + UEFI</code> together. If a Windows pc uses UEFI, it will only support GPT.</p>
<p>Write downloaded Ubuntu ISO to your USB stick. <strong>This USB will be formatted so remember to back up data.</strong> Remember to choose Partition scheme to MBR and File system is FAT32(Default).</p>
<p>PLease also check <a href="https://tutorials.ubuntu.com/tutorial/tutorial-create-a-usb-stick-on-windows#0">Ubuntu official guide</a></p>
<h1 id="3-get-into-boot-menu">3. Get into Boot Menu</h1>
<p><img src="/assets/images/post/linux/start.png" alt="start" /></p>
<p>Restart PC. Know your shortcuts to enter the boot menu. For Thinkpad it’s F12. After the Leveno red sign shows on screen quickly press it before the sign disappears. There will show a line in white color: <code class="language-plaintext highlighter-rouge">Entering Boot Menu</code>.</p>
<p>Here are some shortcuts for other PCs. Since I haven’t tried them all myself, I strongly suggest you verify before actually start:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Leveno PC: F12 or F1
Dell laptop: F12
HASEE laptop: F2
Sony laptop: DEL or F2 or F9
Samsung laptop: F10
IBM Pc: F12
</code></pre></div></div>
<p>Then you will see something like <a href="https://www.theregister.co.uk/2013/07/19/review_lenovo_thinkpad_helix_corei7_convertible/?page=2">this</a>:(Disclaimer: I took the image from this website, I do not own the copy right of this file)</p>
<p><img src="/assets/images/post/linux/lenovo-boot.jpg" alt="boot-menue" /></p>
<ol>
<li>disable <strong>Secure Boot</strong> after entering boot menu.</li>
<li>Turn boot priority into <code class="language-plaintext highlighter-rouge">Legacy First</code> since the disk format is MBR.</li>
<li>Choose USB stick for boot queue.</li>
</ol>
<div class="language-md highlighter-rouge"><div class="highlight"><pre class="highlight"><code>UEFI/Legacy Boot [Both]
UEFI/Legacy Boot Priority [Legacy First]
CSM Support [Yes]
</code></pre></div></div>
<p>Enter esc + y to save and exit.</p>
<h1 id="4-now-install-ubuntu">4. Now install Ubuntu!</h1>
<p>If things go well, after select USB in the boot menu, you should be directed to a menu with the list of “<strong>Try Ubuntu</strong>”, “<strong>Install Ubuntu</strong>”, …</p>
<p><img src="/assets/images/post/linux/install-try-ubuntu.jpeg" alt="install-try" /></p>
<p>Select Install/Try(It doesn’t matter unless you really wanna play with it first for a bit). Remember to select something else in the installation choice step.</p>
<p><img src="/assets/images/post/linux/partition.png" alt="partition" /></p>
<p><strong>Note</strong>: There are many partition scheme online, choose the one that’s suitable! You can check partition guide here in official Ubuntu Wiki. And here’s mine just for your reference;</p>
<p>Also, you have always boot into live USB and adjust the system partition if you would like, be very careful tho ;-)</p>
<ul>
<li><strong>swap</strong>: size of RAM, or twice the size;</li>
<li><strong>/</strong>: minimum is 8 GB, but it is recommended to have at least 15 GB; # system will be blocked if root is full</li>
<li><strong>/boot</strong>: 250 MB ~ 1 GB; # sometimes required, but do not use the same one for several Linux distros;</li>
<li><strong>/home</strong>: as large as possible, especially when you installed your Dropbox here, and have a lot of files; # if you don’t want a seperate home, just merge it with root;</li>
</ul>
<p>Here’s my updated partition as in 2020; I allocated most of the space to <code class="language-plaintext highlighter-rouge">/home</code> cuz I will have my Dropbox and most of my side projects here. Also, I need to ensure the data is safe incase of a drive failure/upgrade. though the general consensus nowadays is to just use the /root which includes /home;</p>
<p><img src="/assets/images/post/linux/gparted.png" alt="gparted" />
<br /><br /></p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># 2020 June: This is my outdated partition, please check image above!</span>
<span class="c"># on sda:</span>
/dev/sda3 / ext4 primary beginning 30GB
/dev/sda4 swap logical beginning 5GB
/dev/sda5 /boot ext4 logical beginning 1GB
/dev/sda6 /home ext4 logical beginning 200GB
<span class="c"># make sure this is as large as possible</span>
</code></pre></div></div>
<h1 id="5-use-easybcd-for-boot-loader">5. Use EasyBCD for boot loader</h1>
<p>After installation, <code class="language-plaintext highlighter-rouge">restart</code> and <code class="language-plaintext highlighter-rouge">enter Windows</code>. Download <a href="https://neosmart.net/EasyBCD/">EasyBCD</a>(This is for BIOS) and add an entry:</p>
<p><img src="/assets/images/post/linux/easy-bcd.png" alt="easy-bcd" /></p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Add Entry: Linux/BSD
Type: GRUB<span class="o">(</span>Legacy<span class="o">)</span>
Name: define yourself
Bootloader: /boot partition <span class="c"># if you have a seperate /boot else just the ubuntu partition </span>
Edit Menu: <span class="c"># Now you should have two entries, one Windows one Linux.</span>
</code></pre></div></div>
<p>Then <strong>restart</strong>, now ou can choose OS as you wish!</p>
<h1 id="6-some-issues">6. SOME issues</h1>
<h2 id="a-failed-to-load-ldlinux32">a. Failed to load ldlinux.32</h2>
<p>This happened the first time I tried to install Linux. I was a complete novice and knew nothing about deep-level OS. In general, this error can be caused by a lot of things: a broken USB port, corrupted ISO image, driver incompatibility…</p>
<p>For me it is because of the writing software, I switched from Ultral SO to Win32 Disk Manager and it all worked out. (But now it’s deprecated so I strongly suggest not to follow)</p>
<h2 id="b-underscore-flashing-on-black-screen-after-booting-into-newly-installed-ubuntu">b. Underscore flashing on black screen after booting into newly installed Ubuntu</h2>
<p>Something like this:</p>
<p><img src="/assets/images/post/linux/black.png" alt="black screen blinking cursor" /></p>
<p><strong>Grub</strong> issues.</p>
<p>This happened so many times I could recite all commands I tried in my sleep. Basically to repair it you can <strong>boot into live USB</strong> after installation and add repository of boot repair and run <code class="language-plaintext highlighter-rouge">boot-repair</code>; If it still does not work, try installing grub in your \boot partition.</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>add-apt-repository ppa:yannubuntu/boot-repair
<span class="nb">sudo </span>apt-get update
<span class="nb">sudo </span>apt-get <span class="nb">install</span> <span class="nt">-y</span> boot-repair <span class="o">&&</span> boot-repair
</code></pre></div></div>
<p><img src="/assets/images/post/linux/boot-repair.png" alt="boot-repair" /></p>
<p><strong>Note</strong>: sometimes you may get a notification before all the boot-repair starts, asking “if this drive** is the fixed drive?””, remeber to choose “No” if you are installing Ubuntu on your PC;</p>
<p>Check links <a href="https://help.ubuntu.com/community/Boot-Repair">here</a> and <a href="https://www.linux.com/learn/how-rescue-non-booting-grub-2-linux%20%20">here</a>.</p>
<h2 id="c-grub-rescue-mode">c. GRUB rescue mode</h2>
<p>How to rescue a Non-booting GRUB 2 on Linux?</p>
<p>GRUB rescue mode.</p>
<p>This is also related to broken grub. It could happen after you reboot into Ubuntu partition. To fix it, simply run command below and check the /root partition.</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>grub rescue <span class="o">></span> <span class="nb">ls</span>
<span class="o">(</span>hd0<span class="o">)</span> <span class="o">(</span>hd0,msdos5<span class="o">)</span> <span class="o">(</span>hd0,msdos3<span class="o">)</span> <span class="o">(</span>hd0,msdos2<span class="o">)</span> <span class="o">(</span>hd0,msdos1<span class="o">)</span> <span class="o">(</span>hd1<span class="o">)</span> <span class="o">(</span>hd1,msdos1<span class="o">)</span>
grub rescue <span class="o">></span> <span class="nb">ls</span> <span class="o">(</span>hd0,msdos1<span class="o">)</span> <span class="c"># try to recognize which partition is this</span>
grub rescue <span class="o">></span> <span class="nb">ls</span> <span class="o">(</span>hd0,msdos2<span class="o">)</span> <span class="c"># let's assume this is the linux partition</span>
grub rescue <span class="o">></span> <span class="nb">set </span><span class="nv">root</span><span class="o">=(</span>hd0,msdos2<span class="o">)</span>
grub rescue <span class="o">></span> <span class="nb">set </span><span class="nv">prefix</span><span class="o">=(</span>hd0,msdos2<span class="o">)</span>/boot/grub <span class="c"># or wherever grub is installed</span>
grub rescue <span class="o">></span> insmod normal <span class="c"># if this produced an error, reset root and prefix to something else ..</span>
grub rescue <span class="o">></span> normal
</code></pre></div></div>
<h1 id="some-useful-links">Some Useful Links</h1>
<ul>
<li><a href="https://askubuntu.com/questions/21719/how-large-should-i-make-root-home-and-swap-partitions">Ubuntu suggested partition</a></li>
<li><a href="https://help.ubuntu.com/community/Boot-Repair">Ubuntu help wiki - Boot Repair</a></li>
</ul>
<p><strong>“Welcome to the producer side!”</strong></p>Danni[UPDATE June 2020] Spilt water on my computer last month and while I was trying to fix it I completely messed up the netwok interfaces and sources.list to the extend that I have to reinstall the Linux distro; Thought I’d update this post I wrote almost over 2 years ago. Hope this help you:)Reservoir Sampling and Randomized Algorithms2020-05-24T00:00:00+00:002020-05-24T00:00:00+00:00https://isdanni.com/reservoir_sampling_and_randomized_algorithms<blockquote>
<p>How the randomized algorithms work and its implementation in streaming systems</p>
</blockquote>
<h1 id="randomized-algorithm">Randomized Algorithm</h1>
<p><strong>Randomized algorithm</strong> applies a certain level of randomness as part of the logic. It usually uses <a href="https://en.wikipedia.org/wiki/Discrete_uniform_distribution">uniform random</a>(<u>Each element from a N dataset has 1/N probability being chosen</u>) to define the behaviour of auxiliary input in the hope of achieving good performance in <u>average case</u>;</p>
<p>The randomized algorithms are random in following aspect:</p>
<ol>
<li>The operation of actual problem is random;</li>
<li>The computing complexity of the problem is a random variable;</li>
<li>The algorithm output is random;(might be right/wrong);</li>
</ol>
<h2 id="a-choose-one-element-randomly">a. choose one element randomly</h2>
<p>For the ith element, we must have (1/i) probability of choosing and (1-1/i) not choosing;</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// One element</span>
<span class="c1">// Proof of uniform random</span>
<span class="c1">// for ith item, the probability of being chosen</span>
<span class="mi">1</span><span class="o">/</span><span class="n">i</span> <span class="o">*</span> <span class="o">(</span><span class="mi">1</span> <span class="o">-</span> <span class="mi">1</span><span class="o">/(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="o">))</span> <span class="o">*</span> <span class="o">(</span><span class="mi">1</span> <span class="o">-</span> <span class="mi">1</span><span class="o">/(</span><span class="n">i</span><span class="o">+</span><span class="mi">2</span><span class="o">))</span> <span class="o">*</span> <span class="o">...</span> <span class="o">*</span> <span class="o">(</span><span class="mi">1</span> <span class="o">-</span> <span class="mi">1</span><span class="o">/</span><span class="n">n</span><span class="o">)</span>
<span class="o">=</span> <span class="mi">1</span><span class="o">/</span><span class="n">i</span> <span class="o">*</span> <span class="n">i</span><span class="o">/(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="o">)</span> <span class="o">*</span> <span class="o">(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="o">)/(</span><span class="n">i</span><span class="o">+</span><span class="mi">2</span><span class="o">)</span> <span class="o">*</span> <span class="o">...</span> <span class="o">*</span> <span class="o">(</span><span class="n">n</span><span class="o">-</span><span class="mi">1</span><span class="o">)</span> <span class="o">/</span> <span class="n">n</span>
<span class="o">=</span> <span class="mi">1</span><span class="o">/</span><span class="n">n</span>
</code></pre></div></div>
<h2 id="b-choose-k-elements-randomly">b. choose k elements randomly</h2>
<p>For the ith element, the probability of being chosen k/i, the probability of not being chosen (1-k/i);</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// K elements</span>
<span class="c1">// Proof of uniform random</span>
<span class="c1">// for ith item, the probability of being chosen</span>
<span class="n">k</span><span class="o">/</span><span class="n">i</span> <span class="o">*</span> <span class="o">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">k</span><span class="o">/(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="o">)</span> <span class="o">*</span> <span class="mi">1</span><span class="o">/</span><span class="n">k</span><span class="o">)</span> <span class="o">*</span> <span class="o">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">k</span><span class="o">/(</span><span class="n">i</span><span class="o">+</span><span class="mi">2</span><span class="o">)</span> <span class="o">*</span> <span class="mi">1</span><span class="o">/</span><span class="n">k</span><span class="o">)</span> <span class="o">*</span> <span class="o">...</span> <span class="o">*</span> <span class="o">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">k</span><span class="o">/</span><span class="n">n</span> <span class="o">*</span> <span class="mi">1</span><span class="o">/</span><span class="n">k</span><span class="o">)</span>
<span class="o">=</span> <span class="n">k</span><span class="o">/</span><span class="n">i</span> <span class="o">*</span> <span class="o">(</span><span class="mi">1</span> <span class="o">-</span> <span class="mi">1</span><span class="o">/(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="o">))</span> <span class="o">*</span> <span class="o">(</span><span class="mi">1</span> <span class="o">-</span> <span class="mi">1</span><span class="o">/(</span><span class="n">i</span><span class="o">+</span><span class="mi">2</span><span class="o">))</span> <span class="o">*</span> <span class="o">...</span> <span class="o">*</span> <span class="o">(</span><span class="mi">1</span> <span class="o">-</span> <span class="mi">1</span><span class="o">/</span><span class="n">n</span><span class="o">)</span>
<span class="o">=</span> <span class="n">k</span><span class="o">/</span><span class="n">i</span> <span class="o">*</span> <span class="n">i</span><span class="o">/(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="o">)</span> <span class="o">*</span> <span class="o">(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="o">)/(</span><span class="n">i</span><span class="o">+</span><span class="mi">2</span><span class="o">)</span> <span class="o">*</span> <span class="o">...</span> <span class="o">*</span> <span class="o">(</span><span class="n">n</span><span class="o">-</span><span class="mi">1</span><span class="o">)</span> <span class="o">/</span> <span class="n">n</span>
<span class="o">=</span> <span class="n">k</span><span class="o">/</span><span class="n">n</span>
</code></pre></div></div>
<h1 id="1-reservoir-sampling">1. Reservoir sampling</h1>
<p><strong>Reservoir sampling</strong> is a family of <a href="https://en.wikipedia.org/wiki/Randomized_algorithm">randomized algorithms</a> for <strong>choose a simple [random sample] [without replacement of k items] from a population of [unknown size n] in a [single pass] over the items</strong>.</p>
<p><strong>NOTE</strong>:</p>
<ul>
<li>size n here usually cannot fit into <a href="https://en.wikipedia.org/wiki/Main_memory">main memory</a>;</li>
<li>n is unknown, revealed over the time; otherwise it would be too easy;</li>
<li>time complexity required is <code class="language-plaintext highlighter-rouge">O(N)</code>;</li>
<li>probability of each item being chosen must be <code class="language-plaintext highlighter-rouge">k/n</code>;</li>
</ul>
<h2 id="simple-algorithm">Simple Algorithm</h2>
<p>The commonly used algotihm that is <strong>simple but slow</strong>, known as: <a href="http://www.cs.umd.edu/~samir/498/vitter.pdf">Algorithm R</a>;</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="cm">/**
* Algorithm R works by maintaining a reservoir size of k,
* 1. which initially contains the first k items of the input;
* then iterates over the remaining items until the input is exhausted.
* 2. when reaches the ith item
* a. if i >= k, random choose d in [0, i]
* if d is within [0, k -1], use ith item to replace dth item in reservoir;
* 3. repeat second step;
*/</span>
<span class="kt">int</span><span class="o">[]</span> <span class="n">reservoir</span> <span class="o">=</span> <span class="k">new</span> <span class="kt">int</span><span class="o">[</span><span class="n">k</span><span class="o">];</span>
<span class="k">for</span> <span class="o">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">reservoir</span><span class="o">.</span><span class="na">length</span><span class="o">;</span> <span class="n">i</span><span class="o">++)</span>
<span class="o">{</span>
<span class="n">reservoir</span><span class="o">[</span><span class="n">i</span><span class="o">]</span> <span class="o">=</span> <span class="n">dataStream</span><span class="o">[</span><span class="n">i</span><span class="o">];</span>
<span class="o">}</span>
<span class="k">for</span> <span class="o">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="n">k</span><span class="o">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">dataStream</span><span class="o">.</span><span class="na">length</span><span class="o">;</span> <span class="n">i</span><span class="o">++)</span>
<span class="o">{</span>
<span class="c1">// random integer in [0, i];</span>
<span class="kt">int</span> <span class="n">d</span> <span class="o">=</span> <span class="n">rand</span><span class="o">.</span><span class="na">nextInt</span><span class="o">(</span><span class="n">i</span> <span class="o">+</span> <span class="mi">1</span><span class="o">);</span>
<span class="c1">// if integer is within [0, m-1],then replace reservoir</span>
<span class="k">if</span> <span class="o">(</span><span class="n">d</span> <span class="o"><</span> <span class="n">k</span><span class="o">)</span>
<span class="o">{</span>
<span class="n">reservoir</span><span class="o">[</span><span class="n">d</span><span class="o">]</span> <span class="o">=</span> <span class="n">dataStream</span><span class="o">[</span><span class="n">i</span><span class="o">];</span>
<span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>
<p><strong>CONS</strong>: <a href="https://en.wikipedia.org/wiki/Big_O_notation">asymptotic running time</a> is thus <code class="language-plaintext highlighter-rouge">O(n)</code>, which causes the algorithm to be unnecessarily slow if the input population is large.</p>
<h2 id="distributedparallel-reservoir-sampling">Distributed/Parallel Reservoir Sampling</h2>
<p>In idistributed systems, <strong>main memory & IO ops</strong> would be the bottleneck; So for the data of a super large scale, we could improve the overall performance using parallel algorithm:</p>
<ol>
<li>Assume we have <code class="language-plaintext highlighter-rouge">m</code> machines, divide the stream into <code class="language-plaintext highlighter-rouge">m</code> data stream, every single machine process one stream of m samples, then note down as <code class="language-plaintext highlighter-rouge">N1, N2, ..., Nk, ... NK</code>(assume m < Nk>) => N1 + N2 + N3 + … + NK = N;</li>
<li>Choose a <strong>random number</strong> d from <code class="language-plaintext highlighter-rouge">[1, N]</code>:
a. if d < N1, then replace (1/m) from the first machine, …; repeat m times;</li>
</ol>
<p>=> m / N</p>
<h2 id="implementation">Implementation</h2>
<p>Because the reservoir sampling has <strong>O(N) time complexity</strong> and <strong>O(M) space complexity</strong>, it is usually adopted in streaming systems where statistical sampling is required. For example, random output n lines from a large-scale dataset;</p>
<p>For algorithm lovers, you could also find some common problems like: <a href="https://leetcode.com/problems/linked-list-random-node/">linked list random node</a>, <a href="https://leetcode.com/problems/random-pick-index/">pick random index</a>;</p>
<h2 id="limitations">Limitations</h2>
<p>Reservoir sampling makes the assumption that the desired sample fits into main memory, often <strong>implying that k is a constant independent of n</strong>.</p>
<blockquote>
<p>ref: wiki: <a href="https://en.wikipedia.org/wiki/Reservoir_sampling#Limitations">reservoir sampling#limitations</a></p>
</blockquote>
<p>In applications where we would like to select a large subset of the input list (say a third, i.e. <code class="language-plaintext highlighter-rouge">k=n/3</code>), other methods need to be adopted. Distributed implementations for this problem have been proposed.</p>
<h1 id="2-geometric-distribution">2. Geometric Distribution</h1>
<p>Time Complexity O(K + Klog(N/K))</p>
<p><img src="/assets/images/post/geometric_distribution.jpg" alt="geometric-distribution" /></p>
<h1 id="3-fisheryates-shuffle">3. Fisher–Yates shuffle</h1>
<p>The Fisher–Yates shuffle is used to generate a random permutation of a finite sequence, meaning to shuffle the sequence;</p>
<p>So choose K items randomly from the sequence just equals to shuffling the deck of size k;</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>-- To shuffle an array a of n elements (indices 0..n-1):
for i from n−1 downto 1 do
j ← random integer such that 0 ≤ j ≤ i
exchange a[j] and a[i]
</code></pre></div></div>
<h1 id="reference">Reference</h1>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Reservoir_sampling">Reservoir sampling</a>;</li>
<li><a href="https://en.wikipedia.org/wiki/Randomized_algorithm">randomized algorithms</a>;</li>
<li><a href="https://peteroupc.github.io/randomfunc.html">random functions</a></li>
<li><a href="https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle">Fisher–Yates shuffle</a></li>
<li><a href="https://en.wikipedia.org/wiki/Geometric_distribution">Geometric Distribution</a></li>
</ul>DanniHow the randomized algorithms work and its implementation in streaming systemsClean Code: A Handbook of Agile Software Craftsmanship2020-05-19T00:00:00+00:002020-05-19T00:00:00+00:00https://isdanni.com/clean-code<p>Finally had the time to <del>almost</del> finish this book ;-)</p>
<h1 id="naming">Naming</h1>
<ul>
<li>Use descriptive and unambiguous names;</li>
<li>Avoid misunderstanding; (e.g. Use <code class="language-plaintext highlighter-rouge">accountList</code> for a list of accounts unless it is the real list data type, otherwise <code class="language-plaintext highlighter-rouge">accounts</code> or <code class="language-plaintext highlighter-rouge">AccountGroup</code> would be better);</li>
<li>Use meaningful distinction; (e.g. Usually do not use <code class="language-plaintext highlighter-rouge">a</code> or <code class="language-plaintext highlighter-rouge">the</code> for variable prefix since it is hard to distinguish what it actually means)</li>
<li>Use names that can be pronounced;</li>
<li>Use searchable names => easier to adjust during debugging & cr stage;</li>
<li>Be consistent;</li>
<li>Avoid encodings:
<ul>
<li>do not append prefix/postfix like <code class="language-plaintext highlighter-rouge">strings</code> or <code class="language-plaintext highlighter-rouge">str</code>, the compiler can distinguish them itself;</li>
</ul>
</li>
<li>Replace <a href="https://en.wikipedia.org/wiki/Magic_number_(programming)">magic numbers</a> with named constants;</li>
</ul>
<h1 id="function">Function</h1>
<ul>
<li><strong>small</strong>;</li>
<li>Do <strong>ONE</strong> thing;</li>
<li>Use <strong>descriptive</strong> name;</li>
<li>Arguments:
<ul>
<li>have fewer arguments => functions & arguments are on different abstract levels;</li>
<li>avoid passing <code class="language-plaintext highlighter-rouge">Boolean</code> as input;</li>
<li>If a function has arguments but no output, it should be an event, otherwise must have return;</li>
</ul>
</li>
<li>No side effects;</li>
<li>Use exception instead of error;</li>
<li>Goal: Eliminate duplicate functions;</li>
</ul>
<h1 id="comments">Comments</h1>
<ul>
<li><strong>NOTE</strong>: some comments might be outdated, or just simply wrong;</li>
<li>Do not comment on ill-formatted function, re-construct functions instead;</li>
<li>Dos:
<ul>
<li>Alert</li>
<li>Use <code class="language-plaintext highlighter-rouge">// TODO</code> if necessary;</li>
<li>ALways try to explain the code;</li>
</ul>
</li>
<li>Don’ts:
<ul>
<li>Don’t be redundant.</li>
<li>Don’t add obvious noise.</li>
<li>Don’t use closing brace comments.</li>
<li>Don’t comment out code;</li>
</ul>
</li>
</ul>
<h1 id="source-code-structure">Source code structure</h1>
<ul>
<li>Shorter file is easier to understand;</li>
<li>Use identation, even when the function only has one-line statement/empty;</li>
<li>Declare variables near their usage;</li>
<li>Vertically distance:
<ul>
<li>Do not place similar concepts in different folder unless for a very good reason;</li>
<li>the control variable in the loop should always be decalred within the statement;</li>
<li>Keep functions with similar usage close;</li>
<li>The function that is calling should always be placed on top of the called function;</li>
</ul>
</li>
<li>Source code should be clear, and well-structured; Its name shows it’s in correct module; At the beginning of the file, it should display the high-level concept & algorithm, then details in following sections;</li>
<li>Follow the team rule;</li>
<li>Boundaries:
<ul>
<li>Hide third-party APIs;
<ul>
<li>once third-party package changes, easier to change our own codebase;</li>
<li>consistant code style, easier to read;</li>
</ul>
</li>
<li>Write tests for their-party APIs:
<ul>
<li>Learning test: Faster to understadn its usage;</li>
<li>Efficient way to know if the API function changes;</li>
</ul>
</li>
</ul>
</li>
</ul>
<h1 id="object--data-structures">Object & Data Structures</h1>
<ul>
<li>Avoid hybrid structures => half object half data;</li>
<li>Only do <strong>one</strong> thing;</li>
<li>Do not put public accesser & function with change operations together;</li>
<li>If we want to frequently add functions instead of new objects, we should use <strong>Procedure oriented</strong> programming method => only adds in one place;</li>
<li>If we want to frequently add data objects, use <strong>OOP</strong> => does not change other data code;</li>
<li>Hide internal implementation/structures;</li>
<li>Prefer non-static methods;</li>
<li>Better to implement many functions than passing many arguments into one function to select a behaviour;</li>
</ul>
<h1 id="error-handeling">Error handeling</h1>
<ul>
<li>Use exceptions instead of code;</li>
<li>Use <strong>unchecked exceptions</strong> => does not require try/catch or throw to compile => simplify the codes;</li>
<li>Add exception statement => Why this failed? Because the default exception only provides stack trace; If the system has log system, note it;</li>
<li>Do not return <code class="language-plaintext highlighter-rouge">NULL</code> => If returned, we need to constantly check the <code class="language-plaintext highlighter-rouge">NULL</code> value, thus prone to <code class="language-plaintext highlighter-rouge">NullPointerException</code>;</li>
<li>Do not pass <code class="language-plaintext highlighter-rouge">NULL</code>;</li>
<li>Define <strong>Special Case Pattern</strong>;</li>
</ul>
<h1 id="tests">Tests</h1>
<ul>
<li>Keep the testing code clean like the production code; But different standard => Usually production code aims for performance but the testing code does not;</li>
<li>aim for higher test coverage;</li>
<li>Readability is important;</li>
<li><strong>TDD</strong>: First create test data, then test the ata, then verify the result;</li>
<li><strong>FIRST</strong> rule: F(fast), I(Independent), R(Repeatable), S(Self-sufficient), T(Timely);</li>
</ul>
<h1 id="code-smells">Code smells</h1>
<blockquote>
<p>When to change the code? When the code has bad smeels;</p>
</blockquote>
<p>Here’s a list:</p>
<h2 id="comment">Comment</h2>
<ul>
<li>Unwanted information => e.g. change history;</li>
<li>Commented out code; => Just delete it;</li>
<li>Comment that is too obvious => comment should have information the code does not offer;</li>
<li>Outdated comment;</li>
</ul>
<h2 id="environment">Environment</h2>
<ul>
<li>How many steps to finish the structure => should be seperated into single oprations;</li>
<li>How many steps to complete the tests;</li>
</ul>
<h2 id="functions">Functions</h2>
<ul>
<li>Dead function => never called;</li>
<li>Over-complicated;</li>
<li>Too many arguments;</li>
<li>Return arguments;</li>
<li>Identification parameters;</li>
</ul>
<h2 id="parameters">Parameters</h2>
<ul>
<li>Does not follow standard naming principles;</li>
<li>Use all kinds of prefix/postfix;</li>
<li>Does not explain what it is used for;</li>
<li>…</li>
</ul>
<h2 id="testing">Testing</h2>
<ul>
<li>Not enough coverage;</li>
<li>No coverage tools;</li>
<li>Neglect small tests;</li>
<li>Too slow;</li>
</ul>
<h2 id="general">General</h2>
<ol>
<li>Rogidity: difficult to change. A small change causes a cascade of subsequent changes;</li>
<li>Fragility: breaks in many places due to a single change;</li>
<li>Immobility: cannot reuse parts of the code in other projects because of involved risks and high effort;</li>
<li>Needless Complexity;</li>
<li>Needless Repetition;</li>
<li>Opacity: hard to understand the code;</li>
</ol>DanniFinally had the time to almost finish this book ;-)Design patterns in systems with limited memory2020-03-09T00:00:00+00:002020-03-09T00:00:00+00:00https://isdanni.com/patterns-for-systems-with-limited-memory<blockquote>
<p>Reading Small Memory Software: Patterns for systems with limited memory</p>
</blockquote>
<p>2020 so far has been a train wreck. Without any classes on campus, I did manage to spend some time focusing on learning design pattern in computer software and systems in general: a goal I set a year ago but have never had time to finish.</p>
<p><strong>Small Memory Software</strong> is a classic for those wishing to learn more on system design patterns & memory efficiency. Its first version was published around 2000s. However, though the memory capcity engineers are familiar with has changed greatly, there are still many principles applicable and useful for any software development that relies on the efficient use of memory and other resources. I first discovered this book through online forum while some senior engineers recommened as “at least it’s worthwhile to read a few chapters to know if you have seen the same memory constraints in the past”. I was quite skeptical in the beginning since the book seems nothing special and pretty old for materials in the tech industry. So I read the <a href="http://smallmemory.com/1_IntroductionChapter.pdf">introduction</a> before making any purchases. It was a fun experience, espcially for those who have had similar issues: It was like a series of “yeah same” moments linked together and made coherent. Few hours into reading, I bought the physical book.</p>
<h3 id="why-still-read-this-book">Why still read this book?</h3>
<p>So first, why we still should read this book? Computer memory used to be expensive, but now the company, even the individuals can easily afford a model with good memory. But as the computing power of mobile devices advances, it is not uncommon to rely more on our cellphones with a huge number of applications. The need for developers of such applications to support large request amount increases beyong imagination. So, yes, small memory softwares are back.</p>
<p>In this post we will be fousing on key components like <strong>RAM</strong>, <strong>ROM</strong> and <strong>secondary storage</strong>. Of course there are other constraints like network, processing power, graphics that can slow down the entire system in real life but there have been more patterns in detecting paramsters mentioned before.</p>
<h3 id="small-archietecture">Small Archietecture</h3>
<table>
<thead>
<tr>
<th> </th>
<th style="text-align: center"><strong>Embedded Systems</strong></th>
<th style="text-align: center"><strong>Mobile devices</strong></th>
<th style="text-align: center"><strong>PC</strong></th>
<th style="text-align: right"><strong>server farms</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>typical applications</strong></td>
<td style="text-align: center">Device control, protocol conversion, etc</td>
<td style="text-align: center">Diary, Address book, Phone, Email</td>
<td style="text-align: center">Word processing spreadsheet, small database,accounting.</td>
<td style="text-align: right">E-commerce, large database applications, accounting, stock control.</td>
</tr>
<tr>
<td><strong>UI</strong></td>
<td style="text-align: center">NA</td>
<td style="text-align: center">GUI; libraries in ROM</td>
<td style="text-align: center">GUI, with several possible libraries as DLLs on disk</td>
<td style="text-align: right">Implemented by clients, browsers or terminals</td>
</tr>
<tr>
<td><strong>Network</strong></td>
<td style="text-align: center">None, Serial Connection, or industrial LAN</td>
<td style="text-align: center">TCP/IP over a wireless connection</td>
<td style="text-align: center">10MBps LAN</td>
<td style="text-align: right">100 MBps LAN</td>
</tr>
<tr>
<td><strong>IO</strong></td>
<td style="text-align: center">As needed – often the main purpose of device.</td>
<td style="text-align: center">Serial connections</td>
<td style="text-align: center">Serial & parallel ports, modem, etc.</td>
<td style="text-align: right">Any, accessed via LAN</td>
</tr>
</tbody>
</table>
<h3 id="allocations">Allocations</h3>
<blockquote>
<p>pdf: <a href="http://smallmemory.com/6_AllocationChapter.pdf">allocations</a></p>
</blockquote>
<h5 id="fragmentation">Fragmentation</h5>
<p>For dynamic memory allocation, there are two types: <strong>internal fragmentation</strong> and <strong>external fragmentation</strong>(And <strong>Data Fragmentation</strong>, as some would mention). The cause of such cases usueally happens when the user processes are loaded and removed from RAM, stored in blocks, making the main memory not enough for loading new process even though there are available memory spaces: small size memory blocks.</p>
<p>The memory would eventually run out, no matter which memory allocation scheme we choose, so what we could do for better memory management is to choose memory plan accordingly.</p>
<ol>
<li>
<p><strong>Fixed size Client Memories</strong>: makes user to take responsibility to take memory problems but harder to provide full features of the app, sometimes may result in lower user engagement rate.</p>
</li>
<li>
<p><strong>Signal an error</strong>: It is quite easy to inform the client side of the error but it is more important to handle the error correctly, think about the parital failure pattern. This approach usually gives us more options to handle memory problems.</p>
</li>
<li>
<p><strong>Reduce quality to reduce quantity</strong>: reduce quality can maintain the system throughput. One popular example is reduceing the quality of the image we store, or reducing sampling frequency.</p>
</li>
<li>
<p><strong>Delete old objects</strong>: This is known as common practice. For example, if you load your pics for far too long in Instgram, most likely you would refresh or reopen the app, which means <em>Fresh Work Before Stable</em>: terminating old connections that have lower chance to be answered, delete old ones for new objects to arrive.</p>
</li>
<li>
<p><strong>Defer new requests</strong>/ <strong>IGNORE</strong>.</p>
</li>
</ol>
<p><img src="/img/post/small_mem/ALLOCATION.png" alt="Allocation Patterns" /></p>
<blockquote>
<p>I will be updating this post regularly till I finish the whole book. If you have any questions, feel free to discuss in the comment section below.</p>
</blockquote>DanniReading Small Memory Software: Patterns for systems with limited memoryWriting elegant Golang2019-12-20T00:00:00+00:002019-12-20T00:00:00+00:00https://isdanni.com/elegant-golang<p>Full disclosure, I did’t start using Golang actively till recent months, even though I have always claimed to know it and put it in the language section on my resume(naively & shamelessly). But there is definietly a huge difference between knowing some common syntaxes and understanding the language in engineering level completely.</p>
<p>Last week, while I was having a conversation with one of my friends who started using Golang because of his PhD thesis, I realized we both shared same learning experience(Though definitely not the most efficient learning curve):</p>
<ol>
<li>started a huge list of online tutorials;</li>
<li>proceeded to spend money on books;</li>
<li>gave up the first two & just started development;</li>
<li>found bugs couldn’t understand, solved with online forumn, and went back to learning materials;</li>
</ol>
<p>One thing we both 100% agreed on: <strong>practice, practice, practice</strong>. More precisely, practicing by building something yourself. I have always believed the best way to learn programming language is a quick project that adopts most common features and has a progressive learning curve.</p>
<h3 id="personal-notes-on-writing-concise--elegant-golang">personal notes on writing concise & elegant Golang</h3>
<ol>
<li><code class="language-plaintext highlighter-rouge">gofmt</code>, <code class="language-plaintext highlighter-rouge">goimports</code>, <code class="language-plaintext highlighter-rouge">golangci-lint</code>, etc.</li>
<li>Standard Go Project Layout:
<ul>
<li>do NOT contain <code class="language-plaintext highlighter-rouge">/src</code>: especially for those Java developers who is used to its design pattern;</li>
<li><code class="language-plaintext highlighter-rouge">/internal</code> modules cannot be used by external parties;</li>
</ul>
</li>
<li>do NOT use <code class="language-plaintext highlighter-rouge">init</code> for initialization, like in <code class="language-plaintext highlighter-rouge">rpc</code>, <code class="language-plaintext highlighter-rouge">DB</code>, or <code class="language-plaintext highlighter-rouge">Redis</code>, because init will be anonymously executed, initializing the resource connection. Every time we declare an <code class="language-plaintext highlighter-rouge">init()</code> function, Go will load and tun it <strong>prior</strong> to anything else in that package;</li>
</ol>
<p>At the <code class="language-plaintext highlighter-rouge">init</code> stage. Best make some simple conditional/judging statements, like using <code class="language-plaintext highlighter-rouge">flag: True/False</code> to determine the status of the parameters;</p>
<div class="language-golang highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// main.go</span>
<span class="k">package</span> <span class="n">main</span>
<span class="k">import</span> <span class="p">(</span>
<span class="s">"fmt"</span>
<span class="p">)</span>
<span class="k">var</span> <span class="n">name</span> <span class="kt">string</span>
<span class="k">func</span> <span class="n">init</span><span class="p">()</span> <span class="p">{</span>
<span class="n">name</span> <span class="o">=</span> <span class="s">"anonymous"</span>
<span class="p">}</span>
<span class="k">func</span> <span class="n">main</span><span class="p">()</span> <span class="p">{</span>
<span class="n">fmt</span><span class="o">.</span><span class="n">Printf</span><span class="p">(</span><span class="s">"My name is %s"</span><span class="p">,</span> <span class="n">name</span><span class="p">)</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Instead, it is better to use <strong>Client + NewClient</strong> for initializing connection.</p>
<ol>
<li>
<p>Testing. Use frameworks like (GoMock)[https://blog.codecentric.de/en/2017/08/gomock-tutorial/], <code class="language-plaintext highlighter-rouge">httpMonky</code>, <code class="language-plaintext highlighter-rouge">monkey</code> and (GoMock)[https://github.com/golang/mock] for testing;</p>
</li>
<li>
<p>Optimization:</p>
<ul>
<li>Instead of <code class="language-plaintext highlighter-rouge">fmt.Sprintf</code>, use <code class="language-plaintext highlighter-rouge">strconv</code>;</li>
<li>Use <code class="language-plaintext highlighter-rouge">sync.Pool</code>to re-use previously allocated objects and re-duce the work of the garbage collector;</li>
<li>avoid using structures containing pointers as key for large maps
<ul>
<li>For example, if you have a structure: <code class="language-plaintext highlighter-rouge">map[string]int</code>, the garbage collection has to check every string since it contains pointers;</li>
</ul>
</li>
</ul>
</li>
</ol>
<h3 id="reference">Reference</h3>
<p>https://stephen.sh/posts/quick-go-performance-improvements</p>DanniFull disclosure, I did’t start using Golang actively till recent months, even though I have always claimed to know it and put it in the language section on my resume(naively & shamelessly). But there is definietly a huge difference between knowing some common syntaxes and understanding the language in engineering level completely.Consistent Hashing: tradeoffs & how-to in Redis2019-08-16T00:00:00+00:002019-08-16T00:00:00+00:00https://isdanni.com/consistent-hashing<h2 id="what-is-hashing">What is Hashing?</h2>
<blockquote>
<p><strong>Merriam-Webster</strong>: <strong><em>noun</em></strong>: “chopped meat mixed with potatoes and browned”; <strong><em>verb</em></strong>: “to chop (as meat and potatoes) into small pieces.”</p>
</blockquote>
<p>So basically, hashing is the mapping between data object.(general terms) The input and output values do not need to be the same type.</p>
<p><strong>hash collision</strong>: more than one input being maped to the same hash result. (the infamous <a href="https://learncryptography.com/hash-functions/hash-collision-attack">Hash Collision Attack</a>)</p>
<h2 id="simple-hash-in-redis">Simple hash in Redis?</h2>
<p>To ensure the high availability and improve readability, we can simply do the <a href="https://www.digitalocean.com/community/tutorials/how-to-configure-redis-replication-on-ubuntu-16-04">replication</a> in Redis so that it can form the <code class="language-plaintext highlighter-rouge">Master-Master</code> or <code class="language-plaintext highlighter-rouge">Master-Slave</code>; Building clusters to split read/write in data operations. Similar to the database, when the data is too large, we create new db/tables.</p>
<p><img src="/assets/images/post/redis/redis-labs.png" alt="High Availability" /><em>High Availability in Redis [Source: Redis Labs][i1]</em></p>
<p>One example, if the key is the image name and the value is the file path,</p>
<ul>
<li>to search for a certain image on Redis server takes traversal time(for each server)</li>
<li>If we use the plain hash, <code class="language-plaintext highlighter-rouge">hash(file-name.png) % num(server)</code>, we can directly go to the one we need but there are certain problems -> when the server name changes, every cache location changes too.</li>
</ul>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">hash</span><span class="o">(</span>1.png<span class="o">)</span>%2 -> <span class="nb">hash</span><span class="o">(</span>1.png<span class="o">)</span>%3 <span class="o">=</span> ?
</code></pre></div></div>
<h2 id="consistent-hashing">Consistent Hashing</h2>
<p>Consistent hashing still uses module method. But instead of moding the server number, it mods the <code class="language-plaintext highlighter-rouge">2^32</code>, which takes the entire hashing space as a <code class="language-plaintext highlighter-rouge">clock-wise circle</code>, starting from the <code class="language-plaintext highlighter-rouge">0</code> node.</p>
<ul>
<li>First, hash each server(using IP/server name/…)</li>
<li>For each file,use sane hash function to hash the key to get the hash value on the circle, walking clock-wise, the first server encoutered should be its designated server.</li>
</ul>
<p><img src="/assets/images/post/redis/consistent-hashing.png" alt="Consistent Hashing" /><em>Consistent Hashing in Redis [Source: Redis Labs][i2]</em></p>
<h5 id="fault-tolerance">Fault Tolerance</h5>
<p>If one node down or more node added, we simply need to update a small portion of file mapping while the majority stay unborthered.</p>
<h5 id="weighted-hosts">Weighted Hosts</h5>
<p>This happends when we have more (or less) load to one server as to the rest. Possible reasons are <strong>un-evenly distributed nodes</strong>(/the few nodes).</p>
<p>For this situation, we can adopt <strong>virtual nodes</strong> that still map back to the original node, like “Node A#1”, “Node A#2”, “Node A#3”. IN practice, it’s common to set the number of virtual nodes larger than 32 so that the even distribution with few nodes is guaranteed.</p>
<h3 id="reference">Reference</h3>
<ul>
<li><a href="https://medium.com/@dgryski/consistent-hashing-algorithmic-tradeoffs-ef6b8e2fcae8">Consistent Hashing: Algorithmic Tradeoffs</a></li>
<li><a href="https://zhuanlan.zhihu.com/p/34985026">什么是一致性Hash算法?</a></li>
</ul>DanniWhat is Hashing?Intellij IDEA for Spark w/ Scala examples2019-07-30T00:00:00+00:002019-07-30T00:00:00+00:00https://isdanni.com/scala1<blockquote>
<p>“And why I don’t use Eclipse for Spark”</p>
</blockquote>
<h2 id="why-i-dont-use-eclipse-for-spark">Why I don’t use Eclipse for Spark?</h2>
<p>I tried eclipse, Atom, Sublime and even Emacs before settling on IntelliJ. The reason I finally went back tp Intellij is the same like most of other Scala developers – the more stable IDE and more features.</p>
<p>Since the Scala IDE team also showed interest to move to VS Code back in 2017 and started a few new projects on GitHub, there’s really no use to stick to Eclipse when it’s already not the top choice from Scala’s own team.</p>
<p>And as I quote from <a href="https://qr.ae/TWvYsa">this Quora user</a> here:</p>
<blockquote>
<p>Having tried Eclipse on and off, and sticking IntelliJ for a while, its a tradeoff between being <strong>less useful</strong> but <strong>more responsive/performant</strong> (Eclipse) vs less responsive/performance but more useful (IntelliJ).</p>
</blockquote>
<p>The <strong>auto-completion</strong> & <strong>refactoring</strong> features in IntelliJ work really well when we are using Java but it becomes more like an issue when it comes to Scala. The type system in Scala is very complicated so sometimes it caues more trouble than easying the burden.(e.g. the incorrect highlighting, libraries importing…)</p>
<p>However, compared to the less <strong>rich features</strong> the Eclipse provides, I’m more than happy to stick to IntelliJ than going back to Eclipse, especially when it already gave me so much painful memory while doing some Java web projects & I have the liscence for IntelliJ Ultimate version ; )</p>
<p><img src="/assets/images/post/scala/new-project.png" alt="New Scala projects in INtelliJ" /></p>
<h2 id="set-up-dev-environment-in-intellij-for-scala">Set up DEV environment in IntelliJ for Scala</h2>
<h6 id="in-2019">In 2019</h6>
<ol>
<li><strong>Config</strong></li>
</ol>
<p>I directly followed this <a href="https://www.jetbrains.com/help/idea/run-debug-and-test-scala.html">guide</a> from JetBrains. but its worthwhile to check this <a href="http://www.itversity.com/2018/04/19/setup-development-environment-big-data-hadoop-and-spark/">post</a> from itversity in 2018. It has more thourough guides.</p>
<ol>
<li><strong>Running</strong></li>
</ol>
<ul>
<li>Local running: just go to “Run” - > “Run Configurations”</li>
<li>Running in Spark clusters: pack the program as Jar as use shell. Select “File” –> “Project Structure” –> “Artifact”,then select “+” –> “Jar” –> “From Modules with dependencies”, and choose <code class="language-plaintext highlighter-rouge">main</code> function and select the jar location in the pop up. Finally, just choose “Build” –> “Build Artifact” and compile to jar.</li>
</ul>
<p><code class="language-plaintext highlighter-rouge">./bin/spark-shell --master <master-url></code></p>
<p>If we use local mode in Spark commands and run it on 4 CPU cores, the command will simply become <code class="language-plaintext highlighter-rouge">./bin/spark-shell --master local[4]</code>.</p>
<p>And for convinence, it’s better to config system parameter:</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>vi /etc/profile
<span class="c"># add following to the end of the file</span>
<span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span><span class="nv">$PATH</span>:/usr/local/spark-[version]-bin-hadoop[version]/bin
<span class="c"># activate the change</span>
<span class="nb">source</span> /etc/profile
</code></pre></div></div>
<h2 id="scala-code-examples">Scala Code examples</h2>
<h4 id="word-count">Word Count</h4>
<p>4 paramters: Spark master location, program name, Spark installation log and Jar location.</p>
<div class="language-scala highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">import</span> <span class="nn">org.apache.spark._</span>
<span class="k">import</span> <span class="nn">SparkContext._</span>
<span class="k">val</span> <span class="nv">sc</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">SparkContext</span><span class="o">(</span>
<span class="nf">args</span><span class="o">(</span><span class="mi">0</span><span class="o">),</span> <span class="s">"WordCount"</span><span class="o">,</span>
<span class="nv">System</span><span class="o">.</span><span class="py">getenv</span><span class="o">(</span><span class="s">"SPARK_HOME"</span><span class="o">),</span>
<span class="nc">Seq</span><span class="o">(</span><span class="nv">System</span><span class="o">.</span><span class="py">getenv</span><span class="o">(</span><span class="s">"SPARK_TEST_JAR"</span><span class="o">))</span>
<span class="o">)</span>
<span class="c1">// read in file</span>
<span class="k">val</span> <span class="nv">textFile</span> <span class="k">=</span> <span class="nv">sc</span><span class="o">.</span><span class="py">textFile</span><span class="o">(</span><span class="nf">args</span><span class="o">(</span><span class="mi">1</span><span class="o">))</span>
<span class="c1">// directly create a Hadoop RDD Object</span>
<span class="k">var</span> <span class="n">hadoopRdd</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">HadoopRDD</span><span class="o">(</span>
<span class="n">sc</span><span class="o">,</span>
<span class="n">conf</span><span class="o">,</span>
<span class="n">classOf</span><span class="o">[</span><span class="kt">SequenceFileInputFormat</span><span class="o">[</span><span class="kt">Text</span>,<span class="kt">Text</span>,<span class="kt">classOf</span><span class="o">[</span><span class="kt">Text</span><span class="o">]</span>, <span class="kt">classOf</span><span class="o">[</span><span class="kt">Text</span><span class="o">]]</span>
<span class="o">)</span>
<span class="c1">// first get the word from input & put the same word in one bucket, then count the frequences.</span>
<span class="k">val</span> <span class="nv">result</span> <span class="k">=</span> <span class="nv">hadoopRdd</span><span class="o">.</span><span class="py">flatMap</span><span class="o">{</span>
<span class="nf">case</span> <span class="o">(</span><span class="n">key</span><span class="o">,</span><span class="n">value</span><span class="o">)</span> <span class="k">=</span> <span class="o">></span> <span class="nv">value</span><span class="o">.</span><span class="py">toString</span><span class="o">().</span><span class="py">split</span><span class="o">(</span><span class="s">"\\s+"</span><span class="o">);</span>
<span class="o">}.</span><span class="py">map</span><span class="o">(</span>
<span class="n">word</span> <span class="k">=</span> <span class="o">></span> <span class="o">(</span><span class="n">word</span><span class="o">,</span> <span class="mi">1</span><span class="o">)).</span><span class="py">reduceByKey</span> <span class="o">(</span><span class="k">_</span> <span class="o">+</span><span class="k">_</span><span class="o">)</span>
<span class="nv">result</span><span class="o">.</span><span class="py">saveAsSequenceFile</span><span class="o">(</span><span class="nf">args</span><span class="o">(</span><span class="mi">2</span><span class="o">))</span>
</code></pre></div></div>
<h4 id="top-k">Top K</h4>
<p>Top K task has many answers, either in algorithm way or Big data. Here in Spark, we simply follow the avove program and find the top K words.</p>
<p>A lot of tech blogs tend to use the top method from <code class="language-plaintext highlighter-rouge">SparkAPI</code> to calculate, but we can also use the algorithm way, which is <code class="language-plaintext highlighter-rouge">heap sort</code> to get the answer.</p>
<p>Here’s the common way:</p>
<div class="language-scala highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">import</span> <span class="nn">org.apache.spark.</span><span class="o">{</span><span class="nc">SparkConf</span><span class="o">,</span> <span class="nc">SparkContext</span><span class="o">}</span>
<span class="k">import</span> <span class="nn">org.apache.spark.SparkContext._</span>
<span class="k">object</span> <span class="nc">TopK</span> <span class="o">{</span>
<span class="k">def</span> <span class="nf">main</span><span class="o">(</span><span class="n">args</span><span class="k">:</span> <span class="kt">Array</span><span class="o">[</span><span class="kt">String</span><span class="o">])</span> <span class="o">{</span>
<span class="nf">if</span> <span class="o">(</span><span class="nv">args</span><span class="o">.</span><span class="py">length</span> <span class="o">!=</span> <span class="mi">2</span><span class="o">)</span> <span class="o">{</span>
<span class="nv">System</span><span class="o">.</span><span class="py">out</span><span class="o">.</span><span class="py">println</span><span class="o">(</span><span class="s">"Usage: <src> <num>"</span><span class="o">)</span>
<span class="nv">System</span><span class="o">.</span><span class="py">exit</span><span class="o">(</span><span class="mi">1</span><span class="o">)</span>
<span class="o">}</span>
<span class="k">val</span> <span class="nv">conf</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">SparkConf</span><span class="o">().</span><span class="py">setAppName</span><span class="o">(</span><span class="s">"TopK"</span><span class="o">)</span>
<span class="k">val</span> <span class="nv">sc</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">SparkContext</span><span class="o">(</span><span class="n">conf</span><span class="o">)</span>
<span class="k">val</span> <span class="nv">lines</span> <span class="k">=</span> <span class="nv">sc</span><span class="o">.</span><span class="py">textFile</span><span class="o">(</span><span class="nf">args</span><span class="o">(</span><span class="mi">0</span><span class="o">))</span>
<span class="k">val</span> <span class="nv">ones</span> <span class="k">=</span> <span class="nv">lines</span><span class="o">.</span><span class="py">flatMap</span><span class="o">(</span><span class="nv">_</span><span class="o">.</span><span class="py">split</span><span class="o">(</span><span class="s">" "</span><span class="o">)).</span><span class="py">map</span><span class="o">(</span><span class="n">word</span> <span class="k">=></span> <span class="o">(</span><span class="n">word</span><span class="o">,</span> <span class="mi">1</span><span class="o">))</span>
<span class="k">val</span> <span class="nv">count</span> <span class="k">=</span> <span class="nv">ones</span><span class="o">.</span><span class="py">reduceByKey</span><span class="o">((</span><span class="n">a</span><span class="o">,</span> <span class="n">b</span><span class="o">)</span> <span class="k">=></span> <span class="n">a</span> <span class="o">+</span> <span class="n">b</span><span class="o">)</span>
<span class="k">val</span> <span class="nv">convert</span> <span class="k">=</span> <span class="nv">count</span><span class="o">.</span><span class="py">map</span> <span class="o">{</span>
<span class="nf">case</span> <span class="o">(</span><span class="n">key</span><span class="o">,</span> <span class="n">value</span><span class="o">)</span> <span class="k">=></span> <span class="o">(</span><span class="n">value</span><span class="o">,</span> <span class="n">key</span><span class="o">)</span>
<span class="o">}.</span><span class="py">sortByKey</span><span class="o">(</span><span class="kc">true</span><span class="o">,</span> <span class="mi">1</span><span class="o">)</span>
<span class="nv">convert</span><span class="o">.</span><span class="py">top</span><span class="o">(</span><span class="nf">args</span><span class="o">(</span><span class="mi">1</span><span class="o">).</span><span class="py">toInt</span><span class="o">).</span><span class="py">foreach</span><span class="o">(</span><span class="n">a</span> <span class="k">=></span> <span class="nv">System</span><span class="o">.</span><span class="py">out</span><span class="o">.</span><span class="py">println</span><span class="o">(</span><span class="s">"("</span> <span class="o">+</span> <span class="nv">a</span><span class="o">.</span><span class="py">_2</span> <span class="o">+</span> <span class="s">","</span> <span class="o">+</span> <span class="nv">a</span><span class="o">.</span><span class="py">_1</span> <span class="o">+</span> <span class="s">")"</span><span class="o">))</span>
<span class="o">}</span>
</code></pre></div></div>
<p>Here’s the Heap method, taken from <a href="https://stackoverflow.com/questions/5674741/simplest-way-to-get-the-top-n-elements-of-a-scala-iterable">StackOverflow</a>.</p>
<div class="language-scala highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">pickTopN</span><span class="o">[</span><span class="kt">A</span>, <span class="kt">B</span><span class="o">](</span><span class="n">n</span><span class="k">:</span> <span class="kt">Int</span><span class="o">,</span> <span class="n">iterable</span><span class="k">:</span> <span class="kt">Iterable</span><span class="o">[</span><span class="kt">A</span><span class="o">],</span> <span class="n">f</span><span class="k">:</span> <span class="kt">A</span> <span class="o">=></span> <span class="n">B</span><span class="o">)(</span><span class="k">implicit</span> <span class="n">ord</span><span class="k">:</span> <span class="kt">Ordering</span><span class="o">[</span><span class="kt">B</span><span class="o">])</span><span class="k">:</span> <span class="kt">Seq</span><span class="o">[</span><span class="kt">A</span><span class="o">]</span> <span class="k">=</span> <span class="o">{</span>
<span class="k">val</span> <span class="nv">seq</span> <span class="k">=</span> <span class="nv">iterable</span><span class="o">.</span><span class="py">toSeq</span>
<span class="k">val</span> <span class="nv">q</span> <span class="k">=</span> <span class="nv">collection</span><span class="o">.</span><span class="py">mutable</span><span class="o">.</span><span class="py">PriorityQueue</span><span class="o">[</span><span class="kt">A</span><span class="o">](</span><span class="nv">seq</span><span class="o">.</span><span class="py">take</span><span class="o">(</span><span class="n">n</span><span class="o">)</span><span class="k">:_</span><span class="kt">*</span><span class="o">)(</span><span class="nv">ord</span><span class="o">.</span><span class="py">on</span><span class="o">(</span><span class="n">f</span><span class="o">).</span><span class="py">reverse</span><span class="o">)</span> <span class="c1">// initialize with first n</span>
<span class="c1">// invariant: keep the top k scanned so far</span>
<span class="nv">seq</span><span class="o">.</span><span class="py">drop</span><span class="o">(</span><span class="n">n</span><span class="o">).</span><span class="py">foreach</span><span class="o">(</span><span class="n">v</span> <span class="k">=></span> <span class="o">{</span>
<span class="n">q</span> <span class="o">+=</span> <span class="n">v</span>
<span class="nv">q</span><span class="o">.</span><span class="py">dequeue</span><span class="o">()</span>
<span class="o">})</span>
<span class="nv">q</span><span class="o">.</span><span class="py">dequeueAll</span><span class="o">.</span><span class="py">reverse</span>
<span class="o">}</span>
</code></pre></div></div>Danni“And why I don’t use Eclipse for Spark”