Skip to content

Record the key ideas and insights I captured from the papers I read.

Notifications You must be signed in to change notification settings

EvanLyu732/daily-paper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 

Repository files navigation

Evan’s Daily Paper

I am *not* a researcher or scientist. Just a regular guy who follow my interest try to figure out some interesting stuff. Since I love emacs and org mode. Why not track all the papers that i read in a single org file? This is a lifelong repo.

Content

DatePaper
20240423A Survey of Embodied AI: From Simulators to Research Tasks
20240424Why Functional Programming Matters
20240425Recurrent Neural Networks (RNNs): A gentle Introduction and Overview
20240426Neural Machine Translation by Jointly Learning to Align and Translate
20240427A General Survey on Attention Mechanisms in Deep Learning
20240428MEGALODON: Efficient LLM Pretraining and Inference with Unlimited Context Length
20240429Mega: Moving Average Equipped Gated Attention
20240430The Next Decade in AI
20240501The Bitter Lesson
20240502KAN: Kolmogorov–Arnold Networks
20240503Multilayer feedforward networks are universal approximators
20240504Sequence to Sequence Learning with Neural Networks
20240505Translating Videos to Natural Language Using Deep Recurrent Neural Networks
20240506Summarizing Source Code using a Neural Attention Model
20240507Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks
20240508Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
20240509BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
20240510Language Models are Unsupervised Multitask Learners
20240511Improving Language Understanding by Generative Pre-Training
20240512Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond
20240513Cramming: Training a Language Model on a Single GPU in One Day
20240514Autonomous LLM-driven research from data to human-verifiable research papers
20240515LoRA: Low-Rank Adaptation of Large Language Models
20240516When to Retrieve: Teaching LLMs to Utilize Information Retrieval Effectively
20240517A PRIMER ON THE INNER WORKINGS OF TRANSFORMER-BASED LANGUAGE MODELS
20240518Is artificial consciousness achievable? Lessons from the human brain
20240519Teaching Algorithm Design: A Literature Review
20240520How Good Are Low-bit Quantized LLAMA3 Models? An Empirical Study
20240521A Survey on Retrieval-Augmented Text Generation for Large Language Models
20240522Best Practices and Lessons Learned on Synthetic Data for Language Models
20240523Exploring the Limits of Language Modeling
20240524The First Law of Complexodynamics
20240525The Unreasonable Effectiveness of Recurrent Neural Networks
20240526Recurrent Models of Visual Attention
20240527Neural Turing Machines
20240528Relational recurrent neural networks
20240529Keeping Neural Networks Simple by Minimizing the Description Length of the Weights
20240530RECURRENT NEURAL NETWORK REGULARIZATION
20240531Layer Normalization
20240601Scaling Laws for Neural Language Models
20240602Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
20240603A Tutorial Introduction to the Minimum Description Length Principle
20240604ORDER MATTERS: SEQUENCE TO SEQUENCE FOR SETS
20240605Pointer Networks
20240606Deep Residual Learning for Image Recognition
20240607The Shattered Gradients Problem: If resnets are the answer, then what is the question?
20240608Scaling and evaluating sparse autoencoders
20240612Identity Mappings in Deep Residual Networks
20240613Quantifying the Rise and Fall of Complexity in Closed Systems: The Coffee Automaton
20240614VARIATIONAL LOSSY AUTOENCODER
20240617A simple neural network module for relational reasoning
20240619The Dawning of a New Era in Applied Mathematics
20240620LANGUAGE MODELING IS COMPRESSION
20240625Large Language Model Evaluation via Matrix Entropy
20240626The Platonic Representation Hypothesis
20240627Superlinear Returns
20240628How to Do Great Work
20240703The Best Essay
20240704Life is Short
20240705Putting Ideas into Words
20240708How to think in writing
20240709C++ design patterns for low-latency applications including high-frequency trading
20240710An Introduction to Vision-Language Modeling
20240711Being a Noob
20240712How to Start Google
20240713RT-1: ROBOTICS TRANSFORMER FOR REAL-WORLD CONTROL AT SCALE
20240715RT-2: Vision-Language-Action Models Transfer
20240716A Survey on Efficient Inference for Large Language Models
20240717Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems
20240718Beyond Euclid: An Illustrated Guide to Modern Machine Learning with Geometric, Topological, and Algebraic Structures
20240719End-To-End Planning of Autonomous Driving in Industry and Academia: 2022-2023
20240729The Right Kind of Stubborn
20240730What I’ve Learned from Users
20240731How to Work Hard
20240801The Risk of Discovery
20240802The Need to Read
20240803The Surprising Power of The Long Game
20240804What makes a great technical blog
20240805My programming beliefs as of July 2024
20240806RDMA over Ethernet for Distributed AI Training at Meta Scale
20240807Beyond Smart
20240808How To Become A Hacker
20240809How To Learn Hacking
20240810Why Your Data Stack Won’t Last - And How To Build Data Infrastructure That Will
20240812Weird Languages
20240813Make Luck Your Destiny
20240814Finding Time to Invest in Yourself
20240815Accountability Means Letting People Criticize You
20240816Example: From Laborer to Entrepreneur
20240817How Did We Get Here? The Tangled History of the Second Law of Thermodynamics
20240819Ten Proofs of the Generalized Second Law
20240820The Shift from Models to Compound AI Systems
20240821New LLM Pre-training and Post-training Paradigms
20240822Natural Language Can Help Bridge the Sim2Real Gap
20240823Evolving Virtual Creatures
20240824GPU Utilization is a Misleading Metric
20240825All Models Are Wrong
20240826Mental Models: The Best Way to Make Intelligent Decisions (~100 Models Explained)
20240827Growth: Thinking in systems
20240828Deliberate Practice and Acquisition of Expert Performance: A General Overview
20240829How to Write Usefully
20240830The Munger Operating System: How to Live a Life That Really Works
20240902Founder Mode
20240903The Art of Finishing
20240904Loss of plasticity in deep continual learning
20240905You can learn AI later
20240906How Completely Messed Up Practices Become Normal
20240907The Work You Do, the Person You Are
20240909Richard Feynman and The Connection Machine
20240910The future of European competitiveness
20240911Tutorial on Diffusion Models for Imaging and Vision
20240912What every computer science major should know
20240913Notes on OpenAI’s new o1 chain-of-thought models
20240914Is God a Strange Loop?
20240918The Brouhaha Over Consciousness and “Pseudoscience”
20240919Holding a Program in One’s Head
20240920How to Measure Progress in a Software Project
20240923What Is a Particle?
20240924Averaging is a convenient fiction of neuroscience
20240925On Impactful AI Research
20240926If the Universe Is a Hologram, This Long-Forgotten Math Could Decode It
20240927Critical Mass and Tipping Points: How To Identify Inflection Points Before They Happen
20240929Mark Zuckerberg: “Ship the app”
20240930How your brain detects patterns in the everyday: without conscious thought
20241001When to do what you love
20241008The Rise of Worse is Better
20241009The Computational View of Time
20241010Observer Theory
20241011Kernighan’s lever
20241012What Is Consciousness? Some New Perspectives from Our Physics Project

20240423

  • Paper: A Survey of Embodied AI: From Simulators to Research Tasks
  • Links: https://arxiv.org/pdf/2103.04918.pdf
  • Ideas:
    1. Embodied AI Simulators: DeepMind Lab, AI2-THOR, SAPIEN, VirtualHome, VRKitchen, ThreeDWorld, CHALET, iGibson, and Habitat-Sim.

20240424


20240425

  • Paper: Recurrent Neural Networks (RNNs): A gentle Introduction and Overview
  • Links: https://arxiv.org/pdf/1912.05911.pdf
  • Ideas:
    1. RNN deal with sequence data.
    2. BPTT (Back Propagation Through Time): store weight when processed through each loss term
    3. LSTM (Long Short-Term Memory): design to handle vanish graident problems and introduce the gated cell to store more information (What information?)
    4. DRNN (Deep Recurrent Neural Networks): stack ordinary RNN together.
    5. BRNN (Bidirectional Recurrent Neural Networks): the authors create the section, but i do not get any ideas.
    6. Seq2Seq: What problems does seq2seq or encoder-decoder structure solves?
    7. Attention & Transformers: Why Attentions works? Why Skip-Connection works?
    8. Pointer Networks

20240426


20240427

  • paper: A General Survey on Attention Mechanisms in Deep Learning
  • links: https://arxiv.org/pdf/2203.14263
  • ideas:
    1. authors define a task model, which contains four component, 1. the feature model 2. the query model 3. the attention model 4. the output model
    2. feature model: used to extract features can be RNN or CNN and ...., for turning o$Xn$ into $fn$
    3. query model: a query tell which feature $fn$ to attend to.
    4. attention model: given input query $qn$ and features vectors $fn$, the model extract the key matrix $Kn$ and value matrix $Vn$ from $fn$. Traditionaly, this process can be achived by linear transformation and use weight matrix $Wk$ and $Wv$.
    5. attention mechanisms can be classify into three categories: query-related, feature-related and general(not relate to query or feature).

    To learn more about attention mechanisms, this page https://slds-lmu.github.io/seminar_nlp_ss20/attention-and-self-attention-for-nlp.html and 3blue1brown video https://www.3blue1brown.com/lessons/attentionare are helpful


20240428

  • paper: MEGALODON: Efficient LLM Pretraining and Inference with Unlimited Context Length
  • links: https://arxiv.org/pdf/2404.08801
  • ideas:
    1. traditional transformer: computation complexity, limited inductive bias.
    2. introduce the complex exponential moving average(CEMA) components, timestamp normalization layer, normalized attention and pre-norm with two-hop residual configuraion.

      Q1: This paper is based on the architecture of MEGA, But What is MEGA? Q2: Why this architecture and deal with unlimited length?

      Evaluation on long-context modeling, including perplexity in various context lengths up to 2M and long-context QA tasks in Scrolls (Parisotto et al.,

not understand


20240429

  • paper: Mega: Moving Average Equipped Gated Attention
  • links: https://arxiv.org/pdf/2209.10655
  • ideas:
    1. sequence modeling common approaches: self-attention and EMA(exponential moving average) Well, this kind of theortical paper is too difficult for me, mayme i should start with some basic ideas and understand the concepts by doing project.

20240430

  • paper: The Next Decade in AI
  • links: https://arxiv.org/pdf/2002.06177
  • ideas:
    1. authors cites “The Bitter Lesson” - By Rich Sutton, i have seen this paper in many places. I should check out this paper.
    2. claim1: /to build a robust, knowledge-driven approach to AI we must have the machinery of symbol-manipulation in our toolkit. Too much of useful knowledge is abstract to make do without tools that represent and manipulate abstraction, and to date, the only machinery that we know of that can manipulate such abstract knowledge reliably is the apparatus of symbol-manipulation/.
    3. claim2: robust artificial intelligences properties:
      • have the ability to learn new knowledge
      • can learn knowledged that is symbolically represented.
      • significant knowledge is likely to be abstract.
      • rules and exceptions are co-existed
      • Some significant fraction of the knowledge that a robust system is likely to be causal, and to support counterfactuals.
      • Some small but important subset of human knowledge is likely to be innate; robust AI, too, should start with some important prior knowledge.
    4. claim3: rather than starting each new AI system from scratch, as a blank slate, with little knowledge of the world, we should seek to build learning systems that start with initial frameworks for domains like time, space, and causality, in order to speed up learning and massively constrain the hypothesis space.
    5. knowledge by itself it not enough. knowledge put into practice with tool of reasoning.

      a reasoning system that can leverage large-scale background knowledge efficiently, even when available information is incomplete is a prerequisite to robustness.


20240502

  • paper: KAN: Kolmogorov–Arnold Networks
  • links: https://arxiv.org/pdf/2404.19756
  • ideas
    1. claim1: Kolmogorov-Arnold representation theorem What is Kolmogorov-Arnold representation theorem? Why it can represented any function like Universal Approximation Theorem?
    2. claim2: MLP: learnable weights on edges, KAN learnable activation functions on edges. TOMORRORW PAPER IS ABOUT UNIVERSAL APPROXIMATION THEOREM
    3. claim3: KANs’ nodes simply sum incoming signals without applying any non-linearities
    4. claim4: KANs are nothing more than combinations of splines What is splines?
    5. claim5: Currently, the biggest bottleneck of KANs lies in its slow training. KANs are usually 10x slower than MLPs, given the same number of parameters. We should be honest that we did not try hard to optimize KANs’ efficiency though, so we deem KANs’ slow training more as an engineering problem to be improved in the future rather than a fundamental limitation. If one wants to train a model fast, one should use MLPs. In other cases, however, KANs should be comparable or better than MLPs, which makes them worth trying.


20240503

  • paper: Multilayer feedforward networks are universal approximators
  • links: https://cognitivemedium.com/magic_paper/assets/Hornik.pdf
  • ideas:
    1. claim1: Advocates of the virtues of multilayer feedfor- ward networks (e.g., Hecht-Nielsen, 1987) often cite Kolmogorov’s (1957) superposition theorem or its more recent improvements (e.g.. Lorentz, 1976) in support of their capabilities. However, these results require a different unknown transformation (g in Lorentz’s notation) for each continuous function to be represented, while specifying an exact upper limit to the number of intermediate units needed for the representation.
    2. Anyway, this paper prove multilayer feedforward networks is a class of universal approximators. While reading this paper, i am wondering why encoder-decoder structure network work? who proposed that? This is tomorrow topic.

20240504

  • paper: Sequence to Sequence Learning with Neural Networks
  • links: https://arxiv.org/pdf/1409.3215
  • ideas:
    1. claim1: DNNs can only be applied to problems whose inputs and targets can be sensibly encoded with vectors of fixed dimensionality.
    2. claim2: network architecture, one LSTM for encoder and another LSTM for decoder. What is the encoder and the decoder has different network structure?

20240505

  • paper: Translating Videos to Natural Language Using Deep Recurrent Neural Networks
  • links: https://arxiv.org/pdf/1412.4729
  • ideas:
    1. claim1: “` video -> cnn -> lstm -> label “`

      It seems like features extraction network is a kind of encoder-decoder structure networks.


20240506

20240507

20240508

  • paper: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
  • links: https://arxiv.org/pdf/2006.16236
  • ideas:
    1. claim1: tranditional transformers require quadratics memories, such as for $N$ input. The time complexity is $O(N_2)$. This paper propose linear transformation. What’s the difference between attention layer and self-attention layer?

      every transformer can be seen as a recurrent neural network

20240509

  • paper: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
  • links: https://arxiv.org/pdf/1810.04805
  • ideas:
    1. claim1: two methods for apply pre-trained language models to downstream tasks( feature-based and find-tuning)

20240510

20240511

  • paper: Improving Language Understanding by Generative Pre-Training
  • links: https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf
  • ideas:
    1. By pre-training on a diverse corpus with long stretches of contiguous text our model acquires significant world knowledge and ability to process long-range dependencies which are then successfully transferred to solving discriminative tasks such as question answering, semantic similarity assessment, entailmentdetermination, and text classification, improving the state of the art on 9 of the 12 datasets we study.

      What sources of corpus is suitable for pre training?

20240512

  • paper: Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond
  • links: https://arxiv.org/pdf/2304.13712
  • ideas: What’s the difference between encoder-decoder structure with decoder only model?
    1. decoder-only model
    2. nlu task: text classification, named entity recognition (NER),

entailment prediction, and so on.

  1. nlg task: Natural Language Generation broadly encompasses two major categories of tasks, with the goal of creating coherent, meaningful, and contextually appropriate sequences of symbols.

20240513


20240514

  • paper: Autonomous LLM-driven research from data to human-verifiable research papers
  • links: https://arxiv.org/pdf/2404.17605
  • ideas:
    1. this paper propose data to paper. crazy idea…

Starting with a human-provided dataset.the process is designed to raise hypotheses, write, debug and execute code to analyze the data and perform statistical tests, interpret the results and write well-structured scientific papers which not only describe results and conclusions but also transparently delineate the research methodologies, allowing human scientists to understand, repeat and verify the analysis. The discussion on emerging guidelines for AI-driven science (22) have served as a design framework for data-to-paper, yielding a fully transparent, traceable and verifiable workflow, and algorithmic \u201cchaining\u201d of data, methodology and result allowing to trace downstream results back to the part of code which generated them. The system can run with or without a predefined research goal (fixed/open-goal modalities) and with or without human interactions and feedback (copilot/autopilot modes). We performed two open-goal and two fixed-goal case studies on different public datasets (24\u201327) and evaluated the AI-driven research process as well as the novelty and accuracy of created scientific papers. We show that, running fully autonomously (autopilot), data-to-paper can perform complete and correct run cycles for simple goals, while for complex goals, human co-piloting becomes critical.


20240515

  • paper: LoRA: Low-Rank Adaptation of Large Language Models
  • links: https://arxiv.org/pdf/2106.09685
  • ideas:
    1. claim1: LoRA: which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable pa-rameters for downstream tasks

20240516

  • paper: When to Retrieve: Teaching LLMs to Utilize Information Retrieval Effectively
  • links: https://arxiv.org/pdf/2404.19705
  • ideas:
    1. claim1: this paper propose a method that when LLM generate token by <RET>, it use ir system for retrieve outer sources.

20240517

  • paper: A PRIMER ON THE INNER WORKINGS OF TRANSFORMER-BASED LANGUAGE MODELS
  • links: https://arxiv.org/pdf/2405.00208
  • ideas:
    1. claim1: layer normalization is a common operation used to stabilize the training process of deep neural networks This paper is too long for me to digest. Maybe one day i’ll come to visit when i project transfromer architecture

20240518

  • paper: Is artificial consciousness achievable? Lessons from the

human brain

  • links: https://arxiv.org/pdf/2405.04540
  • ideas:
    1. claim1:

      Given this uncertainty, we recommend not to use the same general term (i.e., consciousness) for both humans and artificial systems; to clearly specify the key differences between them; and, last but not least, to be very clear about which dimension and level of consciousness the artificial system may possibly be capable of displaying.


20240519

  • paper: Teaching Algorithm Design: A Literature Review
  • links: https://arxiv.org/pdf/2405.00832
  • ideas:
    1. claim: Systematic literature reviews
      • Research Question
      • Protocol Development
      • Search Databases
      • Screen Studies
      • Extract Data
      • Assess Quality
      • Synthesize Data
      • Report Findings

20240520

  • paper: How Good Are Low-bit Quantized LLAMA3 Models? An Empirical Study
  • links: https://arxiv.org/pdf/2404.14047
  • ideas:
    1. Round-To-Nearest(RTN) rounding quantization method.
    2. LORA find tuning quantization What’s the difference between Post-Training Quantization and LORA find-tuning?

20240521

  • paper: A Survey on Retrieval-Augmented Text Generation for Large Language Models
  • links: https://arxiv.org/pdf/2404.10981
  • ideas:
    1. the RAG paradigm into four categories: pre-retrieval, retrieval, post-retrieval, and generation

20240522

  • paper: Best Practices and Lessons Learned on Synthetic Data for Language Models
  • links: https://arxiv.org/pdf/2404.07503
  • ideas:
    1. Training with synthetic data makes evaluation decontamination harder.

20240523

  • paper: Exploring the Limits of Language Modeling
  • links: https://arxiv.org/pdf/1602.02410
  • ideas:
    1. The goal of LM is to learn a probability distribution over sequences of symbols pertaining to a language

20240524: start to follow llya-30, well-written Blog are considered as paper as well.


20240524

  • paper: The First Law of Complexodynamics
  • links: https://scottaaronson.blog/?p=762
  • ideas:
    1. quote1: why does “complexity” or “interestingness” of physical systems seem to increase with time and then hit a maximum and decrease, in contrast to the entropy, which of course increases monotonically?

    Question: What’s the difference between entropy in physics and information theory?

    1. suppose: Kolmogorov complexity to define entropy.
    2. quote2: First Law of Complexodynamics,” exhibiting exactly the behavior that Sean wants: small for the initial state, large for intermediate states, then small again once the mixing has finished.

20240525


20240526

  • paper: Recurrent Models of Visual Attention
  • links: https://arc.net/folder/D0472A20-9C20-4D3F-B145-D2865C0A9FEE
  • ideas:
    1. quote1: The model is a recurrent neural network (RNN) which processes inputs sequentially, attending to different locations within the images (or video frames) one at a time, and incrementally combines information from these fixations to build up a dynamic internal representation of the scene or environment.
    2. Partially Observable Markov Decision Process (POMDP).


20240527

  • paper: Neural Turing Machines
  • links: https://arxiv.org/pdf/1410.5401
  • ideas:
    1. quote1: Fodor and Pylyshyn (Fodor and Pylyshyn, 1988) famously made two barbed claims about the limitations of neural networks for cognitive modeling. They first objected that connectionist theories were incapable of variable-binding, or the assignment of a particular datum to a particular slot in a data structure.
Neural Turing Machine:

                      External Input          External Output
                             \                  /
                              \                /
                             +------------+
                             | Controller |
                             +------------+
                               /      \
                              /        \
                       +-----------+   +-----------+
                       | Read Heads|   | Write Heads|
                       +-----------+   +------------+
      

20240528

  • paper: Relational recurrent neural networks
  • links: https://arxiv.org/pdf/1806.01822
  • ideas:
    1. claim1: Relational Memory Core (RMC) – which employs multi-head dot product attention to allow memories to interact
                              CORE
      
                          Prev. Memory
                               |
                               v
           +-------------------+------------------+
           |                   A                  |
           |               +----+----+            |
           |               |    |    |            |
           |               |  Residual            |
           |               +----+----+            |
           |                    |                 |
           |                    v                 |
           |                  +----+              |
           |                  | MLP |             |
           |                  +----+              |
           |                    |                 |
           |                Residual              |
           |                    |                 |
           +-------------------+------------------+
                               |
                               v
                             Output
      
      
               MULTI-HEAD DOT PRODUCT ATTENTION
      
                Memory
                  |
                  v
           +-------------------------+
           |    W_q   W_k   W_v      |
           |     |     |     |       |
           | query key value         |
           | (q1)  (k1)  (v1)        |
           |     \   |   /           |
           |      \  |  /            |
           |       softmax(QK^T)V    |
           |            |            |
           |            v            |
           |      Updated Memory     |
           +-------------------------+
      
      Compute attention weights
      Queries (Q)            Keys (K)               Weights
      +---+---+---+         +---+---+---+         +---+---+---+
      |q1 |q2 |...|         |k1 |k2 |...|         |w1 |w2 |...|
      +---+---+---+         +---+---+---+         +---+---+---+
      
      Normalize weights with row-wise softmax
      Normalized Weights
      +---+---+---+
      | w1,1 w1,2...|
      | w2,1 w2,2...|
      | ...         |
      +---+---+---+
      
      Compute weighted average of values
      Values (V)                Weighted Values
      +---+---+---+         +---+---+---+
      |v1 |v2 |...|         |wv1|wv2|...|
      +---+---+---+         +---+---+---+
      
      Return updated memory
      Updated Memory
      +---+---+---+
      | M1 | M2 |...|
      +---+---+---+
              

20240529

  • paper: Keeping Neural Networks Simple by Minimizing the Description Length of the Weights
  • links: https://www.cs.toronto.edu/~hinton/absps/colt93.pdf
  • ideas:
    1. quote1: The Minimum Description Length Principle (Rissanen, 1986) asserts that the best model of some data is the one that minimizes the combined cost of describing the model and describing the misfit between the model and the data.

20240530

  • paper: RECURRENT NEURAL NETWORK REGULARIZATION
  • links: https://arxiv.org/pdf/1409.2329
  • ideas:
    1. claim1: The main contribution of this paper is a recipe for applying dropout to LSTMs in a way that successfully reduces overfitting

20240531

  • paper: Layer Normalization
  • links: https://arxiv.org/pdf/1607.06450
  • ideas:
    1. claim1: batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and variance which are then used to normalize the summed input to that neuron on each training case.
    2. claim2: layer normalization, a simple normalization method to improve the training speed for various neural network models. Unlike batch normalization, the proposed method directly estimates the normalization statistics from the summed inputs to the neurons within a hidden layer so the normalization does not introduce any new dependencies between training cases
    3. claim3: Layer normalization does not have such problem because its normalization terms depend only on the summed inputs to a layer at the current time-step.

      Well, here is a question /what is Internal Covariate Shift/, /why Internal Covariate Shift becomes a problem/ ?

      Answer by kimi:

      Imagine you’re playing a game where you have to sort different colored balls into different baskets. Each basket is for a different color, like a basket for red balls, a basket for blue balls, and so on.

      Now, suppose at the start you only have a few balls, and they’re not evenly spread out in terms of color. Some baskets have more red balls, some have more blue balls. When you start sorting, you learn how to tell the colors apart based on these limited balls. But if the rules of your game suddenly change, like if you get a lot more balls, or the colors of the balls change, then the way you learned to sort them might not be so accurate anymore.

      In machine learning, “Internal Covariate Shift” is a bit like that situation. When we train a machine learning model, we usually use a lot of data to teach it. But if we change the distribution of the data during training, or if we don’t have enough data to represent all the possible situations, then what the model learned might change, too. That’s what we call “Internal Covariate Shift.”

      Just like how the way you learned to sort the balls at the start of the game might not be accurate if the game’s rules change, the machine learning model might need to adjust if the data distribution changes to keep being accurate.


20240601

  • paper: Scaling Laws for Neural Language Models
  • links: https://arxiv.org/pdf/2001.08361
  • ideas:
    1. claim1: Larger models require fewer samples to reach the same performance
      • Performance depends strongly on scale, weakly on model shape

        Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget.


20240602

  • paper: Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
  • links: https://arxiv.org/pdf/1512.02595
  • ideas:
    1. claim1:

      To achieve these results, we have explored various network architectures, finding several effective techniques: enhancements to numerical optimization through SortaGrad and Batch Normalization, evaluation of RNNs with larger strides with bigram outputs for English, searching through both bidirectional and unidirectional models. This exploration was powered by a well optimized, High Performance Computing inspired training system that allows us to train new, full-scale models on our large datasets in just a few days.

This paper focuses more on the engineering aspects of the topic.


20240603

  • paper: A Tutorial Introduction to the Minimum Description Length Principle
  • links: https://arxiv.org/pdf/math/0406077
  • ideas:
    1. claim1: we can therefore say that the more we are able to compress the data, the more we have learned about the data.
    2. claim2: The Fundamental Idea: Learning as Data Compression
    3. claim3:

      To formalize our ideas, we need to decide on a description method, that is, a formal language in which to express properties of the data. The most general choice is a general-purpose2 computer language such as C or Pascal. This choice leads to the definition of the Kolmogorov Complexity [Li and Vit´anyi 1997] of a sequence as the length of the shortest program that prints the sequence and then halts. The lower the Kolmogorov complexity of a sequence, the more regular it is.

However, it turns out that for every two general-purpose programming languages A and B and every data sequence D, the length of the shortest program for D written in language A and the length of the shortest program for D written in language B differ by no more than a constant c, which does not depend on the length of D. This so-called invari- ance theorem says that, as long as the sequence D is long enough, it is not essential which computer language one chooses, as long as it is general-purpose.

MDL: The Basic Idea The goal of statistical inference may be cast as trying to find regularity in the data. ‘Regularity’ may be identified with ‘/ability to compress/’. MDL combines these two insights by viewing learning as data compression: it tells us that, for a given set of hypotheses H and data set D, we should try to find the hypothesis or combination of hypotheses in H that compresses D most.

.... (have not finished yet.)

This book delves into the fundamental building blocks of current deep learning systems, but it requires a solid background in information theory to fully grasp the underlying concepts.


20240605

  • paper: pointer networks
  • links: https://arxiv.org/pdf/1506.03134
  • ideas:
    1. claim1: Our model solves the problem of variable size output dictionaries using a recently proposed mechanism of neural attention. It differs from the previous attention attempts in

that, instead of using attention to blend hidden units of an encoder to a context vector at each decoder step, it uses attention as a pointer to select a member of the input sequence as the output


20240606

  • paper: Deep Residual Learning for Image Recognition
  • links: https://arxiv.org/pdf/1512.03385
  • ideas:
    1. claim1: Deep networks naturally integrate low/mid/highlevel features
    2. claim2: When deeper networks are able to start converging, a degradation problem has been exposed: with the network depth increasing, accuracy gets saturated (which might be

unsurprising and then degrades rapidly. Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher training error. *What pr


20240607

  • paper: The Shattered Gradients Problem: If resnets are the answer, then what is the question?
  • links: https://arxiv.org/pdf/1702.08591
  • ideas:
    1. claim1: If resnets are the solution, then what is the problem?
    2. claim2: a previously unnoticed difficulty with gradients in deep rectifier networks that is orthogonal to vanishing and exploding gradients. The shattering gradients problem

is that, as depth increases, gradients in standard feedforward networks increasingly resemble white noise.

  1. claim3: The shattered gradient problem is that the spatial structure of gradients is progressively obliterated as neural nets deepen.
  2. claim4: Introducing skip-connections allows much deeper networks to be trained (Srivastava et al., 2015; He et al., 2016b;a; Greff et al., 2017). Skip-connections signif- icantly change the correlation structure of gradients
  3. claim5: Batch normalization was introduced to reduce covariate shift (Ioffe & Szegedy, 2015). However, it has other effects that are less well-known – and directly impact the correlation structure of gradients. Maybe to really understand reset-net and shattered reqiure coding something.

20240608

OpenAI new paper about using top-k sparse autoencoder for neural network explaination

  • p
  • paper: Identity Mappings in Deep Residual Networks
  • links: https://arxiv.org/pdf/1603.05027
  • ideas:
    1. claim1: This paper investigates the propagation formulations behind the connection

mechanisms of deep residual networks. Our derivations imply that identity short- cut connections and identity after-addition activation are essential for making information propagation smooth.


20240613

  • paper: Quantifying the Rise and Fall of Complexity in Closed Systems: The Coffee Automaton
  • links: https://arxiv.org/pdf/1405.6903
  • ideas:
    1. claim1: Just as we can reason about the disorder of the coffee cup system, we can also consider its

“complexity.” Informally, by complexity we mean the amount of information needed to describe everything “interesting” about the system.

I don’t know why paper have relation to deep learning, so i ask GPT4 and here is the answers may help you:

The paper “Quantifying the Rise and Fall of Complexity in Closed Systems: The Coffee Automaton” relates to deep learning in several ways:

  1. **Optimization and Convergence**: The process of complexity rising and falling in a closed system is similar to the optimization and convergence of neural networks during training. As a model learns, its complexity increases, reaches a peak, and then stabilizes as it approaches an optimal solution.
  2. **Dynamic Systems**: Both the Coffee Automaton and deep learning models are dynamic systems that evolve over time. Understanding how complexity changes in these systems can provide insights into the behavior and stability of neural networks.
  3. **Pattern Recognition**: The study of how patterns emerge and disappear in the automaton parallels how deep learning models recognize and simplify patterns in data. This understanding can help improve model design and efficiency.
  4. **Entropy and Information Theory**: The concepts of entropy and information theory used to quantify complexity in the paper are also fundamental to understanding the information processing capabilities of deep learning models.

These parallels highlight the broader applicability of principles from the study of physical systems to the field of deep learning, providing valuable insights into the dynamics and optimization of neural networks.

I spend some time on learning how to read paper, so I upgrade my method from now.


Core principle:

  • no quote from text.
  • always ask question.
  • for every paper use at least one-two sentences to summary paper idea.
  • no missing math formula

20240614

here i try to strcture my reading process.

  • first-pass:

    Q: what is paper about? A: this paper proposed a method that combining VAE(Variational Autoencoder) with neural autoregressive models, which increase the flexiability of global latent codes for various problem and increase sparsity so that result can be better explaination and faster computing.

    Q: how does it improved compare to other works? A: to answer this question is hard. Becase i am limited in my understanding about different noun. but there is an excellent quote in this paper answer the question.

    However, earlier attempts at combining these two kinds of models have run into the problem that the autoregressive part of the model ends up explaining all structure in the data, while the latent variables are not used.

Q: what is the main method in this work? A:

  • goal: given data, model auto learn features without interacting
  • method:
    • VAE + autoregressive model (but why?)

Does density estimator and representation learning are different tasks?

** My quesiton where does maximum likelihood comes from ? **


20240617

  • paper: A simple neural network module for relational reasoning
  • links: https://arxiv.org/pdf/1706.01427
  • ideas:
    • first-pass:

      Q: what is paper about? A: the paper proposed a network called “Relational Network ” as a module that can be plugged into the network, which can improve the reason ability of the networks.


20240619

This paper is more like educational paper, so there is no pass.

  • paper: The Dawning of a New Era in Applied Mathematics
  • links: https://www.ams.org/journals/notices/202104/rnoti-p565.pdf
  • ideas:
    • In the Keplerian paradigm, or the data-driven approach, one extracts scientific discoveries through the analysis of data. The classical example is Kepler’s laws of planetary motion. Bioinformatics provides a compelling illustration of the success of the Keplerian paradigm in modern times
    • In the Newtonian paradigm, or the first-principle-based approach, the objective is to discover the fundamental principles that govern the world around us or the things we are interested in

      The data-driven approach has become a very powerful tool with the advance of statistical methods and machine learning. It is very effective for finding the facts, but less

effective for helping us to find the reasons behind the facts.

The first-principle-based approach aims at understand ing at the most fundamental level. Physics, in particular, is driven by the pursuit of such first principles. A turn- ing point was in 1929 with the establishment of quantum mechanics: as was declared by Dirac [2], with quantum

This is the dilemma we often face in the first principle-based approach: it is fundamental but not very practical.


20240620

  • paper: LANGUAGE MODELING IS COMPRESSION
  • links: https://arxiv.org/pdf/2309.10668v2
  • ideas:
    • first-pass:

      Q: what is paper about? A: prediction-compression equivalence allows us to use any compressor (like gzip) to build a conditional generative mode

      Q: what is the hypnosis? A: Arithmetic coding transforms a sequence model into a compressor, and, conversely, a compressor can be

transformed into a predictor using its coding lengths to construct probability distributions following Shannon’s entropy principle.

Question: It has long been established that predictive models can be transformed into lossless compressors and vice versa. Why?

A:

This paper worth digging.


20240625

  • paper: Large Language Model Evaluation via Matrix Entropy
  • links: https://arxiv.org/pdf/2401.17139
  • ideas:
    • first-pass:

      Q: what is paper about? A: this paper introduce matrix entropy. This compression process enables the model to learn and understand the shared structure of data

      core idea: We introduce matrix entropy, a new intrinsic metric that reflects the extent to which a language model “compresses” the common knowledge in the data.

    Probably try it out in someday.


20240626

  • paper: The Platonic Representation Hypothesis
  • links: https://arxiv.org/pdf/2405.07987
  • ideas:
    • first-pass: Q: what is paper about? A: representations in AI models, par-

ticularly deep networks, are converging. First, we survey many examples of convergence in the lit- erature: over time and across multiple domains, the ways by which different neural networks rep- resent data are becoming more aligned. Next, we demonstrate convergence across data modalities: as vision models and language models get larger, they measure distance between datapoints in a more and more alike way. We hypothesize that this convergence is driving toward a shared sta- tistical model of reality, akin to Plato’s concept of an ideal reality.

The Platonic Representation Hypothesis: Neural networks, trained with different objectives on different data and modalities, are converging to a shared statistical model of reality in their representation spaces.

Models are increasingly aligning to brains.


20240627

  • paper: Superlinear Returns
  • links: https://paulgraham.com/superlinear.html
  • takeaway:
    • If your product is only half as good as your competitor’s, you don’t get half as many customers. You get no customers, and you go out of business.
    • the companies with high growth rates tend to become immensely valuable, while the ones with lower growth rates may not even survive.
    • Y Combinator encourages founders to focus on growth rate rather than absolute numbers.
    • The most common case of exponential growth in preindustrial times was probably scholarship. The more you know, the easier it is to learn new things.
    • Knowledge grows exponentially, but there are also thresholds in it. Learning to ride a bicycle, for example. Some of these thresholds are akin to machine tools.
    • There are two ways work can compound. It can compound directly, in the sense that doing well in one cycle causes you to do better in the next. That happens for example when you’re building infrastructure, or growing an audience or brand. Or work can compound by teaching you, since learning compounds. This second case is an interesting one because you may feel you’re doing badly as it’s happening.
    • This is one reason Silicon Valley is so tolerant of failure. People in Silicon Valley aren't blindly tolerant of failure. They'll only continue to bet on you if you're learning from your failures. But if you are, you are in fact a good bet: maybe your company didn’t grow the way you wanted, but you yourself have, and that should yield results eventually.
    • Which yields another heuristic: always be learning. If you’re not learning, you’re probably not on a path that leads to superlinear returns.
    • But don't overoptimize what you're learning. Don't limit yourself to learning things that are already known to be valuable. You're learning; you don't know for sure yet what's going to be valuable, and if you're too strict you'll lop off the outliers.
    • A principle for taking advantage of thresholds has to include a test to ensure the game is worth playing. Here’s one that does: if you come across something that’s mediocre yet still popular, it could be a good idea to replace it. For example, if a company makes a product that people dislike yet still buy, then presumably they’d buy a better alternative if you made one.
    • So one heuristic here is to be driven by curiosity rather than careerism — to give free rein to your curiosity instead of working on what you’re supposed to.

      Pg’s essay are really good. It’s worthing to categorize all his papers into different category.


20240628

  • paper: How to Do Great Work
  • links: https://paulgraham.com/greatwork.html
  • takeaway:

    Every paragraph seeme like gold.

    • The first step is to decide what to work on. The work you choose needs to have three qualities: it has to be something you have a natural aptitude for, that you have a deep interest in, and that offers scope to do great work.
    • The way to figure out what to work on is by working. If you’re not sure what to work on, guess. But pick something and get going. You’ll probably guess wrong some of the time, but that’s fine. It’s good to know about multiple things; some of the biggest discoveries come from noticing connections between different fields.
    • Develop a habit of working on your own projects. Don’t let “work” mean something other people tell you to do. If you do manage to do great work one day, it will probably be on a project of your own. It may be within some bigger project, but you’ll be driving your part of it.
    • What should your projects be? Whatever seems to you excitingly ambitious. As you grow older and your taste in projects evolves, exciting and important will converge. At 7 it may seem excitingly ambitious to build huge things out of Lego, then at 14 to teach yourself calculus, till at 21 you’re starting to explore unanswered questions in physics. But always preserve excitingness.
    • Once you’ve found something you’re excessively interested in, the next step is to learn enough about it to get you to one of the frontiers of knowledge. Knowledge expands fractally, and from a distance its edges look smooth, but once you learn enough to get close to one, they turn out to be full of gaps.
    • Four steps: choose a field, learn enough to get to the frontier, notice gaps, explore promising ones. This is how practically everyone who’s done great work has done it, from painters to physicists.
    • The three most powerful motives are curiosity, delight, and the desire to do something impressive. Sometimes they converge, and that combination is the most powerful of all. The big prize is to discover a new fractal bud. You notice a crack in the surface of knowledge, pry it open, and there’s a whole world inside.
    • The nature of ambition exacerbates this problem. Ambition comes in two forms, one that precedes interest in the subject and one that grows out of it. Most people who do great work have a mix, and the more you have of the former, the harder it will be to decide what to do.
    • The main reason it’s hard is that you can’t tell what most kinds of work are like except by doing them. Which means the four steps overlap: you may have to work at something for years before you know how much you like it or how good you are at it. And in the meantime you’re not doing, and thus not learning about, most other kinds of work. So in the worst case you choose late based on very incomplete information.

What should you do if you're young and ambitious but don't know what to work on? What you should not do is drift along passively, assuming the problem will solve itself. You need to take action. But there is no systematic procedure you can follow. When you read biographies of people who've done great work, it's remarkable how much luck is involved. They discover what to work on as a result of a chance meeting, or by reading a book they happen to pick up. So you need to make yourself a big target for luck, and the way to do that is to be curious. Try lots of things, meet lots of people, read lots of books, ask lots of questions.

Don't worry if you find you're interested in different things than other people. The stranger your tastes in interestingness, the better. Strange tastes are often strong ones, and a strong taste for work means you'll be productive. And you're more likely to find new things if you're looking where few have looked before.

If you're making something for people, make sure it's something they actually want. The best way to do this is to make something you yourself want. Write the story you want to read; build the tool you want to use. Since your friends probably have similar interests, this will also get you your initial audience.

This should follow from the excitingness rule. Obviously the most exciting story to write will be the one you want to read. The reason I mention this case explicitly is that so many people get it wrong. Instead of making what they want, they try to make what some imaginary, more sophisticated audience wants. And once you go down that route, you’re lost.

There are a lot of forces that will lead you astray when you're trying to figure out what to work on. Pretentiousness, fashion, fear, money, politics, other people's wishes, eminent frauds. But if you stick to what you find genuinely interesting, you'll be proof against all of them. If you're interested, you're not astray.

In most cases the recipe for doing great work is simply: work hard on excitingly ambitious projects, and something good will come of it. Instead of making a plan and then executing it, you just try to preserve certain invariants.

I think for most people who want to do great work, the right strategy is not to plan too much. At each stage do whatever seems most interesting and gives you the best options for the future. I call this approach "staying upwind." This is how most people who've done great work seem to have done it.

This is one case where the young have an advantage. They’re more optimistic, and even though one of the sources of their optimism is ignorance, in this case ignorance can sometimes beat knowledge.

Since there are two senses of starting work — per day and per project — there are also two forms of procrastination. Per-project procrastination is far the more dangerous. You put off starting that ambitious project from year to year because the time isn’t quite right. When you’re procrastinating in units of years, you can get a lot not done.

The way to beat it is to stop occasionally and ask yourself: Am I working on what I most want to work on? When you're young it's ok if the answer is sometimes no, but this gets increasingly dangerous as you get older.

(Note: Don’t lie to yourself.)

There may be some jobs where you have to work diligently for years at things you hate before you get to the good part, but this is not how great work happens. Great work happens by focusing consistently on something you’re genuinely interested in. When you pause to take stock, you’re surprised how far you’ve come.

The reason we’re surprised is that we underestimate the cumulative effect of work. Writing a page a day doesn’t sound like much, but if you do it every day you’ll write a book a year. That’s the key: consistency. People who do great things don’t get a lot done every day. They get something done, rather than nothing.

(Note: to really accumulate something. Conscious level is important.)

If you do work that compounds, you’ll get exponential growth. Most people who do this do it unconsciously, but it’s worth stopping to think about. Learning, for example, is an instance of this phenomenon: the more you learn about something, the easier it is to learn more. Growing an audience is another: the more fans you have, the more new fans they’ll bring you.

Don’t try to work in a distinctive style. Just try to do the best job you can; you won’t be able to help doing it in a distinctive way.

Style is doing things in a distinctive way without trying to. Trying to is affectation.

True by itself is not enough, of course. Great ideas have to be true and new. And it takes a certain amount of ability to see new ideas even once you’ve learned enough to get to one of the frontiers of knowledge.

I’ve never liked the term “creative process.” It seems misleading. Originality isn’t a process, but a habit of mind. Original thinkers throw off new ideas about whatever they focus on, like an angle grinder throwing off sparks. They can’t help it.

To find new ideas you have to seize on signs of breakage instead of looking away. That’s what Einstein did. He was able to see the wild implications of Maxwell’s equations not so much because he was looking for new ideas as because he was stricter.

The other thing you need is a willingness to break rules. Paradoxical as it sounds, if you want to fix your model of the world, it helps to be the sort of person who's comfortable breaking rules. From the point of view of the old model, which everyone including you initially shares, the new model usually breaks at least implicit rules.

There are two ways to be comfortable breaking rules: to enjoy breaking them, and to be indifferent to them. I call these two cases being aggressively and passively independent-minded.

One way to discover broken models is to be stricter than other people. Broken models of the world leave a trail of clues where they bash against reality. Most people don’t want to see these clues. It would be an understatement to say that they’re attached to their current model; it’s what they think in; so they’ll tend to ignore the trail of clues left by its breakage, however conspicuous it may seem in retrospect.

The other way to break rules is not to care about them, or perhaps even to know they exist. This is why novices and outsiders often make new discoveries; their ignorance of a field’s assumptions acts as a source of temporary passive independent-mindedness. Aspies also seem to have a kind of immunity to conventional beliefs. Several I know say that this helps them to have new ideas.

Use the advantages of youth when you have them, and the advantages of age once you have those. The advantages of youth are energy, time, optimism, and freedom. The advantages of age are knowledge, efficiency, money, and power. With effort you can acquire some of the latter when young and keep some of the former when old.


20240703

  • paper: The Best Essay
  • links: https://paulgraham.com/best.html
  • takeaway:
    • How do you get this initial question? It probably won’t work to choose some important-sounding topic at random and go at it. Professional traders won't even trade unless they have what they call an edge — a convincing story about why in some class of trades they'll win more than they lose. Similarly, you shouldn't attack a topic unless you have a way in — some new insight about it or way of approaching it.
    • Perhaps beginning writers are alarmed at the thought of starting with something mistaken or incomplete, but you shouldn’t be, because this is why essay writing works. Forcing yourself to commit to some specific string of words gives you a starting point, and if it’s wrong, you’ll see that when you reread it. At least half of essay writing is rereading what you’ve written and asking is this correct and complete? You have to be very strict when rereading, not just because you want to keep yourself honest, but because a gap between your response and the truth is often a sign of new ideas to be discovered.
    • Ideally the response to a question is two things: the first step in a process that converges on the truth, and a source of additional questions (in my very general sense of the word). So the process continues recursively, as response spurs response. [4]
    • It would be a mistake to let this make you too conservative though, because you can’t predict where a question will lead. Not if you’re doing things right, because doing things right means making discoveries, and by definition you can’t predict those. So the way to respond to this situation is not to be cautious about which initial question you choose, but to write a lot of essays. Essays are for taking risks.
    • Almost any question can get you a good essay. Indeed, it took some effort to think of a sufficiently unpromising topic in the third paragraph, because any essayist’s first impulse on hearing that the best essay couldn’t be about x would be to try to write it. But if most questions yield good essays, only some yield great ones.
    • This essay is an example. Writing about the best essay implies there is such a thing, which pseudo-intellectuals will dismiss as reductive, though it follows necessarily from the possibility of one essay being better than another. And thinking about how to do something so ambitious is close enough to doing it that it holds your attention.
    • I like to start an essay with a gleam in my eye. This could be just a taste of mine, but there’s one aspect of it that probably isn’t: to write a really good essay on some topic, you have to be interested in it. A good writer can write well about anything, but to stretch for the novel insights that are the raison d’etre of the essay, you have to care.
    • What other qualities would a great initial question have? It’s probably good if it has implications in a lot of different areas. And I find it’s a good sign if it’s one that people think has already been thoroughly explored. But the truth is that I've barely thought about how to choose initial questions, because I rarely do it. I rarely choose what to write about; I just start thinking about something, and sometimes it turns into an essay.
    • Perhaps the answer is to go one step earlier: to write about whatever pops into your head, but try to ensure that what pops into your head is good. Indeed, now that I think about it, this has to be the answer, because a mere list of topics wouldn’t be any use if you didn’t have edge with any of them. To start writing an essay, you need a topic plus some initial insight about it, and you can’t generate those systematically. If only. [9]
    • You can probably cause yourself to have more of them, though. The quality of the ideas that come out of your head depends on what goes in, and you can improve that in two dimensions, breadth and depth.
    • You can’t learn everything, so getting breadth implies learning about topics that are very different from one another. When I tell people about my book-buying trips to Hay and they ask what I buy books about, I usually feel a bit sheepish answering, because the topics seem like a laundry list of unrelated subjects. But perhaps that’s actually optimal in this business.
    • You can also get ideas by talking to people, by doing and building things, and by going places and seeing things. I don’t think it’s important to talk to new people so much as the sort of people who make you have new ideas. I get more new ideas after talking for an afternoon with Robert Morris than from talking to 20 new smart people. I know because that’s what a block of office hours at Y Combinator consists of.
    • While breadth comes from reading and talking and seeing, depth comes from doing. The way to really learn about some domain is to have to solve problems in it. Though this could take the form of writing, I suspect that to be a good essayist you also have to do, or have done, some other kind of work. That may not be true for most other fields, but essay writing is different. You could spend half your time working on something else and be net ahead, so long as it was hard.
    • That’s the ultimate source of drag on the connectedness of ideas: the discoveries you make along the way. If you discover enough starting from question A, you’ll never make it to question B. Though if you keep writing essays you’ll gradually fix this problem by burning off such discoveries. So bizarrely enough, writing lots of essays makes it as if the space of ideas were more highly connected.
    • There are two senses in which an essay can be timeless: to be about a matter of permanent importance, and always to have the same effect on readers. With art these two senses blend together. Art that looked beautiful to the ancient Greeks still looks beautiful to us. But with essays the two senses diverge, because essays teach, and you can’t teach people something they already know. Natural selection is certainly a matter of permanent importance, but an essay explaining it couldn’t have the same effect on us that it would have had on Darwin’s contemporaries, precisely because his ideas were so successful that everyone already knows about them.
    • If you want to surprise readers not just now but in the future as well, you have to write essays that won’t stick — essays that, no matter how good they are, won’t become part of what people in the future learn before they read them.
    • But although I wish I could say that writing great essays depends mostly on effort, in the limit case it’s inspiration that makes the difference. In the limit case, the questions are the harder thing to get. That pool has no bottom.

      How to get more questions? That is the most important question of all.



20240704

  • paper: Life is Short
  • links: https://paulgraham.com/vb.html
  • takeaway:
    • If life is short, we should expect its shortness to take us by surprise. And that is just what tends to happen. You take things for granted, and then they’re gone. You think you can always write that book, or climb that mountain, or whatever, and then you realize the window has closed. The saddest windows close when other people die. Their lives are short too. After my mother died, I wished I’d spent more time with her. I lived as if she’d always be there. And in her typical quiet way she encouraged that illusion. But an illusion it was. I think a lot of people make the same mistake I did.
    • Perhaps a better solution is to look at the problem from the other end. Cultivate a habit of impatience about the things you most want to do. Don’t wait before climbing that mountain or writing that book or visiting your mother. You don’t need to be constantly reminding yourself why you shouldn’t wait. Just don’t wait.
    • I can think of two more things one does when one doesn’t have much of something: try to get more of it, and savor what one has. Both make sense here.
    • Relentlessly prune bullshit, don’t wait to do things that matter, and savor the time you have. That’s what you do when life is short.


20240705

  • paper: Putting Ideas into Words
  • links: https://paulgraham.com/vb.html
  • takeaway:
    • Writing about something, even something you know well, usually shows you that you didn’t know it as well as you thought. Putting ideas into words is a severe test. The first words you choose are usually wrong; you have to rewrite sentences over and over to get them exactly right. And your ideas won’t just be imprecise, but incomplete too. Half the ideas that end up in an essay will be ones you thought of while you were writing it. Indeed, that’s why I write them.
    • Once you publish something, the convention is that whatever you wrote was what you thought before you wrote it. These were your ideas, and now you’ve expressed them. But you know this isn’t true. You know that putting your ideas into words changed them. And not just the ideas you published. Presumably there were others that turned out to be too broken to fix, and those you discarded instead.
    • It’s not just having to commit your ideas to specific words that makes writing so exacting. The real test is reading what you’ve written. You have to pretend to be a neutral reader who knows nothing of what’s in your head, only what you wrote. When he reads what you wrote, does it seem correct? Does it seem complete? If you make an effort, you can read your writing as if you were a complete stranger, and when you do the news is usually bad. It takes me many cycles before I can get an essay past the stranger. But the stranger is rational, so you always can, if you ask him what he needs.
    • You can know a great deal about something without writing about it. Can you ever know so much that you wouldn’t learn more from trying to explain what you know? I don’t think so. I’ve written about at least two subjects I know well — Lisp hacking and startups — and in both cases I learned a lot from writing about them. In both cases there were things I didn’t consciously realize till I had to explain them.
    • And I don’t think my experience was anomalous. A great deal of knowledge is unconscious, and experts have if anything a higher proportion of unconscious knowledge than beginners
    • I’m not saying that writing is the best way to explore all ideas. If you have ideas about architecture, presumably the best way to explore them is to build actual buildings. What I’m saying is that however much you learn from exploring ideas in other ways, you’ll still learn new things from writing about them.
    • If you’re lazy, of course, writing and talking are equally useless. But if you want to push yourself to get things right, writing is the steeper hill.


20240708

  • paper: How to think in writing
  • links: https://www.henrikkarlsson.xyz/p/writing-to-think
  • takeaway:
    • The reason I’ve spent so long establishing this rather obvious point [that writing helps you refine your thinking] is that it leads to another that many people will find shocking. If writing down your ideas always makes them more precise and more complete, then no one who hasn’t written about a topic has fully formed ideas about it. And someone who never writes has no fully formed ideas about anything nontrivial.

      It feels to them as if they do, especially if they’re not in the habit of critically examining their own thinking. Ideas can feel complete. It’s only when you try to put them into words that you discover they’re not. So if you never subject your ideas to that test, you’ll not only never have fully formed ideas, but also never realize it.

    • Good thinking is about pushing past your current understanding and reaching the thought behind the thought.
    • When I write, I get to observe the transition from this fluid mode of thinking to the rigid. As I type, I’m often in a fluid mode—writing at the speed of thought. I feel confident about what I’m saying. But as soon as I stop, the thoughts solidify, rigid on the page, and, as I read what I’ve written, I see cracks spreading through my ideas. What seemed right in my head fell to pieces on the page.
    • And it is only the first step. Once you have made your thoughts definite, clear, concrete, sharp, and rigid, you also want to unfold them.
    • By doing this, I try to continually focus my reading on the goal of forming a bottom-line view, rather than just “gathering information.” I think this makes my investigations more focused and directed, and the results easier to retain. I consider this approach to be probably the single biggest difference-maker between "reading a ton about lots of things, but retaining little" and "efficiently developing a set of views on key topics and retaining the reasoning behind them."

20240709

  • paper: C++ design patterns for low-latency applications including high-frequency trading
  • links: https://arxiv.org/pdf/2309.04259
  • takeaway: Optimization in C++
    • Cache Warming:
    • Compile-time dispatch:
    • Constexpr:
    • Loop Unrolling:
    • Short-circuiting:
    • Signed vs Unsigned Comparsions:
    • Avoid Mixing Float and Doubles
    • Branch Prediction/Reduction:
    • Slowpath Removal:
    • SIMD:
    • Prefetching:
    • Lock-free Programming:
    • Inlining:

ring buffer: lock-free programming

        +--------------------+
        |   Memory request   |
        +--------------------+
                  |
                  v
        +--------------------+
        |    Request type    |
        +--------------------+
        /                    \
       /                      \
      v                        v
  +-------+                +-------+
  |  Read |                | Write |
  +-------+                +-------+
      |                        |
      v                        v
+------------+            +------------+
| Cache hit? |            | Cache hit? |
+------------+            +------------+
      |                        |
  No  | Yes                No  | Yes
      |                        |
+------------------+      +------------------+
| Locate a cache   |      | Write data into  |
| block to use     |      | cache block      |
+------------------+      +------------------+
      |                        |
      v                        v
+----------------------------+ |
| Read data from lower       | |
| memory into the cache block| |
+----------------------------+ |
      |                        |
      v                        v
+------------------+    +------------------+
|  Return data     |    | Write data into  |
+------------------+    | lower memory     |
      |                 +------------------+
      v                        |
+--------------------+         v
|        Done        |<-------/
+--------------------+

Beyond networking protocols and physical infrastructure, HFT firms invest in specialized hardware. Field-Programmable Gate Arrays (FPGAs) and Application-Specific Integrated Circuits (ASICs) are common choices, as they can execute trading algorithms more efficiently than general-purpose processors

Runtime dispatch, also known as dynamic dispatch, resolves function calls at runtime. This method is primarily associated with inheritance and virtual func- tions [13]. In such cases, the function that gets executed relies on the object’s type at runtime. Conversely, compile-time dispatch determines the function call during the compilation phase and is frequently used in conjunction with templates and function overloading.

Bad design:                                                

if (checkForErrorA())
    handleErrorA();
else if (checkForErrorB())
    handleErrorB();
else if (checkForErrorC())
    handleErrorC();
else
    executeHotpath();


Good design:

uint32_t errorFlags;
...
if (errorFlags)
    HandleError(errorFlags);
else
{
    ... hotpath
}

SIMD Array Addition:

ArrayAddition: Takes approximately 20,000 ns. ArrayAddition_SIMD: Takes approximately 12,000 ns, showing improved performance compared to regular array addition.

Lock-Free Programming:

Mutex: Takes approximately 175,000 ns. Atomic: Takes approximately 75,000 ns, demonstrating better performance compared to using mutex.

kernel bypass: Kernel bypass mitigates these latency issues by facilitating direct communication between user applications and the network interface card (NIC).

Speed Improvement by Optimisation Technique
-------------------------------------------------------------
| Technique                  | Speed Improvement (%)        |
-------------------------------------------------------------
| Cache Warming              | ############################ 90.00% |
| Constexpr                  | ############################ 90.88% |
| Loop unrolling             | ####################### 72.00%      |
| Lock-Free Programming      | ################### 63.00%          |
| Mixing data types          | ################## 52.00%           |
| Short-circuiting           | ################## 50.00%           |
| SIMD Instructions          | ################## 49.00%           |
| Branch reduction           | ########### 36.00%                  |
| Compile-time dispatch      | ########### 26.00%                  |
| Prefetching                | ######## 23.50%                     |
| Inlining                   | ####### 20.50%                      |
| Signed vs unsigned         | #### 12.15%                         |
| Slowpath removal           | #### 12.00%                         |
-------------------------------------------------------------

20240710

./static/vlm.png


20240711

  • paper: Being a Noob
  • links: https://www.paulgraham.com/noob.html
  • takeaway:
    • It’s not pleasant to feel like a noob. And the word “noob” is certainly not a compliment. And yet today I realized something encouraging about being a noob: the more of a noob you are locally, the less of a noob you are globally.
    • Though it feels unpleasant, and people will sometimes ridicule you for it, the more you feel like a noob, the better.

20240712

  • paper: How to Start Google
  • links: https://www.paulgraham.com/google.html
  • takeaway:
    • The trick is to start your own company. So it’s not a trick for avoiding work, because if you start your own company you’ll work harder than you would if you had an ordinary job. But you will avoid many of the annoying things that come with a job, including a boss telling you what to do
    • All you can know when you start working on a startup is that it seems worth pursuing. You can’t know whether it will turn into a company worth billions or one that goes out of business. So when I say I’m going to tell you how to start Google, I mean I’m going to tell you how to get to the point where you can start a company that has as much chance of being Google as Google had of being Google.
    • You need to be good at some kind of technology, you need an idea for what you’re going to build, and you need cofounders to start the company with.
    • Just work on whatever interests you the most. You’ll work much harder on something you’re interested in than something you’re doing because you think you’re supposed to.
    • Those of you who are taking computer science classes in school may at this point be thinking, ok, we’ve got this sorted. We’re already being taught all about programming. But sorry, this is not enough. You have to be working on your own projects, not just learning stuff in classes. You can do well in computer science classes without ever really learning to program. In fact you can graduate with a degree in computer science from a top university and still not be any good at programming. That’s why tech companies all make you take a coding test before they’ll hire you, regardless of where you went to university or how well you did there. They know grades and exam results prove nothing.
    • Actually it's easy to get startup ideas once you're good at technology. Once you're good at some technology, when you look at the world you see dotted outlines around the things that are missing. You start to be able to see both the things that are missing from the technology itself, and all the broken things that could be fixed using it, and each one of these is a potential startup.
    • So the list of what you need to do to get from here to starting a startup is quite short. You need to get good at technology, and the way to do that is to work on your own projects. And you need to do as well in school as you can, so you can get into a good university, because that’s where the cofounders and the ideas are. That’s it, just two things, build stuff and do well in school.

20240713

  • paper: RT-1: ROBOTICS TRANSFORMER FOR REAL-WORLD CONTROL AT SCALE
  • links: https://robotics-transformer1.github.io/assets/rt1.pdf
  • takeaway:
    • first-pass: Q: what’s this paper is about? A: this paper propose a method called robotics transformers which aims to solve general robotics problems.

Architecture:

./static/robot-transformer.png


20240715

  • paper: RT-2: Vision-Language-Action Models Transfer

Web Knowledge to Robotic Control

  • links: https://arxiv.org/pdf/2307.15818
  • takeaway:
    • first-pass: Q: what’s this paper is about? A: Key method: tokenizing the actions into text tokens and creating “multimodal

sentences” (Driess et al., 2023) that “respond” to robotic instructions paired with camera observations by producing corresponding actions.

Architecture:

./static/vlm2.png


20240716

  • paper: A Survey on Efficient Inference for Large Language Models
  • links: https://arxiv.org/pdf/2404.14294
  • takeaway: Based on the above methods and techniques, the inference process of LLMs can be divided into two stages:

    • Prefilling Stage: The LLM calculates and stores the KV cache of the initial input tokens, and generates the first output token

    • Decoding Stage: The LLM generates the output tokens one by one with the KV cache, and then updates it with the key (K) and value (V) pairs of the newly generated token


20240717

  • paper: Towards Efficient Generative Large Language Model Serving:

A Survey from Algorithms to Systems

Architecture:

./static/llm-infer.png

Architecture:

./static/llm-infer2.png

Architecture:

./static/llm-infer3.png

Architecture:

./static/llm-infer4.png

Architecture:

./static/llm-infer5.png

Architecture:

./static/llm-infer6.png

Architecture:

./static/llm-infer7.png


20240718

  • paper: Beyond Euclid: An Illustrated Guide to Modern Machine Learning with Geometric, Topological, and Algebraic Structures
  • links: https://arxiv.org/pdf/2407.09468
  • takeaway:
    • As the availability of richly structured, non-Euclidean

data grows across application domains, there is an increasing need for machine learning methods that can fully leverage the underlying geometry, topology, and symmetries to extract insights. Driven by this need, a new paradigm of non-Euclidean machine learn- ing is emerging that generalizes classical techniques to curved manifolds, topological spaces, and group- structured data. This paradigm shift echoes the non- Euclidean revolution in mathematics in the 19th cen- tury, which radically expanded our notion of geometry and catalyzed significant advancements across the nat- ural sciences.

> graph machine learning

Architecture:

./static/graph-dl.png


Architecture:

./static/overview.jpg


  • paper: The Right Kind of Stubborn
  • links: https://www.paulgraham.com/persistence.html
  • takeaway:
    • The persistent are attached to the goal. The obstinate are attached to their ideas about how to reach it. Worse still, that means they’ll tend to be attached to their first ideas about how to solve a problem, even though these are the least informed by the experience of working on it. So the obstinate aren’t merely attached to details, but disproportionately likely to be attached to wrong ones.
    • That was my initial theory, but on examination it doesn’t hold up. If being obstinate were simply a consequence of being in over one’s head, you could make persistent people become obstinate by making them solve harder problems. But that’s not what happens. If you handed the Collisons an extremely hard problem to solve, they wouldn’t become obstinate. If anything they’d become less obstinate. They’d know they had to be open to anything.
    • Obstinacy is a reflexive resistance to changing one’s ideas. This is not identical with stupidity, but they’re closely related. A reflexive resistance to changing one’s ideas becomes a sort of induced stupidity as contrary evidence mounts. And obstinacy is a form of not giving up that’s easily practiced by the stupid. You don’t have to consider complicated tradeoffs; you just dig in your heels. It even works, up to a point.
    • Merely having energy and imagination is quite rare. But to solve hard problems you need three more qualities: resilience, good judgement, and a focus on some kind of goal.
    • When you look at the internal structure of persistence, it doesn’t resemble obstinacy at all. It’s so much more complex. Five distinct qualities — energy, imagination, resilience, good judgement, and focus on a goal — combine to produce a phenomenon that seems a bit like obstinacy in the sense that it causes you not to give up.
    • The obstinate do sometimes succeed in solving hard problems. One way is through luck: like the stopped clock that’s right twice a day, they seize onto some arbitrary idea, and it turns out to be right. Another is when their obstinacy cancels out some other form of error. For example, if a leader has overcautious subordinates, their estimates of the probability of success will always be off in the same direction. So if he mindlessly says “push ahead regardless” in every borderline case, he’ll usually turn out to be right.

  • paper: What I’ve Learned from Users
  • links: https://www.paulgraham.com/users.html
  • takeaway:
    • Explain what you’ve learned from users. That tests a lot of things: whether you’re paying attention to users, how well you understand them, and even how much they need what you’re making.
    • That’s one advantage of funding large numbers of early stage companies rather than smaller numbers of later-stage ones. You get a lot of data. Not just because you’re looking at more companies, but also because more goes wrong.
    • But knowing (nearly) all the problems startups can encounter doesn’t mean that advising them can be automated, or reduced to a formula. There’s no substitute for individual office hours with a YC partner. Each startup is unique, which means they have to be advised by specific partners who know them well.
    • So the essence of what happens at YC is to figure out which problems matter most, then cook up ideas for solving them — ideally at a resolution of a week or less — and then try those ideas and measure how well they worked. The focus is on action, with measurable, near-term results.
    • Speed defines startups. Focus enables speed. YC improves focus.
    • However good you are, good colleagues make you better. Indeed, very ambitious people probably need colleagues more than anyone else, because they’re so starved for them in everyday life.
    • Between the partners, the alumni, and their batchmates, founders are surrounded by people who want to help them, and can.

  • paper: How to Work Hard
  • links: https://www.paulgraham.com/hwh.html
  • takeaway:
    • One thing I know is that if you want to do great things, you’ll have to work very hard. I wasn’t sure of that as a kid. Schoolwork varied in difficulty; one didn’t always have to work super hard to do well.
    • Bill Gates, for example, was among the smartest people in business in his era, but he was also among the hardest working. “I never took a day off in my twenties,” he said. “Not one.” It was similar with Lionel Messi. He had great natural ability, but when his youth coaches talk about him, what they remember is not his talent but his dedication and his desire to win. P. G. Wodehouse would probably get my vote for best English writer of the 20th century, if I had to choose. Certainly no one ever made it look easier. But no one ever worked harder. At 74, he wrote
    • What I’ve learned since I was a kid is how to work toward goals that are neither clearly defined nor externally imposed. You’ll probably have to learn both if you want to do really great things. The most basic level of which is simply to feel you should be working without anyone telling you to. Now, when I’m not working hard, alarm bells go off. I can’t be sure I’m getting anywhere when I’m working hard, but I can be sure I’m getting nowhere when I’m not, and it feels awful.
    • Once you know the shape of real work, you have to learn how many hours a day to spend on it. You can’t solve this problem by simply working every waking hour, because in many kinds of work there’s a point beyond which the quality of the result will start to decline.
    • The bigger question of what to do with your life is one of these problems with a hard core. There are important problems at the center, which tend to be hard, and less important, easier ones at the edges. So as well as the small, daily adjustments involved in working on a specific problem, you’ll occasionally have to make big, lifetime-scale adjustments about which type of work to do. And the rule is the same: working hard means aiming toward the center — toward the most ambitious problems.
    • So while some people’s lives converge fast, there will be others whose lives never converge. And for these people, figuring out what to work on is not so much a prelude to working hard as an ongoing part of it, like one of a set of simultaneous equations. For these people, the process I described earlier has a third component: along with measuring both how hard you're working and how well you're doing, you have to think about whether you should keep working in this field or switch to another. If you're working hard but not getting good enough results, you should switch. It sounds simple expressed that way, but in practice it’s very difficult.
    • For this test to work, though, you have to be honest with yourself. Indeed, that’s the most striking thing about the whole question of working hard: how at each point it depends on being honest with yourself.

  • paper: The Risk of Discovery
  • links: https://www.paulgraham.com/disc.html
  • takeaway:
    • Maybe the smartness and the craziness were not as separate as we think. Physics seems to us a promising thing to work on, and alchemy and theology obvious wastes of time. But that’s because we know how things turned out. You have to make mistakes to know that which thing is actually works.

  • paper: The Need to Read
  • links: https://www.paulgraham.com/read.html
  • takeaway:
    • Reading about x doesn’t just teach you about x; it also teaches you how to write.
    • A good writer doesn’t just think, and then write down what he thought, as a sort of transcript. A good writer will almost always discover new things in the process of writing.
    • But even after doing this, you’ll find you still discover new things when you sit down to write. There is a kind of thinking that can only be done by writing.
    • You can’t think well without writing well, and you can’t write well without reading well. And I mean that last “well” in both senses. You have to be good at reading, and read good things.
    • People who just want information may find other ways to get it. But people who want to have ideas can’t afford to.

    ^ Plus an essays (Putting Ideas into Words)…

    • Writing about something, even something you know well, usually shows you that you didn't know it as well as you thought.
    • Half the ideas that end up in an essay will be ones you thought of while you were writing it. Indeed, that’s why I write them.
    • Can you ever know so much that you wouldn’t learn more from trying to explain what you know? I don’t think so.
    • A great deal of knowledge is unconscious, and experts have if anything a higher proportion of unconscious knowledge than beginners.
    • Putting ideas into words doesn’t have to mean writing, of course. You can also do it the old way, by talking. But in my experience, writing is the stricter test. You have to commit to a single, optimal sequence of words. Less can go unsaid when you don’t have tone of voice to carry meaning. And you can focus in a way that would seem excessive in conversation.
    • Putting ideas into words is certainly no guarantee that they’ll be right. Far from it. But though it’s not a sufficient condition, it is a necessary one.

  • paper: The Surprising Power of The Long Game
  • links: https://fs.blog/long-game/
  • takeaway:
    • If you do what everyone else is doing, you shouldn’t be surprised to get the same results everyone else is getting.
    • Different outcomes come from doing different things or doing things differently.
    • Long Game: It’s simpler to win than the short game. Simple but not easy. It requires repeatedly doing hard things today that make tomorrow easier.


  • paper: My programming beliefs as of July 2024
  • links: https://evanhahn.com/programming-beliefs-as-of-july-2024/
  • takeaway:
    • how to approaches task:
      1. When presented with a difficult task, I ask myself: “what if I didn’t do this at all?”. Most of the time, this is a stupid question, and I have to do the thing. But ~5% of the time, I realize that I can completely skip some work.
      2. If I’m banging my head against a problem without making progress, I should take a break.
      3. Sometimes, I try implementing a feature in the smallest possible amount of time, with awful code, horrible hacks, and lots of TODOs. Once I have something working, I clean it up.
    • how to design software:
      1. Testability is basically the same thing as modularity.
      2. Make invalid states unrepresentable.
    • High level/career
      1. The most important problems are non-technical. “It’s not about technology for its own sake. It’s about being able to implement your own ideas.”
      2. Typing new code tends to be the easiest part of the job. Bigger challenges: reading code, prioritization, communication, team dynamics, etc.
      3. Making useless stuff can be a great way to learn new things.


  • paper: Beyond Smart
  • links: https://www.paulgraham.com/smart.html
  • takeaway:
    • But that wasn’t what was special about Einstein. What was special about him was that he had important new ideas. Being very smart was a necessary precondition for having those ideas, but the two are not identical.
    • This is the first time I’ve posed the question to myself this way, and I think it may take a while to answer. But I wrote recently about one of the most important: an obsessive interest in a particular topic. And this can definitely be cultivated.
    • This is the first time I’ve posed the question to myself this way, and I think it may take a while to answer. But I wrote recently about one of the most important: an obsessive interest in a particular topic. And this can definitely be cultivated.
    • There are general techniques for having new ideas — for example, for working on your own projects and for overcoming the obstacles you face with early work — and these can all be learned. Some of them can be learned by societies. And there are also collections of techniques for generating specific types of new ideas, like startup ideas and essay topics.
    • One of the most surprising ingredients in having new ideas is writing ability. There’s a class of new ideas that are best discovered by writing essays and books. And that “by” is deliberate: you don’t think of the ideas first, and then merely write them down.

  • paper: How To Become A Hacker
  • links: http://www.catb.org/~esr/faqs/hacker-howto.html
    • hackers build things, crackers break them.
    • As with all creative arts, the most effective way to become a master is to imitate the mind-set of masters — not just intellectually but emotionally as well.
    • Hacker Ethics:
      1. The world is full of fascinating problems waiting to be solved.

        You also have to develop a kind of faith in your own learning capacity — a belief that even though you may not know all of what you need to solve a problem, if you tackle just a piece of it and learn from that, you’ll learn enough to solve the next piece — and so on, until you’re done.

      2. No problem should ever have to be solved twice.

        To behave like a hacker, you have to believe that the thinking time of other hackers is precious — so much so that it’s almost a moral duty for you to share information, solve problems and then give the solutions away just so other hackers can solve new problems instead of having to perpetually re-address old ones.

      3. Boredom and drudgery are evil.

        Hackers (and creative people in general) should never be bored or have to drudge at stupid repetitive work, because when this happens it means they aren’t doing what only they can do — solve new problems. This wastefulness hurts everybody. Therefore boredom and drudgery are not just unpleasant but actually evil.

        To behave like a hacker, you have to believe this enough to want to automate away the boring bits as much as possible, not just for yourself but for everybody else (especially other hackers).

      4. Freedom is good.
      5. Attitude is no substitute for competence.

        To be a hacker, you have to develop some of these attitudes. But copping an attitude alone won’t make you a hacker, any more than it will make you a champion athlete or a rock star. Becoming a hacker will take intelligence, practice, dedication, and hard work.

        Therefore, you have to learn to distrust attitude and respect competence of every kind. Hackers won’t let posers waste their time, but they worship competence — especially competence at hacking, but competence at anything is valued. Competence at demanding skills that few can master is especially good, and competence at demanding skills that involve mental acuteness, craft, and concentration is best.

  • Basic Hacking Skill:
    1. Learn how to program.
    2. Get one of the open-source Unixes and learn to use and run it.
    3. Learn how to use the World Wide Web and write HTML. (this is from 1996)
    4. If you don’t have functional English, learn it.

  • paper: How To Learn Hacking
  • links: http://www.catb.org/~esr/faqs/hacking-howto.html
  • takeaway:
    • Hacking favors scrap-and-rebuild over patch-and-extend. An essential part of hacking is ruthlessly throwing away code that has become overcomplicated or crufty, no matter how much time you have invested in it.
    • incremental-hacking cycle.
      1. First, pick a program that does something you are interested in. Ideally, it should be a program you use regularly and have opinions about. The next best thing is a program you don’t normally use, but that does something you think is interesting. For this learning method to work, you should avoid trying to hack on code that bores you.
      2. If you don’t already know the program, learn how to use it. Read the documentation. Develop a mental model of how it works.
      3. Pick a small feature to change or add.
      4. Search the code until you find the part you need to modify.

        Note: you should specifically not try to read the entire program. You will just exhaust and frustrate yourself if you do that. Instead, use the module structure of the code to zero in on just the part you need to understand. Along the way, you will learn things about how the whole program fits together.

      5. Make, test, debug, and document your change.

        Documenting your change is important. If you develop the habit of doing this early, you’ll produce much higher-quality work.

      6. Send your change as a patch to the program maintainers. See the Software Release Practice HOWTO for tips on how to do this in an effective and polite way.
      7. Now, ask yourself: do I understand this entire program?

  • paper: Why Your Data Stack Won’t Last - And How To Build Data Infrastructure That Will
  • links: https://seattledataguy.substack.com/p/why-your-data-stack-wont-last-and
  • takeaway:
    • How To Avoid Key Dependency Issues
      • Documentation
      • Cross-training(where it makes sense)
    • How To Build With the Business
      • Start with business outcomes
      • Keep the business in the loop

  • paper: Weird Languages
  • links: https://www.paulgraham.com/weird.html
  • takeaway:
    • So if you want to expand your concept of what programming can be, one way to do it is by learning weird languages. Pick a language that most programmers consider weird but whose median user is smart, and then focus on the differences between this language and the intersection of popular languages.

Start to Read 101 Essays That Will Change The Way You Think


  • paper: Make Luck Your Destiny
  • link: https://nav.al/luck-destiny
  • takeaway:
    • Build your character so opportunity finds you
    • One of the things I think that is important to making money, when you want the kind of reputation that makes people do deals through you. I use the example of like, if you’re a great diver then treasure hunters will come and give you a piece of the treasure for your diving skills.
    • If you’re a trusted, reliable, high-integrity, long-term thinking deal maker, then when other people want to do deals but they don’t know how to do them in a trustworthy manner with strangers, they will literally approach you and give you a cut of the deal or offer you a unique deal just because of the integrity and reputation that you have built up.
    • But I would say your character, your reputation, these are things that you can build that then will let you take up advantage of opportunities that other people may characterize as lucky but you know that it wasn’t luck.
    • You have to be a little eccentric to be out on the frontier by yourself
    • Benjamin Disraeli one, are this one from Sam Altman where he said, “extreme people get extreme results.” I think that’s pretty nice. And then there’s this other one from Jeffrey Pfeffer, who is a professor at Stanford that, “you can’t be normal and expect abnormal returns.” I’ve always enjoyed that one too.
    • play stupid games win stupid prizes.

  • paper: Finding Time to Invest in Yourself
  • link: https://nav.al/finding-time
  • takeaway:
    • A common question we get: “How do I find the time to start investing in myself? I have a job.” You have to rent your time to get started

      ( Coming out of college, Warren Buffett wanted to work for Benjamin Graham to learn to be a value investor. Buffett offered to work for free, and Graham responded, “You’re overpriced.” What that means is you have to make sacrifices to take on an apprenticeship. )

    • Find the part of the job with the steepest learning curve

      ( You want to avoid repetitive drudgery—that’s just biding time until your job is automated away. If you’re a barista at the coffee shop, figure out how to make connections with the customers. Figure out how to innovate the service you offer and delight the customer. Managers, founders and owners will take notice.)

    • Develop a founder mentality

      The hardest thing for any founder is finding employees with a founder mentality. This is a fancy way of saying they care enough.

      People will say, “Well, I’m not the founder. I’m not being paid enough to care.” Actually, you are: The knowledge and skills you gain by developing a founder mentality set you up to be a founder down the line; that’s your compensation.

    • Judgment takes experience. It takes a lot of time to build up. You have to put yourself in positions where you can exercise judgment. That’ll come from taking on accountability.

      Leverage is something that society gives you after you’ve demonstrated judgment. You can get it faster by learning high-leverage skills like coding or working with the media. These are permissionless leverage. This is why I encourage people to learn to code or produce media, even if it’s just nights and weekends.

    • find things that interest you and allow you to take on accountability. Don’t worry about short-term compensation. Compensation comes when you’re tired of waiting for it and have given up on it. This is the way the whole system works.

      _HUGE QUOTE HERE, PLEASE PAY ATTENTION_

      Specific knowledge can be timely or timeless

      There are two forms of specific knowledge: timely and timeless.

      If you become a world-class expert in machine learning just as it takes off and you got there through genuine intellectual interest, you’re going to do really well. But 20 years from now, machine learning may be second hat; the world may have moved on to something else. That’s timely knowledge.

      If you’re good at persuading people, it’s probably a skill you picked up early on in life. It’s always going to apply, because persuading people is always going to be valuable. That’s timeless knowledge.

      Timeless specific knowledge usually can’t be taught, and it sticks with you forever. Timely specific knowledge comes and goes; but it tends to have a fairly long shelf life.

    • Companies don’t know how to measure outputs, so they measure inputs instead. Work in a way that your outputs are visible and measurable. If you don’t have accountability, do something different.

  • paper: Accountability Means Letting People Criticize You
  • link: https://nav.al/accountability
  • takeaway:
    • They think accountability means being successfully accountable. No—it means you have to stick your neck out and fail publicly. You have to be willing to let people criticize you.
    • The most interesting parts should be the ones you disagree with
    • Get the free leverage that’s available in tech
    • Don’t refuse to do things just because others can’t do them
    • Realize your philanthropic vision by running a business

  • paper: Example: From Laborer to Entrepreneur
  • link: https://nav.al/laborer-tech
  • takeaway:
    • General contractors get equity, but they’re also taking risk
    • Property developers pocket the profit by applying capital leverage
    • Architects, large developers and REITs are even higher in the stack
    • Companies owner apply the maximum leverage

  • paper: How Did We Get Here? The Tangled History of the Second Law of Thermodynamics
  • link: https://arxiv.org/pdf/2311.10722
  • takeaway:
    • What is heat? The history of discovery what is heat
      TimePersonPropose
      ~500 BCHeraclituseverything is made of fire
      ~460–~370 BCDemocrituseverything made of discreat atoms
      1623Galileo
      1620Francis Baconheat itself, its essence and quiddity, is motion
      1660Robert BoyleBoyle’s Law PV = constant.
      1687Isaac Newton
      1800sJoseph Fourierheat matrial, cario

      The form of Thermodynamics

      1824Sadi Carnotheat can be anlyzsed without heat material
      1845Kelvin

      The history of Gas, Where does the name of gas comes from?

      It was actually only in the 1640s that any kind of general notion of gas began to emerge—with the word “gas” being invented by the “anti–Galen” physician Jan Baptista van Helmont (1580–1644), as a Dutch rendering of the Greek word “chaos”, that meant essentially “void”, or primordial formless- ness.)

Thermodynamic free energy(https://en.wikipedia.org/wiki/Thermodynamic_free_energy)

But there it is: by 1852 the Second Law is out in the open, in at least two different forms. The path to reach it has been circuitous and quite technical. But in the end, stripped of its technical origins, the law seems somehow unsurprising and even obvious. For it’s a matter of common experience that heat flows from hotter bodies to colder ones, and that motion is dissipated by friction into heat. But the point is that it wasn’t until basically 1850 that the overall scientific framework existed to make it useful—or even really possible—to enunciate such observations as a formal scientific law.

In the first half of the 1850s the Second Law had in a sense been presented in two ways. First, as an almost “footnote–style” assumption needed to support the “pure thermodynam- ics” that had grown out of Carnot’s work. And second, as an explicitly–stated–for–the–first– time—if “obvious”—“everyday” feature of nature, that was now realized as having potentially cosmic significance.

In 1854 Clausius was already beginning this process. Perhaps confusingly, he refers to the Second Law as the “second fundamental theorem [Hauptsatz]” in the “mechanical theory of heat”—suggesting it’s something that is proved, even though it’s really introduced just as an empirical law of nature, or perhaps a theoretical axiom:

He starts off by discussing the “first fundamental theorem”, i.e. the First Law. And he emphasizes that this implies that there’s a quantity U (which we now call “internal energy”) that is a pure “function of state”—so that its value depends only on the state of a system, and not the path by which that state was reached. And as an “application” of this, he then points out that the overall change in U in a cyclic process (like the one executed by Carnot’s heat engine) must be zero.

  • The concept of entropy

    proposed by R. Clausius on the “The Mechanical Theory of Heat: With Its Applications to the Steam-engine” “On Different Forms of the Fundamental Equations of Mechanical Theory of Heat.”

https://www.eoht.info/page/Rudolf%20Clausius

https://www.eoht.info/page/Famous%20publications

  • The Concept of Ergodicity

    1877, Boltzmann, in equilibrium all possible micro- scopic states of a system would be equally probable.

Had Boltzmann’s 1872 H theorem proved the Second Law? Was the Second Law—with its rather downbeat implication about the heat death of the universe—even true?

Boltzmann—and Maxwell before him—had introduced the idea of using probability theory to discuss the emergence of thermodynamics and potentially the Second Law.

  • Radiant Heat, the Second Law and Quantum Mechanics

    Max Planck got interested in thermodynamics, and in

1879 wrote a 61–page PhD thesis entitled “On the Second Law of Mechanical Heat Theory”. It was a traditional (if slightly streamlined) discussion of the Second Law, very much based on Clausius’s approach (and even with the same title as Clausius’s 1867 paper)—and without any mention whatsoever of Boltzmann:

And indeed, Erwin Schrödinger (1887–1961), in his 1944 book What Is Life? talked about “negative entropy” associated with life. But he—and many others since—argue that life doesn’t really violate the Second Law because it’s not operating in a closed environment where one should expect evolution to equilibrium. Instead, it’s constantly being driven away from equilibrium, for example by “organized energy” ultimately coming from the Sun.



  • paper: Ten Proofs of the Generalized Second Law
  • link: https://arxiv.org/pdf/0901.3865
  • takeaway:
    • The Ordinary Second Law (OSL) states that the total thermodynamic entropy of the universe is always nondecreasing with time. In a background-free theory such as General Relativity (GR), a “time” is a complete spatial slice, and a “later time” is a complete slice which is entirely in the future of the earlier time slice. The GSL states that the “generalized entropy” of the universe is nondecreas- ing with time. This generalized entropy is given by the expression `kA/ 4Gℏ + Sout`, where k is Boltzmann’s constant, c = 1 [1],1 and A is the sum of the area of all black hole horizons in the universe, while Sout is the ordinary thermodynamic entropy of the system outside of all event horizons. The first term is called the Bekenstein-Hawking entropy (SBH ). Since the horizon area and the outside entropy are time-dependent quantities, each term is defined (like the ordinary entropy) using a complete spatial slice.
    • Another choice is the “Gibbs entropy”, which assigns an entropy to mixed

states. A probability mixture over N states has entropy `S = k ∑i −pi ln p`. Gibbs entropy is very like Shannon Information Entropy.

Well, this paper is very difficult to read. So i just read the definition of Generalized Second Law.


  • paper: The Shift from Models to Compound AI Systems
  • link: https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/
  • takeaway:
    • state-of-the-art AI results are increasingly obtained by compound systems with multiple components, not just monolithic models.
    • Compound AI System as a system that tackles AI tasks using multiple interacting components, including multiple calls to models, retrievers, or external tools. In contrast, an AI Model is simply a statistical model, e.g., a Transformer that predicts the next token in text.
    • In single model development a la PyTorch, users can easily optimize a model end-to-end because the whole model is differentiable. However, compound AI systems contain non-differentiable components like search engines or code interpreters, and thus require new methods of optimization. Optimizing these compound AI systems is still a new research area; for example, DSPy offers a general optimizer for pipelines of pretrained LLMs and other components, while others systems, like LaMDA, Toolformer and AlphaGeometry, use tool calls during model training to optimize models for those tools.
    • Emerging Paradigms
      • Composition Frameworks and Strategies.
      • Automatically Optimizing Quality: DSPy
      • Optimizing Cost: FrugalGPT and AI Gateways
      • Operation: LLMOps and DataOps.



  • paper: Natural Language Can Help Bridge the Sim2Real Gap
  • link: https://arxiv.org/pdf/2405.10020
  • takeaway:
    • vision imitation learning: leverages visual data (such as videos or images) to infer how to perform a task.
    • sim2real: domain randomization, System identification

  • paper: Evolving Virtual Creatures
  • link: https://www.karlsims.com/papers/siggraph94.pdf
  • takeaway:
    • A classic trade-off in the field of computer graphics and animation is that of complexity vs. control.
    • a system has been described that can generate autonomous three-dimensional virtual creatures without requiring cum- bersome user specifications, design efforts, or knowledge of algorithmic details.

  • paper: GPU Utilization is a Misleading Metric
  • link: https://trainy.ai/blog/gpu-utilization-misleading
  • takeaway:
    • MFUs, or Model FLOPS (Floating point Operations Per Second) utilization, is one of the best metrics to understand GPU performance,
    • What is GPU Utilization, really?
      • Percent of time over the past sample period during which one or more kernels was executing on the GPU
      • GPU Utilization, is only measuring whether a kernel is executing at a given time.
      • However, understanding these metrics is less straightforward than trying to get SM Efficiency as high as possible. If you’re interested in learning more, I’d recommend taking a look at the Pytorch Profiler blog, DCGM docs, Nsight’s kernel profiling guide, and Nsight docs.

  • paper: All Models Are Wrong
  • link: https://fs.blog/all-models-are-wrong/
  • takeaway:
    • what is a model? A model is a simplification which fosters understanding.
    • Scientists generally agree that no theory is 100 percent correct. Thus, the real test of knowledge is not truth, but utility. Science gives us power. The more useful that power, the better the science.
    • The world doesn’t have the luxury of waiting for complete answers before it takes action.
    • How Do We Know If A Model Is Useful?
      • Is it a representation of reality?
      • How long has this model been around?
      • Does this model apply to multiple areas?
      • How did this model originate?

  • paper: Mental Models: The Best Way to Make Intelligent Decisions (~100 Models Explained)
  • link: https://fs.blog/mental-models/
  • takeaway:

    The Core Mental Models

    • The Map is Not the Territory
    • Circle of Competence
    • First Principles Thinking
    • Thought Experiment
    • Second-Order Thinking
    • Probabilistic Thinking
    • Inversion
    • Occam’s Razor
    • Hanlon’s Razor

      Book is Great, Should take a look. links: https://fs.blog/mental-models/


  • paper: Growth: Thinking in systems
  • link: https://kamrn.com/blog/growth-thinking-in-systems/
  • takeaway:
    • “A system is never the sum of its parts, it’s the product of their interaction.” – Russell Ackoff ./static/loop.png
    • “We can’t impose our will on a system. We can listen to what the system tells us, and discover how its properties and our values can work together to bring forth something much better than could ever be produced by our will alone.” ― Donella H. Meadows

  • paper: How to Write Usefully
  • link: https://paulgraham.com/useful.html
  • takeaway:
    • it tells people something important, and that at least some of them didn’t already know.
    • Useful writing tells people something true and important that they didn’t already know, and tells them as unequivocally as possible.
    • if you write a bad sentence, you don’t publish it. You delete it and try again. Often you abandon whole branches of four or five paragraphs. Sometimes a whole essay.
    • You can’t ensure that every idea you have is good, but you can ensure that every one you publish is, by simply not publishing the ones that aren’t.
    • Importance has two factors. It’s the number of people something matters to, times how much it matters to them.
    • The way to get novelty is to write about topics you’ve thought about a lot.
    • The fourth component of useful writing, strength, comes from two things: thinking well, and the skillful use of qualification.
    • importance + novelty + correctness + strength

  • paper: The Munger Operating System: How to Live a Life That Really Works
  • link: https://fs.blog/munger-operating-system/
  • takeaway:
    • Munger Operating System:
      1. You want to deliver to the world what you would buy if you were on the other end
      2. there is no love that’s so right as admiration-based love, and that love should include the instructive dead.
      3. you’re hooked for lifetime learning, and without lifetime learning you people are not going to do very well. You are not going to get very far in life based on what you already know. You’re going to advance in life by what you’re going to learn after you leave here…
      4. Another thing I think should be avoided is extremely intense ideology, because it cabbages up one’s mind.
      5. I’m not entitled to have an opinion on this subject unless I can state the arguments against my position better than the people do who are supporting it.
      6. game of life in many respects is getting a lot of practice into the hands of the people that have the most aptitude to learn and the most tendency to be learning machines.
      7. You’ll be most successful where you’re most intensely interested.
      8. Learn the all-important concept of assiduity: Sit down and do it until it’s done.
      9. Use setbacks in life as an opportunity to become a bigger and better person. Don’t wallow.

  • paper: Founder Mode
  • link: https://paulgraham.com/foundermode.html
  • takeaway:
    • founder mode vs manage mode: as soon as the concept of founder mode becomes established, people will start misusing it. Founders who are unable to delegate even things they should will use founder mode as the excuse. Or managers who aren’t founders will decide they should try to act like founders.

  • paper: The Art of Finishing
  • link: https://www.bytedrum.com/posts/art-of-finishing/
  • takeaway:
    • As long as you’re working on something, you feel productive. Jumping from project to project gives you a constant stream of “new project energy,” which can feel more invigorating than the grind of finishing a single project
    • There’s a unique satisfaction in seeing a project through to completion that no amount of starting can match. Moreover, unfinished projects carry a mental weight. They linger in the back of your mind, quietly draining your mental energy and enthusiasm.
    • Each unfinished project can chip away at your confidence. Over time, you might start to doubt your ability to complete anything substantial, creating a self-fulfilling prophecy of incompletion.

      Strategies for finishing project:

      1. Define “Done” from the Start
      2. Instead of aiming for perfection, I’ll aim for “good enough.”
      3. Time-Box My Projects: I’ll give myself a deadline. It doesn’t have to be short, but it should be finite.
      4. Practice Finishing Small Things
      5. Separate Ideation from Implementation:
      6. Celebrate Completions
      7. Embrace Accountability

      it’s about growing as a developer and creator. Each finished project, no matter how small, is a step towards becoming someone who not only starts with enthusiasm but finishes with satisfaction.

Andrew Ng Advices For DL/ML industry people:

  1. reading enough paper (15 papers) + replicate results
  2. dig into dirty work, in parellal read a lot of papers.
  3. spending wisely about your weekend
  4. keep working hard

  • paper: The Art of Finishing
  • link: https://www.bytedrum.com/posts/art-of-finishing/
  • takeaway:
    • As long as you’re working on something, you feel productive. Jumping from project to project gives you a constant stream of “new project energy,” which can feel more invigorating than the grind of finishing a single project
    • There’s a unique satisfaction in seeing a project through to completion that no amount of starting can match. Moreover, unfinished projects carry a mental weight. They linger in the back of your mind, quietly draining your mental energy and enthusiasm.
    • Each unfinished project can chip away at your confidence. Over time, you might start to doubt your ability to complete anything substantial, creating a self-fulfilling prophecy of incompletion.

      Strategies for finishing project:

      1. Define “Done” from the Start
      2. Instead of aiming for perfection, I’ll aim for “good enough.”
      3. Time-Box My Projects: I’ll give myself a deadline. It doesn’t have to be short, but it should be finite.
      4. Practice Finishing Small Things
      5. Separate Ideation from Implementation:
      6. Celebrate Completions
      7. Embrace Accountability

      it’s about growing as a developer and creator. Each finished project, no matter how small, is a step towards becoming someone who not only starts with enthusiasm but finishes with satisfaction.

Andrej Ng Advices For DL/ML industry people:

  1. reading enough paper (15 papers) + replicate results
  2. dig into dirty work, in parellal read a lot of papers.
  3. spending wisely about your weekend
  4. keep working hard

  • paper: Loss of plasticity in deep continual learning
  • link: file:///home/evan/Downloads/s41586-024-07711-7.pdf
  • takeaway:
    • standard deep-learning meth- ods lose their ability to learn with extended training on new data, a phenomenon that we call loss of plasticity.
    • standard deep-learning methods lose plasticity with extended learning, we show that a simple change enables them to maintain plasticity indefinitely in both supervised and reinforce- ment learning.
    • Surprisingly, popular methods such as Adam, Dropout and normaliza- tion actually increased loss of plasticity. L2 regularization, on the other hand, reduced loss of plasticity in many cases. L2 regularization stops the weights from becoming too large by moving them towards zero at each step.

  • paper: You can learn AI later
  • link: https://world.hey.com/jason/you-can-learn-ai-later-08fce896
  • takeaway:
    • the best way to learn something is to need that something. Learning when you don’t really need to is a good way to give up early.
    • Learning when there’s something you truly need to do, but can’t, but could, is the right time to figure something out.
    • necessity is the mother of invention, but it’s really the impetus for learning. The time will come, and you can figure it out then
    • Start curious, stay curious, know what it’s capable of, and, when the necessity strikes, figure it out. Until then, ignore the demands and focus on doing what you’re already good at.

  • paper: How Completely Messed Up Practices Become Normal
  • link: http://danluu.com/wat/
  • takeaway:
    • Knowledge is imperfect and uneven

      People don’t automatically know what should be normal, and when new people are onboarded, they can just as easily learn deviant processes that have become normalized as reasonable processes.

      new person joins new person: WTF WTF WTF WTF WTF old hands: yeah we know we’re concerned about it new person: WTF WTF wTF wtf wtf w… new person gets used to it new person #2 joins new person #2: WTF WTF WTF WTF new person: yeah we know. we’re concerned about it.

    • In most company cultures, people feel weird about giving feedback. Everyone has stories about a project that lingered on for months or years after it should have been terminated because no one was willing to offer explicit feedback.
    • just do the right thing yourself and ignore what’s going on around you.

    1 Pay attention to weak signals 2 Resist the urge to be unreasonably optimistic 3 Teach employees how to conduct emotionally uncomfortable conversations 4 System operators need to feel safe in speaking up 5 Realize that oversight and monitoring are never-ending

    • Startups spend a lot of time thinking about growth, and while they’ll all tell you that they care a lot about engineering culture, revealed preference shows that they don't.
    • targets like how to tell if you’re acculturating people so that they don’t ignore weak signals are softer and harder to determine, but that doesn’t mean they’re any less important. People write a lot about how things like using fancier languages or techniques like TDD or agile will make your teams more productive, but having a strong engineering culture is much larger force multiplier.


  • paper: Richard Feynman and The Connection Machine
  • link: https://longnow.org/essays/richard-feynman-connection-machine/
  • takeaway:
    • cellular automata can be used to simulate physics system (why?)
    • Feynman always started by asking very basic questions like, "What is the simplest example?" or "How can you tell if the answer is right?" He asked questions until he reduced the problem to some essential puzzle that he thought he would be able to solve.
    • In retrospect I realize that in almost everything that we worked on together, we were both amateurs. In digital physics, neural networks, even parallel computing, we never really knew what we were doing. But the things that we studied were so new that no one else knew exactly what they were doing either. It was amateurs who made the progress.
    • Actually, I doubt that it was "progress" that most interested Richard. He was always searching for patterns, for connections, for a new way of looking at something, but I suspect his motivation was not so much to understand the world as it was to find new ideas to explain. The act of discovery was not complete for him until he had taught it to someone else.

  • paper: The future of European competitiveness
  • link: https://commission.europa.eu/document/download/97e481fd-2dc3-412d-be4c-f152a8232961_en?filename=The%20future%20of%20European%20competitiveness%20_%20A%20competitiveness%20strategy%20for%20Europe.pdf
  • takeaway:
    • Europe’s exporters managed to capture market shares in faster growing parts of the world, especially Asia.
    • The EU is entering the first period in its recent history in which growth will not be supported by rising populations. By 2040, the workforce is projected to shrink by close to 2 million workers each year. We will have to lean more on productivity to drive growth.
    • The top 3 investors in R&I in Europe have been dominated by automotive companies for the past twenty years. It was the same in the US in the early 2000s, with autos and pharma leading, but now the top 3 are all in tech.
    • we are failing to translate innovation into commercialisation, and innovative companies that want to scale up in Europe are hindered at every stage by inconsistent and restrictive regulations.
    • The EU has a unique opportunity to lower the cost of AI deployment by increasing computational capacity and making available its network of high-performance computers
    • The EU should promote cross-industry coordination and data sharing to accelerate the integration of AI into European industry.

  • paper: Tutorial on Diffusion Models for Imaging and Vision
  • link: https://arxiv.org/pdf/2403.18103
  • takeaway:
    • diffusion: a particular sampling mechanism that has overcome some longstanding shortcomings in previous approaches
    • Diffusion models are incremental updates where the assembly of the whole gives us the encoder-decoder structure
    • VAE is largely a one-step generation — if you give us a latent code z, we ask the neural network fθ (·) to immediately return us the generated signal x ∼ N (x | fθ (z), σ2 decI).
    • The biggest challenges of diffusion models are the consistency with the physical world, let alone their high computational complexity

      *THIS PAPER IS VERY GOOD, IF YOU LIKE ME DON'T KNOW ANYTHING ABOUT DIFFUSION MODELS. I HIGHLY RECOMMEND READING THIS PAPER.*



  • paper: Notes on OpenAI’s new o1 chain-of-thought models
  • link: https://simonwillison.net/2024/Sep/12/openai-o1/
  • takeaway:
    • There’s a lot to understand about these models—they’re not as simple as the next step up from GPT-4o, instead introducing some major trade-offs in terms of cost and performance in exchange for improved “reasoning” capabilities.
    • Most interestingly is the introduction of “reasoning tokens”—tokens that are not visible in the API response but are still billed and counted as output tokens. These tokens are where the new magic happens.
    • Limit additional context in retrieval-augmented generation (RAG): When providing additional context or documents, include only the most relevant information to prevent the model from overcomplicating its response.

  • paper: Is God a Strange Loop?
  • link: https://johnhorgan.org/cross-check/is-god-a-strange-loop
  • takeaway:
    • Hofstadter agrees with Dennett that consciousness is “not as deep a mystery as it seems” because it is an “illusion.” By this, Hofstadter apparently means that our conscious thoughts and perceptions are often misleading, and they are trivial compared to all the computations whizzing and whirring below the level of our awareness.
    • Our sense of free will is an illusion, too, according to Hofstadter. He told me that he doesn’t feel as though he has truly made any decisions in his life. “I feel like decisions are made for me by the forces inside my brain.”

  • paper: The Brouhaha Over Consciousness and “Pseudoscience”
  • link: https://johnhorgan.org/cross-check/the-brouhaha-over-consciousness-and-pseudoscience
  • takeaway:
    • Integrated information theory, or IIT, which I’ve tracked for years, holds that consciousness arises in any system whose components exchange information in a certain mathematically defined way.
    • A theory is pseudoscientific if it isn’t formulated rigorously enough to be tested and potentially falsified, or proven wrong. A theory can resist falsification if it is so vaguely defined, or has so many variables, that it can “predict” any observation.
    • Integrated information theory falls into this category. It is a general theory of consciousness, which attempts to explain how consciousness arises not just in the brains of humans but in any physical system, which needn’t even be biological.
    • As evolutionary psychologist Robert Trivers points out, we deceive ourselves at least as effectively as we deceive others.
    • We have devised methods for cultivating self-knowledge and quelling our anxieties, such as meditation and psychotherapy. But these practices strike me as forms of self-brainwashing. When we meditate or see a therapist, we are not solving the solipsism problem. We are merely training ourselves to ignore it, to suppress the horror and despair that it triggers.
    • Science cannot be reduced to a formal, logical system or method, Popper says; a scientific theory is an invention, an act of creation, based more upon a scientist’s intuition than upon pre-existing empirical data. “The history of science is everywhere speculative,” Popper says. “It is a marvelous history. It makes you proud to be a human being.” Framing his face in his outstretched hands, Popper intones, “I believe in the human mind.” (quote from [“The Popper Paradox”](https://johnhorgan.org/cross-check/the-paradox-of-karl-popper))

  • paper: Holding a Program in One’s Head
  • link: https://paulgraham.com/head.html
  • takeaway:
    • initially the most important thing is to be able to change what you’re doing. Not just to solve the problem in a different way, but to change the problem you’re solving.
    • Your code is your understanding of the problem you’re exploring. So it’s only when you have your code in your head that you really understand the problem.
    • Avoid distractions.
    • Work in long stretches: Sometimes when you return to a problem after a rest, you find your unconscious mind has left an answer waiting for you.
    • Use succinct languages: The more succinct the language, the shorter the program, and the easier it is to load and keep in your head.
    • Keep rewriting your program
    • Write rereadable code
    • Work in small groups
    • Don’t have multiple people editing the same piece of code
    • Start small
    • But having ideas is not very parallelizable. And that’s what programs are: ideas.
    • But regardless of what the solution turns out to be, the first step is to realize there’s a problem. There is a contradiction in the very phrase “software company.” The two words are pulling in opposite directions. Any good programmer in a large organization is going to be at odds with it, because organizations are designed to prevent what programmers strive for.
    • So if you’re a little startup, this is the place to attack them. Take on the kind of problems that have to be solved in one big brain.

  • paper: How to Measure Progress in a Software Project
  • link: https://rethinkingsoftware.substack.com/p/how-to-measure-progress-in-a-software
  • takeaway:
    • Agile created a new way of measuring progress. Evaluate the state of your software, code some more (but only for a few weeks), then begin the process again.
    • With a customer, it goes like this: code something, get some feedback, code something else, get more feedback. Continue until the customer is satisfied.

  • paper: What Is a Particle?
  • link: https://www.quantamagazine.org/what-is-a-particle-20201112/
  • takeaway:
    • “At the moment that I detect it, it collapses the wave and becomes a particle. … [The particle is] the collapsed wave function.” —Dimitri Nanopoulos
    • “What is a particle from a physicist’s point of view? It’s a quantum excitation of a field. We write particle physics in a math called quantum field theory. In that, there are a bunch of different fields; each field has different properties and excitations, and they are different depending on the properties, and those excitations we can think of as a particle.” —Helen Quinn
    • “Particles are at a very minimum described by irreducible representations of the Poincaré group.” — Sheldon Glashow
    • “Ever since the fundamental paper of Wigner on the irreducible representations of the Poincaré group, it has been a (perhaps implicit) definition in physics that an elementary particle ‘is’ an irreducible representation of the group, G, of ‘symmetries of nature.’” —Yuval Ne’eman and Shlomo Sternberg
    • “Particles have so many layers.” —Xiao-Gang Wen
    • “What we think of as elementary particles, instead they might be vibrating strings.” —Mary Gaillard
    • “Every particle is a quantized wave. The wave is a deformation of the qubit ocean.” —Xiao-Gang Wen
    • “Particles are what we measure in detectors. … We start slipping into the language of saying that it’s the quantum fields that are real, and particles are excitations. We talk about virtual particles, all this stuff — but it doesn’t go click, click, click in anyone’s detector.” —Nima Arkani-Hamed

  • paper: Averaging is a convenient fiction of neuroscience
  • link: https://www.thetransmitter.org/neural-coding/averaging-is-a-convenient-fiction-of-neuroscience/
  • takeaway:
    • Averaging is ubiquitous—and hides from us how the brain works.
    • Averaging over time hides what a neuron actually sees, what it actually gets to compute with.
    • But neurons don’t take averages. All a neuron gets to work with is the moment-to-moment spikes sent by its inputs. Those inputs carry few, if any, spikes. Each input likely varies its response to the same significant event. Averaging over time hides what a neuron actually sees, what it actually gets to compute with.



  • paper: If the Universe Is a Hologram, This Long-Forgotten Math Could Decode It
  • link: https://www.quantamagazine.org/if-the-universe-is-a-hologram-this-long-forgotten-math-could-decode-it-20240925/
  • takeaway:
    • Along the way, as a young man in 1932, von Neumann rewrote the rules of quantum mechanics, formulating the strange new theory of particles and their fluctuating, probabilistic behavior in the mathematical language used today. Then he went further. He developed a framework known as “operator algebras” to describe quantum systems in a more powerful but more abstract way. Unlike his earlier work on quantum theory, this framework was hard to understand and did not catch on widely in theoretical physics. It was literally a century ahead of its time.
    • Even before von Neumann did his work, Albert Einstein’s theories of relativity merged space and time into a four-dimensional fabric known as “space-time.” Einstein showed that the force of gravity is generated by curves in this fabric. But physicists know that the fabric can’t be the whole story. Dying stars puncture it, creating intensely warped regions called black holes where the equations of general relativity break down. And even in calmer parts of space-time, when you zoom in to the smallest scales, quantum fluctuations seem to shred it apart.
    • “It’s saying gravity is not something separate from regular quantum theory,” said Josephine Suh, a physicist at the Korea Advanced Institute of Science and Technology. “It’s saying that gravity is just a different description of a quantum theory.”
    • Similarly, in quantum systems, entropy is also a measure of your ignorance. It tells you how much information you can’t access because of the entanglement between your quantum system and the world outside.

  • paper: Critical Mass and Tipping Points: How To Identify Inflection Points Before They Happen
  • link: https://fs.blog/critical-mass/
  • takeaway:
    • Critical mass, which is sometimes referred to as tipping points, is one of the most effective mental models you can use to understand the world. The concept can explain everything from viral cat videos to why changing habits is so hard.
    • In nuclear physics, critical mass is defined as the minimum amount of a fissile material required to create a self-sustaining fission reaction
    • In sociology, a critical mass is a term for a group of people who make a drastic change, altering their behavior, opinions or actions.

      “When enough people (a critical mass) think about and truly consider the plausibility of a concept, it becomes reality.” —Joseph Duda


  • paper: Mark Zuckerberg: “Ship the app”
  • link: https://www.techemails.com/p/mark-zuckerberg-ship-photos-app
  • takeaway: Note: This paper was written in 2011, during a period of rapid growth for Instagram. Around this time, Facebook was internally working on building its own photo-sharing app.
    • your note though was that you care more about fixing the team than shipping the app. I think we need to do both, but I do think it’s a crisis that we don’t have a mobile photos app out and I’d prioritize pushing that out as much as possible. Getting the team to a good state is not a milestone by itself that I care about.
    • I guess my basic point is this: I get that your team has issues, so fix them and ship the app. I don’t see why this should be so hard, and I don’t think we should accept any excuses for not getting this done in a short period of time.

  • paper: How your brain detects patterns in the everyday: without conscious thought
  • link: https://www.nature.com/articles/d41586-024-03116-8
  • takeaway:
    • neurons could also anticipate what images would appear next, suggesting that the brain can learn to predict future events on the basis of learnt patterns.

  • paper: When to do what you love
  • link: https://paulgraham.com/when.html
  • takeaway:
    • People pay you for doing what they want, not what you want. But there’s an obvious exception: when you both want the same thing. For example, if you love football, and you’re good enough at it, you can get paid a lot to play it.
    • This is not to say you shouldn’t try though. It depends how much ability you have and how hard you’re willing to work.
    • If you want to make a really huge amount of money — hundreds of millions or even billions of dollars — it turns out to be very useful to work on what interests you the most. The reason is not the extra motivation you get from doing this, but that the way to make a really large amount of money is to start a startup, and working on what interests you is an excellent way to discover startup ideas.
    • if you want to become moderately rich, you can’t usually afford to; but if you want to become super rich, and you’re young and good at technology, working on what you’re most interested in becomes a good idea again.
    • What do you do in the face of uncertainty? Get more certainty. And probably the best way to do that is to try working on things you’re interested in. That will get you more information about how interested you are in them, how good you are at them, and how much scope they offer for ambition.
    • Don’t wait. Don’t wait till the end of college to figure out what to work on. Don’t even wait for internships during college. You don’t necessarily need a job doing x in order to work on x; often you can just start doing it in some form yourself. And since figuring out what to work on is a problem that could take years to solve, the sooner you start, the better.
    • The other thing you do in the face of uncertainty is to make choices that are uncertainty-proof. The less sure you are about what to do, the more important it is to choose options that give you more options in the future. I call this “staying upwind.” If you’re unsure whether to major in math or economics, for example, choose math; math is upwind of economics in the sense that it will be easier to switch later from math to economics than from economics to math.
    • The root of great work is a sort of ambitious curiosity, and you can’t manufacture that.

  • paper: The Rise of Worse is Better
  • link: https://www.dreamsongs.com/RiseOfWorseIsBetter.html
  • takeaway:
    • To such a designer it is important to get all of the following characteristics right:

      Simplicity – the design must be simple, both in implementation and interface. It is more important for the interface to be simple than the implementation. Correctness – the design must be correct in all observable aspects. Incorrectness is simply not allowed. Consistency – the design must not be inconsistent. A design is allowed to be slightly less simple and less complete to avoid inconsistency. Consistency is as important as correctness. Completeness – the design must cover as many important situations as is practical. All reasonably expected cases must be covered. Simplicity is not allowed to overly reduce completeness.

    • The New Jersey guy said that the right tradeoff has been selected in Unix – namely, implementation simplicity was more important than interface simplicity.
    • big complex system scenario vs. diamond-like jewel scenario
    • either impossible or beyond the capabilities of most implementors. The two scenarios correspond to Common Lisp and Scheme. The first scenario is also the scenario for classic artificial intelligence software. The right thing is frequently a monolithic piece of software, but for no reason other than that the right thing is often designed monolithically. That is, this characteristic is a happenstance.
    • The lesson to be learned from this is that it is often undesirable to go for the right thing first. It is better to get half of the right thing available so that it spreads like a virus. Once people are hooked on it, take the time to improve it to 90% of the right thing.
    • A wrong lesson is to take the parable literally and to conclude that C is the right vehicle for AI software. The 50% solution has to be basically right, and in this case it isn’t.

  • paper: The Computational View of Time
  • link: https://writings.stephenwolfram.com/2024/10/on-the-nature-of-time/
  • takeaway:
    • Principle of Computational Equivalence implies—our universe is inevitably full of computational irreducibility which in effect defines a robust notion of the progress of time.
    • In essence, therefore, we experience time because of the interplay between our computational boundedness as observers, and the computational irreducibility of underlying processes in the universe. If we were not computationally bounded, we could “perceive the whole of the future in one gulp” and we wouldn’t need a notion of time at all. And if there wasn’t underlying computational irreducibility there wouldn’t be the kind of “progressive revealing of the future” that we associate with our experience of time.
    • a robust concept of time depends on us being computationally bounded observers. If we were not, then we’d able to break the Second Law by doing detailed computations of molecular processes, and we wouldn’t just describe things in terms of randomness and heat. And similarly, we’d be able to break the linear flow of time, either jumping ahead or following different threads of time.
    • time is what progresses when one applies computational rules.

  • paper: Observer Theory
  • link: https://writings.stephenwolfram.com/2023/12/observer-theory/
  • takeaway:
    • Central to what we think of as an observer is the notion that the observer will take the raw complexity of the world and extract from it some reduced representation suitable for a finite mind.
    • There’s in a sense a certain duality between computation and observation. In computation one’s generating new states of a system. In observation, one’s equivalencing together different states.
    • But a key concept of observer theory is that it’s possible to make conclusions about an observer’s impression of the world just by knowing about the capabilities—and assumptions—of the observer, without knowing in detail what the observer is “like inside”.
    • At an informational level we might say that there has to be more information processing going on inside than there is flow of information from the outside. Or, in other words, if we’re going to be meaningful “observers like us” we can’t just be bombarded by input we don’t process; we have to have some capability to “think about what we’re seeing”.
    • We’ve talked about observers operating by compressing the complexities of the world to “inner impressions” suitable for finite minds.
    • But now there’s a problem with computational irreducibility. Yes, the rules determine the pattern. But to get from these rules to the actual pattern can require an irreducible amount of computation. And to “reverse engineer the pattern” to find the rules can require even more computation.
    • In effect, perception and measurement tend to do “lossy compression”; analysis is more about “lossless compression” where the equivalencing is effectively not between possible inputs but between possible generative rules.
    • For in our Physics Project, space is ultimately “made” of a network of relations (or connections) between discrete “atoms of space” —that’s progressively being updated in what ends up being a computationally irreducible way. But we as computationally bounded observers can’t “decode” all the details of what’s happening, and instead we end up with a simple “aggregate” narrative, that turns out to correspond to continuum space operating according to the laws of general relativity.
    • In physical space—whether we’re looking at molecules in a fluid or atoms of space—we can think of us operating as observers who are physically large enough to span many underlying discrete elements, so that what we end up observing is just some kind of aggregate, averaged result.
    • And effectively this is what happens in the transition from quantum to classical behavior. Even though there are many possible detailed (“quantum”) threads of history that an object can follow, what we perceive corresponds to a single consistent “aggregate” (“classical”) sequence of behavior.
    • Our Physics Project in a sense brings ideas about the physical and abstract worlds closer—and the concept of the ruliad ultimately leads to a deep unification between them. For what we now imagine is that the physical universe as we perceive it is just the result of the particular kind of sampling of the ruliad made by us as certain kinds of observers.
    • A central feature of our interaction with the ruliad for physics is that observers like us don’t track the detailed behavior of all the various atoms of space. Instead, we equivalence things to the point where we get descriptions that are reduced enough to “fit in our minds”. And something similar is going on in mathematics.
    • Our tendency as observers is always to believe that we can separate our “inner experience” from what’s going on in the “outside world”. But in the end everything is just part of the ruliad. And at the level of the ruliad we as observers are ultimately “made of the same stuff” as everything else.
    • we perceive the universe to be the way we do because we are the way we are as observers. And the most fundamental aspect of observers like us is that we’re doing lots of equivalencing to reduce the “complexity of the world” to “internal impressions” that “fit into our minds”.
    • In an attempt to formalize the “cost of observation” we’ll inevitably have to make what seem like arbitrary choices, just as we would in setting up a scheme to determine when an ongoing computational process has “generated an answer”. But if we assume a certain boundedness to our choices, we can expect that we’ll be able to draw definite conclusions, and in effect be able to construct an analog of computational complexity theory for processes of observation.

  • paper: Kernighan’s lever
  • link:https://www.linusakesson.net/programming/kernighans-lever/index.php
  • takeaway:
    • Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.
    • It is tempting to interpret Kernighan’s aphorism as a warning: Stay away from clever techniques, it seems to say, because if you write clever code, you will never be able to get it to work. But this interpretation is unfortunate, and rests on the false assumption that cleverness is static.
    • Having written code as cleverly as you can, you will suddenly face a problem that you are not clever enough to solve. Certainly, “clever” in this context does not refer to some innate talent, because nobody is born with the ability to write clever code in the first place. The “cleverness” required to write and understand intricate code is an acquired mental skill.
    • Skill is the result of practice, that is, of systematically trying to work slightly beyond one's ability. Quite understandably, most of us don't spend that kind of effort unless we have good reason to. Hence, without motivation we do not practise, but simply cruise along at our current level and never improve any further.
    • The mind is very good at rationalising, and will convince us that our current skills are sufficient, that we are all "good enough"; certainly better than the average programmer anyway. The human brain will do this trick regardless of our actual level of skill. So while we all tend to consider ourselves sufficiently skilled right now, we never regret improving.
    • Kernighan’s witty remarks provide a clue: In programming, as soon as you work at your current level, you will automatically end up in a situation where you have to work beyond your current level. By means of this very fortunate mechanism, you will leverage several basic human drives (honour, pride, stubbornness, curiosity) into providing the motivation necessary for improvement.
    • Kernighan’s lever.: By putting in a small amount of motivation towards the short-term goal of implementing some functionality, you suddenly end up with a much larger amount of motivation towards a long term investment in your own personal growth as a programmer.

  • paper: What Is Consciousness? Some New Perspectives from Our Physics Project
  • link: https://writings.stephenwolfram.com/2021/03/what-is-consciousness-some-new-perspectives-from-our-physics-project/
  • takeaway:
    • if thought about in enough generality, is just a feature of computational sophistication, and therefore quite ubiquitous.
    • The universe in our models is full of sophisticated computation, all the way down. At the lowest level it’s just a giant collection of “atoms of space”, whose relationships are continually being updated according to a computational rule. And inevitably much of that process is computationally irreducible, in the sense that there’s no general way to “figure out what’s going to happen” except, in effect, by just running each step.
    • So what about physical space? The traditional view had been that space was something that could to a large extent just be described as a coherent mathematical object. But in our models of physics, space is actually made of an immense number of discrete elements whose pattern of interconnections evolves in a complex and computationally irreducible way.
    • if there’s underlying computational irreducibility—plus causal invariance—then any observer who forms their perception of the universe in a computationally bounded way must inevitably perceive the universe to follow the laws of general relativity.
    • quantum mechanics is again something that emerges as a result of trying to form a coherent perception of the universe.

  • paper: the quiet art of attention
  • link: https://billwear.github.io/art-of-attention.html
  • takeaway:
    • In this quiet observation, we begin to see patterns. The mind leaps from one thing to another, rarely resting. It is caught in a web of habits, most of which we never consciously chose. But, once we notice this, a door opens. There is space, however small, between the thoughts. And in that space, if we are patient, we can decide how to respond rather than being dragged along by every impulse or fear. This is not about control in the traditional sense, but about clarity. To act, not from reflex, but from intent.
    • It is a simple beginning, but one of great consequence. For when we reclaim our attention, even in this small way, we are no longer mere passengers on the journey. We become, in a sense, our own guides.

      As we grow in this practice of attention, something else becomes clear: much of what occupies our thoughts is unnecessary. The mind is cluttered, filled with concerns that seem urgent but, on closer inspection, do little to serve our deeper well-being. Simplification is not just a matter of decluttering our physical surroundings—it is a way of thinking, of living. As we quiet the noise within, we see more clearly what truly matters. We focus, not on everything, but on the essentials. We pare down, not by force, but by choice.

    • This process of simplification is not an escape from complexity. It is, in fact, a way of engaging with it more meaningfully. There are things in life that are intricate, yes, but not everything needs our attention at once. What truly requires our effort can be approached in small steps, in manageable pieces. The mind works best when it is focused on one thing at a time, when it is allowed to give itself fully to the task at hand. In this way, the most complex of undertakings becomes simple, not because it is easy, but because we have allowed it to unfold naturally, one step after the other.
    • But in this process, we must remember something important: life is not meant to be rushed through. It is not a race, nor is it a problem to be solved. It is an experience to be lived, and living well requires presence. To focus on one thing deeply, to give it your full attention, is to experience it fully. And when we do this, something remarkable happens. Time, which so often feels like it is slipping through our fingers, begins to slow. Moments become rich, textured. Even the simplest of tasks takes on a new significance when approached with care, with attention.

      This is the quiet art of living well. It does not demand that we abandon the world, but that we engage with it more mindfully. It asks that we slow down, that we look more closely, that we listen more carefully. For in doing so, we discover that much of what we seek—clarity, peace, even strength—was always within reach. It was simply waiting for us to stop, to pay attention, and to begin again with intention.

    • The mind, like a garden, requires tending. It needs patience, a steady hand, and, above all, consistency. There will be days when it seems unruly, when old habits return, and when focus feels elusive. But these days, too, are part of the process. Each small effort, each moment of renewed attention, builds upon the last. Over time, these moments accumulate, and what was once difficult becomes second nature.
    • And so, the journey to mastery of the mind begins not with grand gestures but with the simplest of practices: the practice of paying attention. Attention to the present, attention to what truly matters, and attention to the quiet spaces in between. In this way, step by step, thought by thought, we move closer to that elusive state of clarity, of peace, and of freedom.

About

Record the key ideas and insights I captured from the papers I read.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published