This contains notes for paper-ish documents that I read.
1 Evolution/Complexity
1.1 READ Networks and history
CUSTOM_ID: bearman2002networks YEAR: 2002 AUTHOR: Bearman, Peter and Moody, James and Faris, Robert
This was a really nice read. I suspect it's not that nice for insiders though. Anyway, this presents a way to create case (in social sciences, a group of interconnected events specifying causalities of some sort) using a few tricks from network science.
Something a little counter intuitive for me was this idea that only robust nets (let's coin a better term) are important since they carry on and are the only meaningful things when you look back. I don't know enough but probably there are two ways of looking at something like a historical event.
- What actually happened. I believe happenstances play roles here but shouldn't drive derivations of general rules of thumb.
- What happens, taken out as a general, most probable (?), rule. The paper focuses on this piece by presenting ways of finding non chancy causes of events.
The problem is, I don't know which is more important to know about.
1.2 READ The causes of evolvability and their evolution
CUSTOM_ID: payne2018causes YEAR: 2018 AUTHOR: Payne, Joshua L and Wagner, Andreas
A very recent review on the subject with a bunch of experimental results. It talks about three "major causes of evolvability":
- Phenotype heterogeneity
- Robustness
- Adaptive landscapes
And
Whether they often evolve because they confer evolvability remains a particularly challenging open question.
1.3 READ Is evolvability evolvable?
CUSTOM_ID: pigliucci2008evolvability YEAR: 2008 AUTHOR: Pigliucci, Massimo
Barring for the various definitions of evolvability, we look at evolvability as some sort of hyperparametrically derived quantity. These hyperparameters define the phylogenic space the search will continue on in the future, therefore there are the usual two arguments for evolution of evolability in the selection setting:
- as a side effect
- targeted (sliding towards teleology)
Since this is a survey/opinion, there are not much technicalities here.
1.4 READ Robustness and evolvability: a paradox resolved
CUSTOM_ID: wagner2007robustness YEAR: 2007 AUTHOR: Wagner, Andreas
Here we put up definitions for robustness and evolvability for sequences (genotype) and structures (phenotype). The main conclusion says that these two values are negatively correlated for genotype, but they support each other in case of phenotype. There are a few quantitative results on the setting of RNA sequences and the structures they form.
The question I am interested in is, how much can this be generalized to arbitrary levels in arbitrary systems? The key property needed to get this working is:
…even though structure robustness increases modestly with structure frequency, this increase is much smaller than the vast increase in the number of different structures accessible found near a much larger neutral network.
which gives
…the populations with the highly robust phenotype are more diverse, and this increased diversity is much greater than the decreased diversity around any one sequence.
1.5 READ Robustness and evolvability
CUSTOM_ID: masel2010robustness YEAR: 2010 AUTHOR: Masel, Joanna and Trotter, Meredith V
A kind of review of the ideas behind evolutionary robustness. I got a few pointers and terminology to follow from this paper.
1.6 READ How learning can guide evolution
CUSTOM_ID: hinton1987learning YEAR: 1987 AUTHOR: Hinton, Geoffrey E and Nowlan, Steven J
Simulation of a minimalistic system for explaining the idea behind the searching power of evolution + learning. Look here for an argument against the specific example taken.
1.7 READ Coevolution to the edge of chaos: coupled fitness landscapes, poised states, and coevolutionary avalanches
CUSTOM_ID: kauffman1991coevolution YEAR: 1991 AUTHOR: Kauffman, Stuart A and Johnsen, Sonke
This one uses the NK model to experiment with coevolution. The main idea is that you can couple one NK landscape to another using a factor similar to K, called C, which defines how much the other affects this guy. Sounds like a reasonable model to represent the essence of coevolving species. An important hint that we get is that if a metadynamics is present to select the value of K, then that moves it to an attractor state where changes in the system cause avalanches resembling the sandpile model from bak1988self.
1.8 READ Computation at the edge of chaos: phase transitions and emergent computation
Custom_ID: langton1990computation AUTHOR: Langton JOURNAL: Physica D: Nonlinear Phenomena YEAR: 1990 VOLUME: 42 PAGES: 12--37
The question here focuses on how to get rules capable of computation in CAs. Specifically, we are looking at environments which characterize rules that allow:
- Storage of information
- Transmission
- Interaction between the above two
Intuitively, as the rule's output entropy increases, we move from a very simple output (more storage) to output with randomness (more transmission). In between these two, lies the region with the right amount of signal and noise with very large transients and this is where most of the interesting events take place.
An interesting idea involves the definition of \(\lambda\) parameter (that helps in categorizing the rules) which is basically a discrete probability distribution for the range of mapping function.
1.9 Self-organized criticality
Custom_ID: bak1988self AUTHOR: Bak, Tang \& Wiesenfeld JOURNAL: Physical review A YEAR: 1988 VOLUME: 38 PAGES: 364
1.10 READ Revisiting the edge of chaos: Evolving cellular automata to perform computations
Custom_ID: mitchell1993revisiting AUTHOR: Mitchell, Hraber \& Crutchfield JOURNAL: arXiv preprint adap-org/9303003 YEAR: 1993
The edge of chaos idea is pretty popular and used to explain many phenomena. A short article criticizing that is here. This is one of the papers that tried to debunk (kind of) an experiment (packard1988adaptation; this was in my reading list for a long time) which claimed that evolving (in the GA sense) a CA to solve computational problems gyrate it towards the edge of chaos.
It's pretty easy to see the issue since a solution to a specific problem (they took majority classification) is going to have a specific λ and that's going to be what that is, in spite of where the critical λ lies.
Other than that, this paper has some nice explanations and insights for the results from GA. One neat trick that I haven't seen much (though I haven't seen much) is of keeping the number of elites high and changing the evaluation function on each generation. This looks like a more practical way to use GAs in evaluation over real data set. I also like the trick where you stop at a variable number of generations to avoid getting a rule which gets the right answer by alternating between 0s and 1s.
1.11 READ Optimization by Self-Organized Criticality
Custom_ID: hoffmann2018optimization AUTHOR: Hoffmann \& Payton JOURNAL: Scientific reports YEAR: 2018 VOLUME: 8 PAGES: 2358
I believe it is not using SoC in the strict sense. The key is the generation of test patterns. Using the sandpile model, we get a reasonable exploration/exploitation trade offs. Also, two avalanches are less likely to occur on overlapping patches (I am going by hunches on this so can be wrong) so it also provides a more coordinate descent-ish behavior than the regular random patch thing. Not sure if we can say that SoC is specifically helping here.
There are two things. First is that this is better than the random approach (consider random patch since only that is fairly comparable). This probably needs a lot more test cases or some theoretical justification.
Second is about the optimality of the sandpile approach. How about other non 1/f
distributions? I don't know which generating mechanisms can be employed to get
the test patterns but fishing around a bit tells me that this purity of
distribution is not that justified (consider for example the recent
broido2018scale). The point being: if you fix an annealing schedule for
stimulated annealing based on some natural observation, that doesn't:
- create a parameter-less solver, and
- justify the natural observation to be the optimal
All said, I liked the thought of a random object (?) generator which does better than the regular approach in the general case. If there indeed is such a generator, this could work as an off-the-shelf technique replacing uniform random search.
1.12 At the edge of chaos: Real-time computations and self-organized criticality in recurrent neural networks
Custom_ID: bertschinger2005edge AUTHOR: Bertschinger, Natschl\"ager \& Legenstein JOURNAL: YEAR: 2005 PAGES: 145--152
2 AI/ML
2.1 READ Speaker diarization with lstm
CUSTOM_ID: wang2018speaker YEAR: 2018 AUTHOR: Wang, Quan and Downey, Carlton and Wan, Li and Mansfield, Philip Andrew and Moreno, Ignacio Lopz
d-vector + spectral clustering.
2.2 READ Utterance-level Aggregation for Speaker Recognition in the Wild
CUSTOM_ID: xie2019utterance YEAR: 2019 AUTHOR: Xie, Weidi and Nagrani, Arsha and Chung, Joon Son and Zisserman, Andrew
Approached from more of a background reading perspective. Main idea is to use NetVLAD, GhostVLAD style (don't know much about these at the moment) aggregation across time instead of regular temporal average pooling.
2.3 READ pyannote.metrics: A Toolkit for Reproducible Evaluation, Diagnostic, and Error Analysis of Speaker Diarization Systems.
CUSTOM_ID: bredin2017pyannote YEAR: 2017 AUTHOR: Hervé {Bredin}
Useful read for knowing metrics used in segmentation and diarization.
2.4 READ Fully Supervised Speaker Diarization
CUSTOM_ID: zhang2018fully YEAR: 2018 AUTHOR: Aonan {Zhang} and Quan {Wang} and Zhenyao {Zhu} and John {Paisley} and Chong {Wang}
A relatively recent work which has two benefits:
- Allows unspecified number of speakers in an audio
- Learns the clustering over speaker embeddings in a supervised way
I am not totally clear on the parameter estimation part during my first pass but the code, which is here, should help.
2.5 READ Diarization is Hard: Some Experiences and Lessons Learned for the JHU Team in the Inaugural DIHARD Challenge.
CUSTOM_ID: sell2018diarization YEAR: 2018 AUTHOR: Sell, Gregory and Snyder, David and McCree, Alan and Garcia-Romero, Daniel and Villalba, Jes{\'u}s and Maciejewski, Matthew and Manohar, Vimal and Dehak, Najim and Povey, Daniel and Watanabe, Shinji and others
I have been looking over this to get started and know about diarization a bit. Although I got a few concepts and terminologies, this assumes you already know your way around. There are probably better pieces if you want to clear up the basics. Here is a nice resource by the way.
2.6 READ On the Cross-lingual Transferability of Monolingual Representations
CUSTOM_ID: mikel2019cross YEAR: 2019 AUTHOR: Mikel Artetxe and Sebastian Ruder and Dani Yogatama
- The idea of learning embeddings to fit in with a set of layers trained for another language can mostly be used in other kinds of models too.
- There was an interesting degradation in Hindi (+ Turkey) with positional embedding on XQuAD (recoverable with added adapters). I am wondering whether this is because
transferring syntactic abstractions is more challenging than semantic abstractions.
2.7 READ Towards end-to-end spoken language understanding
CUSTOM_ID: serdyuk2018towards YEAR: 2018 AUTHOR: Serdyuk, Dmitriy and Wang, Yongqiang and Fuegen, Christian and Kumar, Anuj and Liu, Baiyang and Bengio, Yoshua
Simple results (plain intent-ish classification) on going directly from audio features to intent. An important decision, I believe, is to have bigger semantic chunks to recur on since audio is very sample heavy.
2.8 READ wav2vec: Unsupervised Pre-training for Speech Recognition.
CUSTOM_ID: schneider2019wav2vec YEAR: 2019 AUTHOR: Steffen {Schneider} and Alexei {Baevski} and Ronan {Collobert} and Michael {Auli}
Not exactly what I assumed an x2vec would be with x = audio. Anyway, the idea is to have a language model-ish system for audio frames which acts as the featurizer for downstream tasks like, here, speech recognition. The gains are decent. There are a few good points covered in between which drive decisions while working with audio. Though I wonder what were the reasons for not going with spectrums1 if they are replacing log-mel input in regular speech recognizer.
2.9 READ Snuba: automating weak supervision to label training data
CUSTOM_ID: varma2018snuba YEAR: 2018 AUTHOR: Varma, Paroma and R{\'e}, Christopher
This is a logical extension of ratner2017snorkel. Instead of users writing heuristics, we go one level farther and just provide primitives (semantically meaningful feature chunks). There are three components:
- Synthesizer that does heuristic creation based on certain labelled dataset.
- A pruner that picks good heuristics, based on certain definitions and constraints.
- A verifier which closes the loop by deciding when to stop, what to feed to synthesizer etc.
While there are obvious upgrades in how we are doing everything, the general architecture reminds me much of classical rule learning systems like LCS.
2.10 READ Distributed representations of sentences and documents
CUSTOM_ID: le2014distributed YEAR: 2014 AUTHOR: Le, Quoc and Mikolov, Tomas
Document vectorization paper following the general series of word2vec ones.
2.11 READ One neuron is more informative than a deep neural network for aftershock pattern forecasting
CUSTOM_ID: mignan2019one YEAR: 2019 AUTHOR: Mignan, Arnaud and Broccardo, Marco
Got from r/MachineLearning. Title kind of says what is there in the paper. Even then, I would recommend skimming it just because the model is drastically simpler than the neural-net they are comparing to.
2.12 READ Generalized end-to-end loss for speaker verification
CUSTOM_ID: wan2018generalized YEAR: 2018 AUTHOR: Wan, Li and Wang, Quan and Papir, Alan and Moreno, Ignacio Lopez
Leaving the speaker verification part aside, this presents a way to train embeddings with membership constraints so that items for one identity are grouped together and are easy to separate from the rest.
2.13 READ The Secret Sharer: Evaluating and testing unintended memorization in neural networks
CUSTOM_ID: carlini2019secret YEAR: 2019 AUTHOR: Carlini, Nicholas and Liu, Chang and Erlingsson, {\'U}lfar and Kos, Jernej and Song, Dawn
This tries to formalize the problem of unintended memorization in neural networks. Notice the emphasis. There are many different ways to interpret memorization and the authors here are only concerned about cases where (say) something like a private sequence gets sucked in the memory and Mallory is able to extract such pieces with reasonable common attacks.
Important is their metric, called exposure, which kind of defines how easy it is to get a memorized piece of information out by playing around with the model API.
2.14 READ Transfer learning from speaker verification to multispeaker text-to-speech synthesis
CUSTOM_ID: jia2018transfer YEAR: 2018 AUTHOR: Jia, Ye and Zhang, Yu and Weiss, Ron and Wang, Quan and Shen, Jonathan and Ren, Fei and Nguyen, Patrick and Pang, Ruoming and Moreno, Ignacio Lopez and Wu, Yonghui and others
Real-Time-Voice-Cloning project pointed me here. Since I don't know much about speech synthesis at the moment, this also was a nice intro to the current modular breakdown. Three components are involved here:
- Discriminative speaker encoder. Trained on US English search voice data.
- Synthesizer. Takes text to spectrogram, conditioned on speaker encoding.
- Vocoder. Takes spectrograms to audio. Wavenet based.
2.15 READ What’s your ML Test Score? A rubric for ML production systems
CUSTOM_ID: breck2016s YEAR: 2016 AUTHOR: Breck, Eric and Cai, Shanqing and Nielsen, Eric and Salib, Michael and Sculley, D
This is a good guide to follow if you work in a production ML setting.
2.16 READ Mixing dirichlet topic models and word embeddings to make lda2vec
CUSTOM_ID: moody2016mixing YEAR: 2016 AUTHOR: Moody, Christopher E
While I like the results, I am wondering which pieces were useful, which were not and how do things compare to other techniques.
2.17 READ Who needs words? lexicon-free speech recognition
CUSTOM_ID: likhomanenko2019needs YEAR: 2019 AUTHOR: Likhomanenko, Tatiana and Synnaeve, Gabriel and Collobert, Ronan
Took it from the wav2letter++ repo. Nothing very specific to comment on. Mostly a results paper. I like how well ConvLM does though.
2.18 READ wav2letter++: The fastest open-source speech recognition system
CUSTOM_ID: pratap2018wav2letter YEAR: 2018 AUTHOR: Pratap, Vineel and Hannun, Awni and Xu, Qiantong and Cai, Jeff and Kahn, Jacob and Synnaeve, Gabriel and Liptchinsky, Vitaliy and Collobert, Ronan
Although it might not be as friendly, I like the focus on architecture and types based guarantees in general. Kaldi just feels annoying at times.
2.19 READ Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces
CUSTOM_ID: coucke2018snips YEAR: 2018 AUTHOR: Coucke, Alice and Saade, Alaa and Ball, Adrien and Bluche, Th{\'e}odore and Caulier, Alexandre and Leroy, David and Doumouro, Cl{\'e}ment and Gisselbrecht, Thibault and Caltagirone, Francesco and Lavril, Thibaut and others
Document on how snips does their SLU in general. A nice thing that I didn't expect was focus on the dynamic LM part. It makes sense to make updates easy and quick on-premise as compared to keeping things mostly frozen.
2.20 READ Supervising strong learners by amplifying weak experts
CUSTOM_ID: christiano2018supervising YEAR: 2018 AUTHOR: Christiano, Paul and Shlegeris, Buck and Amodei, Dario
This is the Iterated Amplification paper. General idea is that harder problems can't be solved directly in a stable way so we want to use expert assistance for breaking down things in pieces. Then an interactive process lets the to-be trained system learn from both the ways things are broken and answer constructed. Since this is one of the initial works, the overall framework might change a little.
Simple algorithmic examples are provided. It will be interesting to see attempts towards the candidate problems which are beyond human intelligence. Haven't really followed the thread so don't know if there already is something done in this direction.
2.21 READ Frustratingly easy domain adaptation
CUSTOM_ID: daume2009frustratingly YEAR: 2009 AUTHOR: Daum{\'e} III, Hal
Tells you how far good insights go. The paper is really simple to follow so not writing anything here.
2.22 READ Green AI
CUSTOM_ID: schwartz2019green YEAR: 2019 AUTHOR: Schwartz, Roy and Dodge, Jesse and Smith, Noah A. and Etzioni, Oren
A noble idea to push for.
I sometime feel bad that things which are good inherently, need to be pushed out in a gamified way for actions to be taken.
2.23 READ Probing Neural Network Comprehension of Natural Language Arguments
CUSTOM_ID: niven2019probing YEAR: 2019 AUTHOR: Niven, Timothy and Kao, Hung-Yu
Was up on r/ml. The abstract is short and clear enough. The idea is that ARCT tasks have simple statistical cues which contribute in a major way for whatever SOTA we are getting. One you balance them out, even strong models like BERT take big hits and go to essentially random-ish performance.
2.24 READ Alignment in dialogue
CUSTOM_ID: garrod2007alignment YEAR: 2007 AUTHOR: Garrod, Simon and Pickering, Martin J
Picked up this because I wanted to get general background of alignment as a linguistic term. A few points I pulled out:
- Common ground (stuff believed to be shared) is stricter than alignment which only refers to the information that happens to be shared.
- Ways of alignment:
- via beliefs about one's interlocutor
- via imitation
- via agreements between interlocutors
- via feedback
- via physical co-presence
2.25 READ Statistical user simulation with a hidden agenda
CUSTOM_ID: schatzmann2007statistical YEAR: 2007 AUTHOR: Schatzmann, Jost and Thomson, Blaise and Young, Steve
This was pointed to by papangelis2019collaborative as a way of modeling users. Two good ideas here:
- Stack based agenda and the general decomposition of the process.
- Tractability piece where we try to put assumptions on various factors like transition probabilities etc.
2.26 READ Collaborative Multi-Agent Dialogue Model Training Via Reinforcement Learning
CUSTOM_ID: papangelis2019collaborative YEAR: 2019 AUTHOR: Papangelis, Alexandros and Wang, Yi-Chia and Molino, Piero and Tur, Gokhan
This is the paper which came out with Uber's plato's release. Here is what you need to know really:
Using DSTC2 as seed data, we trained NLU and NLG networks for each agent and let the agents interact and learn online optimal dialogue policies depending on their role (seeker or provider).
2.27 READ Building a conversational agent overnight with dialogue self-play
CUSTOM_ID: shah2018building YEAR: 2018 AUTHOR: Shah, Pararth and Hakkani-T{\"u}r, Dilek and T{\"u}r, Gokhan and Rastogi, Abhinav and Bapna, Ankur and Nayak, Neha and Heck, Larry
Nice ideas in here plus insights for practical systems like the following,
Covering complex interactions is important when developing datasets to benchmark research aimed towards building human-level dialogue systems. However, we argue that for consumer-facing chatbots, the primary aim is reliable coverage of critical user interactions.
The generated dataset is here by the way.
2.28 READ Entity-Aware Language Model as an Unsupervised Reranker
CUSTOM_ID: rasooli2018entity YEAR: 2018 AUTHOR: Rasooli, Mohammad Sadegh and Parthasarathy, Sarangarajan
A few nice ideas. One was to autogenerate n-best list for a certain true text using phonetic similarity and subsequent LM reranking. Overall idea is to somehow introduce the relation between potential entities in the text while ranking alternatives. The exact approach looks a little too much and I would like to know more about how and why the decisions were taken even though whatever they did sounds intuitive. Backstories people.
2.29 READ Bayesian learning via stochastic gradient Langevin dynamics
CUSTOM_ID: welling2011bayesian YEAR: 2011 AUTHOR: Welling, Max and Teh, Yee W
The update equation here is a minibatched SGD's with a normal noise factor \(\eta\):
\[ \Delta\theta_{t} = \frac{\epsilon_{t}}{2} \left(\nabla \log p (\theta_{t}) + \frac{N}{n} \sum_{i=1}^{n} \nabla \log p (x_{ti} | \theta_{t}) \right) + \eta_{t} \]
Main results involve showing that, when the rate \(\epsilon\) decays following certain properties, the noise due to minibatch dominates in the initial phase giving us normal SGD while the \(\eta\) noise dominates in the later phase which basically lets us sample from the posterior of \(\theta\).
The question of when to say we are in the sampling phase (so that we can start collecting samples, taking the SGD phase as burn-in) is also answered though I am missing some statistical tooling at the moment to appreciate it.
2.30 READ Learning by analogy: Formulating and generalizing plans from past experience
CUSTOM_ID: carbonell1983learning YEAR: 1983 AUTHOR: Carbonell, Jaime G
The summary overall is solving problems \(\equiv\) learning to solve problems. This is probably in general applicable to all analogy based methods but here the idea is also to apply learnings from one problem/domain to others. Two key components are involved here:
- Apply a generic problem solver (in the classical planning sense) to higher order problems like reducing a past solution to a new solution for another problem.
- A learning system which helps in learning parameters for a memory table which indexes actions based on the effects they produce.
Like in many older papers a lot of deliberations from here too are probably now parametrized and learned.
2.31 READ Large-Scale Long-Tailed Recognition in an Open World
CUSTOM_ID: liu2019large YEAR: 2019 AUTHOR: Liu, Ziwei and Miao, Zhongqi and Zhan, Xiaohang and Wang, Jiayun and Gong, Boqing and Yu, Stella X
Here we are grouping the following three tasks, their losses, metrics etc. in one:
- Regular (a little imbalanced) classification
- Few shot classification
- Out of Domain classification
Even though the system is a full network, there are planned components in there. Two are noteworthy:
- A distance metric, reachability in paper, goes to tell how different an instance is as compared to seen examples. This helps in task 2 vs 3.
- Memorized feature infusion which comes into picture in task 1 vs 2. Here we put more weights for features from a memory which helps in reducing the bias towards regular classes with large number of training samples.
2.32 READ Parallelizing wfst speech decoders
CUSTOM_ID: mendis2016parallelizing YEAR: 2016 AUTHOR: Mendis, Charith and Droppo, Jasha and Maleki, Saeed and Musuvathi, Madanlal and Mytkowicz, Todd and Zweig, Geoffrey
I didn't get everything here mostly because of the split between AM and LM phase. Will probably look over with more background. Overall, the idea is to parallelize viterbi as you would think but keeping inter thread communication very low by clumping actions which are mostly independent given their thread's other actions. This clumping gains by knowledge of the graph structure which is affected by the domain; in this case using the information about triphones.
2.33 READ Model-based testing without models: the TodoMVC case study
CUSTOM_ID: bainczyk2017model YEAR: 2017 AUTHOR: Bainczyk, Alexander and Schieweck, Alexander and Steffen, Bernhard and Howar, Falk
This is a case study of a general purpose UI testing approach. Here are the general steps:
- You define a set of actions that can be done.
- Learn a mealy model (makes sense for a lot of UIs) based on exploration using those states (I am not very sure I am using the correct phrasing for this learning)
- Compare with reference, among siblings etc. Probably also fuzz.
Even though there is not much in this specific paper itself, I got a general overview of the scene and references to a few primary sources.
2.34 READ Morphnet: Fast & simple resource-constrained structure learning of deep networks
CUSTOM_ID: gordon2018morphnet YEAR: 2018 AUTHOR: Gordon, Ariel and Eban, Elad and Nachum, Ofir and Chen, Bo and Wu, Hao and Yang, Tien-Ju and Choi, Edward
In a single line (from the appendix) what is happening is:
iterative process of shrinking via a sparsifying regularizer and expanding via a uniform multiplicative factor
The regularizer is an \(L1\) over the batch norm γ parameters for neurons.
2.35 READ Learning reductions that really work
CUSTOM_ID: beygelzimer2016learning YEAR: 2016 AUTHOR: Beygelzimer, Alina and Daum{\'e}, Hal and Langford, John and Mineiro, Paul
I was looking into the general ideas behind Vowpal Wabbit and got to this document which probably summarizes the whole concept of learning reductions.
An important question is how general and fundamental this whole idea really is. Of course computational benefits are a major plus, but reductions also feel very elegant.
2.36 READ Unsupervised Grounding of Plannable First-Order Logic Representation from Images
CUSTOM_ID: asai2019unsupervised YEAR: 2019 AUTHOR: Asai, Masataro
This is a attempt to have interpretable representation from a neural network that can be used with planing systems. The abstract should tell you what problems are getting solved. The keys ideas are the following:
- First Order State Auto Encoder (FOSAE) where the latent space represents FOL predicates based on certain input objects and specified hyperparameters.
- Extensive use of Gumbel-Softmax to impose unitary credit assignment.
I am not very sure how different this is from similar recent works since I haven't followed them. But the main difference looks like focusing on discrete representations and planning capabilities of PDDL-ish tools. Interpretability comes as a side effect.
Since the predicates here are anonymous as of now, an interesting piece of future work involves a bit of supervision to put names on things.
2.37 READ Detecting spacecraft anomalies using lstms and nonparametric dynamic thresholding
CUSTOM_ID: hundman2018detecting YEAR: 2018 AUTHOR: Hundman, Kyle and Constantinou, Valentino and Laporte, Christopher and Colwell, Ian and Soderstrom, Tom
I picked this paper mostly randomly while looking around for the general scene of anomaly detection. The general framework looks like this:
- Train a model on the sequence
- Predict the future while collecting errors
- Identify anomalies using a non-parametric heuristic
- Post-hoc pruning of false positives based on identified anomalies
Even though their method has less knobs to worry about, it still does not feel auto-pilotish while reading the document. Well, that is supposed to happen I guess.
2.38 READ Snorkel: Rapid training data creation with weak supervision
CUSTOM_ID: ratner2017snorkel YEAR: 2017 AUTHOR: Ratner, Alexander and Bach, Stephen H and Ehrenberg, Henry and Fries, Jason and Wu, Sen and R{\'e}, Christopher
This one has bunch of practical upgrades on the original data programming paper. Two major things here involve:
- deciding when to use the generative model (as compared to voting)
- tackling correlation
On a side note, while looking at the results you might find that even majority voting (which is very easy to implement) might not be that bad if you are a little careful.
2.39 READ Bootstrapping Conversational Agents With Weak Supervision
CUSTOM_ID: mallinar2018bootstrapping YEAR: 2018 AUTHOR: Mallinar, Neil and Shah, Abhishek and Ugrani, Rajendra and Gupta, Ayush and Gurusankar, Manikandan and Ho, Tin Kam and Liao, Q Vera and Zhang, Yunfeng and Bellamy, Rachel KE and Yates, Robert and others
I like there method of mass tagging. Other than that, this is a practical implementation of a snorkel like system.
2.40 READ Hidden technical debt in machine learning systems
2.41 READ Few-Shot Generalization Across Dialogue Tasks
CUSTOM_ID: vlasov2018few YEAR: 2018 AUTHOR: Vlasov, Vladimir and Drissner-Schmid, Akela and Nichol, Alan
The idea is to put all the involved pieces in a dialog, i.e. slots, intents and actions, in a space and then match with possible actions to do something. The key idea is to have the items break into compositional pieces before embedding so that a new domain can share along a lot of items and get along well.
2.42 READ Neural machine translation of rare words with subword units
CUSTOM_ID: sennrich2015neural YEAR: 2015 AUTHOR: Sennrich, Rico and Haddow, Barry and Birch, Alexandra
This is the application of subword (BPE based) on NMT. The results mostly show robustness and better learned handling of OOV stuff.
2.43 READ SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
CUSTOM_ID: kudo2018sentencepiece YEAR: 2018 AUTHOR: Kudo, Taku and Richardson, John
A more rewarding read here can be the code in the repository itself since this is just a short documentation on the methods implemented.
2.44 READ Bpemb: Tokenization-free pre-trained subword embeddings in 275 languages
CUSTOM_ID: heinzerling2017bpemb YEAR: 2017 AUTHOR: Heinzerling, Benjamin and Strube, Michael
This is mostly a report on byte-pair embedding results and comparison with other models. Tracing back to (sennrich2015neural) and the original compression paper (gage1994new) should cover the background.
2.45 READ Comparison of grapheme-to-phoneme methods on large pronunciation dictionaries and LVCSR tasks
CUSTOM_ID: hahn2012comparison YEAR: 2012 AUTHOR: Hahn, Stefan and Vozila, Paul and Bisani, Maximilian
This is mostly a comparison of statistical g2p models. I think I have a general idea now but looks like there is a lot more to see if I start looking into individual references. A general thread along all these models was the use of a certain alignment (grapheme to phoneme) algorithm to get what are called graphones and then train an ngrams-ish sequence model on them.
2.46 READ Statistical language modeling for speech disfluencies
CUSTOM_ID: stolcke1996statistical YEAR: 1996 AUTHOR: Stolcke, Andreas and Shriberg, Elizabeth
Got here from srilm's disfluency (DF) LM. The idea is to have a cleanup model which models out certain common DFs, specifically filled pauses, repetitions and deletions. Although there was not much gain, an interesting conclusion comes with filled pauses where the DF model actually increased perplexity. The argument being a filled pause, in most of the cases, linguistically breaks the sentence and so the context behind it is not so useful for what follows.
Since the paper is old and also hints at a bunch of improvements in DF modeling, I guess there might be a more recent reference around.
2.47 READ SRILM-an extensible language modeling toolkit
CUSTOM_ID: stolcke2002srilm YEAR: 2002 AUTHOR: Stolcke, Andreas
This is an early document on SRILM's design and development. If you are looking for something more in-depth, just download the current tarball.
2.48 READ A bit of progress in language modeling
CUSTOM_ID: goodman2001bit YEAR: 2001 AUTHOR: Goodman, Joshua T
This has a lot of nice ideas and intuitions behind tricks employed in statistical language models. I will just write out the general topics since it's a long paper (~73 pages for the extended version):
- Skipping
- Clustering
- Caching
- Sentence Mixture Models
At a higher level we get to know about:
- ways of combining
- approaching analysis
- practical issues
2.49 READ Streaming End-to-end Speech Recognition For Mobile Devices
CUSTOM_ID: he2018streaming YEAR: 2018 AUTHOR: He, Yanzhang and Sainath, Tara N and Prabhavalkar, Rohit and McGraw, Ian and Alvarez, Raziel and Zhao, Ding and Rybach, David and Kannan, Anjuli and Wu, Yonghui and Pang, Ruoming and others
From Google's recent on-device character level Speech Recognition system. There are a bunch of tricks used in the overall system other than the main model itself. A few are:
- parameter quantization (required, of course, for fast computation on mobile device)
- data augmentation using tts for getting numbers, proper nouns etc. right (instead of doing fancy stuff on the model side)
2.50 READ Rapidly building domain-specific entity-centric language models using semantic web knowledge sources
CUSTOM_ID: akbacak2014rapidly YEAR: 2014 AUTHOR: Akbacak, Murat and Hakkani-T{\"u}r, Dilek and Tur, Gokhan
This is focused on filtering search queries for creating language model. The filtering that works out for them is to (after identifying a domain) go from queries to clicked links then back to queries that went to those links. There are a few other pieces involved but the general shape of narrowing is the same.
2.51 The tradeoffs of large scale learning
CUSTOM_ID: bottou2008tradeoffs YEAR: 2008 AUTHOR: Bottou, L{\'e}on and Bousquet, Olivier
2.52 READ Abstract meaning representation for sembanking
CUSTOM_ID: banarescu2013abstract YEAR: 2013 AUTHOR: Banarescu, Laura and Bonial, Claire and Cai, Shu and Georgescu, Madalina and Griffitt, Kira and Hermjakob, Ulf and Knight, Kevin and Koehn, Philipp and Palmer, Martha and Schneider, Nathan
I found AMR while looking into a way of breaking from the usual intent/entity based NLU. While they are not perfect, the specification tells you about pieces which should (in elaborate situations) be considered at least for practical computational language understanding.
2.53 READ Bootstrapping language models for dialogue systems
CUSTOM_ID: weilhammer2006bootstrapping YEAR: 2006 AUTHOR: Weilhammer, Karl and Stuttle, Matthew N and Young, Steve
This is quickly getting domain specific LMs. The idea is to not do a lot of manual (and perfect) text collection but start with simple grammars and get a seed LM using the generated text. Then for more refinements, get a large LM and do sentence selection on in-the-wild data to get sentences with low value of \(PP_{seed} / PP_{large}\). These sentences and the rejected ones then give two more LMs which can then be interpolated based on a validation set.
Exact steps aside, the idea (other than SLMs on grammar generated data) is to do some sort of sentence selection to augment the seed LM.
2.54 READ Developing Production-Level Conversational Interfaces with Shallow Semantic Parsing
CUSTOM_ID: raghuvanshi2018developing YEAR: 2018 AUTHOR: Raghuvanshi, Arushi and Carroll, Lucien and Raghunathan, Karthik
Doc on Mindmeld's NLU system.
2.55 READ Neural text generation from structured data with application to the biography domain
CUSTOM_ID: lebret2016neural YEAR: 2016 AUTHOR: Lebret, R{\'e}mi and Grangier, David and Auli, Michael
From wikipedia info entry (a table) for a person, they generate biographical sentences. The way to condition on the table while doing \(P(w_i | c_{(i-1)})\) is just indexing into (learnable) embeddings. I was looking for something more insightful though.
2.56 Generating exact lattices in the WFST framework
CUSTOM_ID: povey2012generating YEAR: 2012 AUTHOR: Povey, Daniel and Hannemann, Mirko and Boulianne, Gilles and Burget, Luk{\'a}{\v{s}} and Ghoshal, Arnab and Janda, Milo{\v{s}} and Karafi{\'a}t, Martin and Kombrink, Stefan and Motl{\'\i}{\v{c}}ek, Petr and Qian, Yanmin and others
2.57 READ Quantifying the value of pronunciation lexicons for keyword search in lowresource languages
CUSTOM_ID: chen2013quantifying YEAR: 2013 AUTHOR: Chen, Guoguo and Khudanpur, Sanjeev and Povey, Daniel and Trmal, Jan and Yarowsky, David and Yilmaz, Oguz
In a single line, while pronunciation dictionary augmentation doesn't help that much in WER of an LVCSR (since the OOV rates are usually low), it helps a lot in Keyword Search.
A few other things to note are the ways to generate pronunciation and two ways to do KWS if you already have an LVCSR system. Not surprisingly, the proxy keyword system doesn't work that well.
2.58 READ State-of-the-art speech recognition with sequence-to-sequence models
CUSTOM_ID: chiu2018state YEAR: 2018 AUTHOR: Chiu, Chung-Cheng and Sainath, Tara N and Wu, Yonghui and Prabhavalkar, Rohit and Nguyen, Patrick and Chen, Zhifeng and Kannan, Anjuli and Weiss, Ron J and Rao, Kanishka and Gonina, Ekaterina and others
Bunch of improvements on top of the LAS architecture. It feels funny that even in end-to-end systems, we still look for modular presence of components like Language Models. Maybe that helps in adding and justifying heuristics.
2.59 Speech recognition with weighted finite-state transducers
CUSTOM_ID: mohri2008speech YEAR: 2008 AUTHOR: Mohri, Mehryar and Pereira, Fernando and Riley, Michael
Partial notes:
- Composition: Transitive-ness.
- Determinization: Removing multiple transitions on same input.
- Minimization: Compressing to the minimal, equivalent automaton. Done by first weight pushing and then running the classical algorithm.
2.60 READ Data programming: Creating large training sets, quickly
CUSTOM_ID: ratner2016data YEAR: 2016 AUTHOR: Ratner, Alexander J and De Sa, Christopher M and Wu, Sen and Selsam, Daniel and R{\'e}, Christopher
Main idea is to focus on creating \(O(1)\) labelling functions on boundless data to get similar asymptotes as compared to labeled data setting.
This same change of focus has another side effect which I agree with:
One of our hopes is that a user without expertise in ML will be more productive iterating on labeling functions than on features.
2.61 READ Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication
CUSTOM_ID: jaeger2004harnessing YEAR: 2004 AUTHOR: Jaeger, Herbert and Haas, Harald
This is the Echo State Network paper (probably not the original one but sufficiently close). I found it to be a little different than what I had earlier thought about there being separate inputs and outputs.
2.62 READ Learning to map sentences to logical form: Structured classification with probabilistic categorial grammars
CUSTOM_ID: zettlemoyer2012learning YEAR: 2012 AUTHOR: Zettlemoyer, Luke S and Collins, Michael
Assuming the title clarifies the goal, there are three basic components here:
- A parser which takes a sentence \(S\), a set of categories \(\Lambda\) and weights over features of the derivation (generated from parsing) \(\theta\). This then generates logical forms (\(L\)) with certain probabilities.
- Category generator which takes \(S\) and its expected logical form \(L\) to generate the categories needed to parse it to that form.
- An estimator which, given the training set and a set of categories, updates \(\theta\) to increase the score of the form getting parsed.
The interesting pieces are the representation of the logical form \(L\) (using λ calculus) and category generation and pruning. Although the generated categories can be arbitrary, allowing for wrong grammars and such, I believe, it can be made to work better in noisy settings if we generalize parsing and (maybe) the meaning of the structurally rigid categories like \(S/NP\) using a few tricks.
2.63 READ A very short introduction to CCG
CUSTOM_ID: steedman1996very YEAR: 1996 AUTHOR: Steedman, Mark
A lambda calculus formulation of verb (function) acts in natural text. Not sure if I can figure out exact advantages as compared to other approaches. This definitely has more appeal to it because of the functional forms and the tooling they pull in with themselves.
2.64 READ Swoosh: a generic approach to entity resolution
CUSTOM_ID: benjelloun2009swoosh YEAR: 2009 AUTHOR: Benjelloun, Omar and Garcia-Molina, Hector and Menestrina, David and Su, Qi and Whang, Steven Euijong and Widom, Jennifer
The main products are optimal algorithms to do ER which minimize the number of calls to the black box functions that actually perform the matching and merging. To do this, we first formalize the ER problem using:
- Records and features as the data structures
- Merging and matching functions as the operations
Then we look for certain properties of a particular setting (mostly the effect of merge and match functions). Based on whether a few of these are satisfied (surprisingly trivial functions might not do what you expect of them), we can reduce the number of calls to matching.
2.65 READ How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation
CUSTOM_ID: liu2016not YEAR: 2016 AUTHOR: Liu, Chia-Wei and Lowe, Ryan and Serban, Iulian V and Noseworthy, Michael and Charlin, Laurent and Pineau, Joelle
Other than the usuals, it has decent summaries of a few metrics used for sentence similarity.
2.66 READ Bringing machine learning and compositional semantics together
CUSTOM_ID: liang2015bringing YEAR: 2015 AUTHOR: Liang, Percy and Potts, Christopher
Got pointed to this while going through sippycup. This presents, in a very pedagogical way, a simple framework for ranking semantic parses using supervised learning. The important point is that this framework can be applied to a lot of problems in nlu involving different ways of structuring the logical forms and features.
2.67 A decision-theoretic generalization of on-line learning and an application to boosting
Custom_ID: freund1997decision AUTHOR: Freund \& Schapire JOURNAL: Journal of computer and system sciences YEAR: 1997 VOLUME: 55 PAGES: 119--139
3 Computing/Programming
3.1 READ Metaobject protocols: Why we want them and what else they can do
CUSTOM_ID: kiczales1993metaobject YEAR: 1993 AUTHOR: Kiczales, Gregor and Ashley, J Michael and Rodriguez, Luis and Vahdat, Amin and Bobrow, Daniel G
This provides a wider perspective on MOP. Specially the sections on Scheme extension techniques clarified that MOP is a very general way of creating an extension system for something else.
3.2 READ Reflections on trusting trust
CUSTOM_ID: thompson1984reflections YEAR: 1984 AUTHOR: Thompson, Ken and others
I remember reading this or watching the talk 3-4 years earlier but not understanding what Ken was trying to say. This time it was fine. I like this top blurb:
To what extent should one trust a statement that a program is free of Trojan horses? Perhaps it is more important to trust the people who wrote the software.
3.3 READ Online aggregation
CUSTOM_ID: hellerstein1997online YEAR: 1997 AUTHOR: Hellerstein, Joseph M and Haas, Peter J and Wang, Helen J
I was looking into this while looking for prior works that provide streaming
results from a database. The idea is to have always available results for
aggregation queries like SUM
, COUNT
etc. along with uncertainty measurements
based on the currently sampled tuples. Other than the uncertainty estimation
formulations, they presented work on the implementation side of the idea which
involves random sampling and various UX niceties.
3.4 READ Practical type inference based on success typings
CUSTOM_ID: lindahl2006practical YEAR: 2006 AUTHOR: Lindahl, Tobias and Sagonas, Konstantinos
The general idea is to allow all programs that throw no runtime errors. This is specially useful in languages which are philosophically dynamic. I like this approach towards types since programming in a dynamic language involves dropping a lot of so called writer's 'intention' here and there which does not adhere to the static type philosophy.
Not sure if this is one of the firsts (the first for functional languages according to the paper), but these days there are many mainstream dynamic languages adopting such soft typing systems in some form.
3.5 READ Dynamically typed languages
CUSTOM_ID: tratt2009dynamically YEAR: 2009 AUTHOR: Tratt, Laurence
A basic and exhaustive intro to dynamic typed languages. Good for beginners.
3.6 READ Growing a language
CUSTOM_ID: steele1999growing YEAR: 1999 AUTHOR: Steele, Guy L
This is originally a talk, I read a pdf version. An interesting thing is the way the talk itself is structured (its vocabulary mostly) exemplifying the same growth mechanism that Guy talks about in relation to languages.
3.7 READ BlinkDB: queries with bounded errors and bounded response times on very large data
CUSTOM_ID: agarwal2013blinkdb YEAR: 2013 AUTHOR: Agarwal, Sameer and Mozafari, Barzan and Panda, Aurojit and Milner, Henry and Madden, Samuel and Stoica, Ion
I find works like this, and other approximate query processing systems, pretty interesting since their general structures are close to machine learning systems with slightly different metrics to be optimized. This, of course, then provides a lot of food for thought.
So, what BlinkDB does is pretty clear from the title. On the how side, they
basically create a bunch of samples (table subsets) based on criterion derived
from past queries. The 'key' for samples here are sets of columns involved in
the queries' WHERE
, HAVING
etc. clauses. When asked a query with timing, error
requirements, a sample is picked (after some estimation on some data; this is
important, since they don't want to put much assumptions on the type of
workload) and query runs on that.
Since they are mostly for high scale use cases, these methods are not 'very' visible unless you are into such things. Although, I believe, similar ideas (I specially liked the Online Aggregation thing from 1997) can be put in more commonplace, smaller, systems (or already are there).
3.8 Type systems as macros
CUSTOM_ID: chang2017type YEAR: 2017 AUTHOR: Chang, Stephen and Knauth, Alex and Greenman, Ben
3.9 Physics, topology, logic and computation: a Rosetta Stone
Custom_ID: baez2010physics AUTHOR: Baez \& Stay YEAR: 2010
3.10 READ The Genuine Sieve of Eratosthenes
Custom_ID: o2009genuine AUTHOR: O'NEILL JOURNAL: Journal of Functional Programming YEAR: 2009 VOLUME: 19 PAGES: 95--106
This talks about a functional implementation of Sieve of Eratosthenes. Specifically it debunks the following incorrect implementation:
primes = sieve [2..] sieve (p : xs) = p : sieve [x | x <− xs, x `mod` p > 0]
Then we see correct functional implementations with neat tricks made possible due to laziness of Haskell. Although slower, there is a list based implementation by Bird mentioned in the Epilogue which is pretty readable (and elegant) and follows very closely the following description:
primes = [2, 3, ...] \ [[p², p²+p, ...] for p in primes]
3.11 READ Why functional programming matters
Custom_ID: hughes1989functional AUTHOR: Hughes JOURNAL: The computer journal YEAR: 1989 VOLUME: 32 PAGES: 98--107
This is a famous paper and I wanted to see what it focuses on. It's basically about the following two properties and their effect on modularity in functional programmings:
- Higher order functions
- Lazy evaluation
The examples are nice and make this is a good read for beginners. Though I suspect there might be better, recent, articles on these topics now.
4 Misc
4.1 READ More is different
CUSTOM_ID: anderson1972more YEAR: 1972 AUTHOR: Anderson, Philip W and others
The basic idea is the following:
The main fallacy in this kind of thinking is that reductionist hypothesis does not by any means imply a "constructionist" one.
We are trying to understand that reductionist view is not going to explain everything and that fundamental laws at the lowest level are not going to be the fundamental ones for the higher level ("Psychology is not applied biology…"). A littl0e hierarchy is also presented using examples where our movements across levels results in broken symmetry:
- Crystallinity
- Functional structures
- Regular systems with information like DNA
- Ordering in the time dimension for information processing etc.
So it is not true, as a recent article would have it, that we each should "cultivate out own valley, and not attempt to build roads over the mountain ranges … between the sciences." Rather, we should recognize that such roads, while often the quickest shortcut to another part of our own science, are not visible from the viewpoint of one science alone.
4.2 READ Google's hybrid approach to research
CUSTOM_ID: spector2012google YEAR: 2012 AUTHOR: Spector, Alfred and Norvig, Peter and Petrov, Slav
Mostly about the people being researchers and developers and how it affects various aspects of experiments.
4.3 READ Machine learning: The high-interest credit card of technical debt
CUSTOM_ID: sculley2014machine YEAR: 2014 AUTHOR: Sculley, D and Phillips, Todd and Ebner, Dietmar and Chaudhary, Vinay and Young, Michael
4.4 READ Better science through art
CUSTOM_ID: gabriel2010better YEAR: 2010 AUTHOR: Gabriel, Richard P and Sullivan, Kevin J
Here are the last few lines which cover what's common between Science and Art and also summarize the document:
- Explore: wander / defamiliarize
- Discover: guess / abduce
- Understand: validate / ask—did you build the right thing?
4.5 READ Lisp, Jazz, Aikido–Three Expressions of a Single Essence
Custom_ID: verna2018lisp AUTHOR: Verna JOURNAL: arXiv preprint arXiv:1804.00485 YEAR: 2018
Okay, this was up on /r/lisp, felt not that much effort to read so I gave it a shot. There are three general aesthetic avenues that the author covers:
- Conformation
- Transgression
- Unification
The general idea is about the similar interplay of these in all the 3 things (Lisp, Jazz & Aikido) and how they end up being a source of pleasure and enlightenment.
From whatever I have felt, things that focus on an act itself (rather than prioritizing the results) end up being like these (well, probably this is obvious).
This paper is a quick read and is not overly philosophical. Maybe that's because one of the focus is on tools that stay out of your way by staying practical (you can see this when the author talks about Common Lisp specifically). Although I must say that I know next to nothing about both Jazz and Aikido so might not have really been able to connect all the pieces.
Bibliography
- [bak1988self] Bak, Tang & Wiesenfeld. 1988. "Self-organized criticality." Physical review A, 38(1), 364. link. doi.
- [packard1988adaptation] Packard. 1988. "Adaptation toward the edge of chaos." Dynamic patterns in complex systems, 212, 293-301. link. doi.
- [broido2018scale] Broido & Clauset. 2018. "Scale-free networks are rare." arXiv preprint arXiv:1801.03400, , link. doi.
- [ratner2017snorkel] Ratner, Bach, Ehrenberg, Fries, Wu & R\'e. 2017. "Snorkel: Rapid training data creation with weak supervision." Proceedings of the VLDB Endowment, 11(3), 269-282. link. doi.
- [papangelis2019collaborative] @miscpapangelis2019collaborative, Author = Papangelis, Alexandros and Wang, Yi-Chia and Molino, Piero and Tur, Gokhan, Title = Collaborative Multi-Agent Dialogue Model Training Via Reinforcement Learning, Year = 2019, Eprint = arXiv:1907.05507,
- [sennrich2015neural] Sennrich, Haddow & Birch. 2015. "Neural machine translation of rare words with subword units." arXiv preprint arXiv:1508.07909, , link. doi.
- [gage1994new] Gage. 1994. "A new algorithm for data compression." The C Users Journal, 12(2), 23-38. link. doi.
Footnotes:
Might have missed somewhere or there might be mentions in prior works