Category Archives: Data Science

Understanding the Sparse Mixture of Experts (SMoE) Layer in Mixtral – Towards Data Science

Lets begin with the idea of an expert in this context. Experts are feed-forward neural networks. We then connect them to our main model via gates that will route the signal to specific experts. You can imagine our neural network thinks of these experts as simply more complex neurons within a layer.

The problem with a naive implementation of the gates is that you have significantly increased the computational complexity of your neural network, potentially making your training costs enormous (especially for LLMs). So how do you get around this?

The problem here is that neural networks will be required to calculate the value of a neuron so long as there is any signal going to it, so even the faintest amount of information sent to an expert triggers the whole expert network to be computed. The authors of the paper get around this by creating a function, G(x) that forces most low-value signals to compute to zero.

In the above equation, G(X) is our gating function, and E(x) is a function representing our expert. As any number times zero is zero, this logic prevents us from having to run our expert network when we are given a zero by our gating function. So how does the gating function determine which experts to compute?

The gating function itself is a rather ingenious way to only focus on the experts that you want. Lets look at the equations below and then Ill dive into how they all work.

Going from bottom to top, equation 5 is simply a step function. If the input is not within a certain range (here the top k elements of the list v), it will return infinity, thus assuring a perfect 0 when plugged into Softmax. If the value is not -infinity, then a signal is passed through. This k parameter allows us to decide how many experts wed like to hear from (k=1 would only route to 1 expert, k=2 would only route to 2 experts, etc.)

Equation 4 is how we determine what is in the list that we select the top k values from. We begin by multiplying the input to the gate (the signal x) by some weight Wg. This Wg is what will be trained in each successive round for the neural network. Note that the weight associated with each expert likely has a distinct value. Now to help prevent the same expert being chosen every single time, we add in some statistical noise via the second half of our equation. The authors propose distributing this noise along a normal distribution, but the key idea is to add in some randomness to help with expert selection.

Equation 3 simply combines the two equations and puts them into a SoftMax function so that we can be sure that -infinity gets us 0, and any other value will send a signal through to the expert.

The sparse part of the title comes from sparse matrices, or matrices where most of the values are zero, as this is what we effectively create with our gating function.

While our noise injection is valuable to reduce expert concentration, the authors found it was not enough to fully overcome the issue. To incentivize the model to use the experts nearly equally, they adjusted the loss function.

Equation 6 shows how they define importance in terms of the gate function this makes sense as the gate function is ultimately the decider of which expert gets used. Importance here is the sum of all of the experts gate functions. They define their loss function as the coefficient of the variation of the set of Importance. Put simply, this means we are finding a value that represents just how much each expert is used, where a select few experts being used creates a big value and all of them being used creates a small value. The w importance is a hyperparameter that can aid the model to use more of the experts.

Another training challenge the paper calls out involves getting enough data to each of the experts. As a result of our gating function, the amount of data each expert sees is only a fraction of what a comparatively dense neural network would see. Put differently, because each expert will only see a part of the training data, it is effectively like we have taken our training data and hidden most of it from these experts. This makes us more susceptible to overfitting or underfitting.

This is not an easy problem to solve, so the authors suggest the following: leveraging data parallelism, leaning into convolutionality, and applying Mixture of Experts recurrently (rather than convolutionally). These are dense topics, so to prevent this blog post from getting too long I will go into these in later posts if there is interest.

The Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer paper was published in 2017, the same year that the seminal Attention is All You Need paper came out. Just as it took some years before the architecture described in Self-Attention reached the main stream, it took a few years before we had any models that could successfully implement this Sparse architecture.

When Mistral released their Mixtral model in 2024, they showed the world just how powerful this setup can be. With the first production-grade LLM with this architecture, we can look at how its using its experts for further study. One of the most fascinating pieces here is we dont really understand why specialization at the token level is so effective. If you look at the graph below for Mixtral, it is clear that with the exception of mathematics, no one expert is the go-to for any one high level subject.

Consequently, we are left with an intriguing situation where this new architectural layer is a marked improvement yet nobody can explain exactly why this is so.

More major players have been following this architecture as well. Following the open release of Grok-1, we now know that Grok is a Sparse Mixture of Experts model, with 314 billion parameters. Clearly, this is an architecture people are willing to invest amounts of capital into and so will likely be a part of the next wave of foundation models. Major players in the space are moving quickly to push this architecture to new limits.

The Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer paper ends suggesting experts created via a recurrent neural network are the natural next step, as recurrent neural networks tend to be even more powerful than feed-forward ones. If this is the case, then the next frontier of foundation models may not be networks with more parameters, but rather models with more complex experts.

In closing, I think this paper highlights two critical questions for future sparse mixture of experts studies to focus on. First, what scaling effects do we see now that we have added more complex nodes into our neural network? Second, does the complexity of an expert have good returns on cost? In other words, what scaling relationship do we see within the expert network? What are the limits on how complex it should be?

As this architecture is pushed to its limits, it will surely bring in many fantastic areas of research as we add in complexity for better results.

[1] N. Shazeer, et al., OUTRAGEOUSLY LARGE NEURAL NETWORKS: THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER (2017), arXiv

[2] A. Jiang, et al., Mixtral of Experts (2024), arXiv

[3] A. Vaswani, et al., Attention Is All You Need (2017), arXiv

[4] X AI, et al., Open Release of Grok-1 (2024), x ai website

See more here:

Understanding the Sparse Mixture of Experts (SMoE) Layer in Mixtral - Towards Data Science

Syntax: the language form. How do you know that this is a | by Dusko Pavlovic | Mar, 2024 – Towards Data Science

Language processing in humans and computers: Part 3 How do you know that this is a sentence?

People speak many languages. People who speak different languages generally dont understand each other. How is it possible to have a general theory of language?

Life is also diversified in many species, and different species generally cannot interbreed. But life is a universal capability of self-reproduction and biology is a general theory of life.

General linguistics is based on Noam Chomskys Cartesian assumption: that all languages arise from a universal capability of speech, innate to our species. The claim is that all of our different languages share the same deep structures embedded in our brains. Since different languages assign different words to the same things, the semantic assignments of words to meanings are not a part of these universal deep structures. Chomskian general linguistics is mainly concerned with general syntax. It also studies (or it used to study) the transformations of the deep syntactic structures into the surface structures observable in particular languages, just like biology studies the ways in which the general mechanisms of heredity lead to particular organisms. Oversimplified a little, the Chomskian thesis implied that

However, the difference between the pathways from deep structures to surface structures as studied in linguistics on one hand and in biology on the other is that * in biology, the carriers of the deep structures of life are directly observable and empirically studied in genetics, whereas * in linguistics, the deep structures of syntax are not directly observable but merely postulated, as Chomskys Cartesian foundations, and the task of finding actual carriers is left to a future science.

This leaves the Cartesian assumption about the universal syntactic structures on a shaky ground. The emergence of large language models may be a tectonic shift of that ground. Most of our early interactions with chatbots seem to suggest that the demarcation line between syntax and semantics may not be as clear as traditionally assumed.

To understand a paradigm shift, we need to understand the paradigm. To stand a chance to understand large language models, we need a basic understanding of the language models previously developed in linguistics. In this lecture and in the next one, we parkour through the theories of syntax and of semantics, respectively.

Grammar is trivial in the sense that it was the first part of trivium. Trivium and quadrivium were the two main parts of medieval schools, partitioning the seven liberal arts that were studied. Trivium consisted of grammar, logic, and rhetorics; quadrivium of arithmetic, geometry, music, and astronomy. Theology, law, and medicine were not studied as liberal arts because they were controlled by the Pope, the King, and by physicians guilds, respectively. So grammar was the most trivial part of trivium. At the entry point of their studies, the students were taught to classify words into 8 basic syntactic categories, going back to Dionysios Trax from II century BCE: nouns, verbs, participles, articles, pronouns, prepositions, adverbs, and conjunctions. The idea of categories goes back to the first book of Aristotles Organon. The basic noun-verb scaffolding of Indo-European languages was noted still earlier, but Aristotle spelled out the syntax-semantics conundrum: What do the categories of words in the language say about the classes of things in the world? For a long time, partitioning words into categories remained the entry point of all learning. As understanding of language evolved, its structure became the entry point.

Formal grammars and languages are defined in the next couple of displays. They show how it works. If you dont need the details, skip them and move on to the main idea. The notations are explained among the notes.

The idea of the phrase structure theory of syntax is to start from a lexicon as the set of terminals and to specify a grammar that generates as the induced language a desired set of well-formed sentences.

How grammars generate sentences. The most popular sentences are in the form Subject loves Object. One of the most popular sentence from grammar textbooks is in the next figure on the left:

The drawing above the sentence is its constituent tree. The sentence consists of a noun phrase (NP) and a verb phrase (VP), both as simple as possible: the noun phrase is a noun denoting the subject, the verb phrase a transitive verb with another noun phrase denoting the object. The subject-object terminology suggests different things to different people. A wide variety of ideas. If even the simplest possible syntax suggests a wide variety of semantical connotations, then there is no such thing as a purely syntactic example. Every sequence of words has a meaning, and meaning is a process, always on the move, always decomposable. To demonstrate the separation of syntax from semantics, Chomsky constructed the (syntactically) well-formed but (semantically) meaningless sentence illustrated by Dall-E in the above figure on the right. The example is used as evidence that syntactic correctness does not imply semantic interpretability. But there is also a whole tradition of creating poems, stories, and illustrations that assign meanings to this sentence. Dall-Es contribution above is among the simpler ones.

Marxist linguistics and engineering. For a closer look at the demarcation line between syntax and semantics, consider the ambiguity of the sentence One morning I shot an elephant in my pajamas, delivered by Groucho Marx in the movie Animal Crackers.

The claim is ambiguous because it permits the two syntactic analyses:

both derived using the same grammar:

While both analyses are syntactically correct, only one is semantically realistic, whereas the other one is a joke. To plant the joke, Groucho binds his claim to the second interpretation by adding How he got into my pajamas I dont know. The joke is the unexpected turn from syntactic ambiguity to semantic impossibility. The sentences about the colorless green ideas and the elephant in my pajamas illustrate the same process of apparent divergence of syntax and semantics, studied in linguistics and used in comedy.

History of formal grammars. The symbol ::= used in formal grammars is a rudiment of the fact that such rules used to be thought of as one-way equations. Rule (1) in the definition of formal grammars above is meant to be interpreted something like: Whenever you see , you can rewrite it as , but not the other way around. Algebraic theories presented by systems of such one-way equations were studied by Axel Thue in the early XX century. Emil Post used such systems in his studies of string rewriting in the 1920s, to construct what we would now call programs, more than 10 years before Godel and Turing spelled out the idea of programming. In the 1940s, Post proved that his string rewriting systems were as powerful as Turings, Godels, and Churchs models of computation, which had in the meantime appeared. Noam Chomskys 1950s proposal of formal grammars as the principal tool of general linguistics was based on Posts work and inspired by the general theory of computation, rapidly expanding and proving some of its deepest results at the time. While usable grammars of natural languages still required a lot of additional work on transformations, side conditions, binding, and so on, the simple formal grammars that Chomsky classified back then remained the principal tool for specifying programming languages ever since.

Hierarchy of formal grammars and languages. Chomsky defined the nest of languages displayed in the next figure by imposing constraints on the grammatical rules that generate the languages.

The constraints are summarized in the following table. We say that

Here are some examples from each grammar family, together with typical derivation trees and languages:

Does it really work like this in my head? Scientific models of reality usually do not claim that they are the reality. Physicists dont claim that quantum states consist of density matrices used to model them. Grammars are just a computational model of language, born in the early days of the theory of computation. The phrase structure grammars were an attempt to explain language in computational terms. Nowadays even the programming language often dont work that way anymore. Its just a model.

However, when it comes to mental models of mental processes, the division between the reality and its models becomes subtle. They can reflect and influence each other. A computational model of a computer allows the computer to simulate itself. A language can be modeled within itself, and the model can be similar to the process that it models. How close can it get?

Dependency grammars are a step closer to capturing the process of sentence production. Grammatical dependency is a relation between words in a sentence. It relates a head word and an (ordered!) tuple of dependents. The sentence is produced as the dependents are chosen for the given head words, or the heads for the given dependents. The choices are made in the order in which the words occur. Here is how this works on the example of Grouchos elephant sentence:

Unfolding dependencies. The pronoun I occurs first, and it can only form a sentence as a dependent on some verb. The verb shot is selected as the head of that dependency as soon as it is uttered. The sentence could then be closed if the verb shot is used as intransitive. If it is used as transitive, then the object of action needs to be selected as its other dependent. Groucho selects the noun elephant. English grammar requires that this noun is also the head of another dependency, with an article as its dependent. Since the article is required to precede the noun, the word elephant is not uttered before its dependent an or the is chosen. After the words I shot an elephant are uttered (or received), there are again multiple choices to be made: the sentence can be closed with no further dependents, or a dependent can be added to the head shot, or else it can be added to the head elephant. The latter two syntactic choices correspond to the different semantical meanings that create ambiguity. If the prepositional phrase in my pajamas is a syntactic dependent of the head shot, then the subject I wore the pajamas when they shot. If the prepositional phrase is a syntactic dependent of the head elephant, then the object of shooting wore the pajamas when they were shot. The two dependency analyses look like this, with the corresponding constituency analyses penciled above them.

The dependent phrase in my pajamas is headed by the preposition in, whose dependent is the noun pajamas, whose dependent is the possessive my. After that, the speaker has to choose again whether to close the sentence or to add another dependent phrase, say while sleeping furiously, which opens up the same two choices of syntactic dependency and semantic ambiguity. To everyones relief, the speaker chose to close the sentence.

Is dependency a syntactic or a semantic relation? The requirements that a dependency relation exists are usually syntactic. E.g., to form a sentence, a starting noun is usually a dependent of a verb. But the choice of a particular dependent or head assignment is largely semantical: whether I shot an elephant or a traffic sign. The choice of an article dependent on the elephant depends on the context, possibly remote: whether a particular elephant has been determined or not. If it has not been determined, then the form of the independent article an is determined syntactically, and not semantically.

So the answer to the above question seems to suggest that the partition of the relations between words into syntactic and semantic is too simplistic for some situations since the two aspects of language are not independent and can be inseparable.

In programming, type-checking is a basic error-detection mechanism: e.g., the inputs of an arithmetic operation are checked to be of type , the birth dates in a database are checked to have the month field of type , whose terms may be the integers 1,2,, 12, and if someones birth month is entered to be 101, the error will be caught in type-checking. Types allow the programmer to ensure correct program execution by constraining the data that can be processed.

In language processing, the syntactic types are used in a similar process, to restrict the scope of the word choices. Just like the type restricts the inputs of arithmetic operations tointegers, the syntactic type restricts the predicates in sentences to verbs. If you hear something sounding like ``John lo Mary, then without the type constraints, you have more than 3000 English words starting with lo to consider as possible completions. With the syntactic constraint that the word that you didnt discern must be a transitive verb in third person singular, you are down to lobs, locks, logs, maybe loathes, and of course, loves.

The rules of grammar are thus related to the type declarations in programs as

In the grammar listed above after the two parsings of Grouchos elephant sentence, the terminal rules listed on the left are the basic typing statements, whereas the non-terminal rules on the right are type constructors, building composite types from simpler types. The constituency parse trees thus display the type structures of the parsed sentences. The words of the sentence occur as the leaves, whereas the inner tree nodes are the types. The branching nodes are the composite types and the non-branching nodes are the basic types. Constituency parsing is typing.

Dependency parsings, on the other hand, do a strange thing: having routed the dependencies from a head term to its dependents through the constituent types that connect them, they sidestep the types and directly connect the head with its dependents. This is what the above dependency diagrams show. Dependency parsing reduces syntactic typing to term dependencies.

But only the types that record nothing but term dependencies can be reduced to term dependencies. The two dependency parsings of the elephant sentence look something like this:

The expressions below the two copies of the sentence are the syntactic types captured by dependency parsings. They are generated by tupling the reference variables x,y, etc., with their overlined left adjoints and underlined right adjoints. Such syntactic types form pregroups, an algebraic structure introduced in the late 1990s by Jim Lambek, as a simplification of his syntactic calculus of categorial grammars. He had introduced categorial grammars in the late 1950s, to explore decision procedures for Ajdukiewiczs syntactic connexions from the 1930s and for Bar-Hillels quasi-arithmetic from the early 1950s, both based on the reference-based logic of meaning going back to Husserls Logical investigations. Categorial grammars have been subsequently studied for many decades. We only touch pregroups, only as a stepping stone.

A pregroup is an ordered monoid with left and right adjoints. An ordered monoid is a monoid where the underlying set is ordered and the monoid product is monotone.

If you know what this means, you can skip this section. You can also skip it if you dont need to know how it works, since the main idea should transpire as you go anyway. Just in case, here are the details.

It is easy to show that all elements of all pregroups, as ordered monoids with adjoints, satisfy the following claims:

Examples. The free pregroup, generated by an arbitrary poset, consists of the tuples formed from the poset elements and with their left and right adjoints. The monoid operation is the concatenation. The order is lifted from the generating poset pointwise and (most importantly) extended by the order clauses from the definition of the adjoints. For a non-free example, consider the monoid of monotone maps from integers to integers. Taken with the pointwise order again, the monotone maps form a pregroup because every bounded set of integers contains its meet and join, and therefore every monotone map must preserve them. This allows constructing the adjoints.

Here is why pregroups help with understanding language.

To check semantic correctness of a given phrase, each word in the phrase is first assigned a pregroup element as its syntactic type. The type of the phrase is the product of the types of its words, multiplied in the pregroup. The phrase is a well-formed sentence if its syntactic type is bounded from above by the pregroup unit . In other words, we compute the syntactic type S of the given phrase, and it is a sentence just when S. The computation can be reduced to drawing arcs to connect each type x with an adjoint, be it left or right, and coupling them so that each pair falls below . If the arcs are well-nested, eliminating the adjacent linked pairs, that fall below according to the above definition of adjoints, and replacing them by the unit, makes other adjoint pairs adjacent and ready to be eliminated. If the phrase is a sentence, proceeding like reduces its type to the unit. Since the procedure was nondecreasing, this proves that the original type was bounded by the unit from above. If the types cannot be matched by linking and eliminated in this way, then the phrase is not a sentence. The type actually tells what kind of a phrase it is.

We obviously skipped many details and some of them are significant. In practice, the head of the sentence is annotated by a type variable S that does not get discharged and its wire does not arc to another type in the sentence but points straight out. This wire can be interpreted as a reference to another sentence. By linking the S-variables of pairs of sentences and coupling, for instance, questions and answers, one could implement a pregroup version of discourse syntax. Still further up, by pairing messages and coupling, say, the challenges and the responses in an authentication protocol, one could implement a pregroup version of a protocol formalism. We will get back to this in a moment.

While they are obviously related with dependency references, the pregroup couplings usually deviate from them. On the sentential level, this is because the words grouped under the same syntactic type in a lexicon should are expected to be assigned the same pregroup type. Lambeks idea was that even the phrases of the same type in constituency grammars should receive the same pregroup type. Whether this requirement is justified and advantageous is a matter of contention. The only point that matters here is that syntax is typing.

Why dont we stream words, like network routers stream packets? Why cant we approximate what we want to say by adding more words, just like numbers approximate points in space by adding more digits?

The old answer is: We make sentences to catch a breath. When we complete a sentence, we release the dependency threads between its words. Without that, the dependencies accumulate, and you can only keep so many threads in your mind at a time. Breathing keeps references from knotting.

Exercise. We make long sentences for a variety of reasons and purposes. A sample of a long sentence is provided below. Try to split it into shorter ones. What is gained and what lost by such operations? Ask a chatbot to do it.

Anaphora is a syntactic pattern that occurs within or between sentences. In rhetorics and poetry, it is the figure of speech where the same phrase is repeated to amplify the argument or thread a reference. In ChatGPTs view, it works because the rhythm of the verse echoes the patterns of meaning:

In every word, lifes rhythm beats, In every truth, lifes voice speaks. In every dream, lifes vision seeks, In every curse, lifes revenge rears. In every laugh, lifes beat nears, In every pause, lifes sound retreats.

Syntactic partitions reflect the semantic partitions. Sentential syntax is the discipline of charging and discharging syntactic dependencies to transmit semantic references.

The language streams are articulated into words, sentences, paragraphs, sections, chapters, books, libraries, literatures; speakers tell stories, give speeches, maintain conversations, follow conventions, comply with protocols. Computers reduce speech to tweets and expand it to chatbots.

The layering of language articulations is an instance of the stratification of communication channels. Artificial languages evolved the same layering. The internet stack is another instance.

Communication channels are stratified because information carriers are implemented on top of each other. The layered interaction architectures are pervasive, in living organisms, in the communication networks between them, and in all languages developed by the humans. The reference coupling mechanisms, similar to the syntactic type structures that we studied, emerge at all levels. The pregroup structure of sentential syntax is echoed in the question-answer structure of simple discourse and in the SYN-ACK pattern of basic network protocols. Closely related structures arise in all kinds of protocols, across the board, whether they are established to regulate network functions, or secure interactions, or social, political, economic mechanisms. The following figure shows a high-level view of a simple 2-factor authentication protocol, presented as a basic cord space:

And here is the same protocol with the cord interactions viewed as adjoint pairs of types:

The corresponding interactions are marked by the corresponding sequence numbers. The upshot is that

share crucial features. It is tempting to think of them as a product of a high-level deep syntax, shared by all communication processes. Such syntax could conceivably arise from innate capabilities hypothesized in the Chomskian theory, or from physical and logical laws of information processing.

We have seen how syntactic typing supports semantic information transmission. Already Grouchos elephant sentence fed syntactic and semantic ambiguities back into each other.

But if syntactic typing and semantic assignments steer each other, then the generally adopted restriction of syntactic analyses to sentences cannot be justified, since semantic ambiguities cannot be resolved on the level of sentence. Groucho proved that.

Consider the sentence

John said he was sick and got up to leave.

Adding a context changes its meaning:

Mark collapsed on bed. John said he was sick and got up to leave.

For most people, he was sick now refers to Mark. Note that the silent he in [he] got up to leave remains bound to John. Or take

Few professors came to the party and had a great time.

The meaning does not significantly change if we split the sentence in two and expand :

Since it started late, few professors came to the party. They had a great time.

Like in the John and Mark example, a context changes the semantical binding, this time of it:

There was a departmental meeting at 5. Since it started late, few professors came to the party. They had a great time.

But this time, adding a first sentence that binds the subject they differently may change the meaning of they in the last sentence::

They invited professors. There was a departmental meeting at 5. Since it started late, few professors came to the party. They had a great time.

The story is now that students had a great time the students who are never explicitly mentioned! Their presence is only derived from the background knowledge about the general context of professorial existence 😉

On the level of sentential syntax of natural languages, as generated by formal grammars, proving context-sensitivity amounts to finding a language that contains some of the patterns known to require a context-sensitive grammar, such as abc for arbitrary letters a,b,c and any number n, or ww, www, or wwww for arbitrary word w*. Since people are unlikely to go around saying to each other things like abc, the task boiled down to finding languages which require constructions of repetitive words in the form ww, www, etc. The quest for such examples became quite competitive.

Since a language with a finite lexicon has a finite number of words for numbers, at some point you have to say something like quadrillion quadrillion, assuming that quadrillion is the largest number denoted by a single word. But it was decided by the context sensitivity competition referees that numbers dont count.

Then someone found that in the Central-African language Bambara, the construction that says any dog is in the form dog dog. Then someone else noticed context-sensitive nesting phenomena in Dutch, but not everyone agreed. Eventually, most people settled on Swiss German as a definitely context sensitive language, and the debate about syntactic contexts-sensitivity subsided. With a hindsight, it had the main hallmarks of a theological debate. The main problem with counting how many angels can stand on the tip of a needle is that angels generally dont hang out on needles. The main problem with syntactic context sensitivity is that contexts are never purely syntactic.

Chomsky noted that natural language should be construed as context-sensitive as soon as he defined the notion of context-sensitivity. Restricting the language models to syntax, and syntax to sentences, made proving his observation into a conundrum.

But now that the theology of syntactic contexts is behind us, and the language models are in front of us, waiting to be understood, the question arises: How are the contexts really processed? How do we do it, and how do the chatbots do it? Where do we all store big contexts? A novel builds up its context starting from the first sentence, and refers to it 800 pages later. How does a language model find the target of such a reference? It cannot maintain references between everything to everything. How do you choose what to remember?

Semantic dependencies on remote contexts have been one of the central problems of natural language processing from the outset. The advances in natural language processing that we witness currently arise to a large extent from progress in solving that problem. To get an idea about the challenge, consider the following paragraph:

Unsteadily, Holmes stepped out of the barge. Moriarty was walking away down the towpath and into the fog. Holmes ran after him. `Give it back to me, he shouted. Moriarty turned and laughed. He opened his hand and the small piece of metal fell onto the path. Holmes reached to pick it up but Moriarty was too quick for him. With one slight movement of his foot, he tipped the key into the lock.

If you are having trouble understanding what just happened, you are in a good company. Without sufficient hints, the currently available chatbots do not seem to be able to produce a correct interpretation. In the next part, we will see how the contexts are generated, including much larger. After that, we will be ready to explain how they are processed.

View original post here:

Syntax: the language form. How do you know that this is a | by Dusko Pavlovic | Mar, 2024 - Towards Data Science

Maximizing AI Efficiency in Production with Caching: A Cost-Efficient Performance Booster – Towards Data Science

Unlock the Power of Caching to Scale AI Solutions with LangChain Caching Comprehensive Overview 14 min read

Free Friend Link Please help to like this linkedin post

Despite the transformative potential of AI applications, approximately 70% never make it to production. The challenges? Cost, performance, security, flexibility, and maintainability. In this article, we address two critical challenges: escalating costs and the need for high performance and reveal how caching strategy in AI is THE solution.

Running AI models, especially at scale, can be prohibitively expensive. Take, for example, the GPT-4 model, which costs $30 for processing 1M input tokens and $60 for 1M output tokens. These figures can quickly add up, making widespread adoption a financial challenge for many projects.

To put this into perspective, consider a customer service chatbot that processes an average of 50,000 user queries daily. Each query and response pair might average 50 tokens for both. In a single day, that translates to 2,500,000 tokens, up to 75 million input 75 million output and in a month. At GPT-4s pricing, this means the chatbots owner could be facing about $2250 in input token costs and $4500 in output token costs monthly, totaling $6750 just for processing user queries. What if your application is a huge success, and you have 500,000 user queries or 5 million user queries per day?

Todays users expect immediate gratification a demand that traditional machine learning and deep learning approaches struggle to meet. The arrival of Generative AI promises near-real-time responses, transforming user interactions into seamless experiences. But sometimes generative AI may not be fast enough.

Consider the same AI-driven chatbot service for customer support, designed to provide instant responses to customer inquiries. Without caching, each query is processed in

Read more:

Maximizing AI Efficiency in Production with Caching: A Cost-Efficient Performance Booster - Towards Data Science

Overlooked Apollo data from the 1970s reveals huge record of ‘hidden’ moonquakes – Livescience.com

The moon is much more seismically active than we realized, a new study shows. A reanalysis of abandoned data from NASA's Apollo missions has uncovered more than 22,000 previously unknown moonquakes nearly tripling the total number of known seismic events on the moon.

Moonquakes are the lunar equivalent of earthquakes, caused by movement in the moon's interior. Unlike earthquakes, these movements are caused by gradual temperature changes and meteorite impacts, rather than shifting tectonic plates (which the moon does not have, according to NASA). As a result, moonquakes are much weaker than their terrestrial counterparts.

Between 1969 and 1977, seismometers deployed by Apollo astronauts detected around 13,000 moonquakes, which until now were the only such lunar seismic events on record. But in the new study, one researcher spent months painstakingly reanalyzing some of the Apollo records and found an additional 22,000 lunar quakes, bringing the total to 35,000.

The findings were presented at the Lunar and Planetary Science Conference, which was held in Texas between March 13 and March 17, and are in review by the Journal of Geophysical Research.

Related: The moon is shrinking, causing landslides and moonquakes exactly where NASA wants to build its 1st lunar colony

The newly discovered moonquakes show "that the moon may be more seismically and tectonically active today than we had thought," Jeffrey Andrews-Hanna, a geophysicist at the University of Arizona who was not involved in the research, told Science magazine. "It is incredible that after 50 years we are still finding new surprises in the data."

Apollo astronauts deployed two types of seismometers on the lunar surface: one capable of capturing the 3D motion of seismic waves over long periods; and another that recorded more rapid shaking over short periods.

Get the worlds most fascinating discoveries delivered straight to your inbox.

The 13,000 originally identified moonquakes were all spotted in the long-period data. The short-period data has been largely ignored due to a large amount of interference from temperature swings between the lunar day and night, as well as issues beaming the data back to Earth, which made it extremely difficult to make sense of the numbers.

"Literally no one checked all of the short-period data before," study author Keisuke Onodera, a seismologist at the University of Tokyo, told Science Magazine.

Not only had this data gone unchecked, but it was almost lost forever. After the Apollo missions came to an end, NASA pulled funding from lunar seismometers to support new projects. Although the long-period data was saved, NASA researchers abandoned the short-period data and even lost some of their records. However, Yosio Nakamura, a now-retired geophysicist at the University of Texas in Austin, saved a copy of the data on 12,000 reel-to-reel tapes, which were later digitally converted.

"We thought there must be many, many more [moonquakes in the data]," Nakamura told Science magazine. "But we couldn't find them."

In the new study, Onodera spent three months going back over the digitized records and applying "denoising" techniques to remove the interference in the data. This enabled him to identify 30,000 moonquake candidates, and after further analysis, he found that 22,000 of these were caused by lunar quakes.

Not only do these additional quakes show there was more lunar seismic activity than we realized, the readings also hint that more of these quakes were triggered at shallower points than expected, suggesting that the mechanisms behind some of these quakes are more fault-orientated than we knew, Onodera said. However, additional data will be needed to confirm these theories.

Recent and future moon missions could soon help scientists to better understand moonquakes. In August 2023, the Vikram lander from India's Chandrayaan-3 mission detected the first moonquake since the Apollo missions on its third day on the lunar surface.

Onodera and Nakamura hope that future NASA lunar seismometers on board commercial lunar landers such as Intuitive Machine's Odysseus lander, which became the first U.S. lander to reach the moon for more than 50 years in February, will confirm what the new study revealed.

Read the original:

Overlooked Apollo data from the 1970s reveals huge record of 'hidden' moonquakes - Livescience.com

Collection of Guides on Mastering SQL, Python, Data Cleaning, Data Wrangling, and Exploratory Data Analysis – KDnuggets

Data plays a crucial role in driving informed decision-making and enabling Artificial Intelligence based applications. As a result, there is a growing demand for skilled data professionals across various industries. If you are new to data science, this extensive collection of guides is designed to help you develop the essential skills required to extract insights from vast amounts of data.

Link: 7 Steps to Mastering SQL for Data Science

It is a step-by-step approach to mastering SQL, covering the basics of SQL commands, aggregations, grouping, sorting, joins, subqueries, and window functions.

The guide also highlights the significance of using SQL to solve real-world business problems by translating requirements into technical analyses. For practicing and preparation for data science interviews, it recommends practicing SQL through online platforms like HackerRank and PGExercises.

Link: 7 Steps to Mastering Python for Data Science

This guide provides a step-by-step roadmap for learning Python programming and developing the necessary skills for a career in data science and analytics. It starts with learning the fundamentals of Python through online courses and coding challenges. Then, it covers Python libraries for data analysis, machine learning, and web scraping.

The career guide highlights the importance of practicing coding through projects and building an online portfolio to showcase your skills. It also offers free and paid resource recommendations for each step.

Link: 7 Steps to Mastering Data Cleaning and Preprocessing Techniques

A step-by-step guide to mastering data cleaning and preprocessing techniques, which is an essential part of any data science projects. The guide covers various topics, including exploratory data analysis, handling missing values, dealing with duplicates and outliers, encoding categorical features, splitting data into training and test sets, feature scaling, and addressing imbalanced data in classification problems.

You will learn the importance of understanding the problem statement and the data with the help of example codes for the various preprocessing tasks using Python libraries such as Pandas and scikit-learn.

Link: 7 Steps to Mastering Data Wrangling with Pandas and Python

It is a comprehensive learning path for mastering data wrangling with pandas. The guide covers prerequisites like learning Python fundamentals, SQL, and web scraping, followed by steps to load data from various sources, select and filter dataframes, explore and clean datasets, perform transformations and aggregations, join dataframes and create pivot tables. Finally, it suggests building an interactive data dashboard using Streamlit to showcase data analysis skills and create a portfolio of projects, essential for aspiring data analysts seeking job opportunities.

Link: 7 Steps to Mastering Exploratory Data Analysis

The guide outlines the 7 key steps for performing effective Exploratory Data Analysis (EDA) using Python. These steps include data collection, generating statistical summary, preparing data through cleaning and transformations, visualizing data to identify patterns and outliers, conducting univariate, bivariate, and multivariate analysis of variables, analyzing time series data, and dealing with missing values and outliers. EDA is a crucial phase in data analysis, enabling professionals to understand data quality, structure, and relationships, ensuring accurate and insightful analysis in subsequent stages.

To begin your journey in data science, it's recommended to start with mastering SQL. This will allow you to work efficiently with databases. Once you're comfortable with SQL, you can dive into Python programming, which comes with powerful libraries for data analysis. Learning essential techniques like data cleaning is important, as it will help you maintain high-quality datasets.

Then, gain expertise in data wrangling with pandas to reshape and prepare your data. Most importantly, master exploratory data analysis to thoroughly understand datasets and uncover insights.

After following these guidelines, the next step is to work on a project and gain experience. You can start with a simple project and then move on to more complex ones. Write about it on Medium and learn about the latest techniques to improve your skills.

Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master's degree in technology management and a bachelor's degree in telecommunication engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.

Original post:

Collection of Guides on Mastering SQL, Python, Data Cleaning, Data Wrangling, and Exploratory Data Analysis - KDnuggets

iTrials: A User-Centric Design Philosophy to Transform Clinical Trial Patient Enrollment – Polsky Center for … – Polsky Center for Entrepreneurship…

Published on Tuesday, March 19, 2024

iTrials is a participant in Cohort 3 ofTransform, a data science and AI startup accelerator powered by Deep Tech Ventures at the University of Chicagos Polsky Center for Entrepreneurship and Innovationand Innovation in collaboration with theData Science Institute.

iTrials is an AI platform for clinical trial enrollment that automates participant matching and reduces reliance on IT support and recruitment firms.

Cofounders Nitender Goyal and Vaibhav Saxena initially met during their MBA studies at the University of Chicago Booth School of Business. After graduating in 2021, the idea for iTrials surfaced the following year during a casual gathering of Booth alumni in Washington DC.

Not long after, the duo recognizing the vital role of clinical operations and frontline impact added a third cofounder, Manvir Nijar, who brings with her expertise and understanding of the real-world challenges within clinical trials.

Together the team is addressing the challenge of patient recruitment in clinical trials, which experience significant delays due to inefficient data management and resource constraints. It is a well-documented challenge that results in increased costs and delays in bringing therapies to market.

iTrials approach to addressing this, however, differs significantly from others, according to the cofounders.

Unlike many competitors who primarily focus on streamlining workflows or implementing data analytics, our unique market proposition lies in our user-centric design philosophy, said Goyal, who is a physician scientist, with a background in academic medicine and biotech, particularly focusing on clinical trials and drug development. Our solution is crafted with the study team as the central focus, ensuring that all technical complexities are hidden from the end user.

Developed by clinicians, Goyal described the system as intuitive and user-friendly, requiring minimal training. Study coordinators can seamlessly navigate through billions of records and relationships without the need for technical support or specialized knowledge, he added. The approach ensures the platform is versatile to meet the varied needs of researchers, clinicians, and patients.

Saxena, who has implemented emerging technologies across various industries over his career, said the application of AI is highly suitable for addressing the challenge of patient recruitment in clinical trials. Notably, a significant portion of patient data (around 80%) is unstructured and therefore underutilized.

AI, particularly natural language processing (NLP) and large language models (LLMs), however, excels at processing and understanding unstructured data, which Saxena said makes it invaluable for extracting meaningful insights from patient records.

Furthermore, AI-powered solutions can leverage large-scale compute clusters to efficiently handle vast amounts of data. This capability enables our platform to process and analyze patient records at scale, identifying relevant matches for clinical trials with speed and accuracy, he explained. By automating participant matching and reducing reliance on manual processes, AI minimizes the need for IT support or recruitment firms, streamlining the recruitment process and expediting trial timelines.

Additionally, the platform also leverages fine-tuned LLMs and knowledge graphs to simplify decision-making at scale, empowering study teams to make informed decisions based on comprehensive insights derived from both structured and unstructured data, said Saxena.

To ensure the product aligns with market needs the company has continually sought feedback from industry leaders and customers, including investigators and study coordinators. This has helped guide the design, development, and ongoing evolution of the product.

Our commitment to listening and responding to customer input has led to exciting opportunities for collaboration. Weve been selected to collaborate and co-develop our solution with a prestigious academic research institution, further validating the value and relevance of our platform within the clinical research community, said Nijar. This recognition underscores our dedication to delivering innovative solutions that address real-world challenges and drive positive outcomes in the healthcare industry.

As part of its participation in the Transform accelerator, Nijar said the team is excited to connect with industry experts and receive mentorship that will guide them through refining the product and accelerate its growth.

Our aspiration is to emerge as the premier analytics company for clinical research, extending our influence to enhance the utilization of data and generative AI to positively impact drug development, she said.

// Powered by the Polsky Centers Deep Tech Ventures, Transform provides full-spectrum support for the startups accepted into the accelerator, including access to business and technical training, industry mentorship, venture capital connections, and funding opportunities.

Read more:

iTrials: A User-Centric Design Philosophy to Transform Clinical Trial Patient Enrollment - Polsky Center for ... - Polsky Center for Entrepreneurship...

New AI Tool from bioXcelerate Set to Speed Up Drug Discovery – BioPharm International

The health data science division of Optima Partners, bioXcelerate, has released a tool aimed at accelerating the drug discovery process.

The health data science division of Optima Partners Limited, bioXcelerate, announced, in a March 20, 2024 press release, its new artificial intelligence (AI) tool, PleioGraph, which has been designed to speed up the drug discovery process (1).

Using the companys own machine learning algorithms to assess biological data, PleioGraph can aid drug developers to identify genetic colocalizationthe process by which genes, proteins, and cells interact with each other to cause disease. By providing this information, drug developers can gain insights into how diseases develop, which will ultimately aid in speeding up the drug development process.

The scale of data available is outpacing our ability to analyze it, said Dr. Chris Foley, chief scientist and managing director at bioXcelerate, in the press release (1). While patterns explaining how diseases occur are hidden within our ever-increasing databanks, AI technologies which quickly and accurately reveal these patterns are in short supply. At bioXcelerate, we are providing a solution to overcome these barriers. Combining our academic expertise with that of our unmatched breadth of health data science knowledge, our PleioGraph technology provides an exciting, industry-leading tool to supercharge early-phase drug discovery and ultimately, improve patient outcomes.

Total expenditure on R&D across the global pharma industry was reported as $244 billion in 2022 (2). However, according to research from Deloitte, the return on investment being seen from R&D in 2022 fell to its lowest level at 1.2% (3). Additionally, the average time and cost of developing a new drug have both increased (3), and attrition rates remain stubbornly high, with some estimates claiming a failure rate of 90% for clinical drug development (4).

Drug development costs are on the rise, and coupled with a high failure rate, the development of new treatments has become drawn out and costly, added Dr. Heiko Runz, a scientific partner of Optima, in the press release (1). By improving efficiencies in drug discovery, we hope to accelerate the emergence of novel medicines that positively impact patients lives and reach them more rapidly.

An innovative health data division of Optima Partners, bioXcelerate was founded by academics from universities, including Cambridge, Edinburgh, Imperial College, and Oxford, all in the United Kingdom. The company is focused on using data science to enhance the pharmaceutical, biotech, and public health sectors, and is seeking to improve patient outcomes through catalyzing drug discovery and clinical development processes using advanced statistics, machine learning, and software engineering.

1. bioXcelerate. Groundbreaking Health Data Science Tool Set to Make Drug Discovery 100 Times Faster. Press Release, March 20, 2024. 2. Mikulic, M. Total Global Pharmaceutical R&D Spending 20142028. Statista, Oct. 6, 2023. 3. Deloitte. Deloitte Pharma Study: Drop-off in Returns on R&D InvestmentsSharp Decline in Peak Sales per Asset. Press Release, Jan. 23, 2023. 4. Sun, D.; Gao, W.; Hu, H.; Zhou, S. Why 90% of Clinical Drug Development Fails and How to Improve it? Acta Pharm. Sin. B. 2022 12 (7) 30493062.

Source: bioXcelerate

View post:

New AI Tool from bioXcelerate Set to Speed Up Drug Discovery - BioPharm International

More Studies by Columbia Cancer Researchers Are Retracted – The New York Times

Scientists in a prominent cancer lab at Columbia University have now had four studies retracted and a stern note added to a fifth accusing it of severe abuse of the scientific publishing system, the latest fallout from research misconduct allegations recently leveled against several leading cancer scientists.

A scientific sleuth in Britain last year uncovered discrepancies in data published by the Columbia lab, including the reuse of photos and other images across different papers. The New York Times reported last month that a medical journal in 2022 had quietly taken down a stomach cancer study by the researchers after an internal inquiry by the journal found ethics violations.

Despite that studys removal, the researchers Dr. Sam Yoon, chief of a cancer surgery division at Columbia Universitys medical center, and Changhwan Yoon, a more junior biologist there continued publishing studies with suspicious data. Since 2008, the two scientists have collaborated with other researchers on 26 articles that the sleuth, Sholto David, publicly flagged for misrepresenting experiments results.

One of those articles was retracted last month after The Times asked publishers about the allegations. In recent weeks, medical journals have retracted three additional studies, which described new strategies for treating cancers of the stomach, head and neck. Other labs had cited the articles in roughly 90 papers.

A major scientific publisher also appended a blunt note to the article that it had originally taken down without explanation in 2022. This reuse (and in part, misrepresentation) of data without appropriate attribution represents a severe abuse of the scientific publishing system, it said.

Still, those measures addressed only a small fraction of the labs suspect papers. Experts said the episode illustrated not only the extent of unreliable research by top labs, but also the tendency of scientific publishers to respond slowly, if at all, to significant problems once they are detected. As a result, other labs keep relying on questionable work as they pour federal research money into studies, allowing errors to accumulate in the scientific record.

We are having trouble retrieving the article content.

Please enable JavaScript in your browser settings.

Thank you for your patience while we verify access. If you are in Reader mode please exit andlog intoyour Times account, orsubscribefor all of The Times.

Thank you for your patience while we verify access.

Already a subscriber?Log in.

Want all of The Times?Subscribe.

More here:

More Studies by Columbia Cancer Researchers Are Retracted - The New York Times

Data Science at Carleton: Researchers Use Big Data to Address Real-World Issues – Carleton Newsroom

Data science is like detective work for information. It uses scientific methods to collect, organize, and analyze data, uncovering useful patterns and insights. As a national hub of data sciences, Carleton University is deeply engaged in data research. With more than 170 data researchers, across five faculties and 26 departments, Carleton is addressing real-world issues including social movements, accessibility, finance, human behavior, and health and wellness.

On March 26, Carleton will host its 10th annual Data Day. Jointly hosted by the Faculty of Science and the Carleton University Institute for Data Science (CUIDS), Data Day is an annual conference that celebrates the latest developments in data science and analytics research.

Read how Carleton data science researchers are making an impact below.

Cryptocurrencies, such as Bitcoin and Ethereum, have garnered significant attention from the media due to their innovative technology and the potential for high returns on investment. However, this media attention can distort market perceptions, create instability, and increase the risk for investors.

Mohamed Al Guindy, a leading researcher in business and financial technology, is spearheading this research to shed light on the relationship between media attention and cryptocurrency price fluctuations.

Mohamed Al Guindy

Its crucial for investors to educate themselves on these issues prior to delving into the unknown. I think this is especially important at a time when there is so much hype, and so little fundamental or useful information, Al Guindy explains.

Al Guindys research is motivated by the extensive media coverage of cryptocurrencies, particularly Bitcoin, and its potential impact on investor behavior. By analyzing tens of millions of tweets, he has found that heightened media coverage and investor attention may contribute to greater price volatility (instability), posing challenges for small investors in the cryptocurrency market.

It might appear easy to follow the crowd and jump on the bandwagon, but that may not be the most optimal choice, Al Guindy says.

Al Guindys work is paving the way for informed decision-making in the cryptocurrency market.

I believe that many of the emerging and most important research questions in business and finance can be addressed using data science and big data.

Read the original:

Data Science at Carleton: Researchers Use Big Data to Address Real-World Issues - Carleton Newsroom

Must Have Checklist For Data Scientists in 2024 – Analytics Insight

In the dynamic landscape of data science, the role of a data scientist continues to evolve, with new technologies, methodologies, and challenges emerging on a regular basis. To thrive in this ever-changing field, data scientists must possess a diverse set of skills that enable them to navigate complex datasets, extract valuable insights, and drive impactful decisions. As we look ahead to 2024, mastering the following ten essential checklist for data scientists to succeed in their roles and make a significant impact in their organizations.

Data scientists must be proficient in programming languages such as Python, R, and SQL to manipulate data, perform analysis, and build models efficiently. A strong foundation in programming enables data scientists to tackle a wide range of data-related tasks effectively.

A deep understanding of statistical concepts and techniques is crucial for data scientists to derive meaningful insights from data. Proficiency in statistical methods such as hypothesis testing, regression analysis, and Bayesian inference allows data scientists to make informed decisions and draw reliable conclusions.

Effective data visualization skills are essential for communicating insights and findings to stakeholders. Data scientists should be proficient in using tools like Matplotlib, Seaborn, and Tableau to create clear and compelling visualizations that enhance understanding and drive action.

Data scientists must possess expertise in machine learning algorithms and techniques to build predictive models and make accurate forecasts. Proficiency in algorithms such as linear regression, decision trees, and neural networks enables data scientists to leverage the power of machine learning for solving complex problems.

A deep understanding of the domain in which they work is critical for data scientists to interpret data effectively and generate actionable insights. Data scientists should possess domain-specific knowledge that allows them to contextualize data, identify relevant patterns, and make informed recommendations.

Data wrangling, or the process of cleaning and transforming raw data into a usable format, is a fundamental skill for data scientists. Proficiency in data wrangling techniques, such as data cleaning, feature engineering, and data preprocessing, ensures that data scientists can work with high-quality data for analysis and modeling.

Data scientists must be adept problem solvers, capable of approaching complex problems with creativity and analytical rigor. Strong problem-solving skills enable data scientists to identify relevant questions, formulate hypotheses, and devise effective strategies for data analysis and interpretation.

Effective collaboration and communication skills are essential for data scientists to work effectively with cross-functional teams and stakeholders. Data scientists should be able to articulate technical concepts in a clear and concise manner, facilitate discussions, and build consensus around data-driven decisions.

In a rapidly evolving field like data science, continuous learning and adaptability are crucial for staying abreast of new technologies and methodologies. Data scientists should embrace a growth mindset, seek out opportunities for learning and development, and adapt to evolving trends and best practices.

Data scientists must adhere to ethical and responsible data practices to ensure the integrity and privacy of data. They should be familiar with regulations such as GDPR and HIPAA, adhere to ethical guidelines for data collection and usage, and prioritize the ethical implications of their work.

In conclusion, mastering these ten skills is essential for data scientists to succeed in their roles and drive impactful outcomes in 2024 and beyond. By continuously developing and honing these skills, data scientists can position themselves as valuable assets to their organizations, making meaningful contributions to the field of data science and driving innovation in their respective domains.

Join our WhatsApp and Telegram Community to Get Regular Top Tech Updates

Link:

Must Have Checklist For Data Scientists in 2024 - Analytics Insight