# Tagged: strings

# Language in Computer Science

The last post was about strings, so now something about **languages**. From the theoretical point of view a *language* is a set of words or more precisely sentences (strings). Possibly and usually an infinite set. Only sentences that are present in the set are part of the language, nothing else. Neat feature of languages is, that they interconnect two separate formal theories/models into one thing. Let me explain.

Language is a **set**. As in mathematical set. So anything that applies to sets applies to languages as well (like union, intersection, complement, Cartesian product). Members of any language are only **strings**. And there are some operations defined over strings (that is concatenation, reversal, sub-strings etc.). We can take advantage of this fact and abstract those, so they work with language as well. There is generally **a lot** you can do with a formal language either from the set or string point of view.

**Definition 1.** A **formal language** *L* over an alphabet Σ is a subset of Σ^{*}. Formally, L ⊆ Σ^{*}. The magical Σ^{*} denotes a set of all existing words and sentences over Σ. Sometimes it’s called an alphabet iteration.

Like I said earlier, all operations defined for sets work with languages as well. For example **language union** L_{1} ∪ L_{2} = {s; s ∈ L_{1} ∨ s ∈ L_{2}} etc. More interesting are the string operations that can be abstracted so they work on languages. There is** language concatenation**:

**Definition 2.** Let L_{1} be a language over Σ_{1} and L_{2 }be a language over Σ_{2}. **Concatenation** of these two L = L_{1}⋅L2 = {xy; x ∈ L_{1}, y ∈ L_{2}} is a language over alphabet Σ = Σ_{1} ∪ Σ_{2}.

Language concatenation as well as the string equivalent is not commutative operation. There is a couple of special cases:

- L ⋅ {ε} = L
- L ⋅ ∅ = ∅

The first, {ε} is a language with a single word — an empty string, in the second case we’re talking about an empty set. Concatenation can be looked at a sort of multiplication between the two languages (remember Cartesian product?). As a result of this thought, we define **language power**.

**Definition 3.** Let *L* be a language over Σ, then *L ^{n}* denotes n-power of language and

- L
^{0}= {ε} - L
^{n+1}=L^{n}⋅L

With language power defined, we can advance to the last operation called **language iteration**. It’s basically an abstraction of language power.

**Definition 4**. L^{+} denotes positive iteration of language L defined accordingly

- L
^{+}= L ∪ L^{2}∪ … ∪ L^{∞}

**Definition 5**. L^{*} denotes iteration of language L defined accordingly

- L
^{*}= {ε} ∪ L ∪ L^{2}∪ … ∪ L^{∞}

## Summary

In this post were introduced some of the most basic and most common operations that are defined on languages. The most important thing to remember is, that language is a **set** of **sentences** (or words) and the operations that we define originate from these two areas.

We have some set operations like **union**, **intersection**, **complement **as for generic sets. And we have abstractions for string operations like **concatenation** (notice, that it’s somewhat similar to cartesian product), **language power** and **iteration**.

Anyway, these couple of operations is nowhere near to the complete list of all existing functions for languages! After all, if you dive into some algebraic structures a little you can easily define your own :-).

# Strings in Computer Science

You have probably stumbled upon a string data type in some programming language. In C it’s `<string.h>`

, in C++ `std::string`

, Python has them even PHP! They’re useful, pretty straight-forward. And the most beautiful thing is, that there is a theoretical foundation for them! These operations e.g. reversation, concatenation, making sub-string and of course the string itself! Let’s have a look.

**Definition 1.** String is a finite ordered sequence of characters. Let Σ be an alphabet, then *s* ∈ Σ^{*} is a string over Σ.

The mysterious Σ^{*} is an alphabet iteration. That is a set of all existing strings over Σ. Which makes it a infinite set, unless the alphabet is empty. You see, all the strings fall within this definition. s = “abcc” might be a string over alphabet Σ = {a, b, c}. One, sort of special case exists — an empty string, that is usually denoted by ε and its length is 0. Now, let’s proceed to some string operations. One of the most important is **concatenation**.

**Definition 2.** Let w and x be strings over Σ. Then string y = wx is a concatenation of these two. It’s also a string over Σ. You see, that concatenation is not commutative. You cannot switch the strings and expect to get the same result. Length of the new string |y| = |w| + |x| i.e. is equal to a sum of lengths of the concatenated strings.

Concatenation is very important string operation. It allows you to connect multiple strings together. When you have two or multiple strings connected, for instance w,x,y,z ∈ Σ^{*} , z = wxy, you can say that w is a **prefix** of z, y is a **suffix** of z and x is a **substring** of z. A **proper prefix** of a string is not equal to the string itself, i.e. xy != ε and equivalently y is a **proper suffix** if wx != ε.

**Definition 3.** Let x = a_{1}a_{2}…a_{n} be a string over Σ. Reversal of the string, y = *reverse*(x) = a_{n…}a_{2}a_{1}.

One rather trivial definition in the end, reverse string is the same sequence of characters backwards. The thing is, why can string reversal be defined this rather intuitive way? It’s because we can be absolutely sure, that each string has a finite number of symbols, i.e. we know the n.

## Note

It’s important to point out, that we’re talking about **theoretical strings**. You can apply this theory to string literals in programming (like I did) and it will work, but the strings in theory are far more abstract than that. It actually depends on what is your alphabet. Let’s say, that a every symbol in Σ represent a single processor instruction. If you create a string over that Σ, you’ll get a sequence of instructions. You can do it multiple times and get programs *x*,* y* ∈ Σ^{*}. Now, you can use concatenation, right? Then *y* =* xz*, where y is also a program in which *x* and *z* are **subroutines**!

## Conclusion

Strings are useful, when you need to model any **sequential behavior**. Computers are sequential machines (well, not that much lately, but in principle) so there is plenty to cover with these couple of definitions!

## Sources

- Češka M., Vojnar T., Smrčka A., Teoretická informatika, Studijní opora.