swizec.com

#### Senior Mindset Book

Get promoted, earn a bigger salary, work for top companies

# Minimum substring cover problem

A major part of my thesisinvolves finding an algorithm to discover a good substring cover of text in order to properly syllabify said text. But what is the substring cover problem anyway and what does it entail?

The Minimum Substring Cover Problem paper from Hermelin, Rawitz, Rizzi and Vialette dating back to 2007 (judging by the filename) serves as a good entry point into this topic.

There are actually a lot of cover problems, the most famous being Minimum Set Cover and Minimum Vertex Cover problems. In this type of problems we are faced with two sets of elements and we want to cover one of the sets with another, by using the "least" elements from the covering set. I put "least" in quotes because the definition depends on what we want - maybe we want to use the least number of elements, perhaps we want the shortest elements ... whatever.

For an example consider this:

S = ['a', 'aab', 'aba'] C(S) = ['a', 'b', 'aa', 'ab', 'ba', 'aab', 'aba']

We can easily see that C(S) is a set of all the possible coverings of S - using a combination of strings from C we can construct every string in S. This part isn't very difficult to calculate.

Everything gets slightly hairier when you look for minimum covers:

C_1 = ['a', 'b'] # 3-cover (need 3 strings to cover the longest string in S) C_2 = ['a', 'ab'] # 2-cover (need 2 strings to cover the longest string in S)

Depending on how you choose the weight, both C1 and C_2 are _minimum substring covers of S. Considering "least" to mean least amount of strings then both are of weight 2, but if you consider "least" to mean the total length of strings then C_1 is better.

You could easily argue C_2 is better, because it uses the least amount of elements to cover the whole set S. 1+3+3 = 7 for C_1 and 1+2+2 = 5 for C_2.

Ok, so now we know that finding the minimum substring cover of a set of strings depends a whole lot on what you actually want. Always a good sign, having a well-known problem where people can't even agree on what the best solution looks like.

The paper goes on to explain in great theoretical detail that, because this problem is similar to minimum vertex cover, minimum set cover and similar problems, it is NP-hard to approximate. This means that the problem is at least as hard as the hardest problems in NP, but it doesn't necessarily mean that there is no polynomial solution - it just hasn't been found yet.

Luckily, if we constrain some parameters of the problem, it becomes/remains APX-hard - problems in this class have efficient algorithms that can find an answer within some fixed percentage of the optimal answer.

The article then proposes two approximation algorithms for finding minimum substring covers of S.

## Local-Ratio Algorithms

This algorithm follows from the local-ratio lemma, which in the case of substring cover means

Let C be a cover for S, and let w_1 and w_2 be weight functions for C(S). If C is an alpha-approximate, both with respect to w_1 and with respect to w_2, then C is also alpha-approximate with respect to w_1+w_2.

```.css-1yb0ye3{font-family:monospace;color:#728fcb;background-color:#faf8f5;font-size:0.9em;padding-left:0;padding-right:0;}.css-1yb0ye3 .comment,.css-1yb0ye3 .prolog,.css-1yb0ye3 .doctype,.css-1yb0ye3 .cdata,.css-1yb0ye3 .punctuation{color:#b6ad9a;}.css-1yb0ye3 .namespace{opacity:0.7;}.css-1yb0ye3 .tag,.css-1yb0ye3 .operator,.css-1yb0ye3 .number{color:#063289;}.css-1yb0ye3 .property,.css-1yb0ye3 .function{color:#b29762;}.css-1yb0ye3 .tag-id,.css-1yb0ye3 .selector,.css-1yb0ye3 .atrule-id{color:#2d2006;}.css-1yb0ye3 .attr-name{color:#896724;}.css-1yb0ye3 .boolean,.css-1yb0ye3 .string,.css-1yb0ye3 .entity,.css-1yb0ye3 .url,.css-1yb0ye3 .attr-value,.css-1yb0ye3 .keyword,.css-1yb0ye3 .control,.css-1yb0ye3 .directive,.css-1yb0ye3 .unit,.css-1yb0ye3 .statement,.css-1yb0ye3 .regex,.css-1yb0ye3 .at-rule{color:#728fcb;}.css-1yb0ye3 .placeholder,.css-1yb0ye3 .variable{color:#93abdc;}.css-1yb0ye3 .deleted{text-decoration-line:line-through;}.css-1yb0ye3 .inserted{text-decoration-line:underline;}.css-1yb0ye3 .italic{font-style:italic;}.css-1yb0ye3 .important,.css-1yb0ye3 .bold{font-weight:700;}.css-1yb0ye3 .important{color:#896724;}.css-1yb0ye3 .highlight{background:hsla(0, 0%, 70%, .5);}.css-o6ar0x{font-family:monospace;color:#728fcb;background-color:#faf8f5;font-size:0.9em;padding-left:0;padding-right:0;font-family:monospace;color:#728fcb;background-color:#faf8f5;font-size:0.9em;padding-left:0;padding-right:0;}.css-o6ar0x .comment,.css-o6ar0x .prolog,.css-o6ar0x .doctype,.css-o6ar0x .cdata,.css-o6ar0x .punctuation{color:#b6ad9a;}.css-o6ar0x .namespace{opacity:0.7;}.css-o6ar0x .tag,.css-o6ar0x .operator,.css-o6ar0x .number{color:#063289;}.css-o6ar0x .property,.css-o6ar0x .function{color:#b29762;}.css-o6ar0x .tag-id,.css-o6ar0x .selector,.css-o6ar0x .atrule-id{color:#2d2006;}.css-o6ar0x .attr-name{color:#896724;}.css-o6ar0x .boolean,.css-o6ar0x .string,.css-o6ar0x .entity,.css-o6ar0x .url,.css-o6ar0x .attr-value,.css-o6ar0x .keyword,.css-o6ar0x .control,.css-o6ar0x .directive,.css-o6ar0x .unit,.css-o6ar0x .statement,.css-o6ar0x .regex,.css-o6ar0x .at-rule{color:#728fcb;}.css-o6ar0x .placeholder,.css-o6ar0x .variable{color:#93abdc;}.css-o6ar0x .deleted{text-decoration-line:line-through;}.css-o6ar0x .inserted{text-decoration-line:underline;}.css-o6ar0x .italic{font-style:italic;}.css-o6ar0x .important,.css-o6ar0x .bold{font-weight:700;}.css-o6ar0x .important{color:#896724;}.css-o6ar0x .highlight{background:hsla(0, 0%, 70%, .5);}.css-o6ar0x .comment,.css-o6ar0x .prolog,.css-o6ar0x .doctype,.css-o6ar0x .cdata,.css-o6ar0x .punctuation{color:#b6ad9a;}.css-o6ar0x .namespace{opacity:0.7;}.css-o6ar0x .tag,.css-o6ar0x .operator,.css-o6ar0x .number{color:#063289;}.css-o6ar0x .property,.css-o6ar0x .function{color:#b29762;}.css-o6ar0x .tag-id,.css-o6ar0x .selector,.css-o6ar0x .atrule-id{color:#2d2006;}.css-o6ar0x .attr-name{color:#896724;}.css-o6ar0x .boolean,.css-o6ar0x .string,.css-o6ar0x .entity,.css-o6ar0x .url,.css-o6ar0x .attr-value,.css-o6ar0x .keyword,.css-o6ar0x .control,.css-o6ar0x .directive,.css-o6ar0x .unit,.css-o6ar0x .statement,.css-o6ar0x .regex,.css-o6ar0x .at-rule{color:#728fcb;}.css-o6ar0x .placeholder,.css-o6ar0x .variable{color:#93abdc;}.css-o6ar0x .deleted{text-decoration-line:line-through;}.css-o6ar0x .inserted{text-decoration-line:underline;}.css-o6ar0x .italic{font-style:italic;}.css-o6ar0x .important,.css-o6ar0x .bold{font-weight:700;}.css-o6ar0x .important{color:#896724;}.css-o6ar0x .highlight{background:hsla(0, 0%, 70%, .5);}```Data: A set of strings S, a weight function w:C(S) -> Q+, an integer l >= 2
Result: An l-cover C for S (l is the number of substrings covering the longest s in S)
begin

C
``````

This algorithm is guaranteed to terminate after a polynomial amount of recursive calls and it returns a (((m+1) binomial 2) - 1)-approximate l-cover of S.

In sensible terms the algorithm basically does this: Add everything with zero weight to a partial solution, if this isn't the solution, it selects an uncovered substring in S and tries to cover it by examining all substrings in C_s.

## Linear Programming Rounding

Originally the linear programming rounding algorithm was developed by Hajiaghayi et all. for the Minimum Multicolored Subgraph problem when l=2. It has now been expanded for any constant value of l.

This section is extremely light on practical results and just shows a bunch of mathematics that supposedly prove how the algorithm can be extended and that the final result is an O(log^(1/l) n * m^((l/1)^2/l))-approximate algorithm.

From what I can understand this algorithm approaches the problem with the idea that they are basically looking for l-factorizations of strings.

According to this section, the minimum substring cover can be formulated using the following integer linear program:

min SUM*(c in C(s)) w(c)x_c s.t. SUM*(f in Fl(s)) y_f >= 1 every s in S SUM(c in f in F_l(s)) y_f

Then there are a bunch of proofs that this algorithm works and is indeed very awesome ... but by this time my eyes started glazing over and the September deadline for my thesis started looking very near.

Published on January 11th, 2012 in Algorithm, Approximation algorithm, Combinatorics, Linear programming, Math, NP-hard, Substring, Uncategorized

Semantically similar articles hand-picked by GPT-4

### Senior Mindset Book

Get promoted, earn a bigger salary, work for top companies

Have a burning question that you think I can answer? Hit me up on twitter and I'll do my best.

Who am I and who do I help? I'm Swizec Teller and I turn coders into engineers with "Raw and honest from the heart!" writing. No bullshit. Real insights into the career and skills of a modern software engineer.

Want to become a true senior engineer? Take ownership, have autonomy, and be a force multiplier on your team. The Senior Engineer Mindset ebook can help 👉 swizec.com/senior-mindset. These are the shifts in mindset that unlocked my career.

Curious about Serverless and the modern backend? Check out Serverless Handbook, for frontend engineers 👉 ServerlessHandbook.dev

Want to Stop copy pasting D3 examples and create data visualizations of your own? Learn how to build scalable dataviz React components your whole team can understand with React for Data Visualization

Want to get my best emails on JavaScript, React, Serverless, Fullstack Web, or Indie Hacking? Check out swizec.com/collections

Did someone amazing share this letter with you? Wonderful! You can sign up for my weekly letters for software engineers on their path to greatness, here: swizec.com/blog

Want to brush up on your modern JavaScript syntax? Check out my interactive cheatsheet: es6cheatsheet.com

By the way, just in case no one has told you it yet today: I love and appreciate you for who you are ❤️

Created by Swizec with ❤️