Skip to content
Swizec Teller - a geek with a

Minimum substring cover problem

A major part of my thesisinvolves finding an algorithm to discover a good substring cover of text in order to properly syllabify said text. But what is the substring cover problem anyway and what does it entail?

algorithms doodle

The Minimum Substring Cover Problem paper from Hermelin, Rawitz, Rizzi and Vialette dating back to 2007 (judging by the filename) serves as a good entry point into this topic.

There are actually a lot of cover problems, the most famous being Minimum Set Cover and Minimum Vertex Cover problems. In this type of problems we are faced with two sets of elements and we want to cover one of the sets with another, by using the "least" elements from the covering set. I put "least" in quotes because the definition depends on what we want - maybe we want to use the least number of elements, perhaps we want the shortest elements ... whatever.

For an example consider this:

S = ['a', 'aab', 'aba']
C(S) = ['a', 'b', 'aa', 'ab', 'ba', 'aab', 'aba']

We can easily see that C(S) is a set of all the possible coverings of S - using a combination of strings from C we can construct every string in S. This part isn't very difficult to calculate.

Everything gets slightly hairier when you look for minimum covers:

C_1 = ['a', 'b'] # 3-cover (need 3 strings to cover the longest string in S)
C_2 = ['a', 'ab'] # 2-cover (need 2 strings to cover the longest string in S)

Depending on how you choose the weight, both C1 and C_2 are _minimum substring covers of S. Considering "least" to mean least amount of strings then both are of weight 2, but if you consider "least" to mean the total length of strings then C_1 is better.

You could easily argue C_2 is better, because it uses the least amount of elements to cover the whole set S. 1+3+3 = 7 for C_1 and 1+2+2 = 5 for C_2.

Ok, so now we know that finding the minimum substring cover of a set of strings depends a whole lot on what you actually want. Always a good sign, having a well-known problem where people can't even agree on what the best solution looks like.

The paper goes on to explain in great theoretical detail that, because this problem is similar to minimum vertex cover, minimum set cover and similar problems, it is NP-hard to approximate. This means that the problem is at least as hard as the hardest problems in NP, but it doesn't necessarily mean that there is no polynomial solution - it just hasn't been found yet.

Luckily, if we constrain some parameters of the problem, it becomes/remains APX-hard - problems in this class have efficient algorithms that can find an answer within some fixed percentage of the optimal answer.

The article then proposes two approximation algorithms for finding minimum substring covers of S.

Local-Ratio Algorithms

This algorithm follows from the local-ratio lemma, which in the case of substring cover means

Let C be a cover for S, and let w_1 and w_2 be weight functions for C(S). If C is an alpha-approximate, both with respect to w_1 and with respect to w_2, then C is also alpha-approximate with respect to w_1+w_2.

Data: A set of strings S, a weight function w:C(S) -> Q+, an integer l >= 2
Result: An l-cover C for S (l is the number of substrings covering the longest s in S)

This algorithm is guaranteed to terminate after a polynomial amount of recursive calls and it returns a (((m+1) binomial 2) - 1)-approximate l-cover of S.

In sensible terms the algorithm basically does this: Add everything with zero weight to a partial solution, if this isn't the solution, it selects an uncovered substring in S and tries to cover it by examining all substrings in C_s.

Linear Programming Rounding

Originally the linear programming rounding algorithm was developed by Hajiaghayi et all. for the Minimum Multicolored Subgraph problem when l=2. It has now been expanded for any constant value of l.

This section is extremely light on practical results and just shows a bunch of mathematics that supposedly prove how the algorithm can be extended and that the final result is an O(log^(1/l) n * m^((l/1)^2/l))-approximate algorithm.

From what I can understand this algorithm approaches the problem with the idea that they are basically looking for l-factorizations of strings.

According to this section, the minimum substring cover can be formulated using the following integer linear program:

min SUM_(c in C(s)) w(c)x_c
s.t. SUM_(f in F_l(s)) y_f >= 1 every s in S
SUM_(c in f in F_l(s)) y_f

Then there are a bunch of proofs that this algorithm works and is indeed very awesome ... but by this time my eyes started glazing over and the September deadline for my thesis started looking very near.

Enhanced by Zemanta

Did you enjoy this article?

Published on January 11th, 2012 in Algorithm, Approximation algorithm, Combinatorics, Linear programming, Math, NP-hard, Substring, Uncategorized

Learned something new?
Want to become a high value JavaScript expert?

Here's how it works 👇

Leave your email and I'll send you an Interactive Modern JavaScript Cheatsheet 📖right away. After that you'll get thoughtfully written emails every week about React, JavaScript, and your career. Lessons learned over my 20 years in the industry working with companies ranging from tiny startups to Fortune5 behemoths.

Start with an interactive cheatsheet 📖

Then get thoughtful letters 💌 on mindsets, tactics, and technical skills for your career.

"Man, love your simple writing! Yours is the only email I open from marketers and only blog that I give a fuck to read & scroll till the end. And wow always take away lessons with me. Inspiring! And very relatable. 👌"

~ Ashish Kumar

Join over 10,000 engineers just like you already improving their JS careers with my letters, workshops, courses, and talks. ✌️

Have a burning question that you think I can answer? I don't have all of the answers, but I have some! Hit me up on twitter or book a 30min ama for in-depth help.

Ready to Stop copy pasting D3 examples and create data visualizations of your own?  Learn how to build scalable dataviz components your whole team can understand with React for Data Visualization

Curious about Serverless and the modern backend? Check out Serverless Handbook, modern backend for the frontend engineer.

Ready to learn how it all fits together and build a modern webapp from scratch? Learn how to launch a webapp and make your first 💰 on the side with ServerlessReact.Dev

Want to brush up on your modern JavaScript syntax? Check out my interactive cheatsheet:

By the way, just in case no one has told you it yet today: I love and appreciate you for who you are ❤️