Swizec Teller - a geek with a hatswizec.com

    Minimum substring cover problem

    A major part of my thesisinvolves finding an algorithm to discover a good substring cover of text in order to properly syllabify said text. But what is the substring cover problem anyway and what does it entail?

    algorithms doodle

    The Minimum Substring Cover Problem paper from Hermelin, Rawitz, Rizzi and Vialette dating back to 2007 (judging by the filename) serves as a good entry point into this topic.

    There are actually a lot of cover problems, the most famous being Minimum Set Cover and Minimum Vertex Cover problems. In this type of problems we are faced with two sets of elements and we want to cover one of the sets with another, by using the "least" elements from the covering set. I put "least" in quotes because the definition depends on what we want - maybe we want to use the least number of elements, perhaps we want the shortest elements ... whatever.

    For an example consider this:

    S = ['a', 'aab', 'aba'] C(S) = ['a', 'b', 'aa', 'ab', 'ba', 'aab', 'aba']

    We can easily see that C(S) is a set of all the possible coverings of S - using a combination of strings from C we can construct every string in S. This part isn't very difficult to calculate.

    Everything gets slightly hairier when you look for minimum covers:

    C_1 = ['a', 'b'] # 3-cover (need 3 strings to cover the longest string in S) C_2 = ['a', 'ab'] # 2-cover (need 2 strings to cover the longest string in S)

    Depending on how you choose the weight, both C1 and C_2 are _minimum substring covers of S. Considering "least" to mean least amount of strings then both are of weight 2, but if you consider "least" to mean the total length of strings then C_1 is better.

    You could easily argue C_2 is better, because it uses the least amount of elements to cover the whole set S. 1+3+3 = 7 for C_1 and 1+2+2 = 5 for C_2.

    Ok, so now we know that finding the minimum substring cover of a set of strings depends a whole lot on what you actually want. Always a good sign, having a well-known problem where people can't even agree on what the best solution looks like.

    The paper goes on to explain in great theoretical detail that, because this problem is similar to minimum vertex cover, minimum set cover and similar problems, it is NP-hard to approximate. This means that the problem is at least as hard as the hardest problems in NP, but it doesn't necessarily mean that there is no polynomial solution - it just hasn't been found yet.

    Luckily, if we constrain some parameters of the problem, it becomes/remains APX-hard - problems in this class have efficient algorithms that can find an answer within some fixed percentage of the optimal answer.

    The article then proposes two approximation algorithms for finding minimum substring covers of S.

    Local-Ratio Algorithms

    This algorithm follows from the local-ratio lemma, which in the case of substring cover means

    Let C be a cover for S, and let w_1 and w_2 be weight functions for C(S). If C is an alpha-approximate, both with respect to w_1 and with respect to w_2, then C is also alpha-approximate with respect to w_1+w_2.

    Data: A set of strings S, a weight function w:C(S) -> Q+, an integer l >= 2
    Result: An l-cover C for S (l is the number of substrings covering the longest s in S)

    This algorithm is guaranteed to terminate after a polynomial amount of recursive calls and it returns a (((m+1) binomial 2) - 1)-approximate l-cover of S.

    In sensible terms the algorithm basically does this: Add everything with zero weight to a partial solution, if this isn't the solution, it selects an uncovered substring in S and tries to cover it by examining all substrings in C_s.

    Linear Programming Rounding

    Originally the linear programming rounding algorithm was developed by Hajiaghayi et all. for the Minimum Multicolored Subgraph problem when l=2. It has now been expanded for any constant value of l.

    This section is extremely light on practical results and just shows a bunch of mathematics that supposedly prove how the algorithm can be extended and that the final result is an O(log^(1/l) n * m^((l/1)^2/l))-approximate algorithm.

    From what I can understand this algorithm approaches the problem with the idea that they are basically looking for l-factorizations of strings.

    According to this section, the minimum substring cover can be formulated using the following integer linear program:

    min SUM(c in C(s)) w(c)x_c s.t. SUM(f in Fl(s)) y_f >= 1 every s in S SUM(c in f in F_l(s)) y_f

    Then there are a bunch of proofs that this algorithm works and is indeed very awesome ... but by this time my eyes started glazing over and the September deadline for my thesis started looking very near.

    Enhanced by Zemanta

    Did you enjoy this article?

    Published on January 11th, 2012 in Algorithm, Approximation algorithm, Combinatorics, Linear programming, Math, NP-hard, Substring, Uncategorized

    Learned something new?
    Want to become an expert?

    Here's how it works 👇

    Leave your email and I'll send you thoughtfully written emails every week about React, JavaScript, and your career. Lessons learned over 20 years in the industry working with companies ranging from tiny startups to Fortune5 behemoths.

    Join Swizec's Newsletter

    And get thoughtful letters 💌 on mindsets, tactics, and technical skills for your career. Real lessons from building production software. No bullshit.

    "Man, love your simple writing! Yours is the only newsletter I open and only blog that I give a fuck to read & scroll till the end. And wow always take away lessons with me. Inspiring! And very relatable. 👌"

    ~ Ashish Kumar

    Join over 14,000 engineers just like you already improving their careers with my letters, workshops, courses, and talks. ✌️

    Have a burning question that you think I can answer? I don't have all of the answers, but I have some! Hit me up on twitter or book a 30min ama for in-depth help.

    Ready to Stop copy pasting D3 examples and create data visualizations of your own?  Learn how to build scalable dataviz components your whole team can understand with React for Data Visualization

    Curious about Serverless and the modern backend? Check out Serverless Handbook, modern backend for the frontend engineer.

    Ready to learn how it all fits together and build a modern webapp from scratch? Learn how to launch a webapp and make your first 💰 on the side with ServerlessReact.Dev

    Want to brush up on your modern JavaScript syntax? Check out my interactive cheatsheet: es6cheatsheet.com

    By the way, just in case no one has told you it yet today: I love and appreciate you for who you are ❤️

    Created by Swizec with ❤️