Related to AI "self learning" or "auto classification" engines.
Suppose we have a data-set of factors where the values are normalized between 0.0 and 1.0. For illustration, we'll assume it has factors A thru Z. A print-out may resemble:
ID..A.....B.....C....Etc....Z 1 0.500 0.998 0.003 2 0.734 0.034 0.973 3 0.646 0.218 0.824 Etc...(For the first pass, let's assume there is no Null)
Now suppose we wanted to do QueryByExample-like queries on it to find closest matches. An example query may resemble:
D=0.762, G=0.008, Y=1.000What we want to do is find the closest N euclidean-distance "matches" in the database for this query. Factors not given in the query are not included. (This example query has 3 factors, but it may be one or 26.)
How could this be implemented such that an entire sequential search is not necessary? I don't have a clue to how to index such. Any ideas? Such a tool may be useful for AI.
For version 2.0, we want weighting factors on each query factor. Thus, the above query reworked may resemble:
factor value weight ------------------- D 0.762 0.500 G 0.008 0.850 Y 1.000 0.245The weights are also normalized to the range 0.0 to 1.0.
Well, my first approach would be BinarySpacePartitioning? (BSP). To be clear, BSP can be performed on a space with any number of dimensions - 2D, 3D, 4D, or even 26D as you suggest. The 'binary' in 'binary space partitioning' refers to breaking the space in two on a dimension with each partition. I am ill-equipped to explain BSP on WikiWiki (pictures help), so I'll point you to an external resource: http://en.wikipedia.org/wiki/Binary_space_partitioning, http://en.wikipedia.org/wiki/Kd-tree.
A query for "nearest" points to another given point or projection (line, region) will require a narrowing of partitions to search that might contain the desired target, then brute forcing the 'closest' from among the points in the selected partition. Partitions 'nearby' another partition may also be searched, i.e. if the point searched is near the edge of the partition. This narrowing is fairly simple in nature: i.e. if a search splits G<0.5 from the rest, you obviously need to search the 'G<0.5' partition (G query was 0.008, which is <0.5), but you also know that the minimum weighted vector to any point in the other partition is 0.357 (= |(0.08-0.50)|*0.850 ) in the G direction. Thus, when searching for the K nearest points, you can avoid backtrack search into the G>=0.5 partition unless you can't find K points closer than 0.357 in the G<0.5 partition.
Weighted-distance searches are easy. A bit more complex is searching on complex lines (diagonals, curves) and other shapes (spheres, for example). It's all doable, though. BSP is one of the more flexible indexing techniques. And if you can produce a total-order function on a space ('<'), or have other ways of breaking spaces up in arbitrary and automated manners (necessary, for example, for spheres and toroids - there is no total-order in the dimensions measured in radians or degrees), BSP will work just fine. (OTOH, more complex partitioning functions also means more complex region-intersect tests and such, so stick with simple functions wherever feasible!)
BSP supports "persistent" data-structures (by which I mean CopyOnWrite with logarithmic time and space complexity), and high-performance region processing (i.e. grab all points that match certain criterion, or replace all points in a given region with another set). I'm very much of the opinion that programming languages that deal with large amounts of data ought to provide intrinsic support for BSP instead of simple maps, arrays, and hashes! BSP is very suitable for scene-graphs, collision detection, and so on.
Unless there is a reason to do otherwise, each partitioning of the multi-dimensional space simply rotates through the dimensions, so at 3 search-steps down from the root of the BSP tree you'll generally able to narrow distances on three different dimensions. Whether those are the ideal dimensions for your query would depend on many factors, of course, but overall this results in a logarithmic time search (for a sufficiently large number of points).
It is worth mentioning, however, that it takes a veritable ton of data to take advantage of BSP in 26 dimensions. If you wish to break the search space on any given dimension into eighths or sixteenths, the number of datapoints required is on the order of at least 8^D or 16^D. For D=26, that is 10^23 or 10^31 points respectively. And that's not even including ID as a dimension. Suffice to say, BSP by itself doesn't scale to large numbers of dimensions.
Alternative:
Given the large number of dimensions and a relatively small number of points ('relatively small' being less than, say, 16^D) you might wish to instead keep many indexes. The primitive form, in this case, would be one index per dimension on the same data. So I'll consider that first.
The region-selection search technique using one index per dimension:
You can trade off a bit between the designs using, say, four or five dimensions per index (i.e. first index is A..E, second index is on F..J, third index is on K..O, etc.) and doing BSP on each set of five dimensions. This hybrid approach could offer the necessary flexibility for smaller numbers of points in combination with greater narrowing of broadly multi-dimensional searches (i.e. when the dimensions queried share the same index). The cost is of the performance, however, is complexity - greater difficulty to develop query optimizers and such.
I suspect this latter multi-index design may be more suitable for you, since it allows you to easily scale up the number of dimensions without also needing to exponentially scale up the number of datapoints. I seriously doubt you have anywhere near 16^D datapoints! I suggest avoiding the hybrid approach until you've tried both plain-old index-per-dimension and plain-old BSP and have benchmark numbers for both. Keep it simple until the complexity can be justified.
Speeding Sequential Search
Rather than avoid sequential search, perhaps speed it up instead. This may give us something that has a high power-to-complexity ratio, which is always fun to explore. At least they are easier to test drive than the complicated ones.
One possible way to simplify sequential search would be a "compressed index" (CI). This approach assumes that new variables don't come along very often. (If they do, then perhaps slots can be reserved in chunks of say 5 or 10. See below for periodic cleaning process.)
Once you get an approximate list of best candidates from the compressed index, you then sift the actual values of the match candidates to refine the results further. Note that the existence of a compressed index doesn't preclude the existence of a full index(s).
As an example, use 2 binary digits to represent four states (using normalized ranges):
00 - 0% to 24% (rounded) 01 - 25% to 49% 10 - 50% to 74% 11 - 75% to 100%Using this search template:
Var A = 35% Var B = 4% Var C = 82%we'd get the binary "string": 010011 (A=01, B=00, C=11)
We'd then calculate the total difference between the search template and values in the compressed index by scanning and subtracting each pair. Rather than store each match score, perhaps a threshold can be set via manual configuration or by using a random sample first to estimate the distribution. (It may even be theoretically possible to guarantee we catch all possible matches past a certain threshold. I haven't done the math.) Perhaps a relatively small threshold could be initially used, but then a larger one if the first pass returns an insufficient quantity of candidates.
Four variables would fit per byte, although there's the overhead of a record key. However, splitting the bytes may be the bottleneck here, not the size of the compressed index. Thus, perhaps just use a letter for each variable, giving us 26 value steps instead of 4 (and 128 if we want to use the entire Ascii range). If we have 40 variables, then each value in the compressed index is 40 bytes (assuming ASCII characters), plus the record identifier/key.
If we write the CI scanner in C instead of using a database to gain speed, keeping the index updated may involve a periodic cleaning process. If a record is changed on the master database, then a new record value is appended to the CI and the old one left in place until the next cleaning period (re-indexing). The extra outdated CI record would just be ignored when the fine-tune pass through the actual data is done if it shows up in the candidate list. This is because the final score depends on the actual values, not the CI scan. Leaving the old nodes in only results in false positives in the candidate list, never true negatives because the real node is also scanned.
I'm sure I'm not the first to propose this kind of thing.
Or just parallelize the sequential search by hashing the nodes/records out to multiple CPU's or machines.
- t
What you describe is actually a crude manual version of BinarySpacePartitioning?. The binary in BSP exactly means to use one bit of the data at a time. The leading bit of all dimensions first, then the next and so on...
Wrong, for three reasons:
TopMind's approach does have a killer flaw. He didn't look very hard at his assumptions when he said: "then calculate the total difference between the search template and values in the compressed index by scanning and subtracting each pair". This is not nearly as trivial as TopMind suggests. I.e. consider that:
A = 26% | A = 24% B = 51% | B = 49% C = 73% | C = 76% 011010 | 000111 (TopMind's string), subtracted: 010011 (or 010111 w/o carry across dimensions) 011100 | 001011 (bit-at-a-time), subtracted: 010001. Cartesian distance (root of summed squared differences): 4.12%These points are very close together, but nothing about the subtracted strings suggest they should even be in the candidate sets! Not even rearranging the strings bit-at-a-time helps, nor does subtracting on a per-dimension basis. While speeding up a sequential search can be useful, the only true thing TopMind said about his particular proposal is: "I haven't done the math."''
By "pairs" I meant a "horizontal" group of 2 binary digits, not both strings being subtracted from each other. I apologize for not making that clear. Also note that we can in theory choose any number of bits to represent our compressed values. I just chose 2 and later 7 (Ascii) for my example. - top
00 = 0 01 = 1 10 = 2 11 = 3 011010 -> 01,10,10 -> 1,2,2 000111 -> 00,01,11 -> 0,1,3 1, 2, 2 0, 1, 3 ------- 1, 1,-1 (diff of each "group")But, I see your point that a worse-case coincidental set could score relatively low. Obviously, more precision may help, along with a wider "near" return set, but one has to be careful if it's an app where occasional coincidental "misses" are detrimental. I'll leave it as a reader exercise to find the minimum precision and/or minimum "near" threshold to guarantee we don't miss one. - t
I had covered both interpretations, actually. The "w/o carry across dimensions" was the pairwise approach as you describe it here, albeit with overflow. And, yes, more precision will help (if you get enough precision, you'll be representing A, B, C directly!). But recognizing this "wide 'near' return set" is non-trivial. When is one list of numbers 'near' another list? Of course, you could find a sum-of-squares... that would work, but the savings may turn out to be quite marginal.
That depends on what algorithm it's being compared to and how much accuracy is needed, which of course depends on the nature of the problem (the domain). Note that sum of absolute value of differences is also a calculation option. I'll agree that this algorithm makes it complicated to know how often we are getting the very best match. But competitors, such as neural nets, also have similar limitations. Also, an API user has the option of scanning the actual data if thoroughness is required. There are at least parameters/options to choose from, including initial match threshold. Internally, it could look something like:
matches = scan(threshold=small) if insufficient matches and not timeLimitReached { ...matches = scan(threshold=big) ...if insufficient matches and not timeLimitReached { ...... matches = directScan(timeLimit) ...} }(Details not shown: We'd probably want to keep the early matches in case we find nothing during the deeper search and need to present what we have found so far if the total time-limit is reached.)
Things to consider include but are not limited to:
I've been kicking around the idea of using something like this to match similar images, such as all photos taken at approximately the same time and place on a vacation. Possible factors to use:
The problem of clustering similar images is very well studied. Compared to state of the art approaches, any original approach you hack together is likely to be fragile and inefficient. Go learn how other people solve the problem. Example: http://homepage.tudelft.nl/19j49/t-SNE.html --AnonymousDonor
I'm not saying it's the only or best technique for image matching/clustering. My example is more a lab toy or a quicky example of how one may apply a EuclideanProximitySearchEngine rather than anything meant for production. My apologies for not making that clear up front like I should have.
See Also: FuzzyPredicate