Avoid repeating query results by implementing bloom filter in cloud code

lsmilek1 · April 13, 2021, 11:33am

Hello,

I am trying to figure out a way how to query documents only once. Imagine situation where a user would query suggested profiles based on multiple criteria and once he gets served with about 50 profiles, these should not be involved in a query again.

I was able to get indexes working well. For example… Out of 50.000 documents a query examine only 300 index keys, reads 40 documents and return result of 29.

The problem is that running the query again on the next day will scan and return the same documents, logically. Estimating roughly that user would be served with 50 profiles per day, the “skip list” would grow to 18000 objectId in one year and growing.

As false positive is acceptable in this case, I am considering Bloom filter. But I got stuck on how to really implement it so that the query still use indexes and the documents/indexes themselves are not huge. I came to two possible scenarios:

1) bloom filter (array of 32.000 Booleans) would be in profile document

Array of 32.000 Bools should add only 4kB on document size, what is not so much. But each array field would need to be indexed for a query .equalTo("bloom.x", false) and this seems to be unbearable considering that this bloom filter array would need to be updated every time the profile gets fetched by any other users.

2) bloom filter (array of n Integers, growing up to 32.000 Integers) would be in user document

As in the cloud code function is already a user document available, I could load the array of "available bloom filter positions as const bloomArray = request.user.get("bloom") and then use this array of Integers in the same cloud code function query.containedIn("bloomHash", bloomArray). This way the bloom filter does not need to be indexed and the query should still use index scan. Or do I missunderstood the query here and it would still filter the results for “containedIn”? Obvious disadvantage here is that storing an array of 32000 integers take some disc space. Also I am not sure how performant would that query be, after the array would grow close to 32000?

Therefore I was wondering if anyone has an experience in storing bloom filter as bitmap (in that user document, not indexed) and then generate an array of integers in a cloud code before this gets passed to a query function?

Or is there a better way to avoid repeating query result? Any inputs/references/comments are appreciated!

EDIT1:

Investigating the second scenario further it seems that it indeed use an index scan as I successfully got skipped all the documents with:
→ totalKeysExamined 302
→ totalDocsExamined 0
→ nReturned 0
→ executionTimeMillis 13
But even it shows 13ms execution time I can see a much larger response time after calling the cloud code function from the iOS client, going from approx 200ms → 2s. Although I am happy that this solution successfully skips the document scanning, it seems that the bloom filter array is a burden for the cloud function:

Parse.Cloud.define("newProfiles", async (request) => {

    const token = request.user.getSessionToken();
    //get requesting user's profile
    const usrProfQuery = new Parse.Query("PrsProfile");
    usrProfQuery.equalTo("objectId", request.user.id);
    const usrProfile = await usrProfQuery.first({ sessionToken: token });
    if (!usrProfile) {
        throw "current user profile not found";
    }    

    //basic query criteria common for all specific queries
    const minAge = request.user.get("minA");
    const maxAge = request.user.get("maxA");
    const ori = usrProfile.get("o");
    const partnerOri = usrProfile.get("po");
    //Gender: 0 - woman, 1 - notDefined, 2 - man
    const usrGender = usrProfile.get("gn");
    if (usrGender == 1) {
        throw "current user gender is not defined";
    } 
    //max distance
    const maxDistance = request.user.get("maxD");
    const currentLoc = usrProfile.get("gp");

    //search type queries
    var specificQueries = []

    // ------------ dating specific query ------------ 
    const usrDate = usrProfile.get("da");
    //const usrOns = usrProfile.get("on");
    if (usrDate != null && maxDistance != null) {
        const dateQuery = new Parse.Query("PrsProfile");
        //distance in km
        dateQuery.withinKilometers("gp", currentLoc, maxDistance, false);
        
        //basic data
        dateQuery.lessThanOrEqualTo("ag", maxAge);
        dateQuery.greaterThanOrEqualTo("ag", minAge);
        
        if (partnerOri != null) {
            dateQuery.equalTo("o", partnerOri);
        }
        //dateQuery.equalTo("po", ori);
        dateQuery.containedIn("po", [null, ori]);
        //genders are defined for each specific query separately
        if (usrGender == 0) {
            //search for users looking for notDefined or women
            dateQuery.lessThanOrEqualTo("da", 1);
        } else if (usrGender == 2) {
            //search for users looking for notDefined or men
            dateQuery.greaterThanOrEqualTo("da", 1);
        }
        if (usrDate != 1) {
            //if current user defines specific gender, add it to constraints
            dateQuery.equalTo("gn", usrDate);
        } 
        if (usrProfile.get("on") == true) {
            dateQuery.equalTo("on", true);
        }
        //bloom filter
        //Inverted values - Array of "available Integers"
        const bloomFilter = request.user.get("bloom"); 
        dateQuery.containedIn("bf", bloomFilter);
        dateQuery.explain()
        const results = await specificQueries[0].find({ sessionToken: token });
        
        return results;
    } else {
        throw "not enough inputs to run a query";
    }
});

Any idea how to make it less consuming?

EDIT2:

when I remove .explain() from the query and get actually only the results them self, the execution time observed by a client side is back down to the 220ms (500ms non-cached) what seems to be success for me. Now the only question remains, how much size the such inverted Bloom filter takes in size (large array of integers int a document that is being read only by user himself).

davimacedo · April 13, 2021, 4:10pm

I don’t have any experience with bloom filters but I’ve seen many people having a hard time with big arrays in MongoDB specially when they need to modify the arrays frequently: Avoid Unbounded Arrays — MongoDB Atlas. So if you have any way to store this data that not in an array, I’d go for that.

lsmilek1 · April 13, 2021, 4:23pm

Thank you for your prompt response. I am evaluating if this would be a smart way:

instead of array of “available bits” (array of n Integers, growing up to 32.000 Integers) I could store a raw bloom filter bitmap as a data, what should be negligible in size (about 5kB)
in a cloud function I would need to serialise/deserialise it in to an array of “available bits” for the query to use indexing
when the results would come to the client, the raw bitmap could get recalculated and updated (either on client device or directly in that cloud function (depending on how expensive it would be)

As I am total beginner with javaScript and this project is all self-learned in my free time, only thinking about writing such function is already making me already cry. Anyway, does that sound reasonable? I hear often that you should not recompute something that you can precompute (storing array of available bits), but if this would need a little cpu resources, perhaps it is the way. Any thoughts?

For documentation purpose I tried to contact anyone from MongoDB forum here

davimacedo · April 14, 2021, 3:22pm

Another option you can consider is the usage of the Bytes data type.

lsmilek1 · April 15, 2021, 3:39pm

Hello Antonio,

thank you for your input with bytes. I believe I found a pretty robust solution thanks to that. I described it here, but let me repeat it here also in case the reference would get moved.

as user is responsible for his own bloom filter, he prepares the array of bools, where he hash the UID of the served users into one Integer and based on that “seeds” he updates his bloom filter. That array of bools is then encoded to base64. For testing and dummy bloom filter generation in Swift client I used this snippet:

enum Bit { case zero, one
     func asInt() -> Int {
         return (self == .one) ? 1 : 0
     }
 }
 
 func generateBloomFilterData(){
     let startDate = Date()
     var randomlyFilledBloomFilter: [Bit] = []
     var bool: Bit = .zero
     for _ in 0...64000 {
         let random = arc4random_uniform(2)
         if random == 0 {
             bool = .zero
         } else {
             bool = .one
         }
         randomlyFilledBloomFilter.append(bool)
     }
     
     let numBytes = 1 + (randomlyFilledBloomFilter.count - 1) / 8
     var bytes = [UInt8](repeating: 0, count: numBytes)

     for (index, bit) in randomlyFilledBloomFilter.enumerated() {
         if bit == .one {
             bytes[index / 8] += UInt8(1 << (7 - index % 8))
         }
     }
     
     let data = Data(bytes)
     let base64String = data.base64EncodedString()

--> this is saved directly in to the User document in database
}

That way takes the bloom filter only 5kB instead of previous approx. 300kB. As I am using 4 separate filters on each user, this is huge improvement in the data transfer speed. Also upon firing the cloud function query the user’s document gets loaded much faster.

the second step is to decode the bloom filter from base64 to “array of available bits.” As Parse Server uses javaScript in the cloud code I used following part of the query code:

 const buf = new Buffer.from(request.user.get("daBlm"), 'base64');
 
 var blm = Array();

 for (var i = 0; i < buf.length; i++) {
     //iterating through 8-bit chunks
     const byte = buf[i].toString(2);
     for (var l = 0; l < byte.length; l++) {
         //push int to array for each bloom 0 value - not used bit
         if (byte[l] == "0") {
             blm.push(i*8 + l);
         }
     }
 }
 
 dateQuery.containedIn("bf", blm);

To my huge surprise this does not take much computing effort even for a dummy bloom filter with no seeds - empty one - where the loop above has to generate array of 64000 integers. With this is the user able to query only documents that has hashed UID of a value that is not yet present in his bloom filter. After he gets served on his client device with new profile documents, he recalculates his bloom filter and saves it into his User document in MongoDB database. The next query will then take this updated bloom filter, so even he would trigger next query from other device, he would not get served the same profiles again.

I believe this is pretty robust and scalable solution. I hope this helps anyone else in future…

Thank you!