Hello,
I am trying to figure out a way how to query documents only once. Imagine situation where a user would query suggested profiles based on multiple criteria and once he gets served with about 50 profiles, these should not be involved in a query again.
I was able to get indexes working well. For example… Out of 50.000 documents a query examine only 300 index keys, reads 40 documents and return result of 29.
The problem is that running the query again on the next day will scan and return the same documents, logically. Estimating roughly that user would be served with 50 profiles per day, the “skip list” would grow to 18000 objectId in one year and growing.
As false positive is acceptable in this case, I am considering Bloom filter. But I got stuck on how to really implement it so that the query still use indexes and the documents/indexes themselves are not huge. I came to two possible scenarios:
1) bloom filter (array of 32.000 Booleans) would be in profile document
Array of 32.000 Bools should add only 4kB on document size, what is not so much. But each array field would need to be indexed for a query .equalTo("bloom.x", false)
and this seems to be unbearable considering that this bloom filter array would need to be updated every time the profile gets fetched by any other users.
2) bloom filter (array of n Integers, growing up to 32.000 Integers) would be in user document
As in the cloud code function is already a user document available, I could load the array of "available bloom filter positions as const bloomArray = request.user.get("bloom")
and then use this array of Integers in the same cloud code function query.containedIn("bloomHash", bloomArray)
. This way the bloom filter does not need to be indexed and the query should still use index scan. Or do I missunderstood the query here and it would still filter the results for “containedIn”? Obvious disadvantage here is that storing an array of 32000 integers take some disc space. Also I am not sure how performant would that query be, after the array would grow close to 32000?
Therefore I was wondering if anyone has an experience in storing bloom filter as bitmap (in that user document, not indexed) and then generate an array of integers in a cloud code before this gets passed to a query function?
Or is there a better way to avoid repeating query result? Any inputs/references/comments are appreciated!
EDIT1:
Investigating the second scenario further it seems that it indeed use an index scan as I successfully got skipped all the documents with:
→ totalKeysExamined 302
→ totalDocsExamined 0
→ nReturned 0
→ executionTimeMillis 13
But even it shows 13ms execution time I can see a much larger response time after calling the cloud code function from the iOS client, going from approx 200ms → 2s. Although I am happy that this solution successfully skips the document scanning, it seems that the bloom filter array is a burden for the cloud function:
Parse.Cloud.define("newProfiles", async (request) => {
const token = request.user.getSessionToken();
//get requesting user's profile
const usrProfQuery = new Parse.Query("PrsProfile");
usrProfQuery.equalTo("objectId", request.user.id);
const usrProfile = await usrProfQuery.first({ sessionToken: token });
if (!usrProfile) {
throw "current user profile not found";
}
//basic query criteria common for all specific queries
const minAge = request.user.get("minA");
const maxAge = request.user.get("maxA");
const ori = usrProfile.get("o");
const partnerOri = usrProfile.get("po");
//Gender: 0 - woman, 1 - notDefined, 2 - man
const usrGender = usrProfile.get("gn");
if (usrGender == 1) {
throw "current user gender is not defined";
}
//max distance
const maxDistance = request.user.get("maxD");
const currentLoc = usrProfile.get("gp");
//search type queries
var specificQueries = []
// ------------ dating specific query ------------
const usrDate = usrProfile.get("da");
//const usrOns = usrProfile.get("on");
if (usrDate != null && maxDistance != null) {
const dateQuery = new Parse.Query("PrsProfile");
//distance in km
dateQuery.withinKilometers("gp", currentLoc, maxDistance, false);
//basic data
dateQuery.lessThanOrEqualTo("ag", maxAge);
dateQuery.greaterThanOrEqualTo("ag", minAge);
if (partnerOri != null) {
dateQuery.equalTo("o", partnerOri);
}
//dateQuery.equalTo("po", ori);
dateQuery.containedIn("po", [null, ori]);
//genders are defined for each specific query separately
if (usrGender == 0) {
//search for users looking for notDefined or women
dateQuery.lessThanOrEqualTo("da", 1);
} else if (usrGender == 2) {
//search for users looking for notDefined or men
dateQuery.greaterThanOrEqualTo("da", 1);
}
if (usrDate != 1) {
//if current user defines specific gender, add it to constraints
dateQuery.equalTo("gn", usrDate);
}
if (usrProfile.get("on") == true) {
dateQuery.equalTo("on", true);
}
//bloom filter
//Inverted values - Array of "available Integers"
const bloomFilter = request.user.get("bloom");
dateQuery.containedIn("bf", bloomFilter);
dateQuery.explain()
const results = await specificQueries[0].find({ sessionToken: token });
return results;
} else {
throw "not enough inputs to run a query";
}
});
Any idea how to make it less consuming?
EDIT2:
when I remove .explain() from the query and get actually only the results them self, the execution time observed by a client side is back down to the 220ms (500ms non-cached) what seems to be success for me. Now the only question remains, how much size the such inverted Bloom filter takes in size (large array of integers int a document that is being read only by user himself).