[translation] A Simple Recommendation Query With Neo4j
Recommending colleagues
Previous article follow-up
The previous exercise, temporarily left to the curiosity of the reader, consists in finding professional contact recommendations, as specified by the following question:
“find the contacts of my contacts, who know (are in contact with) with someone I already have worked with (with whom I am NOT already in contact)”
A reminder of the graph :
-
users are labeled
CONTACT
-
companies are labeled
COMPANY
-
user nodes have a name property (including both first and last name for simplicity sake)
-
to be in contact with someone is modeled as such
(:CONTACT)-[:IN_CONTACT_WITH]-(:CONTACT)
- to work for a company is modeled as follows:
(:CONTACT)-[:WORKED_IN]->(:COMPANY)
Now, let’s start solving this problem by decomposing it into smaller sub-tasks..
Finding former colleagues
A first simple query could be like:
MATCH (me:CONTACT)-[:WORKED_IN]->(:COMPANY)<-[:WORKED_IN]-(colleagues:CONTACT)
WHERE me.name = {name}
RETURN me, colleagues
Not bad, but I don’t want to see suggested colleagues I’m already in contact with.
To avoid this, let’s verify we are not already connected:
MATCH (me:CONTACT)-[:WORKED_IN]->(:COMPANY)<-[:WORKED_IN]-(colleagues:CONTACT)
WHERE me.name = {name} AND NOT (me-[:IN_CONTACT_WITH]-colleagues)
RETURN me, colleagues
Let’s go even further: we want to include people that worked in a company only when you worked there as well. In other words, will be included colleagues who stopped working (or are still there) after you started and who started before you left (if you left).
However, you’ll notice the specs ain’t complete. After a long and
tedious meeting, it has been decided that WORKED_IN
will comprise 2 new
timestamp properties: beginning and end (end is optional, it means the
person is still working there). Isn’t it convenient to have full-fledged
relationships? They can have attributes, too!
Going back to our query, two WORKED_IN
relationships must be captured,
as they respectively represent your stay and your colleagues stay dates.
Let’s also split the MATCH
clause, given it’s gonna slightly grow as you
can see:
MATCH (me:CONTACT)-[myStay:WORKED_IN]->(company:COMPANY),
company<-[theirStay:WORKED_IN]-(colleagues:CONTACT)
WHERE me.name = {name} AND NOT (me-[:IN_CONTACT_WITH]-colleagues)
RETURN me, colleagues
Let’s now restrict the subgraph with the aforementioned overlap constraints:
MATCH (me:CONTACT)-[myStay:WORKED_IN]->(company:COMPANY),
company<-[theirStay:WORKED_IN]-(colleagues:CONTACT)
WHERE me.name = {name} AND NOT (me-[:IN_CONTACT_WITH]-colleagues)
AND myStay.beginning < theirStay.end
AND theirStay.beginning < myStay.end
RETURN me, colleagues
And let’s make sure that people still working in their company are properly handled (end property won’t be set in that case).
MATCH (me:CONTACT)-[myStay:WORKED_IN]->(company:COMPANY),
company<-[theirStay:WORKED_IN]-(colleagues:CONTACT)
WHERE me.name = {name} AND NOT (me-[:IN_CONTACT_WITH]-colleagues)
AND (NOT HAS(theirStay.end) OR myStay.beginning < theirStay.end)
AND (NOT HAS(myStay.end) OR theirStay.beginning < myStay.end)
RETURN me, colleagues
Filtering further
As you notice, the query has grown quite a bit, but we’re still not done. 2 strategies lay ahead:
-
we can insert more and more patterns in
MATCH
and more filtering inWHERE
-
we chain the previous query with another one
You got it, let’s go with the 2nd option. What’s more, this will be the
perfect excuse to introduce you to WITH
clause, which acts as a UNIX
pipe.
Not much has to be done to allow query chaining. In our simple case,
let’s just replace RETURN
by WITH and write the last filtering step:
MATCH (me:CONTACT)-[myStay:WORKED_IN]->(company:COMPANY),
company<-[theirStay:WORKED_IN]-(colleagues:CONTACT)
WHERE me.name = {name} AND NOT (me-[:IN_CONTACT_WITH]-colleagues)
AND (NOT HAS(theirStay.end) OR myStay.beginning < theirStay.end)
AND (NOT HAS(myStay.end) OR theirStay.beginning < myStay.end)
WITH me, colleagues
WHERE (me-[:IN_CONTACT_WITH]-(:CONTACT)-[:IN_CONTACT_WITH]-colleagues)
RETURN me, colleagues
Quite cool, or?
Cherry on the cake: instead of returning n times 1-1 associations, it’s been decided that a single 1-n relationship should be returned, and the aggregated set of colleagues should be sorted by name.
Tiny subtlety here: colleague order is absolutely not guaranteed by
default. We’re gonna use the quite fitting ORDER BY
, just after the
existing WITH
clause, in order to make sure colleagues are properly
ordered (the subsequent filtering won’t alter this invariant anyway).
MATCH (me:CONTACT)-[myStay:WORKED_IN]->(company:COMPANY),
company<-[theirStay:WORKED_IN]-(colleagues:CONTACT)
WHERE me.name = {name} AND NOT (me-[:IN_CONTACT_WITH]-colleagues)
AND (NOT HAS(theirStay.end) OR myStay.beginning < theirStay.end)
AND (NOT HAS(myStay.end) OR theirStay.beginning < myStay.end)
WITH me, colleagues
ORDER BY colleagues.name
WHERE (me-[:IN_CONTACT_WITH]-(:CONTACT)-[:IN_CONTACT_WITH]-colleagues)
RETURN me, colleagues
After sorting, let’s aggregate!
MATCH (me:CONTACT)-[myStay:WORKED_IN]->(company:COMPANY),
company<-[theirStay:WORKED_IN]-(colleagues:CONTACT)
WHERE me.name = {name} AND NOT (me-[:IN_CONTACT_WITH]-colleagues)
AND (NOT HAS(theirStay.end) OR myStay.beginning < theirStay.end)
AND (NOT HAS(myStay.end) OR theirStay.beginning < myStay.end)
WITH me, colleagues
ORDER BY colleagues.name
WHERE (me-[:IN_CONTACT_WITH]-(:CONTACT)-[:IN_CONTACT_WITH]-colleagues)
RETURN me, COLLECT(colleagues)
And that’s it!
Reasoning and writing the query like this comes with several benefits:
-
the query, split as such, is much more readable
-
it is also arguably more maintainable: 1 subquery = 1 responsibility
-
last but not least, the resulting nodes are self-sufficient: no extra context is needed to interprete the result (both the original contact and its suggestions are returned)
Le grand final
1.5 years ago or so, Cypher was so tiny. It is like seeing a newborn growing: more and more capable and still amazing. It just started as time-off idea and is now almost Turing-complete :-) I cannot help but be enthusiastic about Cypher: easy on the eyes, low barrier for newcomers and still incredibly powerful!
Be ready for it, Cypher is soon gonna be the #1 way to query data on Neo4j. This is quite logical : we all expect a database to include a query language.
What’s still missing, maybe, is a communication protocol with remote Neo4j instances with less overhead than the standard REST API, as suggested Sébastien Deleuze while we were discussing at Soft-Shake.
Post-Scriptum : a test dataset
Open http://console.neo4j.org and try out the final query and its intermediary steps (do not forget to replace {name} by ‘Florent’).