▲Show HN: HelixDB – Open-source vector-graph database for AI applications (Rust)github.com

237 points by GeorgeCurtis 126 days ago | 112 comments

rohanrao123 126 days ago [-]

Congrats on the launch! I'm one of the authors of that paper you cited, glad it was useful and inspiring to building this :) Let me know if we can support in any way!

GeorgeCurtis 126 days ago [-]

Wow! I enjoyed reading it a lot and it was definitely inspiring for this project!

Would love to talk to you about it and make sure we capture all of the pain points if you're open to it? :)

rohanrao123 126 days ago [-]

Absolutely, will DM you on X!

quantike 125 days ago [-]

I spent a bit of time reading up on the internals and had a question about a small design choice (I am new to DB internals, specifically as they relate to vector DBs).

I notice that in your core vector type (`HVector`), you choose to store the vector data as a `Vec<f64>`. Given what I have seen from most embedding endpoints, they return `f32`s. Is there a particular reason for picking `f64` vs `f32` here? Is the additional precision a way to avoid headaches down the line or is it something I am missing context for?

Really cool project, gonna keep reading the code.

xavcochran 125 days ago [-]

thanks for the question! we chose f64 as a default for now as just to cover all cases and we believed that basic vector operations would not be our bottleneck initially. As we optimize our HNSW implementation, we are going to add support for f32 and binary vectors and drop using Vec<f64/f32> and instead use [f64/f32; {num_dimensions}] to avoid unnecessary heap allocation!

quantike 125 days ago [-]

I appreciate the reply! Yeah that sounds like the correct path forward is swapping out the type for some enum of numeric types you want to cover.

I'd be curious if there's some benefit to the runtime-memory utilization to baking in the precision of the vector if it's known at comptime/runtime. In my own usage of vector DBs I've only ever used a single-precision (f32), and often have a single, known dimension. But if Helix is aiming for something more general purpose, then it makes sense to offer the mixing of precision and dimension in the internals.

Cheers

xavcochran 125 days ago [-]

The benefit of baking in the dimension and size of individual elements (the precision) is the fact that the size will be known at compile time meaning it can be allocated on the stack instead of being heap allocated.

srameshc 126 days ago [-]

I was thinking about intertwining Vector and Graph, because I have one specific usecase that required this combination. But I am not courageos or competent enough to build such a DB. So I am very excited to see this project and I am certainly going to use it. One question is what kind of hardware do you think this would require ? I am asking it because from what I understand Graph database performance is directly proportional to the amount of RAM it has and Vectors also needs persistence and computational resources .

GeorgeCurtis 125 days ago [-]

The fortunate thing about our vector DB, like I mentioned in the post, is that we store the HNSW on disk. So, it is much less intense on your memory. Similar thing to what turbo puffer has done.

With regard to the graph db, we mostly use our laptops to test it and haven't run into an issue with performance yet on any size dataset.

If you wanna chat DM me on X :)

UltraSane 126 days ago [-]

Neo4j supports vector indexes

GeorgeCurtis 125 days ago [-]

Neo4j first of all is very slow for vectors, so if performance is something that matters for your user experience they definitely aren't a viable option. This is probably why Neo4j themselves have released guides on how to build that middleman software I mentioned with Qdrant for viable performance.

Furthermore, the vectors is capped at 4k dimensions which although may be enough most of the time, is a problem for some of the users we've spoken to. Also, they don't allow pre filtering which is a problem for a few people we've spoken to including Zep AI. They are on the right track, but there are a lot of holes that we are hoping to fill :)

Edit: AND, it is super memory intensive. People have had problems using extremely small datasets and have had memory overflows.

mauvo59 125 days ago [-]

Hey, want to correct some of your statements here. :-)

Neo4j's vector index uses Lucene's HNSW implementation. So, the performance of vector search is the same as that of Lucene. It's worth noting that performance suffers when configured without sufficient memory, like all HNSW vector indexes.

>> This is probably why Neo4j themselves have released guides on how to build that middleman software I mentioned with Qdrant for viable performance.

No, this is about supporting our customers. Combining graphs and vectors in a single database is the best solution for many users - integration brings convenience, consistency, and performance. But we also recognise that customers might already have invested in a dedicated vector database, need additional vector search features we don't support, or benefit from separating graph and vector resources. Generally, integrating well with the broader data ecosystem helps people succeed.

>> Furthermore, the vectors is capped at 4k dimensions

We occasionally get asked about support for 8k vectors. But so far, whenever I've followed up with users, there doesn't seem to be a case for them. At ~32kb per embedding, they're often not practical in production. Happy to hear about use cases I've missed.

>> Also, they don't allow pre filtering which is a problem for a few people we've spoken to including Zep AI.

We support pre- and post-filtering. We're currently implementing metadata filtering, which may be what you're referring to.

>> AND, it is super memory intensive.

It's no more memory-intensive than other similar implementations. I get that different approaches have different hardware requirements. But in all cases, a misconfigured system will perform poorly.

acefaceZ 112 days ago [-]

Neo4j performance is horrendous unless you have huge amounts of memory. I would wager that anyone who has used Neo4j for anything related to graphrag or used its vector features knows it’s not a great solution. Anyone can verify this quite easily.

GeorgeCurtis 124 days ago [-]

Thanks for clearing up some of these points. Really pleased the post has reached the industry leaders, and I genuinely appreciate your response :)

With regards to the prefiltering, I was referring to filtering during the neighbor search in the HNSW. If you wanted 10 vectors, but with specific conditions, you'd have to retrieve surplus vectors and then perform the filter, hoping you were left with enough. Does that sound right? I suppose that is metadata filtering.

I should've been more specific about the memory issue. Not tying to slate you here, just that a lot of complaints I've read online were about issues with memory overflows using the vectors. But of course, a misconfigured system would definitely perform poorly :)

Thanks again for the response!

125 days ago [-]

hbcondo714 126 days ago [-]

Congrats! Any chance Helixdb can be run in the browser too, maybe via WASM? I'm looking for a vector db that can be pre-populated on the server and then be searched on the client so user queries (chat) stay on-device for privacy / compliance reasons.

GeorgeCurtis 126 days ago [-]

Interesting, we've had a few people ask about this. So essentially you'd call the server to retrieve the HNSW and then store it in the browser and use WASM to query it?

Currently the road block for that is the LMDB storage engine. We have on our own storage engine on our roadmap, which we want to include WASM support with. If you wanna talk about it reach out to my twitter: https://x.com/georgecurtiss

xavcochran 125 days ago [-]

to add to George's reply, for helix to run on the browser with WASM the storage engine has to be completely in memory. At the moment we use LMDB which uses file based storage so that does't work with the browser. As George said, we plan on making our own storage engine and as part of that we aim to have an in-memory implementation.

hansworst 125 days ago [-]

Not entirely sure if you could use it, but wondering if you’ve heard about the origin private file system feature of modern browsers? https://developer.mozilla.org/en-US/docs/Web/API/File_System...

xavcochran 125 days ago [-]

very interesting, will look into this. I know for a fact that you cannot compile the likes of LMDB and RocksDB to work with WASM but this looks promising for our custom storage engine to be able to make it work with the browser. Thanks for this!

huevosabio 126 days ago [-]

Can I run this as an embedded DB like sqlite?

Can I sidestep the DSL? I want my LLMs to generate queries and using a new language is going to make that hard or expensive.

GeorgeCurtis 126 days ago [-]

Currently you can't run us embedded and I'm not sure how you could sidestep the DSL :/

We're working on putting our grammar in llama's cpp code so that it only outputs grammatically correct HQL. But, even without that it shouldn't be hard or expensive to do. I wrote a Claude wrapper that had our docs in its context window, it did a good job of writing queries most of the time.

126 days ago [-]

tmpfs 126 days ago [-]

This is very interesting, are there any examples of interacting with LLMs? If the queries are compiled and loaded into the database ahead of time the pattern of asking an LLM to generate a query from a natural language request seems difficult because current LLMs aren't going to know your query language yet and compiling each query for each prompt would add unnecessary overhead.

GeorgeCurtis 126 days ago [-]

This is definitely a problem we want to work on fixing quickly. We're currently planning an MCP tool that can traverse the graph and decide for itself at each step where to go to next. As opposed to having to generate actual text written queries.

I mentioned in another comment that you can provide a grammar with constrained decoding to force the LLM to generate tokens that comply with the grammar. This ensures that only valid syntactic constructs are produced.

esafak 126 days ago [-]

How does it compare with https://kuzudb.com/ ?

GeorgeCurtis 126 days ago [-]

Kuzu don't support incremental indexing on the vectors. The vector index is completely separate and decoupled from the graph.

I.e: You have to re-index all of the vectors when you make an update to them.

wontonaroo 124 days ago [-]

Firstly congratulations on your effort.

How does the graph component of your database perform compared to Kuzu? Do you have any benchmarks.

For RAG I've tried Qdrant, Meilisearch, and Kuzu. At the moment I wouldn't consider HelixDB because of HelixQL. Wondering why you didn't use OpenCypher?

At the moment you have this system which is aimed to support AI/LLM systems but by creating HelixQL you do not have an AI coding friendly query language.

With OpenCypher even older cheap models can generate queries. Or maybe some GraphQL layer.

GeorgeCurtis 123 days ago [-]

Thanks for the support :)

We're currently working on benchmarks so nothing exact on Kuzu right now with regards to performance. We've had quite a few requests for benchmark comparisons against different databases, so they should take a good few days. Will return here when they are ready

When we've used Cypher in the past we didn't get on with the methodology of the language that well. A functional approach, like gremlin, suited our minds better. But, Gremlin's syntax is awful (in our opinion), and the amount of boilerplate code you need we felt was unnecessary.

We wanted to make something that was easier to read than Gremlin, like Cypher, but also have functional aspect that just made traversals feel so much more intuitive.

Another note, we're more fond of type-safe languages, and it didn't make much sense to us that out of all the programming languages that exist, query languages were the non-type-safe ones.

We know it's a pain learning a new language, but we really believe that our approach will pave the way for a better development experience and a better paradigm.

Onto the AI stuff, you're right, it isn't ideal (right now). We did make a gpt wrapper that did a pretty good job of writing queries based on a condensed version of our docs, but this isn't ideal. So, the next thing on our road map is a graph traversal MCP tool. Instead of the agent having to generate text written queries, it can use the traversal tools and determine where it should hop to at each step.

We know we're being quite ambitious here, but we think there's a lot we can improve on over existing solutions.

Thanks again :)

youdont 126 days ago [-]

Looks very interesting, but I've seen these kind of multi-paradigm databases like Gel, Helix and Surreal and I'm not sure that any of them quite hit the graph spot.

Does Helix support much of the graph algorithm world? For things like GrapgRAG.

Either way, I'd be all over it if there was a python SDK witch worked with the generated types!

BlooIt 125 days ago [-]

Shameless plug: If you're exploring graph+vector databases, check out https://github.com/Pometry/Raphtory/ — with a full Python SDK and built-in support for most common graph algorithms.

It’s built in Rust with native vector support. The open-source version is in-memory, but the commercial version supports disk-based scaling (we tested it with a 3TB graph on an M1 MacBook + insert all 100x faster than existing GraphDBs).

mbrinman 117 days ago [-]

When are you planning on releasing your commercial version? I couldn't find any information online with regard to pricing, etc.

xavcochran 125 days ago [-]

Looking at your benchmarks you say for inserting 1k edges its around 500,000 ns/iteration. Is this 500,000 ns/per edge insertion or for all 1k of them?

BlooIt 120 days ago [-]

Hello. These benchmarks are a bit outdated, we’re currently updating them this sprint.

The open-source in-memory version loads around 3 million edges/second, while the on-disk version handles does about 2 million edges/second with a WAL batch size of 100, and 3m with no WAL.

GeorgeCurtis 125 days ago [-]

We started as a graph database, so that's definitely the main thing we want to get right and we wan't to prioritise capturing all that functionality.

We have a python SDK already! What do you mean by generated types though?

Onawa 126 days ago [-]

I have been happily using Gel (formerly EdgeDB) for a few projects. I'm curious what you think it is missing in regards to hitting the "graph spot"?

GeorgeCurtis 125 days ago [-]

gel is a relational database, have you been building with it under a graph type philosophy?

iannacl 125 days ago [-]

Looks really interesting. A couple of questions: Can you explain how helix handles writes? What are you using for keys? UUIDs? I'm curious if you've done, or are thinking about, any optimizations here.

Feel free to point me to docs / code if these are lazy questions :)

xavcochran 125 days ago [-]

We utilize some of LMDB's optimizations such as the APPEND put flags. We also make use of LMDB handling duplicates as a one-to-many key instead of duplicating keys. This means we can get all values for one key in one call rather than a call for each duplicate.

For keys we are using UUIDs, but using the v6 timestamped uuids so that they are easily lexicographically ordered at creation time. This means keys inserted into LMDB are inserted using the APPEND flag, meaning LMDB shortcuts to the rightmost leaf in its B-Tree (rather than starting at the root) and appends the new record. It can do this because the records are ordered by creation time meaning each new record is guaranteed to be larger (in terms of big-endian byte order) than the previous record.

We also store the UUIDs as u128 values for two reasons. The first is that a u128 takes up 16 bytes where as a string UUID takes up 36 bytes. This means we store 56% less data and LMDB has to decode 56% less bytes when doing code accesses.

For the outgoing/incoming edges for nodes, we store them as fixed sizes which means LMDB packs them in, removing the 8 byte header per Key-Value pair.

In the future, we are also going to separate the properties from the stored value as empty property objects still take up 8 bytes of space. We will also make it so nothing is inserted if the properties are empty.

You can see most of this in action in the storage core file: https://github.com/HelixDB/helix-db/blob/main/helixdb/src/he...

Attummm 126 days ago [-]

It sounds very intriguing indeed. However, the README makes some claims. Are there any benchmarks to support them?

> Built for performance we're currently 1000x faster than Neo4j, 100x faster than TigerGraph

GeorgeCurtis 126 days ago [-]

Those were actual benchmarks that we run, we didn't get a chance to write them out properly before posting. I'll get on it now and notify by replying to this comment when they're on the readme :)

sitkack 126 days ago [-]

Excellent work. Very exited to test this out. What are the limits or gotchas we should be aware of, or how do you want it pushed?

What other papers did you get inspiration from?

xavcochran 125 days ago [-]

Thanks for the kind words! At the moment the query language transpilation is quite unstable but we are in the process of a large remodel which we aim to finish in the next day or so. This will make the query language compilation far more robust, and will return helpful error messages (like the rust compiler). The other thing is the core traversals are currently single threaded, so aggregating huge lists of graph items can take a bit of a hit. Note however, that we are also implementing parallel LMDB iterators with the help of the meilisearch guys to make aggregation of large results much faster.

ckugblenu 125 days ago [-]

Very interesting project. Would be curious of a comparison with memgraph. Will definitely give it to try for my knowledge graph use case.

GeorgeCurtis 125 days ago [-]

I'll add memgraph to our benchmarking list! Make sure you join our discord. would love to help in any way we can and hear about any issues you run in to

iamdanieljohns 125 days ago [-]

How does it compare to SurrealDB and ChromaDB?

GeorgeCurtis 125 days ago [-]

We're working on precise benchmarks, but we are much faster than surreal is right now. Chroma is a standalone vector DB so harder to compare exactly, but for vectors we're on par with them for insertions and reads.

Again, working on benchmarks so will put them here when we're done :)

dietr1ch 126 days ago [-]

Graph DB OOMing 101. Can it do Erdős/Bacon numbers?

Graph DBs have been plagued with exploding complexity of queries as doing things like allowing recursion or counting paths isn't as trivial as it may sound. Do you have benchmarks and comparisons against other engines and query languages?

GeorgeCurtis 125 days ago [-]

No, we are in the process of writing up some proper benchmarks. Our first user used us to build MuskMap and TrumpMap, which went viral on twitter. Not sure how it compared to other graph DBs at the time (bear in mind this was v1 and very bear bones), but it got latency of using Postgres >5s down to 50ms with us.

grounder 120 days ago [-]

What are MuskMap and TrumpMap (I'm kind of afraid to ask), and can you link to more info about how they used your database?

GeorgeCurtis 116 days ago [-]

They were two viral web apps that blew up on twitter. They had approx 25,000 users at their peak.

Originally they were built on Postgres, so we helped move over to us. Their graph had about 50,000 user nodes and 25 million edges (follower connections). This then made it a lot more optimised to handle the highly interconnected users to find shortest paths between one user and Elon Must / Donald Trump.

So to sum it up, they stored clones of all the users and how they were interconnected by follower relationships, and then used our query language to super easily calculate the shortest paths.

bogzz 125 days ago [-]

This is very cool, and right up my alley. Hesitant to try it out because of the bespoke query language for now.

I wonder if you'd like to share your thoughts on GQL becoming an ISO standard? Also, have you looked into how Neptune Analytics handles vector embeddings?

GeorgeCurtis 125 days ago [-]

Thanks for the kind words:)

I like what GQL has done. We've definitely taken inspiration from their methodologies in building our own language.

no1youknowz 125 days ago [-]

> In-house graph-vector storage engine (to replace LMDB)

Not sure if it's possible. But why not use fjall, if it is? [0]

[0]: https://github.com/fjall-rs/fjall/

GeorgeCurtis 125 days ago [-]

We went with LMDB because it was a lot faster. But will definitely look over this before we work on our own engine

J_Shelby_J 126 days ago [-]

How do you think about building the graph relationships? Any special approaches you use?

GeorgeCurtis 126 days ago [-]

Pretty much the same way you would with any graph DB, with the added benefit of being able to treat a vector as a node by creating those explicit relationships between them.

Does that answer your question properly?

carlhjerpe 126 days ago [-]

Nice "I'll have this name" when there's already the helix editor :)

GeorgeCurtis 126 days ago [-]

First I'm hearing from it. The Beatles must've been super pissed when Apple took their name :(

carlhjerpe 126 days ago [-]

https://crates.io/search?q=Helix

I'm surprised none in the team searched crates.io once before picking the name. Good luck!

itishappy 126 days ago [-]

I don't think `helix-editor` is even on crates.io, just placeholders.

https://github.com/helix-editor/helix/discussions/7038

That being said, when I saw `helix-db` I was thrown too. "What's a text editor doing writing a vector-graph database, I thought they were working on plugins?"

GeorgeCurtis 126 days ago [-]

we just started off as a side project and thought the name fitted well. With the strands, graph type structure, connections...

We didn't think of getting people to use it until we found it was solving a real pain point for people, so weren't worried about trademarks or names. There was no other helix db so that was good enough for us at the time.

tavianator 126 days ago [-]

> There was no other helix db

https://en.wikipedia.org/wiki/Helix_(database)

GeorgeCurtis 126 days ago [-]

There was no active one. We saw this and thought it would be a nice nod to history. We've actually spoken to some developers at apple who thought this was really neat :)

carlhjerpe 126 days ago [-]

It's not the end of the world, just me being a bit grumpy. I mean it when I say good luck! :)

GeorgeCurtis 126 days ago [-]

Thank you :)

bbatsell 126 days ago [-]

I can't tell if this is droll sarcasm, but just in case not...

https://en.wikipedia.org/wiki/Apple_Corps_v_Apple_Computer

cormullion 126 days ago [-]

perhaps it’s a homage to the famous Helix database (see Wikipedia)

GeorgeCurtis 126 days ago [-]

well noted

SchwKatze 126 days ago [-]

Super cool!!! I'll try it this week and go back to give a feedback.

GeorgeCurtis 126 days ago [-]

I look forward to it :)

rationably 125 days ago [-]

The fact that it's "backed by NVIDIA" and licensed under AGPL-3.0 makes me wonder about the cost(s) of using it in production.

Could you share any information on the pricing model?

GeorgeCurtis 125 days ago [-]

We are open-source, so you can use and self host us for free. Our plan is to create a managed service (so long as all goes well) which shouldn't be priced any differently from other databases in the space.

We chose AGPL to make sure someone can't make a cloud hosted version of our product, think MongoDB on AWS a few years back.

rationably 124 days ago [-]

I can use it for personal needs, sure. Bringing AGPL in a closed-source project is a no go for obvious reasons.

GeorgeCurtis 124 days ago [-]

We built this to help people make software that was previously harder to make. If people want to build software and share it with everyone, they are welcome to do that for free, and if someone wants to close-source their project to make lots of money, then they can support the community they rely on by paying a license. :)

javierluraschi 126 days ago [-]

What is the max number of dimensions supported for a vector?

GeorgeCurtis 126 days ago [-]

There is currently no cap. We will probably impose a similar cap to Qdrant or Pinecone some time soon ~64k. There's obviously a performance trade off as you go up, but we hope to massively offset this by doing binary quantisation within the next couple of months.

xavcochran 125 days ago [-]

there is also the fact that the more dimensions you have for embedded data the more diluted the embedding becomes so it is unusual to go anywhere near the limits of vector length!

anonymousDan 125 days ago [-]

What would be a typical/recommended server setup for using this for RAG? Would you typically have a separate server for the GPUs and the DB itself?

xavcochran 125 days ago [-]

Assuming you are using GPUs for model inference, the best way to set it up would have the DB and a separate server to send inference requests. Note that we plan on support custom model endpoints and on the database side so you probably won't need the inference server in the future!

michaelsbradley 125 days ago [-]

Can you do a compare/contrast with CozoDB?

https://github.com/cozodb/cozo

xavcochran 125 days ago [-]

apart from the fact Cozo seems to be pretty dead, we use a different storage engine which makes our reads much faster. based on their benchmarks I estimate our most of our reads to be 10x faster. I think our query language is much simpler, and easy to understand than Datalog which is what they use.

riku_iki 126 days ago [-]

How scalable is your DB in your tests? Could it be performent on graphs with 1B/10B/100B connections?

GeorgeCurtis 125 days ago [-]

So far, we've tested it for up to ~10B connections and 50 odd million nodes. We didn't run in to any problems with it yet.

brene 125 days ago [-]

How does this scale horizontally across multiple regions. Is this something on your roadmap?

GeorgeCurtis 125 days ago [-]

It’s definitely on our roadmap, but not a priority because no one using us needs it. Is this something that would be useful to you?

wiradikusuma 125 days ago [-]

"faster than Neo4j" How does it compare to Dgraph?

GeorgeCurtis 125 days ago [-]

We don't have any benchmarks against them but from what I've just read about there bench marks, we should be just as good as them.

That is just heresy though, am interested myself now and will run some proper benchmarks

elpalek 126 days ago [-]

What method/model are you using for sparse search?

GeorgeCurtis 126 days ago [-]

We're going to use BM25. Currently it is just dense search. Coming very soon

elpalek 126 days ago [-]

have you thought about SPALDE models? ex: https://arxiv.org/abs/2109.10086

GeorgeCurtis 126 days ago [-]

Looks really interesting, I'll have a proper read. What would be your reasoning to incorporate this if we already have vector functionality and semantic search?

elpalek 126 days ago [-]

my project deals w/ non-english text, bm25 performance is middeling. Language specific sparse model helps.

xavcochran 125 days ago [-]

We will definitely look into it. The SPLADE models look promising!

xavcochran 125 days ago [-]

SPALDE*

basonjourne 126 days ago [-]

why not surrealdb?

GeorgeCurtis 126 days ago [-]

General consensus is it's really slow, I like the concept of surreal though. Our first, and extremely bare bones, version of the graph db was 1-2 orders of magnitude faster than surreal (we haven't run benchmarks against surreal recently, but I'll put them here when we're done)

datastorydesign 125 days ago [-]

Hey George, Alexander from SurrealDB here.

Congratulations on the launch! This is a very exciting space, and it's great to see your take on it.

Running fair benchmarks, not benchmarketing, is a significant effort and we recently put in this effort to make things as fair and transparent as possible across a range of databases.

You can see the results and links to our code in the write-up here: https://surrealdb.com/blog/beginning-our-benchmarking-journe...

We'd be very interested in seeing the benchmarks you'd run and how we compare :)

You can sacrifice many things for faster performance, such as security, consistency levels or referential integrity.

I'm genuinely curious to learn what design decisions you will make as you continue building the database. There are so many options, each with its pros and cons.

If you would like to have a chat where we can exchange ideas, happy to do that :)

riku_iki 123 days ago [-]

> You can see the results and links to our code in the write-up here: https://surrealdb.com/blog/beginning-our-benchmarking-journe...

page says your benchmark runs on 5M of records only. Is it incredibly small dataset in current world, and is it more micro-benchmarking?

Also, count(*) query having 5s latency on 5m records is very underwhelming if I understand your tables correctly.

GeorgeCurtis 125 days ago [-]

I've only just seen this! Thanks so much for the response, would definitely love to chat with you guys.

You're definitely right btw, those weren't concrete benchmarks and I'm excited to see how we compare now :)

mattturck 125 days ago [-]

I don't think that's all the general consensus

GeorgeCurtis 125 days ago [-]

I was just going off of what we've heard from users and people in the space, I admit that "general consensus might've been the wrong terminology.

lennertjansen 126 days ago [-]

how did you get it 3 OOMs faster than neo4j?

GeorgeCurtis 126 days ago [-]

Partly because they're working with a monolith that I imagine is difficult to iterate on and it's written in Java. We've had the benefit of working on this in Rust which lets us get really nitty and gritty with different optimisations.

My friend who I worked on this with is putting together a technical blog on those graph optimisations so I'll link it here when he's done

xpe 126 days ago [-]

On comparable benchmarks with comparable guarantees? Comparable persistence levels? I’m very skeptical.

GeorgeCurtis 125 days ago [-]

Looking forward to putting you at ease :) Working on some proper benchmarks over the next few days.

126 days ago [-]

raufakdemir 126 days ago [-]

How can I migrate neo4j to this?

GeorgeCurtis 126 days ago [-]

We can build an ingestion engine for you :)

We've built SQL and PGVector ones already, just waiting for someone who could make use of other ones before we build them.

Let us know! Twitter in my bio

lleymrl651 125 days ago [-]

Congrats on the launch!

xavcochran 125 days ago [-]

thank you! any feedback would be much appreciated

sync 126 days ago [-]

Looks nice! Are you looking to compete with https://www.falkordb.com or do something a bit different?

GeorgeCurtis 126 days ago [-]

Pretty much, our biggest focus is on Graph and Hybrid RAG. They seem to have really honed in on Graph RAG since the last time I checked their website.

One of the problems I know people experience with them is that they're super slow at bulk reading.

Oh also, they aren't built in Rust haha

dandan7 121 days ago [-]

[dead]

mdaniel 126 days ago [-]

> so much easier that it’s worth a bit of a learning curve

I think you misspelled "vendor lock in"

GeorgeCurtis 126 days ago [-]

You can literally use us for free haha. There's not a language that properly encapsulates graph and vector functionality, so we needed to make our own. Also, we thought it was dumb that query languages weren't type-safe... So we changed that

Loading comments...

rohanrao123 126 days ago [-]

Congrats on the launch! I'm one of the authors of that paper you cited, glad it was useful and inspiring to building this :) Let me know if we can support in any way!

GeorgeCurtis 126 days ago [-]

Wow! I enjoyed reading it a lot and it was definitely inspiring for this project!

Would love to talk to you about it and make sure we capture all of the pain points if you're open to it? :)

rohanrao123 126 days ago [-]

Absolutely, will DM you on X!

quantike 125 days ago [-]

I spent a bit of time reading up on the internals and had a question about a small design choice (I am new to DB internals, specifically as they relate to vector DBs).

Really cool project, gonna keep reading the code.

xavcochran 125 days ago [-]

quantike 125 days ago [-]

I appreciate the reply! Yeah that sounds like the correct path forward is swapping out the type for some enum of numeric types you want to cover.

Cheers

xavcochran 125 days ago [-]

srameshc 126 days ago [-]

GeorgeCurtis 125 days ago [-]

The fortunate thing about our vector DB, like I mentioned in the post, is that we store the HNSW on disk. So, it is much less intense on your memory. Similar thing to what turbo puffer has done.

With regard to the graph db, we mostly use our laptops to test it and haven't run into an issue with performance yet on any size dataset.

If you wanna chat DM me on X :)

UltraSane 126 days ago [-]

Neo4j supports vector indexes

GeorgeCurtis 125 days ago [-]

Edit: AND, it is super memory intensive. People have had problems using extremely small datasets and have had memory overflows.

mauvo59 125 days ago [-]

Hey, want to correct some of your statements here. :-)

>> This is probably why Neo4j themselves have released guides on how to build that middleman software I mentioned with Qdrant for viable performance.

>> Furthermore, the vectors is capped at 4k dimensions

>> Also, they don't allow pre filtering which is a problem for a few people we've spoken to including Zep AI.

We support pre- and post-filtering. We're currently implementing metadata filtering, which may be what you're referring to.

>> AND, it is super memory intensive.

It's no more memory-intensive than other similar implementations. I get that different approaches have different hardware requirements. But in all cases, a misconfigured system will perform poorly.

acefaceZ 112 days ago [-]

GeorgeCurtis 124 days ago [-]

Thanks for clearing up some of these points. Really pleased the post has reached the industry leaders, and I genuinely appreciate your response :)

Thanks again for the response!

125 days ago [-]

hbcondo714 126 days ago [-]

GeorgeCurtis 126 days ago [-]

Interesting, we've had a few people ask about this. So essentially you'd call the server to retrieve the HNSW and then store it in the browser and use WASM to query it?

xavcochran 125 days ago [-]

hansworst 125 days ago [-]

Not entirely sure if you could use it, but wondering if you’ve heard about the origin private file system feature of modern browsers? https://developer.mozilla.org/en-US/docs/Web/API/File_System...

xavcochran 125 days ago [-]

huevosabio 126 days ago [-]

Can I run this as an embedded DB like sqlite?

Can I sidestep the DSL? I want my LLMs to generate queries and using a new language is going to make that hard or expensive.

GeorgeCurtis 126 days ago [-]

Currently you can't run us embedded and I'm not sure how you could sidestep the DSL :/

126 days ago [-]

tmpfs 126 days ago [-]

GeorgeCurtis 126 days ago [-]

esafak 126 days ago [-]

How does it compare with https://kuzudb.com/ ?

GeorgeCurtis 126 days ago [-]

Kuzu don't support incremental indexing on the vectors. The vector index is completely separate and decoupled from the graph.

I.e: You have to re-index all of the vectors when you make an update to them.

wontonaroo 124 days ago [-]

Firstly congratulations on your effort.

How does the graph component of your database perform compared to Kuzu? Do you have any benchmarks.

For RAG I've tried Qdrant, Meilisearch, and Kuzu. At the moment I wouldn't consider HelixDB because of HelixQL. Wondering why you didn't use OpenCypher?

At the moment you have this system which is aimed to support AI/LLM systems but by creating HelixQL you do not have an AI coding friendly query language.

With OpenCypher even older cheap models can generate queries. Or maybe some GraphQL layer.

GeorgeCurtis 123 days ago [-]

Thanks for the support :)

We wanted to make something that was easier to read than Gremlin, like Cypher, but also have functional aspect that just made traversals feel so much more intuitive.

Another note, we're more fond of type-safe languages, and it didn't make much sense to us that out of all the programming languages that exist, query languages were the non-type-safe ones.

We know it's a pain learning a new language, but we really believe that our approach will pave the way for a better development experience and a better paradigm.

We know we're being quite ambitious here, but we think there's a lot we can improve on over existing solutions.

Thanks again :)

youdont 126 days ago [-]

Looks very interesting, but I've seen these kind of multi-paradigm databases like Gel, Helix and Surreal and I'm not sure that any of them quite hit the graph spot.

Does Helix support much of the graph algorithm world? For things like GrapgRAG.

Either way, I'd be all over it if there was a python SDK witch worked with the generated types!

BlooIt 125 days ago [-]

Shameless plug: If you're exploring graph+vector databases, check out https://github.com/Pometry/Raphtory/ — with a full Python SDK and built-in support for most common graph algorithms.

mbrinman 117 days ago [-]

When are you planning on releasing your commercial version? I couldn't find any information online with regard to pricing, etc.

xavcochran 125 days ago [-]

Looking at your benchmarks you say for inserting 1k edges its around 500,000 ns/iteration. Is this 500,000 ns/per edge insertion or for all 1k of them?

BlooIt 120 days ago [-]

Hello. These benchmarks are a bit outdated, we’re currently updating them this sprint.

The open-source in-memory version loads around 3 million edges/second, while the on-disk version handles does about 2 million edges/second with a WAL batch size of 100, and 3m with no WAL.

GeorgeCurtis 125 days ago [-]

We started as a graph database, so that's definitely the main thing we want to get right and we wan't to prioritise capturing all that functionality.

We have a python SDK already! What do you mean by generated types though?

Onawa 126 days ago [-]

I have been happily using Gel (formerly EdgeDB) for a few projects. I'm curious what you think it is missing in regards to hitting the "graph spot"?

GeorgeCurtis 125 days ago [-]

gel is a relational database, have you been building with it under a graph type philosophy?

iannacl 125 days ago [-]

Feel free to point me to docs / code if these are lazy questions :)

xavcochran 125 days ago [-]

For the outgoing/incoming edges for nodes, we store them as fixed sizes which means LMDB packs them in, removing the 8 byte header per Key-Value pair.

You can see most of this in action in the storage core file: https://github.com/HelixDB/helix-db/blob/main/helixdb/src/he...

Attummm 126 days ago [-]

It sounds very intriguing indeed. However, the README makes some claims. Are there any benchmarks to support them?

> Built for performance we're currently 1000x faster than Neo4j, 100x faster than TigerGraph

GeorgeCurtis 126 days ago [-]

Those were actual benchmarks that we run, we didn't get a chance to write them out properly before posting. I'll get on it now and notify by replying to this comment when they're on the readme :)

sitkack 126 days ago [-]

Excellent work. Very exited to test this out. What are the limits or gotchas we should be aware of, or how do you want it pushed?

What other papers did you get inspiration from?

xavcochran 125 days ago [-]

ckugblenu 125 days ago [-]

Very interesting project. Would be curious of a comparison with memgraph. Will definitely give it to try for my knowledge graph use case.

GeorgeCurtis 125 days ago [-]

I'll add memgraph to our benchmarking list! Make sure you join our discord. would love to help in any way we can and hear about any issues you run in to

iamdanieljohns 125 days ago [-]

How does it compare to SurrealDB and ChromaDB?

GeorgeCurtis 125 days ago [-]

Again, working on benchmarks so will put them here when we're done :)

dietr1ch 126 days ago [-]

Graph DB OOMing 101. Can it do Erdős/Bacon numbers?

GeorgeCurtis 125 days ago [-]

grounder 120 days ago [-]

What are MuskMap and TrumpMap (I'm kind of afraid to ask), and can you link to more info about how they used your database?

GeorgeCurtis 116 days ago [-]

They were two viral web apps that blew up on twitter. They had approx 25,000 users at their peak.

So to sum it up, they stored clones of all the users and how they were interconnected by follower relationships, and then used our query language to super easily calculate the shortest paths.

bogzz 125 days ago [-]

This is very cool, and right up my alley. Hesitant to try it out because of the bespoke query language for now.

I wonder if you'd like to share your thoughts on GQL becoming an ISO standard? Also, have you looked into how Neptune Analytics handles vector embeddings?

GeorgeCurtis 125 days ago [-]

Thanks for the kind words:)

I like what GQL has done. We've definitely taken inspiration from their methodologies in building our own language.

no1youknowz 125 days ago [-]

> In-house graph-vector storage engine (to replace LMDB)

Not sure if it's possible. But why not use fjall, if it is? [0]

[0]: https://github.com/fjall-rs/fjall/

GeorgeCurtis 125 days ago [-]

We went with LMDB because it was a lot faster. But will definitely look over this before we work on our own engine

J_Shelby_J 126 days ago [-]

How do you think about building the graph relationships? Any special approaches you use?

GeorgeCurtis 126 days ago [-]

Pretty much the same way you would with any graph DB, with the added benefit of being able to treat a vector as a node by creating those explicit relationships between them.

Does that answer your question properly?

carlhjerpe 126 days ago [-]

Nice "I'll have this name" when there's already the helix editor :)

GeorgeCurtis 126 days ago [-]

First I'm hearing from it. The Beatles must've been super pissed when Apple took their name :(

carlhjerpe 126 days ago [-]

https://crates.io/search?q=Helix

I'm surprised none in the team searched crates.io once before picking the name. Good luck!

itishappy 126 days ago [-]

I don't think `helix-editor` is even on crates.io, just placeholders.

https://github.com/helix-editor/helix/discussions/7038

That being said, when I saw `helix-db` I was thrown too. "What's a text editor doing writing a vector-graph database, I thought they were working on plugins?"

GeorgeCurtis 126 days ago [-]

we just started off as a side project and thought the name fitted well. With the strands, graph type structure, connections...

tavianator 126 days ago [-]

> There was no other helix db

https://en.wikipedia.org/wiki/Helix_(database)

GeorgeCurtis 126 days ago [-]

There was no active one. We saw this and thought it would be a nice nod to history. We've actually spoken to some developers at apple who thought this was really neat :)

carlhjerpe 126 days ago [-]

It's not the end of the world, just me being a bit grumpy. I mean it when I say good luck! :)

GeorgeCurtis 126 days ago [-]

Thank you :)

bbatsell 126 days ago [-]

I can't tell if this is droll sarcasm, but just in case not...

https://en.wikipedia.org/wiki/Apple_Corps_v_Apple_Computer

cormullion 126 days ago [-]

perhaps it’s a homage to the famous Helix database (see Wikipedia)

GeorgeCurtis 126 days ago [-]

well noted

SchwKatze 126 days ago [-]

Super cool!!! I'll try it this week and go back to give a feedback.

GeorgeCurtis 126 days ago [-]

I look forward to it :)

rationably 125 days ago [-]

The fact that it's "backed by NVIDIA" and licensed under AGPL-3.0 makes me wonder about the cost(s) of using it in production.

Could you share any information on the pricing model?

GeorgeCurtis 125 days ago [-]

We chose AGPL to make sure someone can't make a cloud hosted version of our product, think MongoDB on AWS a few years back.

rationably 124 days ago [-]

I can use it for personal needs, sure. Bringing AGPL in a closed-source project is a no go for obvious reasons.

GeorgeCurtis 124 days ago [-]

javierluraschi 126 days ago [-]

What is the max number of dimensions supported for a vector?

GeorgeCurtis 126 days ago [-]

xavcochran 125 days ago [-]

there is also the fact that the more dimensions you have for embedded data the more diluted the embedding becomes so it is unusual to go anywhere near the limits of vector length!

anonymousDan 125 days ago [-]

What would be a typical/recommended server setup for using this for RAG? Would you typically have a separate server for the GPUs and the DB itself?

xavcochran 125 days ago [-]

michaelsbradley 125 days ago [-]

Can you do a compare/contrast with CozoDB?

https://github.com/cozodb/cozo

xavcochran 125 days ago [-]

riku_iki 126 days ago [-]

How scalable is your DB in your tests? Could it be performent on graphs with 1B/10B/100B connections?

GeorgeCurtis 125 days ago [-]

So far, we've tested it for up to ~10B connections and 50 odd million nodes. We didn't run in to any problems with it yet.

brene 125 days ago [-]

How does this scale horizontally across multiple regions. Is this something on your roadmap?

GeorgeCurtis 125 days ago [-]

It’s definitely on our roadmap, but not a priority because no one using us needs it. Is this something that would be useful to you?

wiradikusuma 125 days ago [-]

"faster than Neo4j" How does it compare to Dgraph?

GeorgeCurtis 125 days ago [-]

We don't have any benchmarks against them but from what I've just read about there bench marks, we should be just as good as them.

That is just heresy though, am interested myself now and will run some proper benchmarks

elpalek 126 days ago [-]

What method/model are you using for sparse search?

GeorgeCurtis 126 days ago [-]

We're going to use BM25. Currently it is just dense search. Coming very soon

elpalek 126 days ago [-]

have you thought about SPALDE models? ex: https://arxiv.org/abs/2109.10086

GeorgeCurtis 126 days ago [-]

Looks really interesting, I'll have a proper read. What would be your reasoning to incorporate this if we already have vector functionality and semantic search?

elpalek 126 days ago [-]

my project deals w/ non-english text, bm25 performance is middeling. Language specific sparse model helps.

xavcochran 125 days ago [-]

We will definitely look into it. The SPLADE models look promising!

xavcochran 125 days ago [-]

SPALDE*

basonjourne 126 days ago [-]

why not surrealdb?

GeorgeCurtis 126 days ago [-]

datastorydesign 125 days ago [-]

Hey George, Alexander from SurrealDB here.

Congratulations on the launch! This is a very exciting space, and it's great to see your take on it.

Running fair benchmarks, not benchmarketing, is a significant effort and we recently put in this effort to make things as fair and transparent as possible across a range of databases.

You can see the results and links to our code in the write-up here: https://surrealdb.com/blog/beginning-our-benchmarking-journe...

We'd be very interested in seeing the benchmarks you'd run and how we compare :)

You can sacrifice many things for faster performance, such as security, consistency levels or referential integrity.

I'm genuinely curious to learn what design decisions you will make as you continue building the database. There are so many options, each with its pros and cons.

If you would like to have a chat where we can exchange ideas, happy to do that :)

riku_iki 123 days ago [-]

> You can see the results and links to our code in the write-up here: https://surrealdb.com/blog/beginning-our-benchmarking-journe...

page says your benchmark runs on 5M of records only. Is it incredibly small dataset in current world, and is it more micro-benchmarking?

Also, count(*) query having 5s latency on 5m records is very underwhelming if I understand your tables correctly.

GeorgeCurtis 125 days ago [-]

I've only just seen this! Thanks so much for the response, would definitely love to chat with you guys.

You're definitely right btw, those weren't concrete benchmarks and I'm excited to see how we compare now :)

mattturck 125 days ago [-]

I don't think that's all the general consensus

GeorgeCurtis 125 days ago [-]

I was just going off of what we've heard from users and people in the space, I admit that "general consensus might've been the wrong terminology.

lennertjansen 126 days ago [-]

how did you get it 3 OOMs faster than neo4j?

GeorgeCurtis 126 days ago [-]

My friend who I worked on this with is putting together a technical blog on those graph optimisations so I'll link it here when he's done

xpe 126 days ago [-]

On comparable benchmarks with comparable guarantees? Comparable persistence levels? I’m very skeptical.

GeorgeCurtis 125 days ago [-]

Looking forward to putting you at ease :) Working on some proper benchmarks over the next few days.

126 days ago [-]

raufakdemir 126 days ago [-]

How can I migrate neo4j to this?

GeorgeCurtis 126 days ago [-]

We can build an ingestion engine for you :)

We've built SQL and PGVector ones already, just waiting for someone who could make use of other ones before we build them.

Let us know! Twitter in my bio

lleymrl651 125 days ago [-]

Congrats on the launch!

xavcochran 125 days ago [-]

thank you! any feedback would be much appreciated

sync 126 days ago [-]

Looks nice! Are you looking to compete with https://www.falkordb.com or do something a bit different?

GeorgeCurtis 126 days ago [-]

Pretty much, our biggest focus is on Graph and Hybrid RAG. They seem to have really honed in on Graph RAG since the last time I checked their website.

One of the problems I know people experience with them is that they're super slow at bulk reading.

Oh also, they aren't built in Rust haha

dandan7 121 days ago [-]

[dead]

mdaniel 126 days ago [-]

> so much easier that it’s worth a bit of a learning curve

I think you misspelled "vendor lock in"

GeorgeCurtis 126 days ago [-]