{"id":1350,"date":"2025-06-11T20:51:05","date_gmt":"2025-06-11T20:51:05","guid":{"rendered":"https:\/\/thinkcolorful.org\/?p=1350"},"modified":"2025-06-11T21:10:35","modified_gmt":"2025-06-11T21:10:35","slug":"running-microsofts-1-bit-bitnet-llm-on-my-dell-r7910-a-self-hosting-adventure","status":"publish","type":"post","link":"https:\/\/thinkcolorful.org\/?p=1350","title":{"rendered":"Running Microsoft&#8217;s 1-Bit BitNet LLM on My Dell R7910 &#8211; A Self-Hosting Adventure"},"content":{"rendered":"\n<p>So Microsoft dropped this 1-bit LLM called <a href=\"https:\/\/github.com\/microsoft\/BitNet?tab=readme-ov-file#build-from-source\" data-type=\"link\" data-id=\"https:\/\/github.com\/microsoft\/BitNet?tab=readme-ov-file#build-from-source\">BitNet<\/a>, and I couldn&#8217;t resist trying to get it running on my new homelab server. Spoiler alert: it actually works incredibly well, and now I have a pretty capable AI assistant running entirely on CPU power! For the demo click on the plus button on the bottom right ;)<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">My Setup (And Why This Matters)<\/h2>\n\n\n\n<p>I&#8217;m running this on my Dell Precision Rack 7910 &#8211; yeah, it&#8217;s basically a workstation crammed into a rack case, but hey, it works! Here&#8217;s what I&#8217;m working with:<\/p>\n\n\n\n<p><strong>My Dell R7910:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Dual Xeon E5-2690V4 processors (28 cores total)<\/li>\n\n\n\n<li>64GB ECC RAM<\/li>\n\n\n\n<li>Running Proxmox VE<\/li>\n\n\n\n<li>Already hosting Nextcloud, Jellyfin, and WordPress<\/li>\n<\/ul>\n\n\n\n<p>The cool thing about BitNet is that it doesn&#8217;t need fancy GPU hardware. While I&#8217;m running it on dual Xeons, you could probably get away with much less.<\/p>\n\n\n\n<p><strong>Minimum specs you&#8217;d probably want:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Any modern 4+ core CPU<\/li>\n\n\n\n<li>8GB RAM (though 16GB+ is better)<\/li>\n\n\n\n<li>50GB storage space<\/li>\n\n\n\n<li>That&#8217;s literally it &#8211; no GPU required!<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">What the Heck is BitNet Anyway?<\/h2>\n\n\n\n<p>Before we dive in, let me explain why I got excited about this. Most AI models use 32-bit or 16-bit numbers for their &#8220;weights&#8221; (basically the model&#8217;s learned knowledge). BitNet uses just three values: -1, 0, and +1.<\/p>\n\n\n\n<p>Sounds crazy, right? But somehow it works! The 2 billion parameter BitNet model:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Uses only ~400MB of RAM (my Llama models use 4-8GB+)<\/li>\n\n\n\n<li>Runs 2-6x faster than similar models<\/li>\n\n\n\n<li>Uses way less power<\/li>\n\n\n\n<li>Still gives pretty decent responses<\/li>\n<\/ul>\n\n\n\n<p>I mean, when I first heard &#8220;1-bit AI,&#8221; I thought it would be terrible, but Microsoft&#8217;s research team clearly knew what they were doing.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The Journey: Setting This Thing Up<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Step 1: Creating a Container for BitNet<\/h3>\n\n\n\n<p>Since I&#8217;m already running a bunch of services running on Proxmox on my R7910, I decided to give BitNet its own LXC container. This keeps things clean and prevents it from messing with my other stuff.<\/p>\n\n\n\n<p>In Proxmox, I created a new container with these specs:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Template:<\/strong> Ubuntu 22.04 LTS<\/li>\n\n\n\n<li><strong>CPU:<\/strong> 16 cores (leaving 12 for my other services)<\/li>\n\n\n\n<li><strong>Memory:<\/strong> 32GB (plenty of headroom)<\/li>\n\n\n\n<li><strong>Storage:<\/strong> 80GB<\/li>\n<\/ul>\n\n\n\n<p><strong>Important:<\/strong> You need to edit the container config file to add these lines, or the build will fail:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># Edit \/etc\/pve\/lxc\/&#91;YOUR_CONTAINER_ID].conf\nfeatures: nesting=1\nlxc.apparmor.profile: unconfined<\/code><\/pre>\n\n\n\n<p>Trust me, I learned this the hard way after wondering why cmake was throwing mysterious errors!<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Step 2: Getting the Environment Ready<\/h3>\n\n\n\n<p>First things first &#8211; we need the right tools. BitNet is picky about its build environment:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># Basic stuff\napt update &amp;&amp; apt upgrade -y\napt install -y curl wget git build-essential cmake clang clang++ libomp-dev\n<\/code><\/pre>\n\n\n\n<p>Now here&#8217;s where I made my first mistake &#8211; I tried to use Python&#8217;s venv initially, but <a href=\"https:\/\/github.com\/microsoft\/BitNet?tab=readme-ov-file#build-from-source\">BitNet&#8217;s instructions specifically mention conda<\/a>, and there&#8217;s a good reason for that. Just install Miniconda:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>cd \/tmp\nwget https:\/\/repo.anaconda.com\/miniconda\/Miniconda3-latest-Linux-x86_64.sh\nchmod +x Miniconda3-latest-Linux-x86_64.sh\n.\/Miniconda3-latest-Linux-x86_64.sh\nsource ~\/.bashrc\n<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Step 3: The BitNet Installation Saga<\/h3>\n\n\n\n<p>This is where things got interesting. The GitHub instructions look straightforward, but there are some gotchas:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>mkdir -p \/opt\/bitnet &amp;&amp; cd \/opt\/bitnet\n\n# This --recursive flag is CRUCIAL - don't skip it!\ngit clone --recursive https:\/\/github.com\/microsoft\/BitNet.git\ncd BitNet\n\n# Create the conda environment\nconda create -n bitnet-cpp python=3.9\nconda activate bitnet-cpp\npip install -r requirements.txt\npip install huggingface_hub\n<\/code><\/pre>\n\n\n\n<p>Now for the fun part &#8211; downloading the model and building everything:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># Download the official Microsoft model\nmkdir -p models\nhuggingface-cli download microsoft\/BitNet-b1.58-2B-4T-gguf --local-dir models\/BitNet-b1.58-2B-4T\n\n# Build the whole thing\npython setup_env.py -md models\/BitNet-b1.58-2B-4T -q i2_s\n<\/code><\/pre>\n\n\n\n<p><strong>Pro tip:<\/strong> This build step takes a while. On my dual Xeons, it was about 10 minutes of heavy CPU usage. Grab a coffee &#8211; it&#8217;s compiling a ton of optimized C++ code.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Step 4: Testing My New AI<\/h3>\n\n\n\n<p>Once the build finished, I had to try it out:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>python run_inference.py -m models\/BitNet-b1.58-2B-4T\/ggml-model-i2_s.gguf -p \"Hello, how are you?\" -cnv\n<\/code><\/pre>\n\n\n\n<p>And it worked! The responses weren&#8217;t GPT-4 quality, but they were coherent and surprisingly good for something running entirely on CPU. My partner thought I was connected to a service like openAI because the responses were so fast and the resource usage was so low ^_^<\/p>\n\n\n\n<p>But I didn&#8217;t want to just run it in a terminal. I wanted to integrate it with AnythingLLM that I already had running.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Step 5: Making BitNet Play Nice with AnythingLLM<\/h3>\n\n\n\n<p>Here&#8217;s the cool part &#8211; BitNet comes with a built-in API server:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>python run_inference_server.py -m models\/BitNet-b1.58-2B-4T\/ggml-model-i2_s.gguf -p 8080 --host 0.0.0.0\n<\/code><\/pre>\n\n\n\n<p>Then in AnythingLLM, I just added it as a &#8220;Generic OpenAI&#8221; provider:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>API Endpoint:<\/strong> <code>http:\/\/[my_container_ip]:8080<\/code><\/li>\n\n\n\n<li><strong>Model Name:<\/strong> <code>BitNet-b1.58-2B-4T<\/code><\/li>\n\n\n\n<li><strong>Token Context Window:<\/strong> 4096 (can be adjusted)<\/li>\n\n\n\n<li><strong>Max Tokens<\/strong>: 1024 (can be adjusted)<\/li>\n<\/ul>\n\n\n\n<p>And boom &#8211; I had BitNet responding to queries through AnythingLLM&#8217;s nice web interface!<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Step 6: Making It Actually Reliable<\/h3>\n\n\n\n<p>Running things manually is fun for testing, but I wanted this to be a proper service. So I created a systemd service:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>sudo nano \/etc\/systemd\/system\/bitnet.service\n<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>&#91;Unit]\nDescription=BitNet LLM API Server\nAfter=network.target\nWants=network.target\n\n&#91;Service]\nType=simple\nUser=root\nGroup=root\nWorkingDirectory=\/opt\/bitnet\/BitNet\nEnvironment=PATH=\/root\/miniconda3\/envs\/bitnet-cpp\/bin:\/usr\/local\/sbin:\/usr\/local\/bin:\/usr\/sbin:\/usr\/bin:\/sbin:\/bin\nExecStart=\/root\/miniconda3\/envs\/bitnet-cpp\/bin\/python run_inference_server.py -m models\/BitNet-b1.58-2B-4T\/ggml-model-i2_s.gguf -p 8080 --host 0.0.0.0\nRestart=always\nRestartSec=10\nStandardOutput=journal\nStandardError=journal\n\n&#91;Install]\nWantedBy=multi-user.target\n<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>sudo systemctl daemon-reload\nsudo systemctl enable bitnet.service\nsudo systemctl start bitnet.service\n<\/code><\/pre>\n\n\n\n<p>Now BitNet automatically starts when the container boots, and if it crashes, systemd brings it right back up.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The Results: How Does It Actually Perform?<\/h2>\n\n\n\n<p>I&#8217;ve been running this setup for a while now, and I&#8217;m honestly impressed. On my R7910:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Response time:<\/strong> Usually 1-3 seconds for short responses when running in the container itself.\n<ul class=\"wp-block-list\">\n<li>When connected to anythingllm given the context windows and max tokens I put in it averaged about 7-8 seconds before responding but then would output that response relatively quickly <\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Memory usage:<\/strong> Steady ~400MB as advertised<\/li>\n\n\n\n<li><strong>CPU usage:<\/strong> Spikes during inference, then drops to almost nothing<\/li>\n\n\n\n<li><strong>Quality:<\/strong> Good enough for basic tasks, coding help, and general questions<\/li>\n<\/ul>\n\n\n\n<p>It&#8217;s not going to replace GPT-4 for complex reasoning, but for a lot of everyday AI tasks, it&#8217;s surprisingly capable. And the fact that it&#8217;s running entirely on my own hardware with no API calls or subscriptions? That&#8217;s pretty sweet.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Demo!!<\/h2>\n\n\n\n<p>Click the plus button on the bottom right!<\/p>\n\n\n\n<!--\nPaste this script at the bottom of your HTML before the <\/body> tag.\nSee more style and config options on our docs\nhttps:\/\/github.com\/Mintplex-Labs\/anything-llm\/tree\/master\/embed\/README.md\n-->\n<script\n  data-embed-id=\"82b9709e-8a8f-40bd-8d12-f94a02c803f4\"\n  data-base-api-url=\"https:\/\/anythingllm.thinkcolorful.org\/api\/embed\"\n  src=\"https:\/\/anythingllm.thinkcolorful.org\/embed\/anythingllm-chat-widget.min.js\">\n<\/script>\n<!-- AnythingLLM (https:\/\/anythingllm.com) -->\n\n\n\n\n<h2 class=\"wp-block-heading\">Lessons Learned and Gotchas<\/h2>\n\n\n\n<p><strong>Things that tripped me up:<\/strong><\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Forgetting the <code>--recursive<\/code> flag<\/strong> when cloning &#8211; this downloads necessary submodules<\/li>\n\n\n\n<li><strong>Not installing clang<\/strong> &#8211; the error messages weren&#8217;t super clear about this<\/li>\n\n\n\n<li><strong>Trying to use venv instead of conda<\/strong> &#8211; just follow their instructions!<\/li>\n\n\n\n<li><strong>Container permissions<\/strong> &#8211; those LXC config additions are crucial<\/li>\n<\/ol>\n\n\n\n<p><strong>Performance tips:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Give it plenty of CPU cores if you can<\/li>\n\n\n\n<li>32GB RAM is probably overkill, but it&#8217;s nice to have headroom<\/li>\n\n\n\n<li>The <code>i2_s<\/code> quantization seems to be the sweet spot for quality vs speed<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">What&#8217;s Next?<\/h2>\n\n\n\n<p>I&#8217;m planning to experiment with:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Different quantization types (<code>tl1<\/code> vs <code>i2_s<\/code>)<\/li>\n\n\n\n<li>Running multiple model variants simultaneously<\/li>\n\n\n\n<li>Maybe trying some fine-tuning if Microsoft releases tools for that<\/li>\n<\/ul>\n\n\n\n<p>The self-hosting AI space is moving fast, and BitNet feels like a real game-changer for those of us who want capable AI without needing a mortgage-sized GPU budget.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Wrapping Up<\/h2>\n\n\n\n<p>Setting up BitNet on my Dell R7910 turned out to be way more straightforward than I expected, once I figured out the few gotchas. If you&#8217;ve got a decent CPU and some spare RAM, I&#8217;d definitely recommend giving it a shot.<\/p>\n\n\n\n<p>Having a capable AI assistant running entirely on your own hardware is pretty liberating. No API keys, no usage limits, no privacy concerns about your data leaving your network. Just pure, self-hosted AI goodness.<\/p>\n\n\n\n<p>Plus, there&#8217;s something satisfying about telling people your AI assistant is running on a 1-bit model that uses less RAM than Chrome with a few tabs open!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>So Microsoft dropped this 1-bit LLM called BitNet, and I couldn&#8217;t resist trying to get it running on my&#46;&#46;&#46;<\/p>\n","protected":false},"author":2,"featured_media":1352,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-1350","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/thinkcolorful.org\/index.php?rest_route=\/wp\/v2\/posts\/1350","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/thinkcolorful.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/thinkcolorful.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/thinkcolorful.org\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/thinkcolorful.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1350"}],"version-history":[{"count":12,"href":"https:\/\/thinkcolorful.org\/index.php?rest_route=\/wp\/v2\/posts\/1350\/revisions"}],"predecessor-version":[{"id":1367,"href":"https:\/\/thinkcolorful.org\/index.php?rest_route=\/wp\/v2\/posts\/1350\/revisions\/1367"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/thinkcolorful.org\/index.php?rest_route=\/wp\/v2\/media\/1352"}],"wp:attachment":[{"href":"https:\/\/thinkcolorful.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1350"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/thinkcolorful.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1350"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/thinkcolorful.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1350"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}