Today we're going to quickly discuss an amusing and terrifying cheatsheet about evil pratices about FreeNAS Storage and ZFS filesystems.
This is just my personal view but it's based upon a strengthened experience, so you're all invited to tell me what's yours.
1) No replication
How many times the customer does not allocate the financial resources to build a replication target? Quite often. It's alright until the motherboard of the only FreeNAS Server dies or multiple disk fail occur at once. RAID is not backup nor cluster either.
Please always setup replication. You don't need such a big server, you just need the replica snapshots to work correctly.
2) Bad replication setup
Getting the perfect replication tasks setup requires a lot of tuning. Don't just verify once it works but accurately monitor the resource consumption, specially on mission critical systems. Make particular attention in limiting the transfer bandwidth to avoid throughput bottlenecks and ISCSI timeouts.
This is the result of setting a too high transfer bandwidth:
testnas04.betatechnologies.com kernel log messages:
> WARNING: 192.168.8.2 (iqn.2017-01.com.betatechnologies:testxen02): no ping reply (NOP-Out) after 5 seconds; dropping connection
Maybe you're familiar with this messages also:
> (10:4:0/1): WRITE(10). CDB: 2a 00 28 16 22 29 00 00 08 00
> (10:4:0/1): Tag: 0x0056, type 1
> (10:4:0/1): WRITE(10). CDB: 2a 00 41 d8 47 9b 00 04 00 00
> (10:4:0/1): ctl_datamove: 254 seconds
> (10:4:0/1): Tag: 0x0001, type 1
> ctl_datamove: tag 0x0056 on (10:4:0) aborted
> (10:4:0/1): ctl_process_done: 261 seconds
> (10:4:0/1): WRITE(10). CDB: 2a 00 28 16 21 e9 00 00 08 00
> (10:4:0/1): WRITE(10). CDB: 2a 00 28 16 22 29 00 00 08 00
> (10:4:0/1): Tag: 0x0067, type 1
> (10:4:0/1): Tag: 0x0056, type 1
> (10:4:0/1): ctl_datamove: 254 seconds
> ctl_datamove: tag 0x0067 on (10:4:0) aborted
> (10:4:0/1): WRITE(10). CDB: 2a 00 28 16 4b 11 00 00 08 00
> (10:4:0/1): Tag: 0x0005, type 1
> (10:4:0/1): ctl_datamove: 254 seconds
> ctl_datamove: tag 0x0005 on (10:4:0) aborted
This happens when the server workload is too high and the operations fail.
3) Use slow disks while asking high performances
Have you ever wondered why iXSystems is build all-flash FreeNAS arrays? I think they understood that nowadays IT infrastructures require low-latency and high performances.
Have you ever seen this kernel error message in your FreeNAS output?
> (10:4:0/1): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
What happens when FreeNAS fails the WRITE operation? It keeps track of the failure and requeues the task in the ZIL.
The rock-solid architecture of ZFS allows to easily overcome this kind of problems but you should monitor the WRITE fails and intervene to solve the problem by replacing slow disks (or slow RAID implementations, as RAIDZ2) with better performing ones.
Pay attention because the virtual machines do not know they need to wait and they suppose the writes really failed. This can damage your vm filesystem as the operating system could remount the fs read-only and lose all the pending writes.
On Citrix XenServer hosts, Windows Server virtual machines (with proper virtualization drivers installed), wait patiently with a 100% CPU usage until the disk completes the writes successfully (withing a reasonable timeout of 5 minutes). This gives a better handling of pending writes events.
4) No snapshots
You need explanations? Really?
5) Too many snapshots
You just don't need all these snapshots. It's just STUPID!
Just keep about 100 snapshots per dataset and ensure they're constantly replicated.
6) LAGG abusing and misusing
Insted of abusing LAGG interfaces buy better network cards.
7) Using cheap network cards
Don't ever use cheap network card. Use Intel for 1G or Mellanox for 10G. Follow the FreeBSD guidelines.
8) Bad or missing UPS configuration
FreeNAS embeds NUT. You must configure and test it. Even if your datacenter is protected with energy generation systems you should always setup UPS communication, just in case of unexpected electrical or power failures.
9) Using RAID-Z for virtualization
You should avoid RAID-Z setups in virtualized environments because of the poor performances. The FreeNAS wizard tries to warn you also.
10) Configuring too big zpools
Avoid configuring too big zpools if not necessary. Resilvering and performances will suffer A LOT of degradation. Keep in consideration the ZFS fragmentation that keeps growing. You could need to move your data between different zpools from time to time.
11) Misconfigured or unconfigured email reporting system
This should be considered one of the most important things to be correctly setup while deploying a FreeNAS System. You'll receive every day a lot of diagnostic informations.
12) Using a single IP Address
Always use multiple IP Addresses located in different subnets to tear apart the traffic and separate the clients.
13) Using a single VLAN
Same as point 12 but with a security flavour.
14) The missing crash plan
Still did not prepare yours?
What... if...?
15) No documentation
Many systemists do no write adequate documentation about their deployments. This would eventually turn into a catastrophe it the case you're in vacation or unreachable.