Today we're going to quickly discuss an amusing and terrifying cheatsheet about evil pratices about FreeNAS Storage and ZFS filesystems.

This is just my personal view but it's based upon a strengthened experience, so you're all invited to tell me what's yours.

 

 

1) No replication

How many times the customer does not allocate the financial resources to build a replication target? Quite often. It's alright until the motherboard of the only FreeNAS Server dies or multiple disk fail occur at once. RAID is not backup nor cluster either.

Please always setup replication. You don't need such a big server, you just need the replica snapshots to work correctly.

 

2) Bad replication setup

Getting the perfect replication tasks setup requires a lot of tuning. Don't just verify once it works but accurately monitor the resource consumption, specially on mission critical systems. Make particular attention in limiting the transfer bandwidth to avoid throughput bottlenecks and ISCSI timeouts.

This is the result of setting a too high transfer bandwidth:

testnas04.betatechnologies.com kernel log messages:

> WARNING: 192.168.8.2 (iqn.2017-01.com.betatechnologies:testxen02): no ping reply (NOP-Out) after 5 seconds; dropping connection

 

Maybe you're familiar with this messages also:

> (10:4:0/1): WRITE(10). CDB: 2a 00 28 16 22 29 00 00 08 00
> (10:4:0/1): Tag: 0x0056, type 1
> (10:4:0/1): WRITE(10). CDB: 2a 00 41 d8 47 9b 00 04 00 00
> (10:4:0/1): ctl_datamove: 254 seconds
> (10:4:0/1): Tag: 0x0001, type 1
> ctl_datamove: tag 0x0056 on (10:4:0) aborted
> (10:4:0/1): ctl_process_done: 261 seconds
> (10:4:0/1): WRITE(10). CDB: 2a 00 28 16 21 e9 00 00 08 00
> (10:4:0/1): WRITE(10). CDB: 2a 00 28 16 22 29 00 00 08 00
> (10:4:0/1): Tag: 0x0067, type 1
> (10:4:0/1): Tag: 0x0056, type 1
> (10:4:0/1): ctl_datamove: 254 seconds
> ctl_datamove: tag 0x0067 on (10:4:0) aborted
> (10:4:0/1): WRITE(10). CDB: 2a 00 28 16 4b 11 00 00 08 00
> (10:4:0/1): Tag: 0x0005, type 1
> (10:4:0/1): ctl_datamove: 254 seconds
> ctl_datamove: tag 0x0005 on (10:4:0) aborted

This happens when the server workload is too high and the operations fail.

 

3) Use slow disks while asking high performances

Have you ever wondered why iXSystems is build all-flash FreeNAS arrays? I think they understood that nowadays IT infrastructures require low-latency and high performances.

Have you ever seen this kernel error message in your FreeNAS output?

> (10:4:0/1): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00

What happens when FreeNAS fails the WRITE operation? It keeps track of the failure and requeues the task in the ZIL.

The rock-solid architecture of ZFS allows to easily overcome this kind of problems but you should monitor the WRITE fails and intervene to solve the problem by replacing slow disks (or slow RAID implementations, as RAIDZ2) with better performing ones.

Pay attention because the virtual machines do not know they need to wait and they suppose the writes really failed. This can damage your vm filesystem as the operating system could remount the fs read-only and lose all the pending writes.

On Citrix XenServer hosts, Windows Server virtual machines (with proper virtualization drivers installed), wait patiently with a 100% CPU usage until the disk completes the writes successfully (withing a reasonable timeout of 5 minutes). This gives a better handling of pending writes events.

 

4) No snapshots

You need explanations? Really?

 

5) Too many snapshots

You just don't need all these snapshots. It's just STUPID!

Just keep about 100 snapshots per dataset and ensure they're constantly replicated.

 

6) LAGG abusing and misusing

Insted of abusing LAGG interfaces buy better network cards.

 

7) Using cheap network cards

Don't ever use cheap network card. Use Intel for 1G or Mellanox for 10G. Follow the FreeBSD guidelines.

 

8) Bad or missing UPS configuration

FreeNAS embeds NUT. You must configure and test it. Even if your datacenter is protected with energy generation systems you should always setup UPS communication, just in case of unexpected electrical or power failures.

 

9) Using RAID-Z for virtualization

You should avoid RAID-Z setups in virtualized environments because of the poor performances. The FreeNAS wizard tries to warn you also.

 

10) Configuring too big zpools

Avoid configuring too big zpools if not necessary. Resilvering and performances will suffer A LOT of degradation. Keep in consideration the ZFS fragmentation that keeps growing. You could need to move your data between different zpools from time to time.

 

11) Misconfigured or unconfigured email reporting system

This should be considered one of the most important things to be correctly setup while deploying a FreeNAS System. You'll receive every day a lot of diagnostic informations.

 

12) Using a single IP Address

Always use multiple IP Addresses located in different subnets to tear apart the traffic and separate the clients.

 

13) Using a single VLAN

Same as point 12 but with a security flavour.

 

14) The missing crash plan

Still did not prepare yours?

What... if...?

 

15) No documentation

Many systemists do no write adequate documentation about their deployments. This would eventually turn into a catastrophe it the case you're in vacation or unreachable.